On Fri, Dec 2, 2016 at 7:57 AM, Savolainen, Petri (Nokia - FI/Espoo)
<petri.savolai...@nokia-bell-labs.com> wrote:
>
>
>> -----Original Message-----
>> From: Bill Fischofer [mailto:bill.fischo...@linaro.org]
>> Sent: Friday, December 02, 2016 3:12 PM
>> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-
>> labs.com>
>> Cc: Elo, Matias (Nokia - FI/Espoo) <matias....@nokia-bell-labs.com>; lng-
>> odp-forward <lng-odp@lists.linaro.org>
>> Subject: Re: [lng-odp] [API-NEXT PATCH v3 4/5] linux-gen: sched: new
>> ordered queue implementation
>>
>> On Fri, Dec 2, 2016 at 4:51 AM, Savolainen, Petri (Nokia - FI/Espoo)
>> <petri.savolai...@nokia-bell-labs.com> wrote:
>> >
>> >>
>> >> This code still has the following in do_schedule():
>> >>
>> >> ordered = sched_cb_queue_is_ordered(qi);
>> >>
>> >> /* Do not cache ordered events locally to improve
>> >> * parallelism. Ordered context can only be released
>> >> * when the local cache is empty. */
>> >> if (ordered && max_num < MAX_DEQ)
>> >>     max_deq = max_num;
>> >>
>> >> To do what the comment says this should be changed to:
>> >>
>> >> if (ordered)
>> >>    max_deq = 1;
>> >>
>> >> Because in the case of ordered queues, you want to schedule
>> >> consecutive events to separate threads to allow them to be processed
>> >> in parallel. Allowing multiple events to be processed by a single
>> >> thread introduces head-of-line blocking.
>> >>
>> >> Of course, if you make this change I suspect some of the performance
>> >> gains measured in the simple test cases we have with this
>> >> implementation will go away since I suspect a good portion of those
>> >> gains is due to effectively turning ordered queues back into atomic
>> >> queues, which is what this sort of event batching with limited numbers
>> >> of events does.
>> >>
>> >
>> > This comment: "Do not cache ordered events locally..." refers to
>> scheduler local event stash (odp_event_t ev_stash[MAX_DEQ]), which is not
>> used with ordered queues. When application requests N events, up to N (or
>> MAX_DEQ if N > MAX_DEQ) events will be dequeued. There is no reason to to
>> fix this to one in scheduler. If application is worried about head of line
>> blocking, it can itself limit N to 1. There are various applications and
>> use cases. An application may use many ordered queues, so that ordering is
>> guaranteed, but parallelism is maximized (as in the common case each CPU
>> will process events from different queues). Another application with a
>> single, fat queue (and varying event size) may be more concerned on
>> latency and ask only single event per a schedule call.
>>
>> Except that's not what this code does. One can make the argument that
>> if the application choses to call odp_schedule_multi() and an ordered
>> queue is selected then it's OK to dequeue multiple events from that
>> queue and incur the head-of-line-blocking that such batching implies
>> on the assumption that the application knows what it's doing, but in
>> this case this code unconditionally grabs max_deq items from the
>> ordered queue even if the caller calls odp_schedule(), which is only
>> going to return a single event. The other batched events are held in
>> sched_local.ev_stash with sched_local.num specifying the count that
>> will be serviced by this thread only on subsequent scheduler calls.
>
> No, odp_schedule() calls do_schedule(out_queue, out_ev, max_num) with max_num 
> set to 1. Which is identical operation to odp_schedule_multi() with num set 
> to 1.
>
> So, no event stashing in either case: sched() or sched_multi().

OK, I missed that. Thanks.

>
>
>>
>> >
>> > These things should not be speculated, but measured. That’s why the
>> ordered queue performance test was developed and sent to the list already
>> over a month ago. It uses few, fat input queues and has a conservable
>> amount of processing per ordered event. It demonstrates 1.15 - 2.6x
>> speedup (1 - 12 cores) compared to the old implementation. Additionally,
>> the new implementation scales almost linearly and much better than the old
>> one. These are results for N=32 in application's schedule_multi call. When
>> the application limits N to 1, the throughput is halved, not increased.
>>
>> True, which is why if you want a valid comparison between these two
>> potential implementations you have to control this variability. That
>> means you either measure this code with N=1 vs. the base code or else
>> you relax that restriction in the base code so you have equivalent
>> batching behavior. We know that most of the speedup in atomic queues
>> comes from this batching since the fast path logic in do_schedule()
>> when sched_local.num > 0 is very beneficial. My contention is that
>> most of the measured speedup here is due to that change rather than
>> the implementation change.  There are some clear advantages of this
>> new code, but we shouldn't overstate what those are with
>> non-comparable measurements.
>
> New ordered queues are not the same as atomic queues. Everything else goes in 
> parallel (e.g. 99% of application cycles), but the final synchronization 
> phase of enqueue operations (during the next schedule).
>
> When setting N=1 ('num' of sched_multi()) in the test case, the new code is 
> about 30% faster than the old code. We tried to increase max_num (from 1) in 
> the old code, but it failed to schedule any packets after that.

Yes, the old code's release_ordered() code would have to be changed
since it assumes that N=1. But good to know.

>
> -Petri
>
>

Reply via email to