On Fri, Dec 2, 2016 at 7:57 AM, Savolainen, Petri (Nokia - FI/Espoo) <petri.savolai...@nokia-bell-labs.com> wrote: > > >> -----Original Message----- >> From: Bill Fischofer [mailto:bill.fischo...@linaro.org] >> Sent: Friday, December 02, 2016 3:12 PM >> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell- >> labs.com> >> Cc: Elo, Matias (Nokia - FI/Espoo) <matias....@nokia-bell-labs.com>; lng- >> odp-forward <lng-odp@lists.linaro.org> >> Subject: Re: [lng-odp] [API-NEXT PATCH v3 4/5] linux-gen: sched: new >> ordered queue implementation >> >> On Fri, Dec 2, 2016 at 4:51 AM, Savolainen, Petri (Nokia - FI/Espoo) >> <petri.savolai...@nokia-bell-labs.com> wrote: >> > >> >> >> >> This code still has the following in do_schedule(): >> >> >> >> ordered = sched_cb_queue_is_ordered(qi); >> >> >> >> /* Do not cache ordered events locally to improve >> >> * parallelism. Ordered context can only be released >> >> * when the local cache is empty. */ >> >> if (ordered && max_num < MAX_DEQ) >> >> max_deq = max_num; >> >> >> >> To do what the comment says this should be changed to: >> >> >> >> if (ordered) >> >> max_deq = 1; >> >> >> >> Because in the case of ordered queues, you want to schedule >> >> consecutive events to separate threads to allow them to be processed >> >> in parallel. Allowing multiple events to be processed by a single >> >> thread introduces head-of-line blocking. >> >> >> >> Of course, if you make this change I suspect some of the performance >> >> gains measured in the simple test cases we have with this >> >> implementation will go away since I suspect a good portion of those >> >> gains is due to effectively turning ordered queues back into atomic >> >> queues, which is what this sort of event batching with limited numbers >> >> of events does. >> >> >> > >> > This comment: "Do not cache ordered events locally..." refers to >> scheduler local event stash (odp_event_t ev_stash[MAX_DEQ]), which is not >> used with ordered queues. When application requests N events, up to N (or >> MAX_DEQ if N > MAX_DEQ) events will be dequeued. There is no reason to to >> fix this to one in scheduler. If application is worried about head of line >> blocking, it can itself limit N to 1. There are various applications and >> use cases. An application may use many ordered queues, so that ordering is >> guaranteed, but parallelism is maximized (as in the common case each CPU >> will process events from different queues). Another application with a >> single, fat queue (and varying event size) may be more concerned on >> latency and ask only single event per a schedule call. >> >> Except that's not what this code does. One can make the argument that >> if the application choses to call odp_schedule_multi() and an ordered >> queue is selected then it's OK to dequeue multiple events from that >> queue and incur the head-of-line-blocking that such batching implies >> on the assumption that the application knows what it's doing, but in >> this case this code unconditionally grabs max_deq items from the >> ordered queue even if the caller calls odp_schedule(), which is only >> going to return a single event. The other batched events are held in >> sched_local.ev_stash with sched_local.num specifying the count that >> will be serviced by this thread only on subsequent scheduler calls. > > No, odp_schedule() calls do_schedule(out_queue, out_ev, max_num) with max_num > set to 1. Which is identical operation to odp_schedule_multi() with num set > to 1. > > So, no event stashing in either case: sched() or sched_multi().
OK, I missed that. Thanks. > > >> >> > >> > These things should not be speculated, but measured. That’s why the >> ordered queue performance test was developed and sent to the list already >> over a month ago. It uses few, fat input queues and has a conservable >> amount of processing per ordered event. It demonstrates 1.15 - 2.6x >> speedup (1 - 12 cores) compared to the old implementation. Additionally, >> the new implementation scales almost linearly and much better than the old >> one. These are results for N=32 in application's schedule_multi call. When >> the application limits N to 1, the throughput is halved, not increased. >> >> True, which is why if you want a valid comparison between these two >> potential implementations you have to control this variability. That >> means you either measure this code with N=1 vs. the base code or else >> you relax that restriction in the base code so you have equivalent >> batching behavior. We know that most of the speedup in atomic queues >> comes from this batching since the fast path logic in do_schedule() >> when sched_local.num > 0 is very beneficial. My contention is that >> most of the measured speedup here is due to that change rather than >> the implementation change. There are some clear advantages of this >> new code, but we shouldn't overstate what those are with >> non-comparable measurements. > > New ordered queues are not the same as atomic queues. Everything else goes in > parallel (e.g. 99% of application cycles), but the final synchronization > phase of enqueue operations (during the next schedule). > > When setting N=1 ('num' of sched_multi()) in the test case, the new code is > about 30% faster than the old code. We tried to increase max_num (from 1) in > the old code, but it failed to schedule any packets after that. Yes, the old code's release_ordered() code would have to be changed since it assumes that N=1. But good to know. > > -Petri > >