> > This code still has the following in do_schedule(): > > ordered = sched_cb_queue_is_ordered(qi); > > /* Do not cache ordered events locally to improve > * parallelism. Ordered context can only be released > * when the local cache is empty. */ > if (ordered && max_num < MAX_DEQ) > max_deq = max_num; > > To do what the comment says this should be changed to: > > if (ordered) > max_deq = 1; > > Because in the case of ordered queues, you want to schedule > consecutive events to separate threads to allow them to be processed > in parallel. Allowing multiple events to be processed by a single > thread introduces head-of-line blocking. > > Of course, if you make this change I suspect some of the performance > gains measured in the simple test cases we have with this > implementation will go away since I suspect a good portion of those > gains is due to effectively turning ordered queues back into atomic > queues, which is what this sort of event batching with limited numbers > of events does. >
This comment: "Do not cache ordered events locally..." refers to scheduler local event stash (odp_event_t ev_stash[MAX_DEQ]), which is not used with ordered queues. When application requests N events, up to N (or MAX_DEQ if N > MAX_DEQ) events will be dequeued. There is no reason to to fix this to one in scheduler. If application is worried about head of line blocking, it can itself limit N to 1. There are various applications and use cases. An application may use many ordered queues, so that ordering is guaranteed, but parallelism is maximized (as in the common case each CPU will process events from different queues). Another application with a single, fat queue (and varying event size) may be more concerned on latency and ask only single event per a schedule call. These things should not be speculated, but measured. That’s why the ordered queue performance test was developed and sent to the list already over a month ago. It uses few, fat input queues and has a conservable amount of processing per ordered event. It demonstrates 1.15 - 2.6x speedup (1 - 12 cores) compared to the old implementation. Additionally, the new implementation scales almost linearly and much better than the old one. These are results for N=32 in application's schedule_multi call. When the application limits N to 1, the throughput is halved, not increased. -Petri