On Thu, Apr 6, 2017 at 1:46 PM, Bill Fischofer <bill.fischo...@linaro.org> wrote: > On Thu, Apr 6, 2017 at 1:32 PM, Ola Liljedahl <ola.liljed...@linaro.org> > wrote: >> On 6 April 2017 at 13:48, Jerin Jacob <jerin.ja...@caviumnetworks.com> wrote: >>> -----Original Message----- >>>> Date: Thu, 6 Apr 2017 12:54:10 +0200 >>>> From: Ola Liljedahl <ola.liljed...@linaro.org> >>>> To: Brian Brooks <brian.bro...@arm.com> >>>> Cc: Jerin Jacob <jerin.ja...@caviumnetworks.com>, >>>> "lng-odp@lists.linaro.org" <lng-odp@lists.linaro.org> >>>> Subject: Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software >>>> scheduler >>>> >>>> On 5 April 2017 at 18:50, Brian Brooks <brian.bro...@arm.com> wrote: >>>> > On 04/05 21:27:37, Jerin Jacob wrote: >>>> >> -----Original Message----- >>>> >> > Date: Tue, 4 Apr 2017 13:47:52 -0500 >>>> >> > From: Brian Brooks <brian.bro...@arm.com> >>>> >> > To: lng-odp@lists.linaro.org >>>> >> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software >>>> >> > scheduler >>>> >> > X-Mailer: git-send-email 2.12.2 >>>> >> > >>>> >> > This work derives from Ola Liljedahl's prototype [1] which introduced >>>> >> > a >>>> >> > scalable scheduler design based on primarily lock-free algorithms and >>>> >> > data structures designed to decrease contention. A thread searches >>>> >> > through a data structure containing only queues that are both >>>> >> > non-empty >>>> >> > and allowed to be scheduled to that thread. Strict priority >>>> >> > scheduling is >>>> >> > respected, and (W)RR scheduling may be used within queues of the same >>>> >> > priority. >>>> >> > Lastly, pre-scheduling or stashing is not employed since it is >>>> >> > optional >>>> >> > functionality that can be implemented in the application. >>>> >> > >>>> >> > In addition to scalable ring buffers, the algorithm also uses >>>> >> > unbounded >>>> >> > concurrent queues. LL/SC and CAS variants exist in cases where >>>> >> > absense of >>>> >> > ABA problem cannot be proved, and also in cases where the compiler's >>>> >> > atomic >>>> >> > built-ins may not be lowered to the desired instruction(s). Finally, >>>> >> > a version >>>> >> > of the algorithm that uses locks is also provided. >>>> >> > >>>> >> > See platform/linux-generic/include/odp_config_internal.h for further >>>> >> > build >>>> >> > time configuration. >>>> >> > >>>> >> > Use --enable-schedule-scalable to conditionally compile this scheduler >>>> >> > into the library. >>>> >> >>>> >> This is an interesting stuff. >>>> >> >>>> >> Do you have any performance/latency numbers in comparison to exiting >>>> >> scheduler >>>> >> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on >>>> >> any platform? >>>> It is still a SW implementation, there is overhead accessed with queue >>>> enqueue/dequeue and the scheduling itself. >>>> So for an N-stage pipeline, overhead will accumulate. >>>> If only a subset of threads are associated with each stage (this could >>>> be beneficial for I-cache hit rate), there will be less need for >>>> scalability. >>>> What is the recommended strategy here for OCTEON/ThunderX? >>> >>> In the view of portable event driven applications(Works on both >>> embedded and server capable chips), the SW schedule is an important piece. >>> >>>> All threads/cores share all work? >>> >>> That is the recommend one in HW as it supports nativity. But HW provides >>> means to partition the work load based on odp schedule groups >>> >>> >>>> >>>> > >>>> > To give an idea, the avg latency reported by odp_sched_latency is down >>>> > to half >>>> > that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, >>>> > 16c A57, >>>> > and 12c broadwell. We are still preparing numbers, and I think it's >>>> > worth mentioning >>>> > that they are subject to change as this patch series changes over time. >>>> > >>>> > I am not aware of an existing benchmark that involves switching between >>>> > different >>>> > queue types. Perhaps this is happening in an example app? >>>> This could be useful in e.g. IPsec termination. Use an atomic stage >>>> for the replay protection check and update. Now ODP has ordered locks >>>> for that so the "atomic" (exclusive) section can be achieved from an >>>> ordered processing stage. Perhaps Jerin knows some other application >>>> that utilises two-stage ORDERED->ATOMIC processing. >>> >>> We see ORDERED->ATOMIC as main use case for basic packet forward.Stage >>> 1(ORDERED) to process on N cores and Stage2(ATOMIC) to maintain the ingress >>> order. >> Doesn't ORDERED scheduling maintain the ingress packet order all the >> way to the egress interface? A least that's my understanding of ODP >> ordered queues. >> From an ODP perspective, I fail to see how the ATOMIC stage is needed. > > As pointed out earlier, ordered locks are another option to avoid a > separate processing stage simply to do in-sequence operations within > an ordered flow. I'd be curious to understand the use-case in a bit > more detail here. Ordered queues preserve the originating queue's > order, however to achieve end-to-end ordering involving multiple > processing stages requires that flows traverse only ordered or atomic > queues. If a parallel queue is used ordering is indeterminate from > that point on in the pipeline. > >> >>> >>> >>>> >>>> > >>>> >> When we say scalable scheduler, What application/means used to quantify >>>> >> scalablity?? >>>> It starts with the design, use non-blocking data structures and try to >>>> distribute data to threads so that they do not access shared data very >>>> often. Some of this is a little detrimental to single-threaded >>>> performance, you need to use more atomic operations. It seems to work >>>> well on ARM (A53, A57) though, the penalty is higher on x86 (x86 is >>>> very good with spin locks, cmpxchg seems to have more overhead >>>> compared to ldxr/stxr on ARM which can have less memory ordering >>>> constraints). We actually use different synchronisation strategies on >>>> ARM and on x86 (compile time configuration). >>> >>> Another school of thought is to avoid all the lock using only single >>> producer and >>> single consumer and create N such channels to avoid any sort of locking >>> primitives for communication.
SPSC queue alleviates the need for an atomic RMW, but does it help with memory ordering or interconnect behavior? >> But such N independent channel will limit per-flow throughput per the >> single-threaded performance of slowest stage (e.g. an individual CPU >> core). It works great when you have the fastest CPU's, not so great >> when you have the more power/area efficient CPU cores (which have >> lower single-threaded performance). > > True, however if you have a sufficient number of cores you may be able > to process dozens (or hundreds) of flows in parallel and as long as > each flow doesn't require more than a single core's worth of > processing power you win. Even if you need more than one core, having > 2 or 4 core clusters process a single flow and then have a dozen or > more such clusters running independently may be hugely beneficial vs. > a smaller number of lumbering cores that need to share flows. > >> >> >>> >>>> >>>> You can read more here: >>>> https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ >>>> I also did an internal presentation on the scheduler prototype back at >>>> Las Vegas, that presentation might also be somewhere on the Linaro web >>>> site. >>> >>> Thanks for the presentation. >>> >>>> >>>> >>>> >> >>>> >> Do you have any numbers in comparison to existing scheduler to show >>>> >> magnitude of the scalablity on any platform?