Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler

Jerin Jacob Thu, 06 Apr 2017 04:57:02 -0700

-----Original Message-----
> Date: Thu, 6 Apr 2017 12:54:10 +0200
> From: Ola Liljedahl <ola.liljed...@linaro.org>
> To: Brian Brooks <brian.bro...@arm.com>
> Cc: Jerin Jacob <jerin.ja...@caviumnetworks.com>,
>  "lng-odp@lists.linaro.org" <lng-odp@lists.linaro.org>
> Subject: Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software
>  scheduler
> 
> On 5 April 2017 at 18:50, Brian Brooks <brian.bro...@arm.com> wrote:
> > On 04/05 21:27:37, Jerin Jacob wrote:
> >> -----Original Message-----
> >> > Date: Tue, 4 Apr 2017 13:47:52 -0500
> >> > From: Brian Brooks <brian.bro...@arm.com>
> >> > To: lng-odp@lists.linaro.org
> >> > Subject: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software 
> >> > scheduler
> >> > X-Mailer: git-send-email 2.12.2
> >> >
> >> > This work derives from Ola Liljedahl's prototype [1] which introduced a
> >> > scalable scheduler design based on primarily lock-free algorithms and
> >> > data structures designed to decrease contention. A thread searches
> >> > through a data structure containing only queues that are both non-empty
> >> > and allowed to be scheduled to that thread. Strict priority scheduling is
> >> > respected, and (W)RR scheduling may be used within queues of the same 
> >> > priority.
> >> > Lastly, pre-scheduling or stashing is not employed since it is optional
> >> > functionality that can be implemented in the application.
> >> >
> >> > In addition to scalable ring buffers, the algorithm also uses unbounded
> >> > concurrent queues. LL/SC and CAS variants exist in cases where absense of
> >> > ABA problem cannot be proved, and also in cases where the compiler's 
> >> > atomic
> >> > built-ins may not be lowered to the desired instruction(s). Finally, a 
> >> > version
> >> > of the algorithm that uses locks is also provided.
> >> >
> >> > See platform/linux-generic/include/odp_config_internal.h for further 
> >> > build
> >> > time configuration.
> >> >
> >> > Use --enable-schedule-scalable to conditionally compile this scheduler
> >> > into the library.
> >>
> >> This is an interesting stuff.
> >>
> >> Do you have any performance/latency numbers in comparison to exiting 
> >> scheduler
> >> for completing say two stage(ORDERED->ATOMIC) or N stage pipeline on any 
> >> platform?
> It is still a SW implementation, there is overhead accessed with queue
> enqueue/dequeue and the scheduling itself.
> So for an N-stage pipeline, overhead will accumulate.
> If only a subset of threads are associated with each stage (this could
> be beneficial for I-cache hit rate), there will be less need for
> scalability.
> What is the recommended strategy here for OCTEON/ThunderX?


In the view of portable event driven applications(Works on both
embedded and server capable chips), the SW schedule is an important piece.

> All threads/cores share all work?

That is the recommend one in HW as it supports nativity. But HW provides
means to partition the work load based on odp schedule groups


> 
> >
> > To give an idea, the avg latency reported by odp_sched_latency is down to 
> > half
> > that of other schedulers (pre-scheduling/stashing disabled) on 4c A53, 16c 
> > A57,
> > and 12c broadwell. We are still preparing numbers, and I think it's worth 
> > mentioning
> > that they are subject to change as this patch series changes over time.
> >
> > I am not aware of an existing benchmark that involves switching between 
> > different
> > queue types. Perhaps this is happening in an example app?
> This could be useful in e.g. IPsec termination. Use an atomic stage
> for the replay protection check and update. Now ODP has ordered locks
> for that so the "atomic" (exclusive) section can be achieved from an
> ordered processing stage. Perhaps Jerin knows some other application
> that utilises two-stage ORDERED->ATOMIC processing.

We see ORDERED->ATOMIC as main use case for basic packet forward.Stage
1(ORDERED) to process on N cores and Stage2(ATOMIC) to maintain the ingress
order.


> 
> >
> >> When we say scalable scheduler, What application/means used to quantify
> >> scalablity??
> It starts with the design, use non-blocking data structures and try to
> distribute data to threads so that they do not access shared data very
> often. Some of this is a little detrimental to single-threaded
> performance, you need to use more atomic operations. It seems to work
> well on ARM (A53, A57) though, the penalty is higher on x86 (x86 is
> very good with spin locks, cmpxchg seems to have more overhead
> compared to ldxr/stxr on ARM which can have less memory ordering
> constraints). We actually use different synchronisation strategies on
> ARM and on x86 (compile time configuration).

Another school of thought is to avoid all the lock using only single producer 
and
single consumer and create N such channels to avoid any sort of locking
primitives for communication.

> 
> You can read more here:
> https://docs.google.com/presentation/d/1BqAdni4aP4aHOqO6fNO39-0MN9zOntI-2ZnVTUXBNSQ
> I also did an internal presentation on the scheduler prototype back at
> Las Vegas, that presentation might also be somewhere on the Linaro web
> site.

Thanks for the presentation.

> 
> 
> >>
> >> Do you have any numbers in comparison to existing scheduler to show
> >> magnitude of the scalablity on any platform?

Re: [lng-odp] [API-NEXT PATCH v2 00/16] A scalable software scheduler

Reply via email to