On Mon, May 06, 2019 at 07:10:40PM +0100, Will Deacon wrote:
> On Mon, May 06, 2019 at 06:13:12AM +0000, Jayachandran Chandrasekharan Nair 
> wrote:
> > Perhaps someone from ARM can chime in here how the cas/yield combo
> > is expected to work when there is contention. ThunderX2 does not
> > do much with the yield, but I don't expect any ARM implementation
> > to treat YIELD as a hint not to yield, but to get/keep exclusive
> > access to the last failed CAS location.
> 
> Just picking up on this as "someone from ARM".
> 
> The yield instruction in our implementation of cpu_relax() is *only* there
> as a scheduling hint to QEMU so that it can treat it as an internal
> scheduling hint and run some other thread; see 1baa82f48030 ("arm64:
> Implement cpu_relax as yield"). We can't use WFE or WFI blindly here, as it
> could be a long time before we see a wake-up event such as an interrupt. Our
> implementation of smp_cond_load_acquire() is much better for that kind of
> thing, but doesn't help at all for a contended CAS loop where the variable
> is actually changing constantly.

Looking thru the perf output of this case (open/close of a file from
multiple CPUs), I see that refcount is a significant factor in most
kernel configurations - and that too uses cmpxchg (without yield).
x86 has an optimized inline version of refcount that helps
significantly. Do you think this is worth looking at for arm64?
 
> Implementing yield in the CPU may generally be beneficial for SMT designs so
> that the hardware resources aren't wasted when spinning round a busy loop.

Yield is probably used in sub-optimal implementations of delay or wait.
It is going to be different across multiple implementations and
revisions (given the description in ARM spec). Having a more yielding(?)
implementation would be equally problematic especially in the lockref
case.

> For this particular discussion (i.e. lockref), however, it seems as though
> the cpu_relax() call is questionable to start with.

In case of lockref, taking out the yield/pause and dropping to queued
spinlock after some cycles appears to me to be a better approach.
Relying on the quality of cpu_relax() on the specific processor to
mitigate against contention is going to be tricky anyway.

We will do some more work here, but would appreciate any pointers
based on your experience here.

Thanks,
JC

Reply via email to