Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

2022-05-27 Thread Heiko Carstens
On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote:
> On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> > Linus Torvalds  writes:
> > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak  wrote:
> > >>
> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> > >> x86 CMPXCHG instruction returns success in ZF flag, so this
> > >> change saves a compare after cmpxchg (and related move instruction
> > >> in front of cmpxchg). The main loop of lockref_get improves from:
> > >
> > > Ack on this one regardless of the 32-bit x86 question.
> > >
> > > HOWEVER.
> > >
> > > I'd like other architectures to pipe up too, because I think right now
> > > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > > of operations natively, and I think the generic fallback for when it
> > > is missing might be kind of nasty.
> > >
> > > Maybe it ends up generating ok code, but it's also possible that it
> > > just didn't matter when it was only used in one place in the
> > > scheduler.
> > 
> > This patch seems to generate slightly *better* code on powerpc.
> > 
> > I see one register-to-register move that gets shifted slightly later, so
> > that it's skipped on the path that returns directly via the SUCCESS
> > case.
> 
> FWIW, I see the same on arm64; a register-to-register move gets moved out of
> the success path. That changes the register allocation, and resulting in one
> fewer move, but otherwise the code generation is the same.

Just for the records: s390 code generation changes the same like on
powerpc; so looks good.


Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

2022-05-26 Thread Linus Torvalds
On Thu, May 26, 2022 at 5:15 AM Michael Ellerman  wrote:
>
> Do you know of a benchmark that shows it up? I tried a few things but
> couldn't get lockref_get() to count for more than 1-2%.

Heh. 1% for a small instruction sequence that is only handful of
instructions and is used in just a couple of places counts as "very
hot" for me.

I assume you did the natural thing: threaded pathname lookup (with
paths being of the longer variety to not be dominated by system call
etc costs).

   Linus


Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

2022-05-26 Thread Mark Rutland
On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote:
> Linus Torvalds  writes:
> > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak  wrote:
> >>
> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
> >> x86 CMPXCHG instruction returns success in ZF flag, so this
> >> change saves a compare after cmpxchg (and related move instruction
> >> in front of cmpxchg). The main loop of lockref_get improves from:
> >
> > Ack on this one regardless of the 32-bit x86 question.
> >
> > HOWEVER.
> >
> > I'd like other architectures to pipe up too, because I think right now
> > x86 is the only one that implements that "arch_try_cmpxchg()" family
> > of operations natively, and I think the generic fallback for when it
> > is missing might be kind of nasty.
> >
> > Maybe it ends up generating ok code, but it's also possible that it
> > just didn't matter when it was only used in one place in the
> > scheduler.
> 
> This patch seems to generate slightly *better* code on powerpc.
> 
> I see one register-to-register move that gets shifted slightly later, so
> that it's skipped on the path that returns directly via the SUCCESS
> case.

FWIW, I see the same on arm64; a register-to-register move gets moved out of
the success path. That changes the register allocation, and resulting in one
fewer move, but otherwise the code generation is the same.

Thanks,
Mark.


Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro

2022-05-26 Thread Michael Ellerman
Linus Torvalds  writes:
> On Wed, May 25, 2022 at 7:40 AM Uros Bizjak  wrote:
>>
>> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro.
>> x86 CMPXCHG instruction returns success in ZF flag, so this
>> change saves a compare after cmpxchg (and related move instruction
>> in front of cmpxchg). The main loop of lockref_get improves from:
>
> Ack on this one regardless of the 32-bit x86 question.
>
> HOWEVER.
>
> I'd like other architectures to pipe up too, because I think right now
> x86 is the only one that implements that "arch_try_cmpxchg()" family
> of operations natively, and I think the generic fallback for when it
> is missing might be kind of nasty.
>
> Maybe it ends up generating ok code, but it's also possible that it
> just didn't matter when it was only used in one place in the
> scheduler.

This patch seems to generate slightly *better* code on powerpc.

I see one register-to-register move that gets shifted slightly later, so
that it's skipped on the path that returns directly via the SUCCESS
case.

So LGTM.

> The lockref_get() case can be quite hot under some loads, it would be
> sad if this made other architectures worse.

Do you know of a benchmark that shows it up? I tried a few things but
couldn't get lockref_get() to count for more than 1-2%.

cheers