Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
On Thu, May 26, 2022 at 01:42:35PM +0100, Mark Rutland wrote: > On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > > Linus Torvalds writes: > > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak wrote: > > >> > > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > > >> x86 CMPXCHG instruction returns success in ZF flag, so this > > >> change saves a compare after cmpxchg (and related move instruction > > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > > > Ack on this one regardless of the 32-bit x86 question. > > > > > > HOWEVER. > > > > > > I'd like other architectures to pipe up too, because I think right now > > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > > of operations natively, and I think the generic fallback for when it > > > is missing might be kind of nasty. > > > > > > Maybe it ends up generating ok code, but it's also possible that it > > > just didn't matter when it was only used in one place in the > > > scheduler. > > > > This patch seems to generate slightly *better* code on powerpc. > > > > I see one register-to-register move that gets shifted slightly later, so > > that it's skipped on the path that returns directly via the SUCCESS > > case. > > FWIW, I see the same on arm64; a register-to-register move gets moved out of > the success path. That changes the register allocation, and resulting in one > fewer move, but otherwise the code generation is the same. Just for the records: s390 code generation changes the same like on powerpc; so looks good.
Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
On Thu, May 26, 2022 at 5:15 AM Michael Ellerman wrote: > > Do you know of a benchmark that shows it up? I tried a few things but > couldn't get lockref_get() to count for more than 1-2%. Heh. 1% for a small instruction sequence that is only handful of instructions and is used in just a couple of places counts as "very hot" for me. I assume you did the natural thing: threaded pathname lookup (with paths being of the longer variety to not be dominated by system call etc costs). Linus
Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
On Thu, May 26, 2022 at 10:14:59PM +1000, Michael Ellerman wrote: > Linus Torvalds writes: > > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak wrote: > >> > >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. > >> x86 CMPXCHG instruction returns success in ZF flag, so this > >> change saves a compare after cmpxchg (and related move instruction > >> in front of cmpxchg). The main loop of lockref_get improves from: > > > > Ack on this one regardless of the 32-bit x86 question. > > > > HOWEVER. > > > > I'd like other architectures to pipe up too, because I think right now > > x86 is the only one that implements that "arch_try_cmpxchg()" family > > of operations natively, and I think the generic fallback for when it > > is missing might be kind of nasty. > > > > Maybe it ends up generating ok code, but it's also possible that it > > just didn't matter when it was only used in one place in the > > scheduler. > > This patch seems to generate slightly *better* code on powerpc. > > I see one register-to-register move that gets shifted slightly later, so > that it's skipped on the path that returns directly via the SUCCESS > case. FWIW, I see the same on arm64; a register-to-register move gets moved out of the success path. That changes the register allocation, and resulting in one fewer move, but otherwise the code generation is the same. Thanks, Mark.
Re: [PATCH 1/2] locking/lockref: Use try_cmpxchg64 in CMPXCHG_LOOP macro
Linus Torvalds writes: > On Wed, May 25, 2022 at 7:40 AM Uros Bizjak wrote: >> >> Use try_cmpxchg64 instead of cmpxchg64 in CMPXCHG_LOOP macro. >> x86 CMPXCHG instruction returns success in ZF flag, so this >> change saves a compare after cmpxchg (and related move instruction >> in front of cmpxchg). The main loop of lockref_get improves from: > > Ack on this one regardless of the 32-bit x86 question. > > HOWEVER. > > I'd like other architectures to pipe up too, because I think right now > x86 is the only one that implements that "arch_try_cmpxchg()" family > of operations natively, and I think the generic fallback for when it > is missing might be kind of nasty. > > Maybe it ends up generating ok code, but it's also possible that it > just didn't matter when it was only used in one place in the > scheduler. This patch seems to generate slightly *better* code on powerpc. I see one register-to-register move that gets shifted slightly later, so that it's skipped on the path that returns directly via the SUCCESS case. So LGTM. > The lockref_get() case can be quite hot under some loads, it would be > sad if this made other architectures worse. Do you know of a benchmark that shows it up? I tried a few things but couldn't get lockref_get() to count for more than 1-2%. cheers