[PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()
Modify __down_read_trylock() to optimize for an unlocked rwsem and make it generate slightly better code. Before this patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: jmp0x18 0x0007 <+7>: lea0x1(%rdx),%rcx 0x000b <+11>:mov%rdx,%rax 0x000e <+14>:lock cmpxchg %rcx,(%rdi) 0x0013 <+19>:cmp%rax,%rdx 0x0016 <+22>:je 0x23 0x0018 <+24>:mov(%rdi),%rdx 0x001b <+27>:test %rdx,%rdx 0x001e <+30>:jns0x7 0x0020 <+32>:xor%eax,%eax 0x0022 <+34>:retq 0x0023 <+35>:mov%gs:0x0,%rax 0x002c <+44>:or $0x3,%rax 0x0030 <+48>:mov%rax,0x20(%rdi) 0x0034 <+52>:mov$0x1,%eax 0x0039 <+57>:retq After patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: xor%eax,%eax 0x0007 <+7>: lea0x1(%rax),%rdx 0x000b <+11>:lock cmpxchg %rdx,(%rdi) 0x0010 <+16>:jne0x29 0x0012 <+18>:mov%gs:0x0,%rax 0x001b <+27>:or $0x3,%rax 0x001f <+31>:mov%rax,0x20(%rdi) 0x0023 <+35>:mov$0x1,%eax 0x0028 <+40>:retq 0x0029 <+41>:test %rax,%rax 0x002c <+44>:jns0x7 0x002e <+46>:xor%eax,%eax 0x0030 <+48>:retq By using a rwsem microbenchmark, the down_read_trylock() rate (with a load of 10 to lengthen the lock critical section) on a x86-64 system before and after the patch were: Before PatchAfter Patch # of Threads rlock rlock - - 1 14,496 14,716 28,644 8,453 46,799 6,983 85,664 7,190 On a ARM64 system, the performance results were: Before PatchAfter Patch # of Threads rlock rlock - - 1 23,676 24,488 27,697 9,502 44,945 3,440 82,641 1,603 For the uncontended case (1 thread), the new down_read_trylock() is a little bit faster. For the contended cases, the new down_read_trylock() perform pretty well in x86-64, but performance degrades at high contention level on ARM64. Suggested-by: Linus Torvalds Signed-off-by: Waiman Long --- kernel/locking/rwsem.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 45ee00236e03..1f5775aa6a1d 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -174,14 +174,17 @@ static inline int __down_read_killable(struct rw_semaphore *sem) static inline int __down_read_trylock(struct rw_semaphore *sem) { - long tmp; + /* +* Optimize for the case when the rwsem is not locked at all. +*/ + long tmp = RWSEM_UNLOCKED_VALUE; - while ((tmp = atomic_long_read(&sem->count)) >= 0) { - if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, - tmp + RWSEM_ACTIVE_READ_BIAS)) { + do { + if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, + tmp + RWSEM_ACTIVE_READ_BIAS)) { return 1; } - } + } while (tmp >= 0); return 0; } -- 2.18.1
Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()
On Fri, Mar 22, 2019 at 7:30 AM Waiman Long wrote: > > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > it generate slightly better code. Oh, that should teach me to read all patches in the series before starting to comment on them. So ignore my comment on #1. Linus
Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()
On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote: > Modify __down_read_trylock() to optimize for an unlocked rwsem and make > it generate slightly better code. > > Before this patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: jmp0x18 >0x0007 <+7>: lea0x1(%rdx),%rcx >0x000b <+11>:mov%rdx,%rax >0x000e <+14>:lock cmpxchg %rcx,(%rdi) >0x0013 <+19>:cmp%rax,%rdx >0x0016 <+22>:je 0x23 >0x0018 <+24>:mov(%rdi),%rdx >0x001b <+27>:test %rdx,%rdx >0x001e <+30>:jns0x7 >0x0020 <+32>:xor%eax,%eax >0x0022 <+34>:retq >0x0023 <+35>:mov%gs:0x0,%rax >0x002c <+44>:or $0x3,%rax >0x0030 <+48>:mov%rax,0x20(%rdi) >0x0034 <+52>:mov$0x1,%eax >0x0039 <+57>:retq > > After patch, down_read_trylock: > >0x <+0>: callq 0x5 >0x0005 <+5>: xor%eax,%eax >0x0007 <+7>: lea0x1(%rax),%rdx >0x000b <+11>: lock cmpxchg %rdx,(%rdi) >0x0010 <+16>: jne0x29 >0x0012 <+18>: mov%gs:0x0,%rax >0x001b <+27>: or $0x3,%rax >0x001f <+31>: mov%rax,0x20(%rdi) >0x0023 <+35>: mov$0x1,%eax >0x0028 <+40>: retq >0x0029 <+41>: test %rax,%rax >0x002c <+44>: jns0x7 >0x002e <+46>: xor%eax,%eax >0x0030 <+48>: retq > > By using a rwsem microbenchmark, the down_read_trylock() rate (with a > load of 10 to lengthen the lock critical section) on a x86-64 system > before and after the patch were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 14,496 14,716 > 28,644 8,453 > 46,799 6,983 > 85,664 7,190 > > On a ARM64 system, the performance results were: > > Before PatchAfter Patch ># of Threads rlock rlock > - - > 1 23,676 24,488 > 27,697 9,502 > 44,945 3,440 > 82,641 1,603 > > For the uncontended case (1 thread), the new down_read_trylock() is a > little bit faster. For the contended cases, the new down_read_trylock() > perform pretty well in x86-64, but performance degrades at high > contention level on ARM64. So, 70% for 4 threads, 61% for 4 threads - does this trend continue tailing off as the number of threads (and cores) increase? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up
Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()
On 03/22/2019 01:25 PM, Russell King - ARM Linux admin wrote: > On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote: >> Modify __down_read_trylock() to optimize for an unlocked rwsem and make >> it generate slightly better code. >> >> Before this patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: jmp0x18 >>0x0007 <+7>: lea0x1(%rdx),%rcx >>0x000b <+11>:mov%rdx,%rax >>0x000e <+14>:lock cmpxchg %rcx,(%rdi) >>0x0013 <+19>:cmp%rax,%rdx >>0x0016 <+22>:je 0x23 >>0x0018 <+24>:mov(%rdi),%rdx >>0x001b <+27>:test %rdx,%rdx >>0x001e <+30>:jns0x7 >>0x0020 <+32>:xor%eax,%eax >>0x0022 <+34>:retq >>0x0023 <+35>:mov%gs:0x0,%rax >>0x002c <+44>:or $0x3,%rax >>0x0030 <+48>:mov%rax,0x20(%rdi) >>0x0034 <+52>:mov$0x1,%eax >>0x0039 <+57>:retq >> >> After patch, down_read_trylock: >> >>0x <+0>: callq 0x5 >>0x0005 <+5>: xor%eax,%eax >>0x0007 <+7>: lea0x1(%rax),%rdx >>0x000b <+11>: lock cmpxchg %rdx,(%rdi) >>0x0010 <+16>: jne0x29 >>0x0012 <+18>: mov%gs:0x0,%rax >>0x001b <+27>: or $0x3,%rax >>0x001f <+31>: mov%rax,0x20(%rdi) >>0x0023 <+35>: mov$0x1,%eax >>0x0028 <+40>: retq >>0x0029 <+41>: test %rax,%rax >>0x002c <+44>: jns0x7 >>0x002e <+46>: xor%eax,%eax >>0x0030 <+48>: retq >> >> By using a rwsem microbenchmark, the down_read_trylock() rate (with a >> load of 10 to lengthen the lock critical section) on a x86-64 system >> before and after the patch were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 14,496 14,716 >> 28,644 8,453 >> 46,799 6,983 >> 85,664 7,190 >> >> On a ARM64 system, the performance results were: >> >> Before PatchAfter Patch >># of Threads rlock rlock >> - - >> 1 23,676 24,488 >> 27,697 9,502 >> 44,945 3,440 >> 82,641 1,603 >> >> For the uncontended case (1 thread), the new down_read_trylock() is a >> little bit faster. For the contended cases, the new down_read_trylock() >> perform pretty well in x86-64, but performance degrades at high >> contention level on ARM64. > So, 70% for 4 threads, 61% for 4 threads - does this trend > continue tailing off as the number of threads (and cores) > increase? > I didn't try higher number of contending threads. I won't worry too much about contention as trylock is a one-off event. The chance of having more than one trylock happening simultaneously is very small. Cheers, Longman
Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()
Hi, Could you share the microbenchmark you are using ? I'd like to test the series on powerpc. Thanks Christophe Le 22/03/2019 à 15:30, Waiman Long a écrit : Modify __down_read_trylock() to optimize for an unlocked rwsem and make it generate slightly better code. Before this patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: jmp0x18 0x0007 <+7>: lea0x1(%rdx),%rcx 0x000b <+11>:mov%rdx,%rax 0x000e <+14>:lock cmpxchg %rcx,(%rdi) 0x0013 <+19>:cmp%rax,%rdx 0x0016 <+22>:je 0x23 0x0018 <+24>:mov(%rdi),%rdx 0x001b <+27>:test %rdx,%rdx 0x001e <+30>:jns0x7 0x0020 <+32>:xor%eax,%eax 0x0022 <+34>:retq 0x0023 <+35>:mov%gs:0x0,%rax 0x002c <+44>:or $0x3,%rax 0x0030 <+48>:mov%rax,0x20(%rdi) 0x0034 <+52>:mov$0x1,%eax 0x0039 <+57>:retq After patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: xor%eax,%eax 0x0007 <+7>: lea0x1(%rax),%rdx 0x000b <+11>: lock cmpxchg %rdx,(%rdi) 0x0010 <+16>: jne0x29 0x0012 <+18>: mov%gs:0x0,%rax 0x001b <+27>: or $0x3,%rax 0x001f <+31>: mov%rax,0x20(%rdi) 0x0023 <+35>: mov$0x1,%eax 0x0028 <+40>: retq 0x0029 <+41>: test %rax,%rax 0x002c <+44>: jns0x7 0x002e <+46>: xor%eax,%eax 0x0030 <+48>: retq By using a rwsem microbenchmark, the down_read_trylock() rate (with a load of 10 to lengthen the lock critical section) on a x86-64 system before and after the patch were: Before PatchAfter Patch # of Threads rlock rlock - - 1 14,496 14,716 28,644 8,453 46,799 6,983 85,664 7,190 On a ARM64 system, the performance results were: Before PatchAfter Patch # of Threads rlock rlock - - 1 23,676 24,488 27,697 9,502 44,945 3,440 82,641 1,603 For the uncontended case (1 thread), the new down_read_trylock() is a little bit faster. For the contended cases, the new down_read_trylock() perform pretty well in x86-64, but performance degrades at high contention level on ARM64. Suggested-by: Linus Torvalds Signed-off-by: Waiman Long --- kernel/locking/rwsem.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 45ee00236e03..1f5775aa6a1d 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -174,14 +174,17 @@ static inline int __down_read_killable(struct rw_semaphore *sem) static inline int __down_read_trylock(struct rw_semaphore *sem) { - long tmp; + /* +* Optimize for the case when the rwsem is not locked at all. +*/ + long tmp = RWSEM_UNLOCKED_VALUE; - while ((tmp = atomic_long_read(&sem->count)) >= 0) { - if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, - tmp + RWSEM_ACTIVE_READ_BIAS)) { + do { + if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, + tmp + RWSEM_ACTIVE_READ_BIAS)) { return 1; } - } + } while (tmp >= 0); return 0; }