[PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Waiman Long
Modify __down_read_trylock() to optimize for an unlocked rwsem and make
it generate slightly better code.

Before this patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: jmp0x18 
   0x0007 <+7>: lea0x1(%rdx),%rcx
   0x000b <+11>:mov%rdx,%rax
   0x000e <+14>:lock cmpxchg %rcx,(%rdi)
   0x0013 <+19>:cmp%rax,%rdx
   0x0016 <+22>:je 0x23 
   0x0018 <+24>:mov(%rdi),%rdx
   0x001b <+27>:test   %rdx,%rdx
   0x001e <+30>:jns0x7 
   0x0020 <+32>:xor%eax,%eax
   0x0022 <+34>:retq
   0x0023 <+35>:mov%gs:0x0,%rax
   0x002c <+44>:or $0x3,%rax
   0x0030 <+48>:mov%rax,0x20(%rdi)
   0x0034 <+52>:mov$0x1,%eax
   0x0039 <+57>:retq

After patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: xor%eax,%eax
   0x0007 <+7>: lea0x1(%rax),%rdx
   0x000b <+11>:lock cmpxchg %rdx,(%rdi)
   0x0010 <+16>:jne0x29 
   0x0012 <+18>:mov%gs:0x0,%rax
   0x001b <+27>:or $0x3,%rax
   0x001f <+31>:mov%rax,0x20(%rdi)
   0x0023 <+35>:mov$0x1,%eax
   0x0028 <+40>:retq
   0x0029 <+41>:test   %rax,%rax
   0x002c <+44>:jns0x7 
   0x002e <+46>:xor%eax,%eax
   0x0030 <+48>:retq

By using a rwsem microbenchmark, the down_read_trylock() rate (with a
load of 10 to lengthen the lock critical section) on a x86-64 system
before and after the patch were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   14,496  14,716
28,644   8,453
46,799   6,983
85,664   7,190

On a ARM64 system, the performance results were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   23,676  24,488
27,697   9,502
44,945   3,440
82,641   1,603

For the uncontended case (1 thread), the new down_read_trylock() is a
little bit faster. For the contended cases, the new down_read_trylock()
perform pretty well in x86-64, but performance degrades at high
contention level on ARM64.

Suggested-by: Linus Torvalds 
Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 45ee00236e03..1f5775aa6a1d 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -174,14 +174,17 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   /*
+* Optimize for the case when the rwsem is not locked at all.
+*/
+   long tmp = RWSEM_UNLOCKED_VALUE;
 
-   while ((tmp = atomic_long_read(&sem->count)) >= 0) {
-   if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
-  tmp + RWSEM_ACTIVE_READ_BIAS)) {
+   do {
+   if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+   tmp + RWSEM_ACTIVE_READ_BIAS)) {
return 1;
}
-   }
+   } while (tmp >= 0);
return 0;
 }
 
-- 
2.18.1



Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Linus Torvalds
On Fri, Mar 22, 2019 at 7:30 AM Waiman Long  wrote:
>
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.

Oh, that should teach me to read all patches in the series before
starting to comment on them.

So ignore my comment on #1.

Linus


Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Russell King - ARM Linux admin
On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote:
> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
> it generate slightly better code.
> 
> Before this patch, down_read_trylock:
> 
>0x <+0>: callq  0x5 
>0x0005 <+5>: jmp0x18 
>0x0007 <+7>: lea0x1(%rdx),%rcx
>0x000b <+11>:mov%rdx,%rax
>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>0x0013 <+19>:cmp%rax,%rdx
>0x0016 <+22>:je 0x23 
>0x0018 <+24>:mov(%rdi),%rdx
>0x001b <+27>:test   %rdx,%rdx
>0x001e <+30>:jns0x7 
>0x0020 <+32>:xor%eax,%eax
>0x0022 <+34>:retq
>0x0023 <+35>:mov%gs:0x0,%rax
>0x002c <+44>:or $0x3,%rax
>0x0030 <+48>:mov%rax,0x20(%rdi)
>0x0034 <+52>:mov$0x1,%eax
>0x0039 <+57>:retq
> 
> After patch, down_read_trylock:
> 
>0x <+0>:   callq  0x5 
>0x0005 <+5>:   xor%eax,%eax
>0x0007 <+7>:   lea0x1(%rax),%rdx
>0x000b <+11>:  lock cmpxchg %rdx,(%rdi)
>0x0010 <+16>:  jne0x29 
>0x0012 <+18>:  mov%gs:0x0,%rax
>0x001b <+27>:  or $0x3,%rax
>0x001f <+31>:  mov%rax,0x20(%rdi)
>0x0023 <+35>:  mov$0x1,%eax
>0x0028 <+40>:  retq
>0x0029 <+41>:  test   %rax,%rax
>0x002c <+44>:  jns0x7 
>0x002e <+46>:  xor%eax,%eax
>0x0030 <+48>:  retq
> 
> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
> load of 10 to lengthen the lock critical section) on a x86-64 system
> before and after the patch were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   14,496  14,716
> 28,644   8,453
>   46,799   6,983
>   85,664   7,190
> 
> On a ARM64 system, the performance results were:
> 
>  Before PatchAfter Patch
># of Threads rlock   rlock
> -   -
> 1   23,676  24,488
> 27,697   9,502
> 44,945   3,440
> 82,641   1,603
> 
> For the uncontended case (1 thread), the new down_read_trylock() is a
> little bit faster. For the contended cases, the new down_read_trylock()
> perform pretty well in x86-64, but performance degrades at high
> contention level on ARM64.

So, 70% for 4 threads, 61% for 4 threads - does this trend
continue tailing off as the number of threads (and cores)
increase?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-22 Thread Waiman Long
On 03/22/2019 01:25 PM, Russell King - ARM Linux admin wrote:
> On Fri, Mar 22, 2019 at 10:30:08AM -0400, Waiman Long wrote:
>> Modify __down_read_trylock() to optimize for an unlocked rwsem and make
>> it generate slightly better code.
>>
>> Before this patch, down_read_trylock:
>>
>>0x <+0>: callq  0x5 
>>0x0005 <+5>: jmp0x18 
>>0x0007 <+7>: lea0x1(%rdx),%rcx
>>0x000b <+11>:mov%rdx,%rax
>>0x000e <+14>:lock cmpxchg %rcx,(%rdi)
>>0x0013 <+19>:cmp%rax,%rdx
>>0x0016 <+22>:je 0x23 
>>0x0018 <+24>:mov(%rdi),%rdx
>>0x001b <+27>:test   %rdx,%rdx
>>0x001e <+30>:jns0x7 
>>0x0020 <+32>:xor%eax,%eax
>>0x0022 <+34>:retq
>>0x0023 <+35>:mov%gs:0x0,%rax
>>0x002c <+44>:or $0x3,%rax
>>0x0030 <+48>:mov%rax,0x20(%rdi)
>>0x0034 <+52>:mov$0x1,%eax
>>0x0039 <+57>:retq
>>
>> After patch, down_read_trylock:
>>
>>0x <+0>:  callq  0x5 
>>0x0005 <+5>:  xor%eax,%eax
>>0x0007 <+7>:  lea0x1(%rax),%rdx
>>0x000b <+11>: lock cmpxchg %rdx,(%rdi)
>>0x0010 <+16>: jne0x29 
>>0x0012 <+18>: mov%gs:0x0,%rax
>>0x001b <+27>: or $0x3,%rax
>>0x001f <+31>: mov%rax,0x20(%rdi)
>>0x0023 <+35>: mov$0x1,%eax
>>0x0028 <+40>: retq
>>0x0029 <+41>: test   %rax,%rax
>>0x002c <+44>: jns0x7 
>>0x002e <+46>: xor%eax,%eax
>>0x0030 <+48>: retq
>>
>> By using a rwsem microbenchmark, the down_read_trylock() rate (with a
>> load of 10 to lengthen the lock critical section) on a x86-64 system
>> before and after the patch were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   14,496  14,716
>> 28,644   8,453
>>  46,799   6,983
>>  85,664   7,190
>>
>> On a ARM64 system, the performance results were:
>>
>>  Before PatchAfter Patch
>># of Threads rlock   rlock
>> -   -
>> 1   23,676  24,488
>> 27,697   9,502
>> 44,945   3,440
>> 82,641   1,603
>>
>> For the uncontended case (1 thread), the new down_read_trylock() is a
>> little bit faster. For the contended cases, the new down_read_trylock()
>> perform pretty well in x86-64, but performance degrades at high
>> contention level on ARM64.
> So, 70% for 4 threads, 61% for 4 threads - does this trend
> continue tailing off as the number of threads (and cores)
> increase?
>
I didn't try higher number of contending threads. I won't worry too much
about contention as trylock is a one-off event. The chance of having
more than one trylock happening simultaneously is very small.

Cheers,
Longman



Re: [PATCH v5 3/3] locking/rwsem: Optimize down_read_trylock()

2019-03-25 Thread Christophe Leroy

Hi,

Could you share the microbenchmark you are using ?

I'd like to test the series on powerpc.

Thanks
Christophe

Le 22/03/2019 à 15:30, Waiman Long a écrit :

Modify __down_read_trylock() to optimize for an unlocked rwsem and make
it generate slightly better code.

Before this patch, down_read_trylock:

0x <+0>: callq  0x5 
0x0005 <+5>: jmp0x18 
0x0007 <+7>: lea0x1(%rdx),%rcx
0x000b <+11>:mov%rdx,%rax
0x000e <+14>:lock cmpxchg %rcx,(%rdi)
0x0013 <+19>:cmp%rax,%rdx
0x0016 <+22>:je 0x23 
0x0018 <+24>:mov(%rdi),%rdx
0x001b <+27>:test   %rdx,%rdx
0x001e <+30>:jns0x7 
0x0020 <+32>:xor%eax,%eax
0x0022 <+34>:retq
0x0023 <+35>:mov%gs:0x0,%rax
0x002c <+44>:or $0x3,%rax
0x0030 <+48>:mov%rax,0x20(%rdi)
0x0034 <+52>:mov$0x1,%eax
0x0039 <+57>:retq

After patch, down_read_trylock:

0x <+0>:  callq  0x5 
0x0005 <+5>:  xor%eax,%eax
0x0007 <+7>:  lea0x1(%rax),%rdx
0x000b <+11>: lock cmpxchg %rdx,(%rdi)
0x0010 <+16>: jne0x29 
0x0012 <+18>: mov%gs:0x0,%rax
0x001b <+27>: or $0x3,%rax
0x001f <+31>: mov%rax,0x20(%rdi)
0x0023 <+35>: mov$0x1,%eax
0x0028 <+40>: retq
0x0029 <+41>: test   %rax,%rax
0x002c <+44>: jns0x7 
0x002e <+46>: xor%eax,%eax
0x0030 <+48>: retq

By using a rwsem microbenchmark, the down_read_trylock() rate (with a
load of 10 to lengthen the lock critical section) on a x86-64 system
before and after the patch were:

  Before PatchAfter Patch
# of Threads rlock   rlock
 -   -
 1   14,496  14,716
 28,644   8,453
46,799   6,983
85,664   7,190

On a ARM64 system, the performance results were:

  Before PatchAfter Patch
# of Threads rlock   rlock
 -   -
 1   23,676  24,488
 27,697   9,502
 44,945   3,440
 82,641   1,603

For the uncontended case (1 thread), the new down_read_trylock() is a
little bit faster. For the contended cases, the new down_read_trylock()
perform pretty well in x86-64, but performance degrades at high
contention level on ARM64.

Suggested-by: Linus Torvalds 
Signed-off-by: Waiman Long 
---
  kernel/locking/rwsem.h | 13 -
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 45ee00236e03..1f5775aa6a1d 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -174,14 +174,17 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
  
  static inline int __down_read_trylock(struct rw_semaphore *sem)

  {
-   long tmp;
+   /*
+* Optimize for the case when the rwsem is not locked at all.
+*/
+   long tmp = RWSEM_UNLOCKED_VALUE;
  
-	while ((tmp = atomic_long_read(&sem->count)) >= 0) {

-   if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp,
-  tmp + RWSEM_ACTIVE_READ_BIAS)) {
+   do {
+   if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
+   tmp + RWSEM_ACTIVE_READ_BIAS)) {
return 1;
}
-   }
+   } while (tmp >= 0);
return 0;
  }