Re: rw locks vs memory barriers and adaptive spinning
On Wed, Oct 11, 2017 at 03:15:42PM +0200, Martin Pieuchot wrote: > On 10/10/17(Tue) 20:19, Mateusz Guzik wrote: > > On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote: > > > Hello Mateusz, > > > > > > On 09/10/17(Mon) 21:43, Mateusz Guzik wrote: > > > > I was looking at rw lock code out of curiosity and noticed you always do > > > > membar_enter which on MP-enabled amd64 kernel translates to mfence. > > > > This makes the entire business a little bit slower. > > > > > > > > Interestingly you already have relevant macros for amd64: > > > > #define membar_enter_after_atomic() __membar("") > > > > #define membar_exit_before_atomic() __membar("") > > > > > > If you look at the history, you'll see that theses macro didn't exist > > > when membar_* where added to rw locks. > > > > > > > I know. Addition of such a macro sounds like an accelent opportunity to > > examine existing users. I figured rw locks were ommitted by accident > > as the original posting explicitly mentions mutexes. > > > > https://marc.info/?l=openbsd-tech&m=149555411827749&w=2 > > > > > > And there is even a default variant for archs which don't provide one. > > > > I guess the switch should be easy. > > > > > > Then please do it :) At lot of stuff is easy on OpenBSD but we are not > > > enough. > > > > > > Do it, test it, explain it, get oks and commit it 8) > > > > > > > As noted earlier the patch is trivial. I have no easy means to even > > compile-test it though as I'm not using OpenBSD. Only had a look out of > > curiosity. > > Why don't you start using it? Testing is what makes the difference. > Many of us have ideas of how to improve the kernel but we're lacking > the time to test & debug them all. > I'm playing with concurrency as a hobby and smpifying single-threaded code is above the pain threshold I'm willing to accept in my spare time. > Sometimes a "correct" change exposes a bug and we try not to break > -current, so we cannot afford to commit a non-tested diff. > I definitely don't advocate for just committing stuff. I do state though that there should be no binary change for archs other than i386 and amd64 (== nothing to test there). Given the nature of the change for these two it really seemed it just fell through the cracks earlier and I just wanted to point it out as perhaps someone will find cycles to pick it up and test on relevant platforms. > > Nonetheless, here is the diff which can be split in 2. One part deals > > with barriers, another one cleans up rw_status. > > > > [...] > > > > Read from the lock only once in rw_status. The lock word is marked as > > volatile which forces re-read on each access. There is no correctness > > issue here, but the assembly is worse than it needs to be and in case > > of contention this adds avoidable cache traffic. > > Here's the diff to cleans rw_status. The difference on amd64 with > clang 4.0 is the following: > > 48 8b 0fmov(%rdi),%rcx > f6 c1 04test $0x4,%cl > 75 0c jne738 > 31 c0 xor%eax,%eax > - 48 83 3f 00 cmpq $0x0,(%rdi) > + 48 85 c9test %rcx,%rcx > > ok? There should be another change past rrw_status+0x18: 812557e8: 65 48 8b 04 25 08 00mov%gs:0x8,%rax 812557ef: 00 00 812557f1: 48 8b 80 90 02 00 00mov0x290(%rax),%rax 812557f8: 48 33 07xor(%rdi),%rax > > Index: kern/kern_rwlock.c > === > RCS file: /cvs/src/sys/kern/kern_rwlock.c,v > retrieving revision 1.30 > diff -u -p -r1.30 kern_rwlock.c > --- kern/kern_rwlock.c12 Aug 2017 23:27:44 - 1.30 > +++ kern/kern_rwlock.c11 Oct 2017 12:59:19 - > @@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS > int > rw_status(struct rwlock *rwl) > { > - if (rwl->rwl_owner & RWLOCK_WRLOCK) { > - if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner)) > + unsigned long owner = rwl->rwl_owner; > + > + if (owner & RWLOCK_WRLOCK) { > + if (RW_PROC(curproc) == RW_PROC(owner)) > return RW_WRITE; > else > return RW_WRITE_OTHER; > } > - if (rwl->rwl_owner) > + if (owner) > return RW_READ; > return (0); > } -- Mateusz Guzik
Re: rw locks vs memory barriers and adaptive spinning
On 10/10/17(Tue) 20:19, Mateusz Guzik wrote: > On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote: > > Hello Mateusz, > > > > On 09/10/17(Mon) 21:43, Mateusz Guzik wrote: > > > I was looking at rw lock code out of curiosity and noticed you always do > > > membar_enter which on MP-enabled amd64 kernel translates to mfence. > > > This makes the entire business a little bit slower. > > > > > > Interestingly you already have relevant macros for amd64: > > > #define membar_enter_after_atomic() __membar("") > > > #define membar_exit_before_atomic() __membar("") > > > > If you look at the history, you'll see that theses macro didn't exist > > when membar_* where added to rw locks. > > > > I know. Addition of such a macro sounds like an accelent opportunity to > examine existing users. I figured rw locks were ommitted by accident > as the original posting explicitly mentions mutexes. > > https://marc.info/?l=openbsd-tech&m=149555411827749&w=2 > > > > And there is even a default variant for archs which don't provide one. > > > I guess the switch should be easy. > > > > Then please do it :) At lot of stuff is easy on OpenBSD but we are not > > enough. > > > > Do it, test it, explain it, get oks and commit it 8) > > > > As noted earlier the patch is trivial. I have no easy means to even > compile-test it though as I'm not using OpenBSD. Only had a look out of > curiosity. Why don't you start using it? Testing is what makes the difference. Many of us have ideas of how to improve the kernel but we're lacking the time to test & debug them all. Sometimes a "correct" change exposes a bug and we try not to break -current, so we cannot afford to commit a non-tested diff. > Nonetheless, here is the diff which can be split in 2. One part deals > with barriers, another one cleans up rw_status. > > [...] > > Read from the lock only once in rw_status. The lock word is marked as > volatile which forces re-read on each access. There is no correctness > issue here, but the assembly is worse than it needs to be and in case > of contention this adds avoidable cache traffic. Here's the diff to cleans rw_status. The difference on amd64 with clang 4.0 is the following: 48 8b 0fmov(%rdi),%rcx f6 c1 04test $0x4,%cl 75 0c jne738 31 c0 xor%eax,%eax - 48 83 3f 00 cmpq $0x0,(%rdi) + 48 85 c9test %rcx,%rcx ok? Index: kern/kern_rwlock.c === RCS file: /cvs/src/sys/kern/kern_rwlock.c,v retrieving revision 1.30 diff -u -p -r1.30 kern_rwlock.c --- kern/kern_rwlock.c 12 Aug 2017 23:27:44 - 1.30 +++ kern/kern_rwlock.c 11 Oct 2017 12:59:19 - @@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS int rw_status(struct rwlock *rwl) { - if (rwl->rwl_owner & RWLOCK_WRLOCK) { - if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner)) + unsigned long owner = rwl->rwl_owner; + + if (owner & RWLOCK_WRLOCK) { + if (RW_PROC(curproc) == RW_PROC(owner)) return RW_WRITE; else return RW_WRITE_OTHER; } - if (rwl->rwl_owner) + if (owner) return RW_READ; return (0); }
Re: rw locks vs memory barriers and adaptive spinning
On Tue, Oct 10, 2017 at 10:15:48AM +0200, Martin Pieuchot wrote: > Hello Mateusz, > > On 09/10/17(Mon) 21:43, Mateusz Guzik wrote: > > I was looking at rw lock code out of curiosity and noticed you always do > > membar_enter which on MP-enabled amd64 kernel translates to mfence. > > This makes the entire business a little bit slower. > > > > Interestingly you already have relevant macros for amd64: > > #define membar_enter_after_atomic() __membar("") > > #define membar_exit_before_atomic() __membar("") > > If you look at the history, you'll see that theses macro didn't exist > when membar_* where added to rw locks. > I know. Addition of such a macro sounds like an accelent opportunity to examine existing users. I figured rw locks were ommitted by accident as the original posting explicitly mentions mutexes. https://marc.info/?l=openbsd-tech&m=149555411827749&w=2 > > And there is even a default variant for archs which don't provide one. > > I guess the switch should be easy. > > Then please do it :) At lot of stuff is easy on OpenBSD but we are not > enough. > > Do it, test it, explain it, get oks and commit it 8) > As noted earlier the patch is trivial. I have no easy means to even compile-test it though as I'm not using OpenBSD. Only had a look out of curiosity. Nonetheless, here is the diff which can be split in 2. One part deals with barriers, another one cleans up rw_status. This should not result in binary change on any arch apart from i386 and amd64. On these 2 if there is a breakage, it's a braino of some sort. The change in principle is definitely fine. So: Utilize membar_*{before,after}_atomic in rw locks. On architectures whose atomic operations provide relevant barriers there is no need for additional membar. In particular this is the case on amd64 and i386. Read from the lock only once in rw_status. The lock word is marked as volatile which forces re-read on each access. There is no correctness issue here, but the assembly is worse than it needs to be and in case of contention this adds avoidable cache traffic. diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c index e2c3cae..5551835 100644 --- a/sys/kern/kern_rwlock.c +++ b/sys/kern/kern_rwlock.c @@ -96,7 +96,7 @@ _rw_enter_read(struct rwlock *rwl LOCK_FL_VARS) rw_cas(&rwl->rwl_owner, owner, owner + RWLOCK_READ_INCR))) _rw_enter(rwl, RW_READ LOCK_FL_ARGS); else { - membar_enter(); + membar_enter_after_atomic(); WITNESS_CHECKORDER(&rwl->rwl_lock_obj, LOP_NEWORDER, file, line, NULL); WITNESS_LOCK(&rwl->rwl_lock_obj, 0, file, line); @@ -112,7 +112,7 @@ _rw_enter_write(struct rwlock *rwl LOCK_FL_VARS) RW_PROC(p) | RWLOCK_WRLOCK))) _rw_enter(rwl, RW_WRITE LOCK_FL_ARGS); else { - membar_enter(); + membar_enter_after_atomic(); WITNESS_CHECKORDER(&rwl->rwl_lock_obj, LOP_EXCLUSIVE | LOP_NEWORDER, file, line, NULL); WITNESS_LOCK(&rwl->rwl_lock_obj, LOP_EXCLUSIVE, file, line); @@ -126,7 +126,7 @@ _rw_exit_read(struct rwlock *rwl LOCK_FL_VARS) rw_assert_rdlock(rwl); - membar_exit(); + membar_exit_before_atomic(); if (__predict_false((owner & RWLOCK_WAIT) || rw_cas(&rwl->rwl_owner, owner, owner - RWLOCK_READ_INCR))) _rw_exit(rwl LOCK_FL_ARGS); @@ -141,7 +141,7 @@ _rw_exit_write(struct rwlock *rwl LOCK_FL_VARS) rw_assert_wrlock(rwl); - membar_exit(); + membar_exit_before_atomic(); if (__predict_false((owner & RWLOCK_WAIT) || rw_cas(&rwl->rwl_owner, owner, 0))) _rw_exit(rwl LOCK_FL_ARGS); @@ -261,7 +261,7 @@ retry: if (__predict_false(rw_cas(&rwl->rwl_owner, o, o + inc))) goto retry; - membar_enter(); + membar_enter_after_atomic(); /* * If old lock had RWLOCK_WAIT and RWLOCK_WRLOCK set, it means we @@ -295,7 +295,7 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS) WITNESS_UNLOCK(&rwl->rwl_lock_obj, wrlock ? LOP_EXCLUSIVE : 0, file, line); - membar_exit(); + membar_exit_before_atomic(); do { owner = rwl->rwl_owner; if (wrlock) @@ -312,13 +312,15 @@ _rw_exit(struct rwlock *rwl LOCK_FL_VARS) int rw_status(struct rwlock *rwl) { - if (rwl->rwl_owner & RWLOCK_WRLOCK) { - if (RW_PROC(curproc) == RW_PROC(rwl->rwl_owner)) + unsigned long owner = rwl->rwl_owner; + + if (owner & RWLOCK_WRLOCK) { + if (RW_PROC(curproc) == RW_PROC(owner)) return RW_WRITE; else return RW_WRITE_OTHER; } - if (rwl->rwl_owner) + if (owner) return RW_READ; return (0); } -- Mateusz Guzik
Re: rw locks vs memory barriers and adaptive spinning
Hello Mateusz, On 09/10/17(Mon) 21:43, Mateusz Guzik wrote: > I was looking at rw lock code out of curiosity and noticed you always do > membar_enter which on MP-enabled amd64 kernel translates to mfence. > This makes the entire business a little bit slower. > > Interestingly you already have relevant macros for amd64: > #define membar_enter_after_atomic() __membar("") > #define membar_exit_before_atomic() __membar("") If you look at the history, you'll see that theses macro didn't exist when membar_* where added to rw locks. > And there is even a default variant for archs which don't provide one. > I guess the switch should be easy. Then please do it :) At lot of stuff is easy on OpenBSD but we are not enough. Do it, test it, explain it, get oks and commit it 8)
rw locks vs memory barriers and adaptive spinning
Hello, I was looking at rw lock code out of curiosity and noticed you always do membar_enter which on MP-enabled amd64 kernel translates to mfence. This makes the entire business a little bit slower. Interestingly you already have relevant macros for amd64: #define membar_enter_after_atomic() __membar("") #define membar_exit_before_atomic() __membar("") And there is even a default variant for archs which don't provide one. I guess the switch should be easy. Grabbing for reading is rw_cas to a higher value. On failure you explicitly re-read from the lock This is slower than necessary in presence of concurrent read lock/unlock since cas returns the found value and you can use that instead. Also the read lock fast path does not have to descent to the slow path after a single cas failure, but this probably does not matter right now. The actual question I have here is if you played with adaptive spinning instead of instantly putting threads to sleep at least for cases when the kernel lock is not held. This can be as simplistic as just spinning as long as the lock is owned by a running thread. For cases where curproc holds the kernel lock you can perhaps drop it for spinning purposes and reacquire later. Although I have no idea if this one is going to help anything. Definitely worth testing imo. A side note is that the global locks I found are not annotated in any manner with respect to exclusivity of cacheline placement. In particular netlock in a 6.2 kernel shares its chaceline with if_input_task_locked: 81aca608 D if_input_task_locked 81aca630 D netlock The standard boiler plate to deal with it is to annotate with aligned() and place the variable in a dedicated section. FreeBSD and NetBSD contain the necessary crappery to copy-paste including linker script support. Cheers, -- Mateusz Guzik