Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 26 Oct 2012, Oleg Nesterov wrote: > > The code is different, but it can be changed to use percpu rw semaphores > > (if we add percpu_down_write_trylock). > > I don't really understand how you can make percpu_down_write_trylock() > atomic so that it can be called under br_write_lock(vfsmount_lock) in > sb_prepare_remount_readonly(). So I guess you also need to replace > vfsmount_lock at least. Or _trylock needs the barriers in _down_read. > Or I missed something. > > Oleg. That's true - that code is under spinlock and you can't implement non-blocking percpu_down_write_trylock. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/26, Mikulas Patocka wrote: > > On Fri, 26 Oct 2012, Oleg Nesterov wrote: > > I didn't know about. The code is not reusable, and it doesn't really do > locking. That was my main point. As for the changing fs/namespace.c to use percpu_rwsem, I am not sure it is that simple and even worthwhile but I won't argue, I do not pretend I understand this code. > > I don't understand why do you both think that __mnt_want_write() > > and mnt_make_readonly() provides the same functionality. I looked > > at this code before I started this patch, and unless I completely > > misread it this does very different things. It is not "lock" at all. > > > > Oleg. > > mnt_want_write uses percpu array of counters, just like percpu semaphores. and this is all imo ;) > The code is different, but it can be changed to use percpu rw semaphores > (if we add percpu_down_write_trylock). I don't really understand how you can make percpu_down_write_trylock() atomic so that it can be called under br_write_lock(vfsmount_lock) in sb_prepare_remount_readonly(). So I guess you also need to replace vfsmount_lock at least. Or _trylock needs the barriers in _down_read. Or I missed something. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 26 Oct 2012, Oleg Nesterov wrote: > On 10/26, Dave Chinner wrote: > > > > On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: > > > > > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw > > > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() > > > to use percpu rw semaphores and remove the duplicated code. > > > > I think you misunderstood my point - that rather than re-inventing > > the wheel, why didn't you just copy something that is known to > > work? I didn't know about. The code is not reusable, and it doesn't really do locking. And it has two barriers on the read path, while percpu rw semaphores have none. > I don't understand why do you both think that __mnt_want_write() > and mnt_make_readonly() provides the same functionality. I looked > at this code before I started this patch, and unless I completely > misread it this does very different things. It is not "lock" at all. > > Oleg. mnt_want_write uses percpu array of counters, just like percpu semaphores. The code is different, but it can be changed to use percpu rw semaphores (if we add percpu_down_write_trylock). __mnt_want_write could call percpu_down_read and check if it is readonly (if it is, drop the lock and return -EROFS) __mnt_drop_write could call percpu_up_read mnt_make_readonly and sb_prepare_remount_readonly could call percpu_down_write_trylock instead of mnt_get_writers (if they get the write lock, set it to readonly and drop the write lock) ... and that's it, then, you can remove MNT_WRITE_HOLD, the barriers, spinning and other complexity from fs/namespace.c Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/26, Dave Chinner wrote: > > On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: > > > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw > > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() > > to use percpu rw semaphores and remove the duplicated code. > > I think you misunderstood my point - that rather than re-inventing > the wheel, why didn't you just copy something that is known to > work? I don't understand why do you both think that __mnt_want_write() and mnt_make_readonly() provides the same functionality. I looked at this code before I started this patch, and unless I completely misread it this does very different things. It is not "lock" at all. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Thu, Oct 25, 2012 at 10:09:31AM -0400, Mikulas Patocka wrote: > > > On Wed, 24 Oct 2012, Dave Chinner wrote: > > > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: > > > > > > > > > On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > > > > > > > Yes, I tried this approach - it involves doing LOCK instruction on > > > > > read > > > > > lock, remembering the cpu and doing another LOCK instruction on read > > > > > unlock (which will hopefully be on the same CPU, so no cacheline > > > > > bouncing > > > > > happens in the common case). It was slower than the approach without > > > > > any > > > > > LOCK instructions (43.3 seconds seconds for the implementation with > > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without > > > > > atomic > > > > > instruction; the benchmark involved doing 512-byte direct-io reads > > > > > and > > > > > writes on a ramdisk with 8 processes on 8-core machine). > > > > > > > > So why is that a problem? Surely that's already tons better then what > > > > you've currently got. > > > > > > Percpu rw-semaphores do not improve performance at all. I put them there > > > to avoid performance regression, not to improve performance. > > > > > > All Linux kernels have a race condition - when you change block size of a > > > block device and you read or write the device at the same time, a crash > > > may happen. This bug is there since ever. Recently, this bug started to > > > cause major trouble - multiple high profile business sites report crashes > > > because of this race condition. > > > > > > You can fix this race by using a read lock around I/O paths and write > > > lock > > > around block size changing, but normal rw semaphore cause cache line > > > bouncing when taken for read by multiple processors and I/O performance > > > degradation because of it is measurable. > > > > This doesn't sound like a new problem. Hasn't this global access, > > single modifier exclusion problem been solved before in the VFS? > > e.g. mnt_want_write()/mnt_make_readonly() > > > > Cheers, > > > > Dave. > > Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw > semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() > to use percpu rw semaphores and remove the duplicated code. I think you misunderstood my point - that rather than re-inventing the wheel, why didn't you just copy something that is known to work? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Wed, 24 Oct 2012, Dave Chinner wrote: > On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: > > > > > > On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > > > > > Yes, I tried this approach - it involves doing LOCK instruction on read > > > > lock, remembering the cpu and doing another LOCK instruction on read > > > > unlock (which will hopefully be on the same CPU, so no cacheline > > > > bouncing > > > > happens in the common case). It was slower than the approach without > > > > any > > > > LOCK instructions (43.3 seconds seconds for the implementation with > > > > per-cpu LOCKed access, 42.7 seconds for this implementation without > > > > atomic > > > > instruction; the benchmark involved doing 512-byte direct-io reads and > > > > writes on a ramdisk with 8 processes on 8-core machine). > > > > > > So why is that a problem? Surely that's already tons better then what > > > you've currently got. > > > > Percpu rw-semaphores do not improve performance at all. I put them there > > to avoid performance regression, not to improve performance. > > > > All Linux kernels have a race condition - when you change block size of a > > block device and you read or write the device at the same time, a crash > > may happen. This bug is there since ever. Recently, this bug started to > > cause major trouble - multiple high profile business sites report crashes > > because of this race condition. > > > > You can fix this race by using a read lock around I/O paths and write lock > > around block size changing, but normal rw semaphore cause cache line > > bouncing when taken for read by multiple processors and I/O performance > > degradation because of it is measurable. > > This doesn't sound like a new problem. Hasn't this global access, > single modifier exclusion problem been solved before in the VFS? > e.g. mnt_want_write()/mnt_make_readonly() > > Cheers, > > Dave. Yes, mnt_want_write()/mnt_make_readonly() do the same thing as percpu rw semaphores. I think you can convert mnt_want_write()/mnt_make_readonly() to use percpu rw semaphores and remove the duplicated code. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, Oct 19, 2012 at 06:54:41PM -0400, Mikulas Patocka wrote: > > > On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > > > Yes, I tried this approach - it involves doing LOCK instruction on read > > > lock, remembering the cpu and doing another LOCK instruction on read > > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > > > happens in the common case). It was slower than the approach without any > > > LOCK instructions (43.3 seconds seconds for the implementation with > > > per-cpu LOCKed access, 42.7 seconds for this implementation without > > > atomic > > > instruction; the benchmark involved doing 512-byte direct-io reads and > > > writes on a ramdisk with 8 processes on 8-core machine). > > > > So why is that a problem? Surely that's already tons better then what > > you've currently got. > > Percpu rw-semaphores do not improve performance at all. I put them there > to avoid performance regression, not to improve performance. > > All Linux kernels have a race condition - when you change block size of a > block device and you read or write the device at the same time, a crash > may happen. This bug is there since ever. Recently, this bug started to > cause major trouble - multiple high profile business sites report crashes > because of this race condition. > > You can fix this race by using a read lock around I/O paths and write lock > around block size changing, but normal rw semaphore cause cache line > bouncing when taken for read by multiple processors and I/O performance > degradation because of it is measurable. This doesn't sound like a new problem. Hasn't this global access, single modifier exclusion problem been solved before in the VFS? e.g. mnt_want_write()/mnt_make_readonly() Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
Hi Mikulas, On 10/22, Mikulas Patocka wrote: > > On Fri, 19 Oct 2012, Oleg Nesterov wrote: > > > On 10/19, Mikulas Patocka wrote: > > > > > > synchronize_rcu() is way slower than msleep(1) - > > > > This depends, I guess. but this doesn't mmatter, > > > > > so I don't see a reason > > > why should it be complicated to avoid msleep(1). > > > > I don't think this really needs complications. Please look at this > > patch for example. Or initial (single writer) version below. It is > > not finished and lacks the barriers too, but I do not think it is ^^^ please note the comment above ;) > > more complex. > > Hi > > My implementation has a smaller structure (it doesn't have > wait_queue_head_t). Oh, I don't think sizeof() really matters in this case. > Your implementation is prone to starvation - if the writer has a high > priority and if it is doing back-to-back write unlocks/locks, it may > happen that the readers have no chance to run. Yes, it is write-biased, this was intent. writers should be rare. > The use of mutex instead of a wait queue in my implementation is unusual, > but I don't see anything wrong with it Neither me. Mikulas, apart from _rcu/_sched change, my only point was msleep() can (and imho should) be avoided. > > static inline long brw_read_ctr(struct brw_sem *brw) > > { > > long sum = 0; > > int cpu; > > > > for_each_possible_cpu(cpu) > > sum += per_cpu(*brw->read_ctr, cpu); > > Integer overflow on signed types is undefined - you should use unsigned > long - you can use -fwrapv option to gcc to make signed overflow defined, > but Linux doesn't use it. I don't think -fwrapv can make any difference in this case, but I agree that "unsigned long" makes more sense. > > void brw_up_write(struct brw_sem *brw) > > { > > brw->writer = NULL; > > synchronize_sched(); > > That synchronize_sched should be put before brw->writer = NULL. Yes, I know. I mentioned this at the start, this lacks the necessary barrier between this writer and the next reader. > I had this bug in my implementation too. Yes, exactly. And this is why I cc'ed you initially ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 19 Oct 2012, Oleg Nesterov wrote: > On 10/19, Mikulas Patocka wrote: > > > > synchronize_rcu() is way slower than msleep(1) - > > This depends, I guess. but this doesn't mmatter, > > > so I don't see a reason > > why should it be complicated to avoid msleep(1). > > I don't think this really needs complications. Please look at this > patch for example. Or initial (single writer) version below. It is > not finished and lacks the barriers too, but I do not think it is > more complex. Hi My implementation has a smaller structure (it doesn't have wait_queue_head_t). Using preempt_disable()/synchronize_sched() instead of RCU seems like a good idea. Here, the locked region is so small that it doesn't make sense to play tricks with preemptible RCU. Your implementation is prone to starvation - if the writer has a high priority and if it is doing back-to-back write unlocks/locks, it may happen that the readers have no chance to run. The use of mutex instead of a wait queue in my implementation is unusual, but I don't see anything wrong with it - it makes the structure smaller and it solves the starvation problem (which would otherwise be complicated to solve). Mikulas > Oleg. > > struct brw_sem { > long __percpu *read_ctr; > wait_queue_head_t read_waitq; > struct mutexwriter_mutex; > struct task_struct *writer; > }; > > int brw_init(struct brw_sem *brw) > { > brw->writer = NULL; > mutex_init(&brw->writer_mutex); > init_waitqueue_head(&brw->read_waitq); > brw->read_ctr = alloc_percpu(long); > return brw->read_ctr ? 0 : -ENOMEM; > } > > void brw_down_read(struct brw_sem *brw) > { > for (;;) { > bool done = false; > > preempt_disable(); > if (likely(!brw->writer)) { > __this_cpu_inc(*brw->read_ctr); > done = true; > } > preempt_enable(); > > if (likely(done)) > break; > > __wait_event(brw->read_waitq, !brw->writer); > } > } > > void brw_up_read(struct brw_sem *brw) > { > struct task_struct *writer; > > preempt_disable(); > __this_cpu_dec(*brw->read_ctr); > writer = ACCESS_ONCE(brw->writer); > if (unlikely(writer)) > wake_up_process(writer); > preempt_enable(); > } > > static inline long brw_read_ctr(struct brw_sem *brw) > { > long sum = 0; > int cpu; > > for_each_possible_cpu(cpu) > sum += per_cpu(*brw->read_ctr, cpu); Integer overflow on signed types is undefined - you should use unsigned long - you can use -fwrapv option to gcc to make signed overflow defined, but Linux doesn't use it. > > return sum; > } > > void brw_down_write(struct brw_sem *brw) > { > mutex_lock(&brw->writer_mutex); > brw->writer = current; > synchronize_sched(); > /* >* Thereafter brw_*_read() must see ->writer != NULL, >* and we should see the result of __this_cpu_inc(). >*/ > for (;;) { > set_current_state(TASK_UNINTERRUPTIBLE); > if (brw_read_ctr(brw) == 0) > break; > schedule(); > } > __set_current_state(TASK_RUNNING); > /* >* We can add another synchronize_sched() to avoid the >* spurious wakeups from brw_up_read() after return. >*/ > } > > void brw_up_write(struct brw_sem *brw) > { > brw->writer = NULL; > synchronize_sched(); That synchronize_sched should be put before brw->writer = NULL. This is incorrect, because brw->writer = NULL may be reordered with previous writes done by this process and the other CPU may see brw->writer == NULL (and think that the lock is unlocked) while it doesn't see previous writes done by the writer. I had this bug in my implementation too. > wake_up_all(&brw->read_waitq); > mutex_unlock(&brw->writer_mutex); > } Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 19 Oct 2012, Peter Zijlstra wrote: > > Yes, I tried this approach - it involves doing LOCK instruction on read > > lock, remembering the cpu and doing another LOCK instruction on read > > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > > happens in the common case). It was slower than the approach without any > > LOCK instructions (43.3 seconds seconds for the implementation with > > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > > instruction; the benchmark involved doing 512-byte direct-io reads and > > writes on a ramdisk with 8 processes on 8-core machine). > > So why is that a problem? Surely that's already tons better then what > you've currently got. Percpu rw-semaphores do not improve performance at all. I put them there to avoid performance regression, not to improve performance. All Linux kernels have a race condition - when you change block size of a block device and you read or write the device at the same time, a crash may happen. This bug is there since ever. Recently, this bug started to cause major trouble - multiple high profile business sites report crashes because of this race condition. You can fix this race by using a read lock around I/O paths and write lock around block size changing, but normal rw semaphore cause cache line bouncing when taken for read by multiple processors and I/O performance degradation because of it is measurable. So I put this percpu-rw-semaphore there to fix the crashes and minimize performance impact - on x86 it doesn't take any interlocked instructions in the read path. I don't quite understand why are people opposing to this and what do they want to do instead? If you pull percpu-rw-semaphores out of the kernel, you introduce a performance regression (raw device i/o will be slower on 3.7 than on 3.6, because on 3.6 it doesn't take any lock at all and on 3.7 it takes a read lock). So you have options: 1) don't lock i/o just like on 3.6 and previous versions - you get a fast kernel that randomly crashes 2) lock i/o with normal rw semaphore - you get a kernel that doesn't crash, but that is slower than previous versions 3) lock i/o with percpu rw semaphore - you get kernel that is almost as fast as previous kernels and that doesn't crash For the users, the option 3) is the best. The users don't care whether it looks ugly or not, they care about correctness and performance, that's all. Obviously, you can improve rw semaphores by adding lockdep annotations, or by other things (turning rcu_read_lock/sychronize_rcu into preempt_disable/synchronize_sched, by using barrier()-synchronize_sched() instead of smp_mb()...), but I don't see a reason why do you want to hurt users' experience by pulling it out and reverting to state 1) or 2) and then, two kernel cycles later, come up with percpu-rw-semaphores again. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Thu, Oct 18, 2012 at 07:57:47PM +0200, Oleg Nesterov wrote: > On 10/18, Paul E. McKenney wrote: > > > > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > > > > > > I thought that you meant that without mb() brw_start_write() can > > > race with brw_end_read() and hang forever. > > > > > > But probably you meant that we need the barriers to ensure that, > > > say, if the reader does > > > > > > brw_start_read(); > > > CONDITION = 1; > > > brw_end_read(); > > > > > > then the writer must see CONDITION != 0 after brw_start_write() ? > > > (or vice-versa) > > > > Yes, this is exactly my concern. > > Oh, thanks at lot Paul (as always). Glad it helped. ;-) > > > In this case we need the barrier, yes. Obviously brw_start_write() > > > can return right after this_cpu_dec() and before wake_up_all(). > > > > > > 2/2 doesn't need this guarantee but I agree, this doesn't look > > > sane in gerenal... > > > > Or name it something not containing "lock". And clearly document > > the behavior and how it is to be used. ;-) > > this would be insane, I guess ;) Well, I suppose you could call it a "phase" : brw_start_phase_1() and so on. > So. Ignoring the possible optimization you mentioned before, > brw_end_read() should do: > > smp_mb(); > this_cpu_dec(); > > wake_up_all(); > > And yes, we need the full mb(). wmb() is enough to ensure that the > writer will see the memory modifications done by the reader. But we > also need to ensure that any LOAD inside start_read/end_read can not > be moved outside of the critical section. > > But we should also ensure that "read" will see all modifications > which were done under start_write/end_write. This means that > brw_end_write() needs another synchronize_sched() before > atomic_dec_and_test(), or brw_start_read() needs mb() in the > fast-path. > > Correct? Good point, I missed the need for synchronize_sched() to avoid readers sleeping through the next write cycle due to racing with an exiting writer. But yes, this sounds correct. > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > something similar. Certainly it was not in my tree when I started > this patch... percpu_down_write() doesn't allow multiple writers, > but the main problem it uses msleep(1). It should not, I think. > > But. It seems that percpu_up_write() is equally wrong? Doesn't > it need synchronize_rcu() before "p->locked = false" ? > > (add Mikulas) Mikulas said something about doing an updated patch, so I figured I would look at his next version. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/19, Peter Zijlstra wrote: > > But using preempt_{disable,enable} and using synchronize_sched() would > be better (for PREEMPT_RCU) although it wouldn't fix anything > fundamental. BTW, I agree. I didn't even notice percpu-rwsem.h uses _rcu, not _sched. > Fine goal, although somewhat arch specific. Also note that there's a > relation between atomics and memory barriers, one isn't necessarily > worse than the other, they all require synchronization of sorts. As Paul pointed out, the fast path can avoid mb(). It is only needed when "up_read" detects the writer. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/19, Mikulas Patocka wrote: > > synchronize_rcu() is way slower than msleep(1) - This depends, I guess. but this doesn't mmatter, > so I don't see a reason > why should it be complicated to avoid msleep(1). I don't think this really needs complications. Please look at this patch for example. Or initial (single writer) version below. It is not finished and lacks the barriers too, but I do not think it is more complex. Oleg. struct brw_sem { long __percpu *read_ctr; wait_queue_head_t read_waitq; struct mutexwriter_mutex; struct task_struct *writer; }; int brw_init(struct brw_sem *brw) { brw->writer = NULL; mutex_init(&brw->writer_mutex); init_waitqueue_head(&brw->read_waitq); brw->read_ctr = alloc_percpu(long); return brw->read_ctr ? 0 : -ENOMEM; } void brw_down_read(struct brw_sem *brw) { for (;;) { bool done = false; preempt_disable(); if (likely(!brw->writer)) { __this_cpu_inc(*brw->read_ctr); done = true; } preempt_enable(); if (likely(done)) break; __wait_event(brw->read_waitq, !brw->writer); } } void brw_up_read(struct brw_sem *brw) { struct task_struct *writer; preempt_disable(); __this_cpu_dec(*brw->read_ctr); writer = ACCESS_ONCE(brw->writer); if (unlikely(writer)) wake_up_process(writer); preempt_enable(); } static inline long brw_read_ctr(struct brw_sem *brw) { long sum = 0; int cpu; for_each_possible_cpu(cpu) sum += per_cpu(*brw->read_ctr, cpu); return sum; } void brw_down_write(struct brw_sem *brw) { mutex_lock(&brw->writer_mutex); brw->writer = current; synchronize_sched(); /* * Thereafter brw_*_read() must see ->writer != NULL, * and we should see the result of __this_cpu_inc(). */ for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (brw_read_ctr(brw) == 0) break; schedule(); } __set_current_state(TASK_RUNNING); /* * We can add another synchronize_sched() to avoid the * spurious wakeups from brw_up_read() after return. */ } void brw_up_write(struct brw_sem *brw) { brw->writer = NULL; synchronize_sched(); wake_up_all(&brw->read_waitq); mutex_unlock(&brw->writer_mutex); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 2012-10-19 at 11:32 -0400, Mikulas Patocka wrote: > So if you can do an alternative implementation without RCU, show it. Uhm,,. no that's not how it works. You just don't push through crap like this and then demand someone else does it better. But using preempt_{disable,enable} and using synchronize_sched() would be better (for PREEMPT_RCU) although it wouldn't fix anything fundamental. > The > goal is - there should be no LOCK instructions on the read path and as > few barriers as possible. Fine goal, although somewhat arch specific. Also note that there's a relation between atomics and memory barriers, one isn't necessarily worse than the other, they all require synchronization of sorts. > > So did you consider keeping the inc/dec on the same per-cpu variable? > > Yes this adds a potential remote access to dec and requires you to use > > atomics, but I would not be surprised if the inc/dec were mostly on the > > same cpu most of the times -- which might be plenty fast for what you > > want. > > Yes, I tried this approach - it involves doing LOCK instruction on read > lock, remembering the cpu and doing another LOCK instruction on read > unlock (which will hopefully be on the same CPU, so no cacheline bouncing > happens in the common case). It was slower than the approach without any > LOCK instructions (43.3 seconds seconds for the implementation with > per-cpu LOCKed access, 42.7 seconds for this implementation without atomic > instruction; the benchmark involved doing 512-byte direct-io reads and > writes on a ramdisk with 8 processes on 8-core machine). So why is that a problem? Surely that's already tons better then what you've currently got. Also uncontended LOCK is something all x86 vendors keep optimizing, they'll have to if they want to keep adding CPUs. > > If you've got coherent per-cpu counts, you can better do the > > waitqueue/wake condition for write_down. > > synchronize_rcu() is way slower than msleep(1) - so I don't see a reason > why should it be complicated to avoid msleep(1). Its not about slow, a polling write side is just fscking ugly. Also, if you're already polling that *_sync() is bloody pointless. > > It might also make sense to do away with the mutex, there's no point in > > serializing the wakeups in the p->locked case of down_read. > > The advantage of a mutex is that it is already protected against > starvation. If I replace the mutex with a wait queue and retry, there is > no starvation protection. Which starvation? writer-writer order? What stops you from adding a list there yourself? Also, writers had better be rare for this thing, so who gives a crap? > > Furthermore, > > p->locked seems a complete duplicate of the mutex state, so removing the > > mutex also removes that duplication. > > We could replace if (p->locked) with if (mutex_is_locked(p->mtx)) Quite so.. You're also still lacking lockdep annotations... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Fri, 19 Oct 2012, Peter Zijlstra wrote: > On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote: > > > > On Thu, 18 Oct 2012, Oleg Nesterov wrote: > > > > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > > something similar. Certainly it was not in my tree when I started > > > this patch... percpu_down_write() doesn't allow multiple writers, > > > but the main problem it uses msleep(1). It should not, I think. > > > > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not > > a big problem. > > That code is beyond ugly though.. it should really not have been merged. > > There's absolutely no reason for it to use RCU except to make it more So if you can do an alternative implementation without RCU, show it. The goal is - there should be no LOCK instructions on the read path and as few barriers as possible. > complicated. And as Oleg pointed out that msleep() is very ill > considered. > > The very worst part of it seems to be that nobody who's usually involved > with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo, > tglx, dhowells and me). It doesn't even have lockdep annotations :/ > > So the only reason you appear to use RCU is because you don't actually > have a sane way to wait for count==0. And I'm contesting rcu_sync() is > sane here -- for the very simple reason you still need while (count) > loop right after it. > > So it appears you want an totally reader biased, sleepable rw-lock like > thing? Yes. > So did you consider keeping the inc/dec on the same per-cpu variable? > Yes this adds a potential remote access to dec and requires you to use > atomics, but I would not be surprised if the inc/dec were mostly on the > same cpu most of the times -- which might be plenty fast for what you > want. Yes, I tried this approach - it involves doing LOCK instruction on read lock, remembering the cpu and doing another LOCK instruction on read unlock (which will hopefully be on the same CPU, so no cacheline bouncing happens in the common case). It was slower than the approach without any LOCK instructions (43.3 seconds seconds for the implementation with per-cpu LOCKed access, 42.7 seconds for this implementation without atomic instruction; the benchmark involved doing 512-byte direct-io reads and writes on a ramdisk with 8 processes on 8-core machine). > If you've got coherent per-cpu counts, you can better do the > waitqueue/wake condition for write_down. synchronize_rcu() is way slower than msleep(1) - so I don't see a reason why should it be complicated to avoid msleep(1). > It might also make sense to do away with the mutex, there's no point in > serializing the wakeups in the p->locked case of down_read. The advantage of a mutex is that it is already protected against starvation. If I replace the mutex with a wait queue and retry, there is no starvation protection. > Furthermore, > p->locked seems a complete duplicate of the mutex state, so removing the > mutex also removes that duplication. We could replace if (p->locked) with if (mutex_is_locked(p->mtx)) > Also, that CONFIG_x86 thing.. *shudder*... Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote: > > On Thu, 18 Oct 2012, Oleg Nesterov wrote: > > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > > something similar. Certainly it was not in my tree when I started > > this patch... percpu_down_write() doesn't allow multiple writers, > > but the main problem it uses msleep(1). It should not, I think. > > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not > a big problem. That code is beyond ugly though.. it should really not have been merged. There's absolutely no reason for it to use RCU except to make it more complicated. And as Oleg pointed out that msleep() is very ill considered. The very worst part of it seems to be that nobody who's usually involved with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo, tglx, dhowells and me). It doesn't even have lockdep annotations :/ So the only reason you appear to use RCU is because you don't actually have a sane way to wait for count==0. And I'm contesting rcu_sync() is sane here -- for the very simple reason you still need while (count) loop right after it. So it appears you want an totally reader biased, sleepable rw-lock like thing? So did you consider keeping the inc/dec on the same per-cpu variable? Yes this adds a potential remote access to dec and requires you to use atomics, but I would not be surprised if the inc/dec were mostly on the same cpu most of the times -- which might be plenty fast for what you want. If you've got coherent per-cpu counts, you can better do the waitqueue/wake condition for write_down. It might also make sense to do away with the mutex, there's no point in serializing the wakeups in the p->locked case of down_read. Furthermore, p->locked seems a complete duplicate of the mutex state, so removing the mutex also removes that duplication. Also, that CONFIG_x86 thing.. *shudder*... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Thu, 18 Oct 2012, Oleg Nesterov wrote: > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does > something similar. Certainly it was not in my tree when I started > this patch... percpu_down_write() doesn't allow multiple writers, > but the main problem it uses msleep(1). It should not, I think. synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not a big problem. > But. It seems that percpu_up_write() is equally wrong? Doesn't > it need synchronize_rcu() before "p->locked = false" ? Yes, it does ... and I sent patch for that to Linus. > (add Mikulas) > > Oleg. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/18, Paul E. McKenney wrote: > > On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > > > > I thought that you meant that without mb() brw_start_write() can > > race with brw_end_read() and hang forever. > > > > But probably you meant that we need the barriers to ensure that, > > say, if the reader does > > > > brw_start_read(); > > CONDITION = 1; > > brw_end_read(); > > > > then the writer must see CONDITION != 0 after brw_start_write() ? > > (or vice-versa) > > Yes, this is exactly my concern. Oh, thanks at lot Paul (as always). > > In this case we need the barrier, yes. Obviously brw_start_write() > > can return right after this_cpu_dec() and before wake_up_all(). > > > > 2/2 doesn't need this guarantee but I agree, this doesn't look > > sane in gerenal... > > Or name it something not containing "lock". And clearly document > the behavior and how it is to be used. ;-) this would be insane, I guess ;) So. Ignoring the possible optimization you mentioned before, brw_end_read() should do: smp_mb(); this_cpu_dec(); wake_up_all(); And yes, we need the full mb(). wmb() is enough to ensure that the writer will see the memory modifications done by the reader. But we also need to ensure that any LOAD inside start_read/end_read can not be moved outside of the critical section. But we should also ensure that "read" will see all modifications which were done under start_write/end_write. This means that brw_end_write() needs another synchronize_sched() before atomic_dec_and_test(), or brw_start_read() needs mb() in the fast-path. Correct? Ooooh. And I just noticed include/linux/percpu-rwsem.h which does something similar. Certainly it was not in my tree when I started this patch... percpu_down_write() doesn't allow multiple writers, but the main problem it uses msleep(1). It should not, I think. But. It seems that percpu_up_write() is equally wrong? Doesn't it need synchronize_rcu() before "p->locked = false" ? (add Mikulas) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Thu, Oct 18, 2012 at 06:24:09PM +0200, Oleg Nesterov wrote: > On 10/17, Paul E. McKenney wrote: > > > > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > > > On 10/16, Paul E. McKenney wrote: > > > > > > > > Suppose that the writer arrives and sees that the value of the counter > > > > is zero, > > > > > > after synchronize_sched(). So there are no readers (but perhaps there > > > are brw_end_read's in flight which already decremented read_ctr) > > > > But the preempt_disable() region only covers read acquisition. So > > synchronize_sched() waits only for all the brw_start_read() calls to > > reach the preempt_enable() > > Yes. > > > -- it cannot wait for all the resulting > > readers to reach the corresponding brw_end_read(). > > Indeed. > > > > > and thus never sleeps, and so is also not awakened? > > > > > > and why do we need wakeup in this case? > > > > To get the memory barriers required to keep the critical sections > > ordered -- to ensure that everyone sees the reader's critical section > > as ending before the writer's critical section starts. > > And now I am starting to think I misunderstood your concern from > the very beginning. > > I thought that you meant that without mb() brw_start_write() can > race with brw_end_read() and hang forever. > > But probably you meant that we need the barriers to ensure that, > say, if the reader does > > brw_start_read(); > CONDITION = 1; > brw_end_read(); > > then the writer must see CONDITION != 0 after brw_start_write() ? > (or vice-versa) Yes, this is exactly my concern. > In this case we need the barrier, yes. Obviously brw_start_write() > can return right after this_cpu_dec() and before wake_up_all(). > > 2/2 doesn't need this guarantee but I agree, this doesn't look > sane in gerenal... Or name it something not containing "lock". And clearly document the behavior and how it is to be used. ;-) Otherwise, someone will get confused and introduce bugs. > Or I misunderstood you again? No, this was indeed my concern. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/17, Paul E. McKenney wrote: > > On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > > On 10/16, Paul E. McKenney wrote: > > > > > > Suppose that the writer arrives and sees that the value of the counter > > > is zero, > > > > after synchronize_sched(). So there are no readers (but perhaps there > > are brw_end_read's in flight which already decremented read_ctr) > > But the preempt_disable() region only covers read acquisition. So > synchronize_sched() waits only for all the brw_start_read() calls to > reach the preempt_enable() Yes. > -- it cannot wait for all the resulting > readers to reach the corresponding brw_end_read(). Indeed. > > > and thus never sleeps, and so is also not awakened? > > > > and why do we need wakeup in this case? > > To get the memory barriers required to keep the critical sections > ordered -- to ensure that everyone sees the reader's critical section > as ending before the writer's critical section starts. And now I am starting to think I misunderstood your concern from the very beginning. I thought that you meant that without mb() brw_start_write() can race with brw_end_read() and hang forever. But probably you meant that we need the barriers to ensure that, say, if the reader does brw_start_read(); CONDITION = 1; brw_end_read(); then the writer must see CONDITION != 0 after brw_start_write() ? (or vice-versa) In this case we need the barrier, yes. Obviously brw_start_write() can return right after this_cpu_dec() and before wake_up_all(). 2/2 doesn't need this guarantee but I agree, this doesn't look sane in gerenal... Or I misunderstood you again? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Wed, Oct 17, 2012 at 06:59:02PM +0200, Oleg Nesterov wrote: > On 10/16, Linus Torvalds wrote: > > > > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov wrote: > > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > > > it allows multiple writers too, just "read" and "write" are mutually > > > exclusive. > > > > So those semantics just don't sound sane. It's also not what any kind > > of normal "rw" lock ever does. > > Yes, this is not usual. > > And initially I made brw_sem which allows only 1 writer, but then > I changed this patch. > > > So can you explain why these particular insane semantics are useful, > > and what for? > > To allow multiple uprobe_register/unregister at the same time. Mostly > to not add the "regression", currently this is possible. > > It is not that I think this is terribly important, but still. And > personally I think that "multiple writers" is not necessarily insane > in general. Suppose you have the complex object/subsystem, the readers > can use a single brw_mutex to access it "lockless", start_read() is > very cheap. > > But start_write() is slow. Multiple writes can use the fine-grained > inside the start_write/end_write section and do not block each other. Strangely enough, the old VAXCluster locking primitives allowed this sort of thing. The brw_start_read() would be a "protected read", and brw_start_write() would be a "concurrent write". Even more interesting, they gave the same advice you give -- concurrent writes should use fine-grained locking to protect the actual accesses. It seems like it should be possible to come up with better names, but I cannot think of any at the moment. Thanx, Paul PS. For the sufficiently masochistic, here is the exclusion table for the six VAXCluster locking modes: NL CR CW PR PW EX NL CR X CW X X X PR X X X PW X X X X EX X X X X X "X" means that the pair of modes exclude each other, otherwise the lock may be held in both of the modes simultaneously. Modes: NL: Null, or "not held". CR: Concurrent read. CW: Concurrent write. PR: Protected read. PW: Protected write. EX: Exclusive. A reader-writer lock could use protected read for readers and either of protected write or exclusive for writers, the difference between protected write and exclusive being irrelevant in the absence of concurrent readers. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Wed, Oct 17, 2012 at 06:37:02PM +0200, Oleg Nesterov wrote: > On 10/16, Paul E. McKenney wrote: > > > > On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote: > > > > > > > > I believe that you need smp_mb() here. > > > > > > I don't understand why... > > > > > > > The wake_up_all()'s memory barriers > > > > do not suffice because some other reader might have awakened the writer > > > > between this_cpu_dec() and wake_up_all(). > > > > > > But __wake_up(q) takes q->lock? And the same lock is taken by > > > prepare_to_wait(), so how can the writer miss the result of _dec? > > > > Suppose that the writer arrives and sees that the value of the counter > > is zero, > > after synchronize_sched(). So there are no readers (but perhaps there > are brw_end_read's in flight which already decremented read_ctr) But the preempt_disable() region only covers read acquisition. So synchronize_sched() waits only for all the brw_start_read() calls to reach the preempt_enable() -- it cannot wait for all the resulting readers to reach the corresponding brw_end_read(). > > and thus never sleeps, and so is also not awakened? > > and why do we need wakeup in this case? To get the memory barriers required to keep the critical sections ordered -- to ensure that everyone sees the reader's critical section as ending before the writer's critical section starts. > > > > void brw_end_read(struct brw_mutex *brw) > > > > { > > > > if (unlikely(atomic_read(&brw->write_ctr))) { > > > > smp_mb(); > > > > this_cpu_dec(*brw->read_ctr); > > > > wake_up_all(&brw->write_waitq); > > > > > > Hmm... still can't understand. > > > > > > It seems that this mb() is needed to ensure that brw_end_read() can't > > > miss write_ctr != 0. > > > > > > But we do not care unless the writer already does wait_event(). And > > > before it does wait_event() it calls synchronize_sched() after it sets > > > write_ctr != 0. Doesn't this mean that after that any preempt-disabled > > > section must see write_ctr != 0 ? > > > > > > This code actually checks write_ctr after preempt_disable + enable, > > > but I think this doesn't matter? > > > > > > Paul, most probably I misunderstood you. Could you spell please? > > > > Let me try outlining the sequence of events that I am worried about... > > > > 1. Task A invokes brw_start_read(). There is no writer, so it > > takes the fastpath. > > > > 2. Task B invokes brw_start_write(), atomically increments > > &brw->write_ctr, and executes synchronize_sched(). > > > > 3. Task A invokes brw_end_read() and does this_cpu_dec(). > > OK. And to simplify this discussion, suppose that A invoked > brw_start_read() on CPU_0 and thus incremented read_ctr[0], and > then it migrates to CPU_1 and brw_end_read() uses read_ctr[1]. > > My understanding was, brw_start_write() must see read_ctr[0] == 1 > after synchronize_sched(). Yep. But it makes absolutely no guarantee about ordering of the decrement of read_ctr[1]. > > 4. Task B invokes wait_event(), which invokes brw_read_ctr() > > and sees the result as zero. > > So my understanding is completely wrong? I thought that after > synchronize_sched() we should see the result of any operation > which were done inside the preempt-disable section. We should indeed. But the decrement of read_ctr[1] is not done within the preempt_disable() section, and the guarantee therefore does not apply to it. This means that there is no guarantee that Task A's read-side critical section will be ordered before Task B's read-side critical section. Now, maybe you don't need that guarantee, but if you don't, I am missing what exactly these primitives are doing for you. > No? > > Hmm. Suppose that we have long A = B = STOP = 0, and > > void func(void) > { > preempt_disable(); > if (!STOP) { > A = 1; > B = 1; > } > preempt_enable(); > } > > Now, you are saying that this code > > STOP = 1; > > synchronize_sched(); > > BUG_ON(A != B); > > is not correct? (yes, yes, this example is not very good). Yep. Assuming no other modifications to A and B, at the point of the BUG_ON(), we should have A==1 and B==1. The thing is that the preempt_disable() in your patch only covers brw_start_read(), but not brw_end_read(). So the decrement (along with the rest of the read-side critical section) is unordered with respect to the write-side critical section started by the brw_start_write(). > The comment above synchronize_sched() says: > > return ... after all currently executing > rcu-sched read-side critical sections have completed. > > But if this code is wrong, then what "completed" actually means? > I thought that it also means "all memory operations have completed", > but this is not true? >From what I can see, your interpretation of synchronize_sched() is correct. The p
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/16, Linus Torvalds wrote: > > On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov wrote: > > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > > it allows multiple writers too, just "read" and "write" are mutually > > exclusive. > > So those semantics just don't sound sane. It's also not what any kind > of normal "rw" lock ever does. Yes, this is not usual. And initially I made brw_sem which allows only 1 writer, but then I changed this patch. > So can you explain why these particular insane semantics are useful, > and what for? To allow multiple uprobe_register/unregister at the same time. Mostly to not add the "regression", currently this is possible. It is not that I think this is terribly important, but still. And personally I think that "multiple writers" is not necessarily insane in general. Suppose you have the complex object/subsystem, the readers can use a single brw_mutex to access it "lockless", start_read() is very cheap. But start_write() is slow. Multiple writes can use the fine-grained inside the start_write/end_write section and do not block each other. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On 10/16, Paul E. McKenney wrote: > > On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote: > > > > > > I believe that you need smp_mb() here. > > > > I don't understand why... > > > > > The wake_up_all()'s memory barriers > > > do not suffice because some other reader might have awakened the writer > > > between this_cpu_dec() and wake_up_all(). > > > > But __wake_up(q) takes q->lock? And the same lock is taken by > > prepare_to_wait(), so how can the writer miss the result of _dec? > > Suppose that the writer arrives and sees that the value of the counter > is zero, after synchronize_sched(). So there are no readers (but perhaps there are brw_end_read's in flight which already decremented read_ctr) > and thus never sleeps, and so is also not awakened? and why do we need wakeup in this case? > > > void brw_end_read(struct brw_mutex *brw) > > > { > > > if (unlikely(atomic_read(&brw->write_ctr))) { > > > smp_mb(); > > > this_cpu_dec(*brw->read_ctr); > > > wake_up_all(&brw->write_waitq); > > > > Hmm... still can't understand. > > > > It seems that this mb() is needed to ensure that brw_end_read() can't > > miss write_ctr != 0. > > > > But we do not care unless the writer already does wait_event(). And > > before it does wait_event() it calls synchronize_sched() after it sets > > write_ctr != 0. Doesn't this mean that after that any preempt-disabled > > section must see write_ctr != 0 ? > > > > This code actually checks write_ctr after preempt_disable + enable, > > but I think this doesn't matter? > > > > Paul, most probably I misunderstood you. Could you spell please? > > Let me try outlining the sequence of events that I am worried about... > > 1.Task A invokes brw_start_read(). There is no writer, so it > takes the fastpath. > > 2.Task B invokes brw_start_write(), atomically increments > &brw->write_ctr, and executes synchronize_sched(). > > 3.Task A invokes brw_end_read() and does this_cpu_dec(). OK. And to simplify this discussion, suppose that A invoked brw_start_read() on CPU_0 and thus incremented read_ctr[0], and then it migrates to CPU_1 and brw_end_read() uses read_ctr[1]. My understanding was, brw_start_write() must see read_ctr[0] == 1 after synchronize_sched(). > 4.Task B invokes wait_event(), which invokes brw_read_ctr() > and sees the result as zero. So my understanding is completely wrong? I thought that after synchronize_sched() we should see the result of any operation which were done inside the preempt-disable section. No? Hmm. Suppose that we have long A = B = STOP = 0, and void func(void) { preempt_disable(); if (!STOP) { A = 1; B = 1; } preempt_enable(); } Now, you are saying that this code STOP = 1; synchronize_sched(); BUG_ON(A != B); is not correct? (yes, yes, this example is not very good). The comment above synchronize_sched() says: return ... after all currently executing rcu-sched read-side critical sections have completed. But if this code is wrong, then what "completed" actually means? I thought that it also means "all memory operations have completed", but this is not true? Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Mon, Oct 15, 2012 at 12:10 PM, Oleg Nesterov wrote: > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > it allows multiple writers too, just "read" and "write" are mutually > exclusive. So those semantics just don't sound sane. It's also not what any kind of normal "rw" lock ever does. So can you explain why these particular insane semantics are useful, and what for? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Tue, Oct 16, 2012 at 05:56:23PM +0200, Oleg Nesterov wrote: > Paul, thanks for looking! > > On 10/15, Paul E. McKenney wrote: > > > > > +void brw_start_read(struct brw_mutex *brw) > > > +{ > > > + for (;;) { > > > + bool done = false; > > > + > > > + preempt_disable(); > > > + if (likely(!atomic_read(&brw->write_ctr))) { > > > + __this_cpu_inc(*brw->read_ctr); > > > + done = true; > > > + } > > > > brw_start_read() is not recursive -- attempting to call it recursively > > can result in deadlock if a writer has shown up in the meantime. > > Yes, yes, it is not recursive. Like rw_semaphore. > > > Which is often OK, but not sure what you intended. > > I forgot to document this in the changelog. Hey, I had to ask. ;-) > > > +void brw_end_read(struct brw_mutex *brw) > > > +{ > > > > I believe that you need smp_mb() here. > > I don't understand why... > > > The wake_up_all()'s memory barriers > > do not suffice because some other reader might have awakened the writer > > between this_cpu_dec() and wake_up_all(). > > But __wake_up(q) takes q->lock? And the same lock is taken by > prepare_to_wait(), so how can the writer miss the result of _dec? Suppose that the writer arrives and sees that the value of the counter is zero, and thus never sleeps, and so is also not awakened? Unless I am missing something, there are no memory barriers in that case. Which means that you also need an smp_mb() after the wait_event() in the writer, now that I think on it. > > > + this_cpu_dec(*brw->read_ctr); > > > + > > > + if (unlikely(atomic_read(&brw->write_ctr))) > > > + wake_up_all(&brw->write_waitq); > > > +} > > > > Of course, it would be good to avoid smp_mb on the fast path. Here is > > one way to avoid it: > > > > void brw_end_read(struct brw_mutex *brw) > > { > > if (unlikely(atomic_read(&brw->write_ctr))) { > > smp_mb(); > > this_cpu_dec(*brw->read_ctr); > > wake_up_all(&brw->write_waitq); > > Hmm... still can't understand. > > It seems that this mb() is needed to ensure that brw_end_read() can't > miss write_ctr != 0. > > But we do not care unless the writer already does wait_event(). And > before it does wait_event() it calls synchronize_sched() after it sets > write_ctr != 0. Doesn't this mean that after that any preempt-disabled > section must see write_ctr != 0 ? > > This code actually checks write_ctr after preempt_disable + enable, > but I think this doesn't matter? > > Paul, most probably I misunderstood you. Could you spell please? Let me try outlining the sequence of events that I am worried about... 1. Task A invokes brw_start_read(). There is no writer, so it takes the fastpath. 2. Task B invokes brw_start_write(), atomically increments &brw->write_ctr, and executes synchronize_sched(). 3. Task A invokes brw_end_read() and does this_cpu_dec(). 4. Task B invokes wait_event(), which invokes brw_read_ctr() and sees the result as zero. Therefore, Task B does not sleep, does not acquire locks, and does not execute any memory barriers. As a result, ordering is not guaranteed between Task A's read-side critical section and Task B's upcoming write-side critical section. So I believe that you need smp_mb() in both brw_end_read() and brw_start_write(). Sigh... It is quite possible that you also need an smp_mb() in brw_start_read(), but let's start with just the scenario above. So, does the above scenario show a problem, or am I confused? > > > +void brw_start_write(struct brw_mutex *brw) > > > +{ > > > + atomic_inc(&brw->write_ctr); > > > + synchronize_sched(); > > > + /* > > > + * Thereafter brw_*_read() must see write_ctr != 0, > > > + * and we should see the result of __this_cpu_inc(). > > > + */ > > > + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0); > > > > This looks like it allows multiple writers to proceed concurrently. > > They both increment, do a synchronize_sched(), do the wait_event(), > > and then are both awakened by the last reader. > > Yes. From the changelog: > > Unlike rw_semaphore it allows multiple writers too, > just "read" and "write" are mutually exclusive. OK, color me blind! ;-) > > Was that the intent? (The implementation of brw_end_write() makes > > it look like it is in fact the intent.) > > Please look at 2/2. > > Multiple uprobe_register() or uprobe_unregister() can run at the > same time to install/remove the system-wide breakpoint, and > brw_start_write() is used to block dup_mmap() to avoid the race. > But they do not block each other. Ah, makes sense, thank you! Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read t
Re: [PATCH 1/2] brw_mutex: big read-write mutex
Paul, thanks for looking! On 10/15, Paul E. McKenney wrote: > > > +void brw_start_read(struct brw_mutex *brw) > > +{ > > + for (;;) { > > + bool done = false; > > + > > + preempt_disable(); > > + if (likely(!atomic_read(&brw->write_ctr))) { > > + __this_cpu_inc(*brw->read_ctr); > > + done = true; > > + } > > brw_start_read() is not recursive -- attempting to call it recursively > can result in deadlock if a writer has shown up in the meantime. Yes, yes, it is not recursive. Like rw_semaphore. > Which is often OK, but not sure what you intended. I forgot to document this in the changelog. > > +void brw_end_read(struct brw_mutex *brw) > > +{ > > I believe that you need smp_mb() here. I don't understand why... > The wake_up_all()'s memory barriers > do not suffice because some other reader might have awakened the writer > between this_cpu_dec() and wake_up_all(). But __wake_up(q) takes q->lock? And the same lock is taken by prepare_to_wait(), so how can the writer miss the result of _dec? > > + this_cpu_dec(*brw->read_ctr); > > + > > + if (unlikely(atomic_read(&brw->write_ctr))) > > + wake_up_all(&brw->write_waitq); > > +} > > Of course, it would be good to avoid smp_mb on the fast path. Here is > one way to avoid it: > > void brw_end_read(struct brw_mutex *brw) > { > if (unlikely(atomic_read(&brw->write_ctr))) { > smp_mb(); > this_cpu_dec(*brw->read_ctr); > wake_up_all(&brw->write_waitq); Hmm... still can't understand. It seems that this mb() is needed to ensure that brw_end_read() can't miss write_ctr != 0. But we do not care unless the writer already does wait_event(). And before it does wait_event() it calls synchronize_sched() after it sets write_ctr != 0. Doesn't this mean that after that any preempt-disabled section must see write_ctr != 0 ? This code actually checks write_ctr after preempt_disable + enable, but I think this doesn't matter? Paul, most probably I misunderstood you. Could you spell please? > > +void brw_start_write(struct brw_mutex *brw) > > +{ > > + atomic_inc(&brw->write_ctr); > > + synchronize_sched(); > > + /* > > +* Thereafter brw_*_read() must see write_ctr != 0, > > +* and we should see the result of __this_cpu_inc(). > > +*/ > > + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0); > > This looks like it allows multiple writers to proceed concurrently. > They both increment, do a synchronize_sched(), do the wait_event(), > and then are both awakened by the last reader. Yes. From the changelog: Unlike rw_semaphore it allows multiple writers too, just "read" and "write" are mutually exclusive. > Was that the intent? (The implementation of brw_end_write() makes > it look like it is in fact the intent.) Please look at 2/2. Multiple uprobe_register() or uprobe_unregister() can run at the same time to install/remove the system-wide breakpoint, and brw_start_write() is used to block dup_mmap() to avoid the race. But they do not block each other. Thanks! Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] brw_mutex: big read-write mutex
On Mon, Oct 15, 2012 at 09:10:18PM +0200, Oleg Nesterov wrote: > This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore > it allows multiple writers too, just "read" and "write" are mutually > exclusive. > > brw_start_read() and brw_end_read() are extremely cheap, they only do > this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers. > > OTOH it is write-biased, any brw_start_write() blocks the new readers. > But "write" is slow, it does synchronize_sched() to serialize with > preempt_disable() in brw_start_read(), and wait_event(write_waitq) can > have a lot of extra wakeups before percpu-counter-sum becomes zero. A few questions and comments below, as always. Thanx, Paul > Signed-off-by: Oleg Nesterov > --- > include/linux/brw_mutex.h | 22 +++ > lib/Makefile |2 +- > lib/brw_mutex.c | 67 > + > 3 files changed, 90 insertions(+), 1 deletions(-) > create mode 100644 include/linux/brw_mutex.h > create mode 100644 lib/brw_mutex.c > > diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h > new file mode 100644 > index 000..16b8d5f > --- /dev/null > +++ b/include/linux/brw_mutex.h > @@ -0,0 +1,22 @@ > +#ifndef _LINUX_BRW_MUTEX_H > +#define _LINUX_BRW_MUTEX_H > + > +#include > +#include > + > +struct brw_mutex { > + long __percpu *read_ctr; > + atomic_twrite_ctr; > + wait_queue_head_t read_waitq; > + wait_queue_head_t write_waitq; > +}; > + > +extern int brw_mutex_init(struct brw_mutex *brw); > + > +extern void brw_start_read(struct brw_mutex *brw); > +extern void brw_end_read(struct brw_mutex *brw); > + > +extern void brw_start_write(struct brw_mutex *brw); > +extern void brw_end_write(struct brw_mutex *brw); > + > +#endif > diff --git a/lib/Makefile b/lib/Makefile > index 3128e35..18f2876 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ >idr.o int_sqrt.o extable.o \ >sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ >proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ > - is_single_threaded.o plist.o decompress.o > + is_single_threaded.o plist.o decompress.o brw_mutex.o > > lib-$(CONFIG_MMU) += ioremap.o > lib-$(CONFIG_SMP) += cpumask.o > diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c > new file mode 100644 > index 000..41984a6 > --- /dev/null > +++ b/lib/brw_mutex.c > @@ -0,0 +1,67 @@ > +#include > +#include > +#include > + > +int brw_mutex_init(struct brw_mutex *brw) > +{ > + atomic_set(&brw->write_ctr, 0); > + init_waitqueue_head(&brw->read_waitq); > + init_waitqueue_head(&brw->write_waitq); > + brw->read_ctr = alloc_percpu(long); > + return brw->read_ctr ? 0 : -ENOMEM; > +} > + > +void brw_start_read(struct brw_mutex *brw) > +{ > + for (;;) { > + bool done = false; > + > + preempt_disable(); > + if (likely(!atomic_read(&brw->write_ctr))) { > + __this_cpu_inc(*brw->read_ctr); > + done = true; > + } brw_start_read() is not recursive -- attempting to call it recursively can result in deadlock if a writer has shown up in the meantime. Which is often OK, but not sure what you intended. > + preempt_enable(); > + > + if (likely(done)) > + break; > + > + __wait_event(brw->read_waitq, !atomic_read(&brw->write_ctr)); > + } > +} > + > +void brw_end_read(struct brw_mutex *brw) > +{ I believe that you need smp_mb() here. The wake_up_all()'s memory barriers do not suffice because some other reader might have awakened the writer between this_cpu_dec() and wake_up_all(). IIRC, this smp_mb() is also needed if the timing is such that the writer does not actually block. > + this_cpu_dec(*brw->read_ctr); > + > + if (unlikely(atomic_read(&brw->write_ctr))) > + wake_up_all(&brw->write_waitq); > +} Of course, it would be good to avoid smp_mb on the fast path. Here is one way to avoid it: void brw_end_read(struct brw_mutex *brw) { if (unlikely(atomic_read(&brw->write_ctr))) { smp_mb(); this_cpu_dec(*brw->read_ctr); wake_up_all(&brw->write_waitq); } else { this_cpu_dec(*brw->read_ctr); } } > +static inline long brw_read_ctr(struct brw_mutex *brw) > +{ > + long sum = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) > + sum += per_cpu(*brw->read_ctr, cpu); > + > + return sum; > +} > + > +void brw_start_write(struct brw_mutex *brw) > +{ > + atomic_inc(&brw->write_ctr); > + synchronize_sched(); > + /* > + * Thereafter brw_*_read() must see write_ctr != 0, > + * and we should see the result of __this_cpu_inc()
[PATCH 1/2] brw_mutex: big read-write mutex
This patch adds the new sleeping lock, brw_mutex. Unlike rw_semaphore it allows multiple writers too, just "read" and "write" are mutually exclusive. brw_start_read() and brw_end_read() are extremely cheap, they only do this_cpu_inc(read_ctr) + atomic_read() if there are no waiting writers. OTOH it is write-biased, any brw_start_write() blocks the new readers. But "write" is slow, it does synchronize_sched() to serialize with preempt_disable() in brw_start_read(), and wait_event(write_waitq) can have a lot of extra wakeups before percpu-counter-sum becomes zero. Signed-off-by: Oleg Nesterov --- include/linux/brw_mutex.h | 22 +++ lib/Makefile |2 +- lib/brw_mutex.c | 67 + 3 files changed, 90 insertions(+), 1 deletions(-) create mode 100644 include/linux/brw_mutex.h create mode 100644 lib/brw_mutex.c diff --git a/include/linux/brw_mutex.h b/include/linux/brw_mutex.h new file mode 100644 index 000..16b8d5f --- /dev/null +++ b/include/linux/brw_mutex.h @@ -0,0 +1,22 @@ +#ifndef _LINUX_BRW_MUTEX_H +#define _LINUX_BRW_MUTEX_H + +#include +#include + +struct brw_mutex { + long __percpu *read_ctr; + atomic_twrite_ctr; + wait_queue_head_t read_waitq; + wait_queue_head_t write_waitq; +}; + +extern int brw_mutex_init(struct brw_mutex *brw); + +extern void brw_start_read(struct brw_mutex *brw); +extern void brw_end_read(struct brw_mutex *brw); + +extern void brw_start_write(struct brw_mutex *brw); +extern void brw_end_write(struct brw_mutex *brw); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 3128e35..18f2876 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ -is_single_threaded.o plist.o decompress.o +is_single_threaded.o plist.o decompress.o brw_mutex.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/brw_mutex.c b/lib/brw_mutex.c new file mode 100644 index 000..41984a6 --- /dev/null +++ b/lib/brw_mutex.c @@ -0,0 +1,67 @@ +#include +#include +#include + +int brw_mutex_init(struct brw_mutex *brw) +{ + atomic_set(&brw->write_ctr, 0); + init_waitqueue_head(&brw->read_waitq); + init_waitqueue_head(&brw->write_waitq); + brw->read_ctr = alloc_percpu(long); + return brw->read_ctr ? 0 : -ENOMEM; +} + +void brw_start_read(struct brw_mutex *brw) +{ + for (;;) { + bool done = false; + + preempt_disable(); + if (likely(!atomic_read(&brw->write_ctr))) { + __this_cpu_inc(*brw->read_ctr); + done = true; + } + preempt_enable(); + + if (likely(done)) + break; + + __wait_event(brw->read_waitq, !atomic_read(&brw->write_ctr)); + } +} + +void brw_end_read(struct brw_mutex *brw) +{ + this_cpu_dec(*brw->read_ctr); + + if (unlikely(atomic_read(&brw->write_ctr))) + wake_up_all(&brw->write_waitq); +} + +static inline long brw_read_ctr(struct brw_mutex *brw) +{ + long sum = 0; + int cpu; + + for_each_possible_cpu(cpu) + sum += per_cpu(*brw->read_ctr, cpu); + + return sum; +} + +void brw_start_write(struct brw_mutex *brw) +{ + atomic_inc(&brw->write_ctr); + synchronize_sched(); + /* +* Thereafter brw_*_read() must see write_ctr != 0, +* and we should see the result of __this_cpu_inc(). +*/ + wait_event(brw->write_waitq, brw_read_ctr(brw) == 0); +} + +void brw_end_write(struct brw_mutex *brw) +{ + if (atomic_dec_and_test(&brw->write_ctr)) + wake_up_all(&brw->read_waitq); +} -- 1.5.5.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/