Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Sun, Nov 11, 2012 at 04:45:09PM +0100, Oleg Nesterov wrote: > On 11/09, Paul E. McKenney wrote: > > > > On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: > > > > > > static bool xxx(brw) > > > { > > > down_write(>rw_sem); > > > > down_write_trylock() > > > > As you noted in your later email. Presumably you return false if > > the attempt to acquire it fails. > > Yes, yes, thanks. > > > > But first we should do other changes, I think. IMHO we should not do > > > synchronize_sched() under mutex_lock() and this will add (a bit) more > > > complications. We will see. > > > > Indeed, that does put considerable delay on the writers. There is always > > synchronize_sched_expedited(), I suppose. > > I am not sure about synchronize_sched_expedited() (at least unconditionally), > but: only the 1st down_write() needs synchronize_, and up_write() do not > need to sleep in synchronize_ at all. > > To simplify, lets ignore the fact that the writers need to serialize with > each other. IOW, the pseudo-code below is obviously deadly wrong and racy, > just to illustrate the idea. > > 1. We remove brw->writer_mutex and add "atomic_t writers_ctr". > >update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead >of !mutex_is_locked(). > > 2. down_write() does > > if (atomic_add_return(brw->writers_ctr) == 1) { > // first writer > synchronize_sched(); > ... > } else { > ... XXX: wait for percpu_up_write() from the first writer ... > } > > 3. up_write() does > > if (atomic_dec_unless_one(brw->writers_ctr)) { > ... wake up XXX writers above ... > return; > } else { > // the last writer > call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } ); > } Agreed, an asynchronous callback can be used to switch the readers back onto the fastpath. Of course, as you say, getting it all working will require some care. ;-) > Once again, this all is racy, but hopefully the idea is clear: > > - down_write(brw) sleeps in synchronize_sched() only if brw > has already switched back to fast-path-mode > > - up_write() never sleeps in synchronize_sched(), it uses > call_rcu_sched() or wakes up the next writer. > > Of course I am not sure this all worth the trouble, this should be discussed. > (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody > will like) But I am not going to even try to do this until the current patch > is applied, I need it to fix the bug in uprobes and I think the current code > is "good enough". These changes can't help to speedup the readers, and the > writers are slow/rare anyway. Probably best to wait for multi-writers until there is a measurable need, to be sure! ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Sun, Nov 11, 2012 at 04:45:09PM +0100, Oleg Nesterov wrote: On 11/09, Paul E. McKenney wrote: On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: static bool xxx(brw) { down_write(brw-rw_sem); down_write_trylock() As you noted in your later email. Presumably you return false if the attempt to acquire it fails. Yes, yes, thanks. But first we should do other changes, I think. IMHO we should not do synchronize_sched() under mutex_lock() and this will add (a bit) more complications. We will see. Indeed, that does put considerable delay on the writers. There is always synchronize_sched_expedited(), I suppose. I am not sure about synchronize_sched_expedited() (at least unconditionally), but: only the 1st down_write() needs synchronize_, and up_write() do not need to sleep in synchronize_ at all. To simplify, lets ignore the fact that the writers need to serialize with each other. IOW, the pseudo-code below is obviously deadly wrong and racy, just to illustrate the idea. 1. We remove brw-writer_mutex and add atomic_t writers_ctr. update_fast_ctr() uses atomic_read(brw-writers_ctr) == 0 instead of !mutex_is_locked(). 2. down_write() does if (atomic_add_return(brw-writers_ctr) == 1) { // first writer synchronize_sched(); ... } else { ... XXX: wait for percpu_up_write() from the first writer ... } 3. up_write() does if (atomic_dec_unless_one(brw-writers_ctr)) { ... wake up XXX writers above ... return; } else { // the last writer call_rcu_sched( func = { atomic_dec(brw-writers_ctr) } ); } Agreed, an asynchronous callback can be used to switch the readers back onto the fastpath. Of course, as you say, getting it all working will require some care. ;-) Once again, this all is racy, but hopefully the idea is clear: - down_write(brw) sleeps in synchronize_sched() only if brw has already switched back to fast-path-mode - up_write() never sleeps in synchronize_sched(), it uses call_rcu_sched() or wakes up the next writer. Of course I am not sure this all worth the trouble, this should be discussed. (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody will like) But I am not going to even try to do this until the current patch is applied, I need it to fix the bug in uprobes and I think the current code is good enough. These changes can't help to speedup the readers, and the writers are slow/rare anyway. Probably best to wait for multi-writers until there is a measurable need, to be sure! ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/09, Paul E. McKenney wrote: > > On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: > > > > static bool xxx(brw) > > { > > down_write(>rw_sem); > > down_write_trylock() > > As you noted in your later email. Presumably you return false if > the attempt to acquire it fails. Yes, yes, thanks. > > But first we should do other changes, I think. IMHO we should not do > > synchronize_sched() under mutex_lock() and this will add (a bit) more > > complications. We will see. > > Indeed, that does put considerable delay on the writers. There is always > synchronize_sched_expedited(), I suppose. I am not sure about synchronize_sched_expedited() (at least unconditionally), but: only the 1st down_write() needs synchronize_, and up_write() do not need to sleep in synchronize_ at all. To simplify, lets ignore the fact that the writers need to serialize with each other. IOW, the pseudo-code below is obviously deadly wrong and racy, just to illustrate the idea. 1. We remove brw->writer_mutex and add "atomic_t writers_ctr". update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead of !mutex_is_locked(). 2. down_write() does if (atomic_add_return(brw->writers_ctr) == 1) { // first writer synchronize_sched(); ... } else { ... XXX: wait for percpu_up_write() from the first writer ... } 3. up_write() does if (atomic_dec_unless_one(brw->writers_ctr)) { ... wake up XXX writers above ... return; } else { // the last writer call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } ); } Once again, this all is racy, but hopefully the idea is clear: - down_write(brw) sleeps in synchronize_sched() only if brw has already switched back to fast-path-mode - up_write() never sleeps in synchronize_sched(), it uses call_rcu_sched() or wakes up the next writer. Of course I am not sure this all worth the trouble, this should be discussed. (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody will like) But I am not going to even try to do this until the current patch is applied, I need it to fix the bug in uprobes and I think the current code is "good enough". These changes can't help to speedup the readers, and the writers are slow/rare anyway. Thanks! Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/09, Paul E. McKenney wrote: On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: static bool xxx(brw) { down_write(brw-rw_sem); down_write_trylock() As you noted in your later email. Presumably you return false if the attempt to acquire it fails. Yes, yes, thanks. But first we should do other changes, I think. IMHO we should not do synchronize_sched() under mutex_lock() and this will add (a bit) more complications. We will see. Indeed, that does put considerable delay on the writers. There is always synchronize_sched_expedited(), I suppose. I am not sure about synchronize_sched_expedited() (at least unconditionally), but: only the 1st down_write() needs synchronize_, and up_write() do not need to sleep in synchronize_ at all. To simplify, lets ignore the fact that the writers need to serialize with each other. IOW, the pseudo-code below is obviously deadly wrong and racy, just to illustrate the idea. 1. We remove brw-writer_mutex and add atomic_t writers_ctr. update_fast_ctr() uses atomic_read(brw-writers_ctr) == 0 instead of !mutex_is_locked(). 2. down_write() does if (atomic_add_return(brw-writers_ctr) == 1) { // first writer synchronize_sched(); ... } else { ... XXX: wait for percpu_up_write() from the first writer ... } 3. up_write() does if (atomic_dec_unless_one(brw-writers_ctr)) { ... wake up XXX writers above ... return; } else { // the last writer call_rcu_sched( func = { atomic_dec(brw-writers_ctr) } ); } Once again, this all is racy, but hopefully the idea is clear: - down_write(brw) sleeps in synchronize_sched() only if brw has already switched back to fast-path-mode - up_write() never sleeps in synchronize_sched(), it uses call_rcu_sched() or wakes up the next writer. Of course I am not sure this all worth the trouble, this should be discussed. (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody will like) But I am not going to even try to do this until the current patch is applied, I need it to fix the bug in uprobes and I think the current code is good enough. These changes can't help to speedup the readers, and the writers are slow/rare anyway. Thanks! Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: > On 11/09, Paul E. McKenney wrote: > > > > On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: > > > Contrary, I am going to try to add some complications later, so that > > > it can have more users. In particular, I think it can replace > > > get_online_cpus/cpu_hotplug_begin, just we need > > > percpu_down_write_but_dont_deadlock_with_recursive_readers(). > > > > I must confess that I am a bit concerned about possible scalability > > bottlenecks in the current get_online_cpus(), so +1 from me on this one. > > OK, thanks... > > And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is > trivial, just it needs down_write(rw_sem) "inside" wait_event(), not > before. But I'm afraid I will never manage to write the comments ;) > > static bool xxx(brw) > { > down_write(>rw_sem); down_write_trylock() As you noted in your later email. Presumably you return false if the attempt to acquire it fails. > if (!atomic_read(>slow_read_ctr)) > return true; > > up_write(>rw_sem); > return false; > } > > static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool > recursive_readers) > { > mutex_lock(>writer_mutex); > > synchronize_sched(); > > atomic_add(clear_fast_ctr(brw), >slow_read_ctr); > > if (recursive_readers) { > wait_event(brw->write_waitq, xxx(brw)); I see what you mean about acquiring brw->rw_sem inside of wait_event(). Cute trick! The "recursive_readers" is a global initialization-time thing, right? > } else { > down_write(>rw_sem); > > wait_event(brw->write_waitq, > !atomic_read(>slow_read_ctr)); > } > } Looks like it should work, and would perform and scale nicely even if we end up having to greatly increase the number of calls to get_online_cpus(). > Of course, cpu.c still needs .active_writer to allow get_online_cpus() > under cpu_hotplug_begin(), but this is simple. Yep, same check as now. > But first we should do other changes, I think. IMHO we should not do > synchronize_sched() under mutex_lock() and this will add (a bit) more > complications. We will see. Indeed, that does put considerable delay on the writers. There is always synchronize_sched_expedited(), I suppose. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/09, Oleg Nesterov wrote: > > static bool xxx(brw) > { > down_write(>rw_sem); > if (!atomic_read(>slow_read_ctr)) > return true; I meant, try_to_down_write(). Otherwise this can obviously deadlock. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/09, Paul E. McKenney wrote: > > On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: > > Contrary, I am going to try to add some complications later, so that > > it can have more users. In particular, I think it can replace > > get_online_cpus/cpu_hotplug_begin, just we need > > percpu_down_write_but_dont_deadlock_with_recursive_readers(). > > I must confess that I am a bit concerned about possible scalability > bottlenecks in the current get_online_cpus(), so +1 from me on this one. OK, thanks... And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is trivial, just it needs down_write(rw_sem) "inside" wait_event(), not before. But I'm afraid I will never manage to write the comments ;) static bool xxx(brw) { down_write(>rw_sem); if (!atomic_read(>slow_read_ctr)) return true; up_write(>rw_sem); return false; } static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool recursive_readers) { mutex_lock(>writer_mutex); synchronize_sched(); atomic_add(clear_fast_ctr(brw), >slow_read_ctr); if (recursive_readers) { wait_event(brw->write_waitq, xxx(brw)); } else { down_write(>rw_sem); wait_event(brw->write_waitq, !atomic_read(>slow_read_ctr)); } } Of course, cpu.c still needs .active_writer to allow get_online_cpus() under cpu_hotplug_begin(), but this is simple. But first we should do other changes, I think. IMHO we should not do synchronize_sched() under mutex_lock() and this will add (a bit) more complications. We will see. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: > On 11/08, Andrew Morton wrote: > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > Oleg Nesterov wrote: > > > > > > > > include/linux/percpu-rwsem.h | 83 + > > > lib/Makefile |2 +- > > > lib/percpu-rwsem.c | 123 > > > ++ > > > > The patch also uninlines everything. > > > > And it didn't export the resulting symbols to modules, so it isn't an > > equivalent. We can export thing later if needed I guess. > > Yes, currently it is only used by block_dev.c > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > avoid including the code altogether, methinks? > > I am going to add another user (uprobes), this was my motivation for > this patch. And perhaps it will have more users. > > But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send > the patch. > > > > +#include > > > +#include > > > +#include > > > > This list is nowhere near sufficient to support this file's > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > more. IOW, if it compiles, it was sheer luck. > > OK, thanks, I'll send > send > percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix > > > > +/* > > > + * A writer takes ->writer_mutex to exclude other writers and to force > > > the > > > + * readers to switch to the slow mode, note the mutex_is_locked() check > > > in > > > + * update_fast_ctr(). > > > + * > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr > > > counter, > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > > + * counter it represents the number of active readers. > > > + * > > > + * Finally the writer takes ->rw_sem for writing and blocks the new > > > readers, > > > + * then waits until the slow counter becomes zero. > > > + */ > > > > Some overview of how fast/slow_read_ctr are supposed to work would be > > useful. This comment seems to assume that the reader already knew > > that. > > I hate to say this, but I'll try to update this comment too ;) > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > +{ > > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > > + mutex_lock(>writer_mutex); > > > + > > > + /* > > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > > + *so that update_fast_ctr() can't succeed. > > > + * > > > + * 2. Ensures we see the result of every previous this_cpu_add() in > > > + *update_fast_ctr(). > > > + * > > > + * 3. Ensures that if any reader has exited its critical section via > > > + *fast-path, it executes a full memory barrier before we return. > > > + */ > > > + synchronize_sched(); > > > > Here's where I get horridly confused. Your patch completely deRCUifies > > this code, yes? Yet here we're using an RCU primitive. And we seem to > > be using it not as an RCU primitive but as a handy thing which happens > > to have desirable side-effects. But the implementation of > > synchronize_sched() differs considerably according to which rcu > > flavor-of-the-minute you're using. > > It is documented that synchronize_sched() should play well with > preempt_disable/enable. From the comment: > > Note that preempt_disable(), > local_irq_disable(), and so on may be used in place of > rcu_read_lock_sched(). > > But I guess this needs more discussion, I see other emails in this > thread... > > > And part 3 talks about the reader's critical section. The only > > critical sections I can see on the reader side are already covered by > > mutex_lock() and preempt_diable(). > > Yes, but we need to ensure that if we take the lock for writing, we > should see all memory modifications done under down_read/up_read(). > > IOW. Suppose that the reader does > > percpu_down_read(); > STORE; > percpu_up_read(); // no barriers in the fast path > > The writer should see the result of that STORE under percpu_down_write(). > > Part 3 tries to say that at this point we should already see the result, > so we should not worry about acquire/release semantics. > > > If this code isn't as brain damaged as it > > initially appears then please, > > I hope ;) > > > go easy on us simpletons in the next > > version? > > Well, I'll try to update the comments... but the code is simple, I do > not think I can simplify it more. The nontrivial part is the barriers, > but this is always nontrivial. > > Contrary, I am going to try to add some complications later, so that > it can have more users. In particular, I think it can replace > get_online_cpus/cpu_hotplug_begin, just we need > percpu_down_write_but_dont_deadlock_with_recursive_readers(). I must confess that I am a bit concerned about possible scalability bottlenecks in the current get_online_cpus(), so +1 from me on this one.
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Fri, Nov 09, 2012 at 05:35:38PM +0100, Oleg Nesterov wrote: > On 11/08, Paul E. McKenney wrote: > > > > On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: > > > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > > > > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > > > > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > > > > Oleg Nesterov wrote: > > > > > > > > > > > > > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and > > > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and > > > > > synchronize_sched(). The real-time guys would prefer the change > > > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that > > > > > you mention it. > > > > > > > > > > Oleg, Mikulas, any reason not to move to > > > > > rcu_read_lock()/rcu_read_unlock() > > > > > and synchronize_rcu()? > > > > > > > > preempt_disable/preempt_enable is faster than > > > > rcu_read_lock/rcu_read_unlock for preemptive kernels. > > Yes, I chose preempt_disable() because it is the fastest/simplest > primitive and the critical section is really tiny. > > But: > > > > Significantly faster in this case? Can you measure the difference > > > from a user-mode test? > > I do not think rcu_read_lock() or rcu_read_lock_sched() can actually > make a measurable difference. > > > Actually, the fact that __this_cpu_add() will malfunction on some > > architectures is preemption is not disabled seems a more compelling > > reason to keep preempt_enable() than any performance improvement. ;-) > > Yes, but this_cpu_add() should work. Indeed! But this_cpu_add() just does the preempt_enable() under the covers, so not much difference from a latency viewpoint. > > > Careful. The real-time guys might take the same every-little-bit approach > > > to latency that you seem to be taking for CPU cycles. ;-) > > Understand... > > So I simply do not know. Please tell me if you think it would be > better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched, > and I'll send the patch. I doubt if it makes a measurable difference for either throughput or latency. One could argue that rcu_read_lock() would be better for readability, but making sure that the preempt_disable() is clearly commented as starting an RCU-sched read-side critical section would be just as good. So I am OK with the current preempt_disable() approach. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/08, Paul E. McKenney wrote: > > On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: > > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > > > Oleg Nesterov wrote: > > > > > > > > > > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and > > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and > > > > synchronize_sched(). The real-time guys would prefer the change > > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that > > > > you mention it. > > > > > > > > Oleg, Mikulas, any reason not to move to > > > > rcu_read_lock()/rcu_read_unlock() > > > > and synchronize_rcu()? > > > > > > preempt_disable/preempt_enable is faster than > > > rcu_read_lock/rcu_read_unlock for preemptive kernels. Yes, I chose preempt_disable() because it is the fastest/simplest primitive and the critical section is really tiny. But: > > Significantly faster in this case? Can you measure the difference > > from a user-mode test? I do not think rcu_read_lock() or rcu_read_lock_sched() can actually make a measurable difference. > Actually, the fact that __this_cpu_add() will malfunction on some > architectures is preemption is not disabled seems a more compelling > reason to keep preempt_enable() than any performance improvement. ;-) Yes, but this_cpu_add() should work. > > Careful. The real-time guys might take the same every-little-bit approach > > to latency that you seem to be taking for CPU cycles. ;-) Understand... So I simply do not know. Please tell me if you think it would be better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched, and I'll send the patch. Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/08, Andrew Morton wrote: > > On Thu, 8 Nov 2012 14:48:49 +0100 > Oleg Nesterov wrote: > > > > > include/linux/percpu-rwsem.h | 83 + > > lib/Makefile |2 +- > > lib/percpu-rwsem.c | 123 > > ++ > > The patch also uninlines everything. > > And it didn't export the resulting symbols to modules, so it isn't an > equivalent. We can export thing later if needed I guess. Yes, currently it is only used by block_dev.c > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > avoid including the code altogether, methinks? I am going to add another user (uprobes), this was my motivation for this patch. And perhaps it will have more users. But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send the patch. > > +#include > > +#include > > +#include > > This list is nowhere near sufficient to support this file's > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > more. IOW, if it compiles, it was sheer luck. OK, thanks, I'll send send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix > > +/* > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > + * update_fast_ctr(). > > + * > > + * After that the readers can only inc/dec the slow ->slow_read_ctr > > counter, > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > + * counter it represents the number of active readers. > > + * > > + * Finally the writer takes ->rw_sem for writing and blocks the new > > readers, > > + * then waits until the slow counter becomes zero. > > + */ > > Some overview of how fast/slow_read_ctr are supposed to work would be > useful. This comment seems to assume that the reader already knew > that. I hate to say this, but I'll try to update this comment too ;) > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > + mutex_lock(>writer_mutex); > > + > > + /* > > +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > +*so that update_fast_ctr() can't succeed. > > +* > > +* 2. Ensures we see the result of every previous this_cpu_add() in > > +*update_fast_ctr(). > > +* > > +* 3. Ensures that if any reader has exited its critical section via > > +*fast-path, it executes a full memory barrier before we return. > > +*/ > > + synchronize_sched(); > > Here's where I get horridly confused. Your patch completely deRCUifies > this code, yes? Yet here we're using an RCU primitive. And we seem to > be using it not as an RCU primitive but as a handy thing which happens > to have desirable side-effects. But the implementation of > synchronize_sched() differs considerably according to which rcu > flavor-of-the-minute you're using. It is documented that synchronize_sched() should play well with preempt_disable/enable. From the comment: Note that preempt_disable(), local_irq_disable(), and so on may be used in place of rcu_read_lock_sched(). But I guess this needs more discussion, I see other emails in this thread... > And part 3 talks about the reader's critical section. The only > critical sections I can see on the reader side are already covered by > mutex_lock() and preempt_diable(). Yes, but we need to ensure that if we take the lock for writing, we should see all memory modifications done under down_read/up_read(). IOW. Suppose that the reader does percpu_down_read(); STORE; percpu_up_read(); // no barriers in the fast path The writer should see the result of that STORE under percpu_down_write(). Part 3 tries to say that at this point we should already see the result, so we should not worry about acquire/release semantics. > If this code isn't as brain damaged as it > initially appears then please, I hope ;) > go easy on us simpletons in the next > version? Well, I'll try to update the comments... but the code is simple, I do not think I can simplify it more. The nontrivial part is the barriers, but this is always nontrivial. Contrary, I am going to try to add some complications later, so that it can have more users. In particular, I think it can replace get_online_cpus/cpu_hotplug_begin, just we need percpu_down_write_but_dont_deadlock_with_recursive_readers(). Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012, Andrew Morton wrote: > On Thu, 8 Nov 2012 14:48:49 +0100 > Oleg Nesterov wrote: > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > to acquire/release the semaphore, and during this time the readers > > are blocked completely. Even if the "write" section was not actually > > started or if it was already finished. > > > > With this patch down_write/up_write does synchronize_sched() twice > > and down_read/up_read are still possible during this time, just they > > use the slow path. > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > increment the "slow" counter to take the lock for reading, then it > > takes that rw_semaphore for writing and blocks the readers. > > > > Also. With this patch the code relies on the documented behaviour of > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > barrier. > > > > ... > > > > include/linux/percpu-rwsem.h | 83 + > > lib/Makefile |2 +- > > lib/percpu-rwsem.c | 123 > > ++ > > The patch also uninlines everything. > > And it didn't export the resulting symbols to modules, so it isn't an > equivalent. We can export thing later if needed I guess. > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > avoid including the code altogether, methinks? If you want to use percpu-rwsem only for block devices then you can remove Oleg's patch at all. Oleg's optimizations are useless for block device use case (the contention between readers and writers is very rare and it doesn't matter if readers are blocked in case of contention). I suppose that Oleg made the optimizations because he wants to use percpu-rwsem for something else - if not, you can drop the patch and revert to the previois version that is simpler. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. ... include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? If you want to use percpu-rwsem only for block devices then you can remove Oleg's patch at all. Oleg's optimizations are useless for block device use case (the contention between readers and writers is very rare and it doesn't matter if readers are blocked in case of contention). I suppose that Oleg made the optimizations because he wants to use percpu-rwsem for something else - if not, you can drop the patch and revert to the previois version that is simpler. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/08, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. Yes, currently it is only used by block_dev.c It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? I am going to add another user (uprobes), this was my motivation for this patch. And perhaps it will have more users. But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send the patch. +#include linux/percpu-rwsem.h +#include linux/rcupdate.h +#include linux/sched.h This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. OK, thanks, I'll send send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix +/* + * A writer takes -writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow -slow_read_ctr counter, + * -fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes -rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. I hate to say this, but I'll try to update this comment too ;) +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(brw-writer_mutex); + + /* +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read +*so that update_fast_ctr() can't succeed. +* +* 2. Ensures we see the result of every previous this_cpu_add() in +*update_fast_ctr(). +* +* 3. Ensures that if any reader has exited its critical section via +*fast-path, it executes a full memory barrier before we return. +*/ + synchronize_sched(); Here's where I get horridly confused. Your patch completely deRCUifies this code, yes? Yet here we're using an RCU primitive. And we seem to be using it not as an RCU primitive but as a handy thing which happens to have desirable side-effects. But the implementation of synchronize_sched() differs considerably according to which rcu flavor-of-the-minute you're using. It is documented that synchronize_sched() should play well with preempt_disable/enable. From the comment: Note that preempt_disable(), local_irq_disable(), and so on may be used in place of rcu_read_lock_sched(). But I guess this needs more discussion, I see other emails in this thread... And part 3 talks about the reader's critical section. The only critical sections I can see on the reader side are already covered by mutex_lock() and preempt_diable(). Yes, but we need to ensure that if we take the lock for writing, we should see all memory modifications done under down_read/up_read(). IOW. Suppose that the reader does percpu_down_read(); STORE; percpu_up_read(); // no barriers in the fast path The writer should see the result of that STORE under percpu_down_write(). Part 3 tries to say that at this point we should already see the result, so we should not worry about acquire/release semantics. If this code isn't as brain damaged as it initially appears then please, I hope ;) go easy on us simpletons in the next version? Well, I'll try to update the comments... but the code is simple, I do not think I can simplify it more. The nontrivial part is the barriers, but this is always nontrivial. Contrary, I am going to try to add some complications later, so that it can have more users. In particular, I think it can replace get_online_cpus/cpu_hotplug_begin, just we need percpu_down_write_but_dont_deadlock_with_recursive_readers(). Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/08, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: On Thu, 8 Nov 2012, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: The algorithm would work given rcu_read_lock()/rcu_read_unlock() and synchronize_rcu() in place of preempt_disable()/preempt_enable() and synchronize_sched(). The real-time guys would prefer the change to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that you mention it. Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu()? preempt_disable/preempt_enable is faster than rcu_read_lock/rcu_read_unlock for preemptive kernels. Yes, I chose preempt_disable() because it is the fastest/simplest primitive and the critical section is really tiny. But: Significantly faster in this case? Can you measure the difference from a user-mode test? I do not think rcu_read_lock() or rcu_read_lock_sched() can actually make a measurable difference. Actually, the fact that __this_cpu_add() will malfunction on some architectures is preemption is not disabled seems a more compelling reason to keep preempt_enable() than any performance improvement. ;-) Yes, but this_cpu_add() should work. Careful. The real-time guys might take the same every-little-bit approach to latency that you seem to be taking for CPU cycles. ;-) Understand... So I simply do not know. Please tell me if you think it would be better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched, and I'll send the patch. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Fri, Nov 09, 2012 at 05:35:38PM +0100, Oleg Nesterov wrote: On 11/08, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: On Thu, 8 Nov 2012, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: The algorithm would work given rcu_read_lock()/rcu_read_unlock() and synchronize_rcu() in place of preempt_disable()/preempt_enable() and synchronize_sched(). The real-time guys would prefer the change to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that you mention it. Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu()? preempt_disable/preempt_enable is faster than rcu_read_lock/rcu_read_unlock for preemptive kernels. Yes, I chose preempt_disable() because it is the fastest/simplest primitive and the critical section is really tiny. But: Significantly faster in this case? Can you measure the difference from a user-mode test? I do not think rcu_read_lock() or rcu_read_lock_sched() can actually make a measurable difference. Actually, the fact that __this_cpu_add() will malfunction on some architectures is preemption is not disabled seems a more compelling reason to keep preempt_enable() than any performance improvement. ;-) Yes, but this_cpu_add() should work. Indeed! But this_cpu_add() just does the preempt_enable() under the covers, so not much difference from a latency viewpoint. Careful. The real-time guys might take the same every-little-bit approach to latency that you seem to be taking for CPU cycles. ;-) Understand... So I simply do not know. Please tell me if you think it would be better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched, and I'll send the patch. I doubt if it makes a measurable difference for either throughput or latency. One could argue that rcu_read_lock() would be better for readability, but making sure that the preempt_disable() is clearly commented as starting an RCU-sched read-side critical section would be just as good. So I am OK with the current preempt_disable() approach. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: On 11/08, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. Yes, currently it is only used by block_dev.c It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? I am going to add another user (uprobes), this was my motivation for this patch. And perhaps it will have more users. But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send the patch. +#include linux/percpu-rwsem.h +#include linux/rcupdate.h +#include linux/sched.h This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. OK, thanks, I'll send send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix +/* + * A writer takes -writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow -slow_read_ctr counter, + * -fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes -rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. I hate to say this, but I'll try to update this comment too ;) +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(brw-writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + *so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + *update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + *fast-path, it executes a full memory barrier before we return. + */ + synchronize_sched(); Here's where I get horridly confused. Your patch completely deRCUifies this code, yes? Yet here we're using an RCU primitive. And we seem to be using it not as an RCU primitive but as a handy thing which happens to have desirable side-effects. But the implementation of synchronize_sched() differs considerably according to which rcu flavor-of-the-minute you're using. It is documented that synchronize_sched() should play well with preempt_disable/enable. From the comment: Note that preempt_disable(), local_irq_disable(), and so on may be used in place of rcu_read_lock_sched(). But I guess this needs more discussion, I see other emails in this thread... And part 3 talks about the reader's critical section. The only critical sections I can see on the reader side are already covered by mutex_lock() and preempt_diable(). Yes, but we need to ensure that if we take the lock for writing, we should see all memory modifications done under down_read/up_read(). IOW. Suppose that the reader does percpu_down_read(); STORE; percpu_up_read(); // no barriers in the fast path The writer should see the result of that STORE under percpu_down_write(). Part 3 tries to say that at this point we should already see the result, so we should not worry about acquire/release semantics. If this code isn't as brain damaged as it initially appears then please, I hope ;) go easy on us simpletons in the next version? Well, I'll try to update the comments... but the code is simple, I do not think I can simplify it more. The nontrivial part is the barriers, but this is always nontrivial. Contrary, I am going to try to add some complications later, so that it can have more users. In particular, I think it can replace get_online_cpus/cpu_hotplug_begin, just we need percpu_down_write_but_dont_deadlock_with_recursive_readers(). I must confess that I am a bit concerned about possible scalability bottlenecks in the current get_online_cpus(), so +1 from me on this one. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/09, Paul E. McKenney wrote: On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: Contrary, I am going to try to add some complications later, so that it can have more users. In particular, I think it can replace get_online_cpus/cpu_hotplug_begin, just we need percpu_down_write_but_dont_deadlock_with_recursive_readers(). I must confess that I am a bit concerned about possible scalability bottlenecks in the current get_online_cpus(), so +1 from me on this one. OK, thanks... And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is trivial, just it needs down_write(rw_sem) inside wait_event(), not before. But I'm afraid I will never manage to write the comments ;) static bool xxx(brw) { down_write(brw-rw_sem); if (!atomic_read(brw-slow_read_ctr)) return true; up_write(brw-rw_sem); return false; } static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool recursive_readers) { mutex_lock(brw-writer_mutex); synchronize_sched(); atomic_add(clear_fast_ctr(brw), brw-slow_read_ctr); if (recursive_readers) { wait_event(brw-write_waitq, xxx(brw)); } else { down_write(brw-rw_sem); wait_event(brw-write_waitq, !atomic_read(brw-slow_read_ctr)); } } Of course, cpu.c still needs .active_writer to allow get_online_cpus() under cpu_hotplug_begin(), but this is simple. But first we should do other changes, I think. IMHO we should not do synchronize_sched() under mutex_lock() and this will add (a bit) more complications. We will see. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On 11/09, Oleg Nesterov wrote: static bool xxx(brw) { down_write(brw-rw_sem); if (!atomic_read(brw-slow_read_ctr)) return true; I meant, try_to_down_write(). Otherwise this can obviously deadlock. Oleg. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote: On 11/09, Paul E. McKenney wrote: On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote: Contrary, I am going to try to add some complications later, so that it can have more users. In particular, I think it can replace get_online_cpus/cpu_hotplug_begin, just we need percpu_down_write_but_dont_deadlock_with_recursive_readers(). I must confess that I am a bit concerned about possible scalability bottlenecks in the current get_online_cpus(), so +1 from me on this one. OK, thanks... And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is trivial, just it needs down_write(rw_sem) inside wait_event(), not before. But I'm afraid I will never manage to write the comments ;) static bool xxx(brw) { down_write(brw-rw_sem); down_write_trylock() As you noted in your later email. Presumably you return false if the attempt to acquire it fails. if (!atomic_read(brw-slow_read_ctr)) return true; up_write(brw-rw_sem); return false; } static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool recursive_readers) { mutex_lock(brw-writer_mutex); synchronize_sched(); atomic_add(clear_fast_ctr(brw), brw-slow_read_ctr); if (recursive_readers) { wait_event(brw-write_waitq, xxx(brw)); I see what you mean about acquiring brw-rw_sem inside of wait_event(). Cute trick! The recursive_readers is a global initialization-time thing, right? } else { down_write(brw-rw_sem); wait_event(brw-write_waitq, !atomic_read(brw-slow_read_ctr)); } } Looks like it should work, and would perform and scale nicely even if we end up having to greatly increase the number of calls to get_online_cpus(). Of course, cpu.c still needs .active_writer to allow get_online_cpus() under cpu_hotplug_begin(), but this is simple. Yep, same check as now. But first we should do other changes, I think. IMHO we should not do synchronize_sched() under mutex_lock() and this will add (a bit) more complications. We will see. Indeed, that does put considerable delay on the writers. There is always synchronize_sched_expedited(), I suppose. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > > Oleg Nesterov wrote: > > > > > > > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > > > > to acquire/release the semaphore, and during this time the readers > > > > > are blocked completely. Even if the "write" section was not actually > > > > > started or if it was already finished. > > > > > > > > > > With this patch down_write/up_write does synchronize_sched() twice > > > > > and down_read/up_read are still possible during this time, just they > > > > > use the slow path. > > > > > > > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > > > > increment the "slow" counter to take the lock for reading, then it > > > > > takes that rw_semaphore for writing and blocks the readers. > > > > > > > > > > Also. With this patch the code relies on the documented behaviour of > > > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > > > > barrier. > > > > > > > > > > ... > > > > > > > > > > include/linux/percpu-rwsem.h | 83 + > > > > > lib/Makefile |2 +- > > > > > lib/percpu-rwsem.c | 123 > > > > > ++ > > > > > > > > The patch also uninlines everything. > > > > > > > > And it didn't export the resulting symbols to modules, so it isn't an > > > > equivalent. We can export thing later if needed I guess. > > > > > > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > > > avoid including the code altogether, methinks? > > > > > > > > > > > > > > ... > > > > > > > > > > --- /dev/null > > > > > +++ b/lib/percpu-rwsem.c > > > > > @@ -0,0 +1,123 @@ > > > > > > > > That was nice and terse ;) > > > > > > > > > +#include > > > > > +#include > > > > > +#include > > > > > > > > This list is nowhere near sufficient to support this file's > > > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > > > more. IOW, if it compiles, it was sheer luck. > > > > > > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + brw->fast_read_ctr = alloc_percpu(int); > > > > > + if (unlikely(!brw->fast_read_ctr)) > > > > > + return -ENOMEM; > > > > > + > > > > > + mutex_init(>writer_mutex); > > > > > + init_rwsem(>rw_sem); > > > > > + atomic_set(>slow_read_ctr, 0); > > > > > + init_waitqueue_head(>write_waitq); > > > > > + return 0; > > > > > +} > > > > > + > > > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + free_percpu(brw->fast_read_ctr); > > > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > > > > +} > > > > > + > > > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, > > > > > unsigned int val) > > > > > +{ > > > > > + bool success = false; > > > > > + > > > > > + preempt_disable(); > > > > > + if (likely(!mutex_is_locked(>writer_mutex))) { > > > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > > > > + success = true; > > > > > + } > > > > > + preempt_enable(); > > > > > + > > > > > + return success; > > > > > +} > > > > > + > > > > > +/* > > > > > + * Like the normal down_read() this is not recursive, the writer can > > > > > + * come after the first percpu_down_read() and create the deadlock. > > > > > + */ > > > > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + if (likely(update_fast_ctr(brw, +1))) > > > > > + return; > > > > > + > > > > > + down_read(>rw_sem); > > > > > + atomic_inc(>slow_read_ctr); > > > > > + up_read(>rw_sem); > > > > > +} > > > > > + > > > > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + if (likely(update_fast_ctr(brw, -1))) > > > > > + return; > > > > > + > > > > > + /* false-positive is possible but harmless */ > > > > > + if (atomic_dec_and_test(>slow_read_ctr)) > > > > > + wake_up_all(>write_waitq); > > > > > +} > > > > > + > > > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > > > > +{ > > > > > + unsigned int sum = 0; > > > > > + int cpu; > > > > > + > > > > > + for_each_possible_cpu(cpu) { > > > > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > > > > + } > > > > > + > > > > > + return sum; > > > > > +} > > > > > + > > > > > +/* > > > > > + * A writer takes ->writer_mutex to exclude other writers and to > > > > > force the > > > > > + * readers to switch to the slow mode, note the mutex_is_locked() > > > > >
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote: > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > > On Thu, 8 Nov 2012 14:48:49 +0100 > > > Oleg Nesterov wrote: > > > > > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > > > to acquire/release the semaphore, and during this time the readers > > > > are blocked completely. Even if the "write" section was not actually > > > > started or if it was already finished. > > > > > > > > With this patch down_write/up_write does synchronize_sched() twice > > > > and down_read/up_read are still possible during this time, just they > > > > use the slow path. > > > > > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > > > increment the "slow" counter to take the lock for reading, then it > > > > takes that rw_semaphore for writing and blocks the readers. > > > > > > > > Also. With this patch the code relies on the documented behaviour of > > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > > > barrier. > > > > > > > > ... > > > > > > > > include/linux/percpu-rwsem.h | 83 + > > > > lib/Makefile |2 +- > > > > lib/percpu-rwsem.c | 123 > > > > ++ > > > > > > The patch also uninlines everything. > > > > > > And it didn't export the resulting symbols to modules, so it isn't an > > > equivalent. We can export thing later if needed I guess. > > > > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > > avoid including the code altogether, methinks? > > > > > > > > > > > ... > > > > > > > > --- /dev/null > > > > +++ b/lib/percpu-rwsem.c > > > > @@ -0,0 +1,123 @@ > > > > > > That was nice and terse ;) > > > > > > > +#include > > > > +#include > > > > +#include > > > > > > This list is nowhere near sufficient to support this file's > > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > > more. IOW, if it compiles, it was sheer luck. > > > > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + brw->fast_read_ctr = alloc_percpu(int); > > > > + if (unlikely(!brw->fast_read_ctr)) > > > > + return -ENOMEM; > > > > + > > > > + mutex_init(>writer_mutex); > > > > + init_rwsem(>rw_sem); > > > > + atomic_set(>slow_read_ctr, 0); > > > > + init_waitqueue_head(>write_waitq); > > > > + return 0; > > > > +} > > > > + > > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + free_percpu(brw->fast_read_ctr); > > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > > > +} > > > > + > > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned > > > > int val) > > > > +{ > > > > + bool success = false; > > > > + > > > > + preempt_disable(); > > > > + if (likely(!mutex_is_locked(>writer_mutex))) { > > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > > > + success = true; > > > > + } > > > > + preempt_enable(); > > > > + > > > > + return success; > > > > +} > > > > + > > > > +/* > > > > + * Like the normal down_read() this is not recursive, the writer can > > > > + * come after the first percpu_down_read() and create the deadlock. > > > > + */ > > > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + if (likely(update_fast_ctr(brw, +1))) > > > > + return; > > > > + > > > > + down_read(>rw_sem); > > > > + atomic_inc(>slow_read_ctr); > > > > + up_read(>rw_sem); > > > > +} > > > > + > > > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + if (likely(update_fast_ctr(brw, -1))) > > > > + return; > > > > + > > > > + /* false-positive is possible but harmless */ > > > > + if (atomic_dec_and_test(>slow_read_ctr)) > > > > + wake_up_all(>write_waitq); > > > > +} > > > > + > > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > > > +{ > > > > + unsigned int sum = 0; > > > > + int cpu; > > > > + > > > > + for_each_possible_cpu(cpu) { > > > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > > > + } > > > > + > > > > + return sum; > > > > +} > > > > + > > > > +/* > > > > + * A writer takes ->writer_mutex to exclude other writers and to force > > > > the > > > > + * readers to switch to the slow mode, note the mutex_is_locked() > > > > check in > > > > + * update_fast_ctr(). > > > > + * > > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr > > > > counter, > > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the > > > > slow > > > > + * counter it
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012, Paul E. McKenney wrote: > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > > On Thu, 8 Nov 2012 14:48:49 +0100 > > Oleg Nesterov wrote: > > > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > > to acquire/release the semaphore, and during this time the readers > > > are blocked completely. Even if the "write" section was not actually > > > started or if it was already finished. > > > > > > With this patch down_write/up_write does synchronize_sched() twice > > > and down_read/up_read are still possible during this time, just they > > > use the slow path. > > > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > > increment the "slow" counter to take the lock for reading, then it > > > takes that rw_semaphore for writing and blocks the readers. > > > > > > Also. With this patch the code relies on the documented behaviour of > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > > barrier. > > > > > > ... > > > > > > include/linux/percpu-rwsem.h | 83 + > > > lib/Makefile |2 +- > > > lib/percpu-rwsem.c | 123 > > > ++ > > > > The patch also uninlines everything. > > > > And it didn't export the resulting symbols to modules, so it isn't an > > equivalent. We can export thing later if needed I guess. > > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > > avoid including the code altogether, methinks? > > > > > > > > ... > > > > > > --- /dev/null > > > +++ b/lib/percpu-rwsem.c > > > @@ -0,0 +1,123 @@ > > > > That was nice and terse ;) > > > > > +#include > > > +#include > > > +#include > > > > This list is nowhere near sufficient to support this file's > > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > > more. IOW, if it compiles, it was sheer luck. > > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > > +{ > > > + brw->fast_read_ctr = alloc_percpu(int); > > > + if (unlikely(!brw->fast_read_ctr)) > > > + return -ENOMEM; > > > + > > > + mutex_init(>writer_mutex); > > > + init_rwsem(>rw_sem); > > > + atomic_set(>slow_read_ctr, 0); > > > + init_waitqueue_head(>write_waitq); > > > + return 0; > > > +} > > > + > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > > +{ > > > + free_percpu(brw->fast_read_ctr); > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > > +} > > > + > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned > > > int val) > > > +{ > > > + bool success = false; > > > + > > > + preempt_disable(); > > > + if (likely(!mutex_is_locked(>writer_mutex))) { > > > + __this_cpu_add(*brw->fast_read_ctr, val); > > > + success = true; > > > + } > > > + preempt_enable(); > > > + > > > + return success; > > > +} > > > + > > > +/* > > > + * Like the normal down_read() this is not recursive, the writer can > > > + * come after the first percpu_down_read() and create the deadlock. > > > + */ > > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > > +{ > > > + if (likely(update_fast_ctr(brw, +1))) > > > + return; > > > + > > > + down_read(>rw_sem); > > > + atomic_inc(>slow_read_ctr); > > > + up_read(>rw_sem); > > > +} > > > + > > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > > +{ > > > + if (likely(update_fast_ctr(brw, -1))) > > > + return; > > > + > > > + /* false-positive is possible but harmless */ > > > + if (atomic_dec_and_test(>slow_read_ctr)) > > > + wake_up_all(>write_waitq); > > > +} > > > + > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > > +{ > > > + unsigned int sum = 0; > > > + int cpu; > > > + > > > + for_each_possible_cpu(cpu) { > > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > > + } > > > + > > > + return sum; > > > +} > > > + > > > +/* > > > + * A writer takes ->writer_mutex to exclude other writers and to force > > > the > > > + * readers to switch to the slow mode, note the mutex_is_locked() check > > > in > > > + * update_fast_ctr(). > > > + * > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr > > > counter, > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > > + * counter it represents the number of active readers. > > > + * > > > + * Finally the writer takes ->rw_sem for writing and blocks the new > > > readers, > > > + * then waits until the slow counter becomes zero. > > > + */ > > > > Some overview of how fast/slow_read_ctr are supposed to work would be > > useful. This comment seems to assume that the reader already knew > > that. > > > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > > +{ > > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > > + mutex_lock(>writer_mutex); > > >
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: > On Thu, 8 Nov 2012 14:48:49 +0100 > Oleg Nesterov wrote: > > > Currently the writer does msleep() plus synchronize_sched() 3 times > > to acquire/release the semaphore, and during this time the readers > > are blocked completely. Even if the "write" section was not actually > > started or if it was already finished. > > > > With this patch down_write/up_write does synchronize_sched() twice > > and down_read/up_read are still possible during this time, just they > > use the slow path. > > > > percpu_down_write() first forces the readers to use rw_semaphore and > > increment the "slow" counter to take the lock for reading, then it > > takes that rw_semaphore for writing and blocks the readers. > > > > Also. With this patch the code relies on the documented behaviour of > > synchronize_sched(), it doesn't try to pair synchronize_sched() with > > barrier. > > > > ... > > > > include/linux/percpu-rwsem.h | 83 + > > lib/Makefile |2 +- > > lib/percpu-rwsem.c | 123 > > ++ > > The patch also uninlines everything. > > And it didn't export the resulting symbols to modules, so it isn't an > equivalent. We can export thing later if needed I guess. > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will > avoid including the code altogether, methinks? > > > > > ... > > > > --- /dev/null > > +++ b/lib/percpu-rwsem.c > > @@ -0,0 +1,123 @@ > > That was nice and terse ;) > > > +#include > > +#include > > +#include > > This list is nowhere near sufficient to support this file's > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty > more. IOW, if it compiles, it was sheer luck. > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > > +{ > > + brw->fast_read_ctr = alloc_percpu(int); > > + if (unlikely(!brw->fast_read_ctr)) > > + return -ENOMEM; > > + > > + mutex_init(>writer_mutex); > > + init_rwsem(>rw_sem); > > + atomic_set(>slow_read_ctr, 0); > > + init_waitqueue_head(>write_waitq); > > + return 0; > > +} > > + > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > > +{ > > + free_percpu(brw->fast_read_ctr); > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > > +} > > + > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int > > val) > > +{ > > + bool success = false; > > + > > + preempt_disable(); > > + if (likely(!mutex_is_locked(>writer_mutex))) { > > + __this_cpu_add(*brw->fast_read_ctr, val); > > + success = true; > > + } > > + preempt_enable(); > > + > > + return success; > > +} > > + > > +/* > > + * Like the normal down_read() this is not recursive, the writer can > > + * come after the first percpu_down_read() and create the deadlock. > > + */ > > +void percpu_down_read(struct percpu_rw_semaphore *brw) > > +{ > > + if (likely(update_fast_ctr(brw, +1))) > > + return; > > + > > + down_read(>rw_sem); > > + atomic_inc(>slow_read_ctr); > > + up_read(>rw_sem); > > +} > > + > > +void percpu_up_read(struct percpu_rw_semaphore *brw) > > +{ > > + if (likely(update_fast_ctr(brw, -1))) > > + return; > > + > > + /* false-positive is possible but harmless */ > > + if (atomic_dec_and_test(>slow_read_ctr)) > > + wake_up_all(>write_waitq); > > +} > > + > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > > +{ > > + unsigned int sum = 0; > > + int cpu; > > + > > + for_each_possible_cpu(cpu) { > > + sum += per_cpu(*brw->fast_read_ctr, cpu); > > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > > + } > > + > > + return sum; > > +} > > + > > +/* > > + * A writer takes ->writer_mutex to exclude other writers and to force the > > + * readers to switch to the slow mode, note the mutex_is_locked() check in > > + * update_fast_ctr(). > > + * > > + * After that the readers can only inc/dec the slow ->slow_read_ctr > > counter, > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > > + * counter it represents the number of active readers. > > + * > > + * Finally the writer takes ->rw_sem for writing and blocks the new > > readers, > > + * then waits until the slow counter becomes zero. > > + */ > > Some overview of how fast/slow_read_ctr are supposed to work would be > useful. This comment seems to assume that the reader already knew > that. > > > +void percpu_down_write(struct percpu_rw_semaphore *brw) > > +{ > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > > + mutex_lock(>writer_mutex); > > + > > + /* > > +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read > > +*so that update_fast_ctr() can't succeed. > > +* > > +* 2. Ensures we see the result of every previous this_cpu_add() in > > +*update_fast_ctr(). > > +* > > +
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov wrote: > Currently the writer does msleep() plus synchronize_sched() 3 times > to acquire/release the semaphore, and during this time the readers > are blocked completely. Even if the "write" section was not actually > started or if it was already finished. > > With this patch down_write/up_write does synchronize_sched() twice > and down_read/up_read are still possible during this time, just they > use the slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. > > ... > > include/linux/percpu-rwsem.h | 83 + > lib/Makefile |2 +- > lib/percpu-rwsem.c | 123 > ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? > > ... > > --- /dev/null > +++ b/lib/percpu-rwsem.c > @@ -0,0 +1,123 @@ That was nice and terse ;) > +#include > +#include > +#include This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) > +{ > + brw->fast_read_ctr = alloc_percpu(int); > + if (unlikely(!brw->fast_read_ctr)) > + return -ENOMEM; > + > + mutex_init(>writer_mutex); > + init_rwsem(>rw_sem); > + atomic_set(>slow_read_ctr, 0); > + init_waitqueue_head(>write_waitq); > + return 0; > +} > + > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) > +{ > + free_percpu(brw->fast_read_ctr); > + brw->fast_read_ctr = NULL; /* catch use after free bugs */ > +} > + > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int > val) > +{ > + bool success = false; > + > + preempt_disable(); > + if (likely(!mutex_is_locked(>writer_mutex))) { > + __this_cpu_add(*brw->fast_read_ctr, val); > + success = true; > + } > + preempt_enable(); > + > + return success; > +} > + > +/* > + * Like the normal down_read() this is not recursive, the writer can > + * come after the first percpu_down_read() and create the deadlock. > + */ > +void percpu_down_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, +1))) > + return; > + > + down_read(>rw_sem); > + atomic_inc(>slow_read_ctr); > + up_read(>rw_sem); > +} > + > +void percpu_up_read(struct percpu_rw_semaphore *brw) > +{ > + if (likely(update_fast_ctr(brw, -1))) > + return; > + > + /* false-positive is possible but harmless */ > + if (atomic_dec_and_test(>slow_read_ctr)) > + wake_up_all(>write_waitq); > +} > + > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) > +{ > + unsigned int sum = 0; > + int cpu; > + > + for_each_possible_cpu(cpu) { > + sum += per_cpu(*brw->fast_read_ctr, cpu); > + per_cpu(*brw->fast_read_ctr, cpu) = 0; > + } > + > + return sum; > +} > + > +/* > + * A writer takes ->writer_mutex to exclude other writers and to force the > + * readers to switch to the slow mode, note the mutex_is_locked() check in > + * update_fast_ctr(). > + * > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter, > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow > + * counter it represents the number of active readers. > + * > + * Finally the writer takes ->rw_sem for writing and blocks the new readers, > + * then waits until the slow counter becomes zero. > + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. > +void percpu_down_write(struct percpu_rw_semaphore *brw) > +{ > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ > + mutex_lock(>writer_mutex); > + > + /* > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read > + *so that update_fast_ctr() can't succeed. > + * > + * 2. Ensures we see the result of every previous this_cpu_add() in > + *update_fast_ctr(). > + * > + * 3. Ensures that if any reader has exited its critical section via > + *fast-path, it executes a full memory barrier before we return. > + */ > + synchronize_sched(); Here's where I get horridly confused. Your patch completely deRCUifies this code, yes? Yet here
[PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov Reviewed-by: Paul E. McKenney --- include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ 3 files changed, 137 insertions(+), 71 deletions(-) create mode 100644 lib/percpu-rwsem.c diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 250a4ac..592f0d6 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include +#include #include -#include -#include +#include struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutexwriter_mutex; + struct rw_semaphore rw_sem; + atomic_tslow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(>mtx); - this_cpu_inc(*p->counters); - mutex_unlock(>mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(>mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p->locked, paired with A */ - p->locked = false; - mutex_unlock(>mtx); -} - -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) -{ - p->counters = alloc_percpu(unsigned); - if (unlikely(!p->counters)) - return -ENOMEM; - p->locked = false; - mutex_init(>mtx); - return 0; -} - -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) -{ - free_percpu(p->counters); - p->counters = NULL; /* catch use after free bugs */ -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); #endif diff --git a/lib/Makefile b/lib/Makefile index 821a162..4dad4a7 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ -is_single_threaded.o plist.o decompress.o +is_single_threaded.o plist.o decompress.o percpu-rwsem.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c new file mode 100644 index 000..0e3bc0f --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ +#include +#include +#include + +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw->fast_read_ctr =
[PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov o...@redhat.com Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com --- include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ 3 files changed, 137 insertions(+), 71 deletions(-) create mode 100644 lib/percpu-rwsem.c diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 250a4ac..592f0d6 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include linux/mutex.h +#include linux/rwsem.h #include linux/percpu.h -#include linux/rcupdate.h -#include linux/delay.h +#include linux/wait.h struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutexwriter_mutex; + struct rw_semaphore rw_sem; + atomic_tslow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p-locked)) { - rcu_read_unlock_sched(); - mutex_lock(p-mtx); - this_cpu_inc(*p-counters); - mutex_unlock(p-mtx); - return; - } - this_cpu_inc(*p-counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p-locked and read of data, paired with D */ -} +extern void percpu_down_write(struct percpu_rw_semaphore *); +extern void percpu_up_write(struct percpu_rw_semaphore *); -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p-counter, paired with C */ - this_cpu_dec(*p-counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(p-mtx); - p-locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p-counters)) - msleep(1); - heavy_mb(); /* C, between read of p-counter and write to data, paired with B */ -} - -static inline void percpu_up_write(struct percpu_rw_semaphore *p) -{ - heavy_mb(); /* D, between write to data and write to p-locked, paired with A */ - p-locked = false; - mutex_unlock(p-mtx); -} - -static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p) -{ - p-counters = alloc_percpu(unsigned); - if (unlikely(!p-counters)) - return -ENOMEM; - p-locked = false; - mutex_init(p-mtx); - return 0; -} - -static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p) -{ - free_percpu(p-counters); - p-counters = NULL; /* catch use after free bugs */ -} +extern int percpu_init_rwsem(struct percpu_rw_semaphore *); +extern void percpu_free_rwsem(struct percpu_rw_semaphore *); #endif diff --git a/lib/Makefile b/lib/Makefile index 821a162..4dad4a7 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ idr.o int_sqrt.o extable.o \ sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \ -is_single_threaded.o plist.o decompress.o +is_single_threaded.o plist.o decompress.o percpu-rwsem.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c new file mode 100644 index 000..0e3bc0f --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ +#include linux/percpu-rwsem.h
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. ... include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? ... --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ That was nice and terse ;) +#include linux/percpu-rwsem.h +#include linux/rcupdate.h +#include linux/sched.h This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw-fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw-fast_read_ctr)) + return -ENOMEM; + + mutex_init(brw-writer_mutex); + init_rwsem(brw-rw_sem); + atomic_set(brw-slow_read_ctr, 0); + init_waitqueue_head(brw-write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw-fast_read_ctr); + brw-fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(brw-writer_mutex))) { + __this_cpu_add(*brw-fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +/* + * Like the normal down_read() this is not recursive, the writer can + * come after the first percpu_down_read() and create the deadlock. + */ +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(brw-rw_sem); + atomic_inc(brw-slow_read_ctr); + up_read(brw-rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(brw-slow_read_ctr)) + wake_up_all(brw-write_waitq); +} + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw-fast_read_ctr, cpu); + per_cpu(*brw-fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes -writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow -slow_read_ctr counter, + * -fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes -rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(brw-writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + *so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + *update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + *fast-path, it executes a full memory barrier before we return. + */ + synchronize_sched(); Here's where I get horridly confused. Your patch completely deRCUifies this code, yes? Yet here we're using an RCU primitive. And we seem to be
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. ... include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? ... --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ That was nice and terse ;) +#include linux/percpu-rwsem.h +#include linux/rcupdate.h +#include linux/sched.h This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw-fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw-fast_read_ctr)) + return -ENOMEM; + + mutex_init(brw-writer_mutex); + init_rwsem(brw-rw_sem); + atomic_set(brw-slow_read_ctr, 0); + init_waitqueue_head(brw-write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw-fast_read_ctr); + brw-fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(brw-writer_mutex))) { + __this_cpu_add(*brw-fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +/* + * Like the normal down_read() this is not recursive, the writer can + * come after the first percpu_down_read() and create the deadlock. + */ +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(brw-rw_sem); + atomic_inc(brw-slow_read_ctr); + up_read(brw-rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(brw-slow_read_ctr)) + wake_up_all(brw-write_waitq); +} + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw-fast_read_ctr, cpu); + per_cpu(*brw-fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes -writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow -slow_read_ctr counter, + * -fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes -rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(brw-writer_mutex); + + /* +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read +*so that update_fast_ctr() can't succeed. +* +* 2. Ensures we see the result of every previous this_cpu_add() in +*update_fast_ctr(). +* +* 3. Ensures that if any reader has exited its critical section via +*fast-path, it executes a full memory barrier before we return. +*/ + synchronize_sched(); Here's where I get
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, 8 Nov 2012, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. ... include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? ... --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ That was nice and terse ;) +#include linux/percpu-rwsem.h +#include linux/rcupdate.h +#include linux/sched.h This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw-fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw-fast_read_ctr)) + return -ENOMEM; + + mutex_init(brw-writer_mutex); + init_rwsem(brw-rw_sem); + atomic_set(brw-slow_read_ctr, 0); + init_waitqueue_head(brw-write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw-fast_read_ctr); + brw-fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(brw-writer_mutex))) { + __this_cpu_add(*brw-fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +/* + * Like the normal down_read() this is not recursive, the writer can + * come after the first percpu_down_read() and create the deadlock. + */ +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(brw-rw_sem); + atomic_inc(brw-slow_read_ctr); + up_read(brw-rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(brw-slow_read_ctr)) + wake_up_all(brw-write_waitq); +} + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw-fast_read_ctr, cpu); + per_cpu(*brw-fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes -writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow -slow_read_ctr counter, + * -fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes -rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also blocks update_fast_ctr() which checks mutex_is_locked() */ + mutex_lock(brw-writer_mutex); + + /* + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read + *so that update_fast_ctr() can't succeed. + * + * 2. Ensures we see the result of every previous this_cpu_add() in + *update_fast_ctr(). + * + * 3. Ensures that if any reader has exited its critical section via + *
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: On Thu, 8 Nov 2012, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. ... include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? ... --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ That was nice and terse ;) +#include linux/percpu-rwsem.h +#include linux/rcupdate.h +#include linux/sched.h This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw-fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw-fast_read_ctr)) + return -ENOMEM; + + mutex_init(brw-writer_mutex); + init_rwsem(brw-rw_sem); + atomic_set(brw-slow_read_ctr, 0); + init_waitqueue_head(brw-write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw-fast_read_ctr); + brw-fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(brw-writer_mutex))) { + __this_cpu_add(*brw-fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +/* + * Like the normal down_read() this is not recursive, the writer can + * come after the first percpu_down_read() and create the deadlock. + */ +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(brw-rw_sem); + atomic_inc(brw-slow_read_ctr); + up_read(brw-rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(brw-slow_read_ctr)) + wake_up_all(brw-write_waitq); +} + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw-fast_read_ctr, cpu); + per_cpu(*brw-fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes -writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow -slow_read_ctr counter, + * -fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes -rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment seems to assume that the reader already knew that. +void percpu_down_write(struct percpu_rw_semaphore *brw) +{ + /* also
Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily
On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote: On Thu, 8 Nov 2012, Paul E. McKenney wrote: On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote: On Thu, 8 Nov 2012 14:48:49 +0100 Oleg Nesterov o...@redhat.com wrote: Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. ... include/linux/percpu-rwsem.h | 83 + lib/Makefile |2 +- lib/percpu-rwsem.c | 123 ++ The patch also uninlines everything. And it didn't export the resulting symbols to modules, so it isn't an equivalent. We can export thing later if needed I guess. It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will avoid including the code altogether, methinks? ... --- /dev/null +++ b/lib/percpu-rwsem.c @@ -0,0 +1,123 @@ That was nice and terse ;) +#include linux/percpu-rwsem.h +#include linux/rcupdate.h +#include linux/sched.h This list is nowhere near sufficient to support this file's requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty more. IOW, if it compiles, it was sheer luck. +int percpu_init_rwsem(struct percpu_rw_semaphore *brw) +{ + brw-fast_read_ctr = alloc_percpu(int); + if (unlikely(!brw-fast_read_ctr)) + return -ENOMEM; + + mutex_init(brw-writer_mutex); + init_rwsem(brw-rw_sem); + atomic_set(brw-slow_read_ctr, 0); + init_waitqueue_head(brw-write_waitq); + return 0; +} + +void percpu_free_rwsem(struct percpu_rw_semaphore *brw) +{ + free_percpu(brw-fast_read_ctr); + brw-fast_read_ctr = NULL; /* catch use after free bugs */ +} + +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val) +{ + bool success = false; + + preempt_disable(); + if (likely(!mutex_is_locked(brw-writer_mutex))) { + __this_cpu_add(*brw-fast_read_ctr, val); + success = true; + } + preempt_enable(); + + return success; +} + +/* + * Like the normal down_read() this is not recursive, the writer can + * come after the first percpu_down_read() and create the deadlock. + */ +void percpu_down_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, +1))) + return; + + down_read(brw-rw_sem); + atomic_inc(brw-slow_read_ctr); + up_read(brw-rw_sem); +} + +void percpu_up_read(struct percpu_rw_semaphore *brw) +{ + if (likely(update_fast_ctr(brw, -1))) + return; + + /* false-positive is possible but harmless */ + if (atomic_dec_and_test(brw-slow_read_ctr)) + wake_up_all(brw-write_waitq); +} + +static int clear_fast_ctr(struct percpu_rw_semaphore *brw) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) { + sum += per_cpu(*brw-fast_read_ctr, cpu); + per_cpu(*brw-fast_read_ctr, cpu) = 0; + } + + return sum; +} + +/* + * A writer takes -writer_mutex to exclude other writers and to force the + * readers to switch to the slow mode, note the mutex_is_locked() check in + * update_fast_ctr(). + * + * After that the readers can only inc/dec the slow -slow_read_ctr counter, + * -fast_read_ctr is stable. Once the writer moves its sum into the slow + * counter it represents the number of active readers. + * + * Finally the writer takes -rw_sem for writing and blocks the new readers, + * then waits until the slow counter becomes zero. + */ Some overview of how fast/slow_read_ctr are supposed to work would be useful. This comment