Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-12 Thread Paul E. McKenney
On Sun, Nov 11, 2012 at 04:45:09PM +0100, Oleg Nesterov wrote:
> On 11/09, Paul E. McKenney wrote:
> >
> > On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
> > >
> > >   static bool xxx(brw)
> > >   {
> > >   down_write(>rw_sem);
> >
> > down_write_trylock()
> >
> > As you noted in your later email.  Presumably you return false if
> > the attempt to acquire it fails.
> 
> Yes, yes, thanks.
> 
> > > But first we should do other changes, I think. IMHO we should not do
> > > synchronize_sched() under mutex_lock() and this will add (a bit) more
> > > complications. We will see.
> >
> > Indeed, that does put considerable delay on the writers.  There is always
> > synchronize_sched_expedited(), I suppose.
> 
> I am not sure about synchronize_sched_expedited() (at least unconditionally),
> but: only the 1st down_write() needs  synchronize_, and up_write() do not
> need to sleep in synchronize_ at all.
> 
> To simplify, lets ignore the fact that the writers need to serialize with
> each other. IOW, the pseudo-code below is obviously deadly wrong and racy,
> just to illustrate the idea.
> 
> 1. We remove brw->writer_mutex and add "atomic_t writers_ctr".
> 
>update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead
>of !mutex_is_locked().
> 
> 2. down_write() does
> 
>   if (atomic_add_return(brw->writers_ctr) == 1) {
>   // first writer
>   synchronize_sched();
>   ...
>   } else {
>   ... XXX: wait for percpu_up_write() from the first writer ...
>   }
> 
> 3. up_write() does
> 
>   if (atomic_dec_unless_one(brw->writers_ctr)) {
>   ... wake up XXX writers above ...
>   return;
>   } else {
>   // the last writer
>   call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } );
>   }

Agreed, an asynchronous callback can be used to switch the readers
back onto the fastpath.  Of course, as you say, getting it all working
will require some care.  ;-)

> Once again, this all is racy, but hopefully the idea is clear:
> 
>   - down_write(brw) sleeps in synchronize_sched() only if brw
> has already switched back to fast-path-mode
> 
>   - up_write() never sleeps in synchronize_sched(), it uses
> call_rcu_sched() or wakes up the next writer.
> 
> Of course I am not sure this all worth the trouble, this should be discussed.
> (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody
> will like) But I am not going to even try to do this until the current patch
> is applied, I need it to fix the bug in uprobes and I think the current code
> is "good enough". These changes can't help to speedup the readers, and the
> writers are slow/rare anyway.

Probably best to wait for multi-writers until there is a measurable need,
to be sure!  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-12 Thread Paul E. McKenney
On Sun, Nov 11, 2012 at 04:45:09PM +0100, Oleg Nesterov wrote:
 On 11/09, Paul E. McKenney wrote:
 
  On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
  
 static bool xxx(brw)
 {
 down_write(brw-rw_sem);
 
  down_write_trylock()
 
  As you noted in your later email.  Presumably you return false if
  the attempt to acquire it fails.
 
 Yes, yes, thanks.
 
   But first we should do other changes, I think. IMHO we should not do
   synchronize_sched() under mutex_lock() and this will add (a bit) more
   complications. We will see.
 
  Indeed, that does put considerable delay on the writers.  There is always
  synchronize_sched_expedited(), I suppose.
 
 I am not sure about synchronize_sched_expedited() (at least unconditionally),
 but: only the 1st down_write() needs  synchronize_, and up_write() do not
 need to sleep in synchronize_ at all.
 
 To simplify, lets ignore the fact that the writers need to serialize with
 each other. IOW, the pseudo-code below is obviously deadly wrong and racy,
 just to illustrate the idea.
 
 1. We remove brw-writer_mutex and add atomic_t writers_ctr.
 
update_fast_ctr() uses atomic_read(brw-writers_ctr) == 0 instead
of !mutex_is_locked().
 
 2. down_write() does
 
   if (atomic_add_return(brw-writers_ctr) == 1) {
   // first writer
   synchronize_sched();
   ...
   } else {
   ... XXX: wait for percpu_up_write() from the first writer ...
   }
 
 3. up_write() does
 
   if (atomic_dec_unless_one(brw-writers_ctr)) {
   ... wake up XXX writers above ...
   return;
   } else {
   // the last writer
   call_rcu_sched( func = { atomic_dec(brw-writers_ctr) } );
   }

Agreed, an asynchronous callback can be used to switch the readers
back onto the fastpath.  Of course, as you say, getting it all working
will require some care.  ;-)

 Once again, this all is racy, but hopefully the idea is clear:
 
   - down_write(brw) sleeps in synchronize_sched() only if brw
 has already switched back to fast-path-mode
 
   - up_write() never sleeps in synchronize_sched(), it uses
 call_rcu_sched() or wakes up the next writer.
 
 Of course I am not sure this all worth the trouble, this should be discussed.
 (and, cough, I'd like to add the multi-writers mode which I'm afraid nobody
 will like) But I am not going to even try to do this until the current patch
 is applied, I need it to fix the bug in uprobes and I think the current code
 is good enough. These changes can't help to speedup the readers, and the
 writers are slow/rare anyway.

Probably best to wait for multi-writers until there is a measurable need,
to be sure!  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-11 Thread Oleg Nesterov
On 11/09, Paul E. McKenney wrote:
>
> On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
> >
> > static bool xxx(brw)
> > {
> > down_write(>rw_sem);
>
>   down_write_trylock()
>
> As you noted in your later email.  Presumably you return false if
> the attempt to acquire it fails.

Yes, yes, thanks.

> > But first we should do other changes, I think. IMHO we should not do
> > synchronize_sched() under mutex_lock() and this will add (a bit) more
> > complications. We will see.
>
> Indeed, that does put considerable delay on the writers.  There is always
> synchronize_sched_expedited(), I suppose.

I am not sure about synchronize_sched_expedited() (at least unconditionally),
but: only the 1st down_write() needs  synchronize_, and up_write() do not
need to sleep in synchronize_ at all.

To simplify, lets ignore the fact that the writers need to serialize with
each other. IOW, the pseudo-code below is obviously deadly wrong and racy,
just to illustrate the idea.

1. We remove brw->writer_mutex and add "atomic_t writers_ctr".

   update_fast_ctr() uses atomic_read(brw->writers_ctr) == 0 instead
   of !mutex_is_locked().

2. down_write() does

if (atomic_add_return(brw->writers_ctr) == 1) {
// first writer
synchronize_sched();
...
} else {
... XXX: wait for percpu_up_write() from the first writer ...
}

3. up_write() does

if (atomic_dec_unless_one(brw->writers_ctr)) {
... wake up XXX writers above ...
return;
} else {
// the last writer
call_rcu_sched( func => { atomic_dec(brw->writers_ctr) } );
}

Once again, this all is racy, but hopefully the idea is clear:

- down_write(brw) sleeps in synchronize_sched() only if brw
  has already switched back to fast-path-mode

- up_write() never sleeps in synchronize_sched(), it uses
  call_rcu_sched() or wakes up the next writer.

Of course I am not sure this all worth the trouble, this should be discussed.
(and, cough, I'd like to add the multi-writers mode which I'm afraid nobody
will like) But I am not going to even try to do this until the current patch
is applied, I need it to fix the bug in uprobes and I think the current code
is "good enough". These changes can't help to speedup the readers, and the
writers are slow/rare anyway.

Thanks!

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-11 Thread Oleg Nesterov
On 11/09, Paul E. McKenney wrote:

 On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
 
  static bool xxx(brw)
  {
  down_write(brw-rw_sem);

   down_write_trylock()

 As you noted in your later email.  Presumably you return false if
 the attempt to acquire it fails.

Yes, yes, thanks.

  But first we should do other changes, I think. IMHO we should not do
  synchronize_sched() under mutex_lock() and this will add (a bit) more
  complications. We will see.

 Indeed, that does put considerable delay on the writers.  There is always
 synchronize_sched_expedited(), I suppose.

I am not sure about synchronize_sched_expedited() (at least unconditionally),
but: only the 1st down_write() needs  synchronize_, and up_write() do not
need to sleep in synchronize_ at all.

To simplify, lets ignore the fact that the writers need to serialize with
each other. IOW, the pseudo-code below is obviously deadly wrong and racy,
just to illustrate the idea.

1. We remove brw-writer_mutex and add atomic_t writers_ctr.

   update_fast_ctr() uses atomic_read(brw-writers_ctr) == 0 instead
   of !mutex_is_locked().

2. down_write() does

if (atomic_add_return(brw-writers_ctr) == 1) {
// first writer
synchronize_sched();
...
} else {
... XXX: wait for percpu_up_write() from the first writer ...
}

3. up_write() does

if (atomic_dec_unless_one(brw-writers_ctr)) {
... wake up XXX writers above ...
return;
} else {
// the last writer
call_rcu_sched( func = { atomic_dec(brw-writers_ctr) } );
}

Once again, this all is racy, but hopefully the idea is clear:

- down_write(brw) sleeps in synchronize_sched() only if brw
  has already switched back to fast-path-mode

- up_write() never sleeps in synchronize_sched(), it uses
  call_rcu_sched() or wakes up the next writer.

Of course I am not sure this all worth the trouble, this should be discussed.
(and, cough, I'd like to add the multi-writers mode which I'm afraid nobody
will like) But I am not going to even try to do this until the current patch
is applied, I need it to fix the bug in uprobes and I think the current code
is good enough. These changes can't help to speedup the readers, and the
writers are slow/rare anyway.

Thanks!

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Paul E. McKenney
On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
> On 11/09, Paul E. McKenney wrote:
> >
> > On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
> > > Contrary, I am going to try to add some complications later, so that
> > > it can have more users. In particular, I think it can replace
> > > get_online_cpus/cpu_hotplug_begin, just we need
> > > percpu_down_write_but_dont_deadlock_with_recursive_readers().
> >
> > I must confess that I am a bit concerned about possible scalability
> > bottlenecks in the current get_online_cpus(), so +1 from me on this one.
> 
> OK, thanks...
> 
> And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is
> trivial, just it needs down_write(rw_sem) "inside" wait_event(), not
> before. But I'm afraid I will never manage to write the comments ;)
> 
>   static bool xxx(brw)
>   {
>   down_write(>rw_sem);

down_write_trylock()

As you noted in your later email.  Presumably you return false if
the attempt to acquire it fails.

>   if (!atomic_read(>slow_read_ctr))
>   return true;
> 
>   up_write(>rw_sem);
>   return false;
>   }
> 
>   static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool 
> recursive_readers)
>   {
>   mutex_lock(>writer_mutex);
> 
>   synchronize_sched();
> 
>   atomic_add(clear_fast_ctr(brw), >slow_read_ctr);
> 
>   if (recursive_readers)  {
>   wait_event(brw->write_waitq, xxx(brw));

I see what you mean about acquiring brw->rw_sem inside of wait_event().

Cute trick!

The "recursive_readers" is a global initialization-time thing, right?

>   } else {
>   down_write(>rw_sem);
> 
>   wait_event(brw->write_waitq, 
> !atomic_read(>slow_read_ctr));
>   }
>   }

Looks like it should work, and would perform and scale nicely even
if we end up having to greatly increase the number of calls to
get_online_cpus().

> Of course, cpu.c still needs .active_writer to allow get_online_cpus()
> under cpu_hotplug_begin(), but this is simple.

Yep, same check as now.

> But first we should do other changes, I think. IMHO we should not do
> synchronize_sched() under mutex_lock() and this will add (a bit) more
> complications. We will see.

Indeed, that does put considerable delay on the writers.  There is always
synchronize_sched_expedited(), I suppose.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/09, Oleg Nesterov wrote:
>
>   static bool xxx(brw)
>   {
>   down_write(>rw_sem);
>   if (!atomic_read(>slow_read_ctr))
>   return true;

I meant, try_to_down_write(). Otherwise this can obviously deadlock.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/09, Paul E. McKenney wrote:
>
> On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
> > Contrary, I am going to try to add some complications later, so that
> > it can have more users. In particular, I think it can replace
> > get_online_cpus/cpu_hotplug_begin, just we need
> > percpu_down_write_but_dont_deadlock_with_recursive_readers().
>
> I must confess that I am a bit concerned about possible scalability
> bottlenecks in the current get_online_cpus(), so +1 from me on this one.

OK, thanks...

And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is
trivial, just it needs down_write(rw_sem) "inside" wait_event(), not
before. But I'm afraid I will never manage to write the comments ;)

static bool xxx(brw)
{
down_write(>rw_sem);
if (!atomic_read(>slow_read_ctr))
return true;

up_write(>rw_sem);
return false;
}

static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool 
recursive_readers)
{
mutex_lock(>writer_mutex);

synchronize_sched();

atomic_add(clear_fast_ctr(brw), >slow_read_ctr);

if (recursive_readers)  {
wait_event(brw->write_waitq, xxx(brw));
} else {
down_write(>rw_sem);

wait_event(brw->write_waitq, 
!atomic_read(>slow_read_ctr));
}
}

Of course, cpu.c still needs .active_writer to allow get_online_cpus()
under cpu_hotplug_begin(), but this is simple.

But first we should do other changes, I think. IMHO we should not do
synchronize_sched() under mutex_lock() and this will add (a bit) more
complications. We will see.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Paul E. McKenney
On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
> On 11/08, Andrew Morton wrote:
> >
> > On Thu, 8 Nov 2012 14:48:49 +0100
> > Oleg Nesterov  wrote:
> >
> > >
> > >  include/linux/percpu-rwsem.h |   83 +
> > >  lib/Makefile |2 +-
> > >  lib/percpu-rwsem.c   |  123 
> > > ++
> >
> > The patch also uninlines everything.
> >
> > And it didn't export the resulting symbols to modules, so it isn't an
> > equivalent.  We can export thing later if needed I guess.
> 
> Yes, currently it is only used by block_dev.c
> 
> > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > avoid including the code altogether, methinks?
> 
> I am going to add another user (uprobes), this was my motivation for
> this patch. And perhaps it will have more users.
> 
> But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send
> the patch.
> 
> > > +#include 
> > > +#include 
> > > +#include 
> >
> > This list is nowhere near sufficient to support this file's
> > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > more.  IOW, if it compiles, it was sheer luck.
> 
> OK, thanks, I'll send
> send 
> percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix
> 
> > > +/*
> > > + * A writer takes ->writer_mutex to exclude other writers and to force 
> > > the
> > > + * readers to switch to the slow mode, note the mutex_is_locked() check 
> > > in
> > > + * update_fast_ctr().
> > > + *
> > > + * After that the readers can only inc/dec the slow ->slow_read_ctr 
> > > counter,
> > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > + * counter it represents the number of active readers.
> > > + *
> > > + * Finally the writer takes ->rw_sem for writing and blocks the new 
> > > readers,
> > > + * then waits until the slow counter becomes zero.
> > > + */
> >
> > Some overview of how fast/slow_read_ctr are supposed to work would be
> > useful.  This comment seems to assume that the reader already knew
> > that.
> 
> I hate to say this, but I'll try to update this comment too ;)
> 
> > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > + mutex_lock(>writer_mutex);
> > > +
> > > + /*
> > > +  * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > +  *so that update_fast_ctr() can't succeed.
> > > +  *
> > > +  * 2. Ensures we see the result of every previous this_cpu_add() in
> > > +  *update_fast_ctr().
> > > +  *
> > > +  * 3. Ensures that if any reader has exited its critical section via
> > > +  *fast-path, it executes a full memory barrier before we return.
> > > +  */
> > > + synchronize_sched();
> >
> > Here's where I get horridly confused.  Your patch completely deRCUifies
> > this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> > be using it not as an RCU primitive but as a handy thing which happens
> > to have desirable side-effects.  But the implementation of
> > synchronize_sched() differs considerably according to which rcu
> > flavor-of-the-minute you're using.
> 
> It is documented that synchronize_sched() should play well with
> preempt_disable/enable. From the comment:
> 
>   Note that preempt_disable(),
>   local_irq_disable(), and so on may be used in place of
>   rcu_read_lock_sched().
> 
> But I guess this needs more discussion, I see other emails in this
> thread...
> 
> > And part 3 talks about the reader's critical section.  The only
> > critical sections I can see on the reader side are already covered by
> > mutex_lock() and preempt_diable().
> 
> Yes, but we need to ensure that if we take the lock for writing, we
> should see all memory modifications done under down_read/up_read().
> 
> IOW. Suppose that the reader does
> 
>   percpu_down_read();
>   STORE;
>   percpu_up_read();   // no barriers in the fast path
> 
> The writer should see the result of that STORE under percpu_down_write().
> 
> Part 3 tries to say that at this point we should already see the result,
> so we should not worry about acquire/release semantics.
> 
> > If this code isn't as brain damaged as it
> > initially appears then please,
> 
> I hope ;)
> 
> > go easy on us simpletons in the next
> > version?
> 
> Well, I'll try to update the comments... but the code is simple, I do
> not think I can simplify it more. The nontrivial part is the barriers,
> but this is always nontrivial.
> 
> Contrary, I am going to try to add some complications later, so that
> it can have more users. In particular, I think it can replace
> get_online_cpus/cpu_hotplug_begin, just we need
> percpu_down_write_but_dont_deadlock_with_recursive_readers().

I must confess that I am a bit concerned about possible scalability
bottlenecks in the current get_online_cpus(), so +1 from me on this one.

   

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Paul E. McKenney
On Fri, Nov 09, 2012 at 05:35:38PM +0100, Oleg Nesterov wrote:
> On 11/08, Paul E. McKenney wrote:
> >
> > On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> > > >
> > > > On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> > > >
> > > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > > > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > > > > Oleg Nesterov  wrote:
> > > > > >
> > > > >
> > > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> > > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> > > > > synchronize_sched().  The real-time guys would prefer the change
> > > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> > > > > you mention it.
> > > > >
> > > > > Oleg, Mikulas, any reason not to move to 
> > > > > rcu_read_lock()/rcu_read_unlock()
> > > > > and synchronize_rcu()?
> > > >
> > > > preempt_disable/preempt_enable is faster than
> > > > rcu_read_lock/rcu_read_unlock for preemptive kernels.
> 
> Yes, I chose preempt_disable() because it is the fastest/simplest
> primitive and the critical section is really tiny.
> 
> But:
> 
> > > Significantly faster in this case?  Can you measure the difference
> > > from a user-mode test?
> 
> I do not think rcu_read_lock() or rcu_read_lock_sched() can actually
> make a measurable difference.
> 
> > Actually, the fact that __this_cpu_add() will malfunction on some
> > architectures is preemption is not disabled seems a more compelling
> > reason to keep preempt_enable() than any performance improvement.  ;-)
> 
> Yes, but this_cpu_add() should work.

Indeed!  But this_cpu_add() just does the preempt_enable() under the
covers, so not much difference from a latency viewpoint.

> > > Careful.  The real-time guys might take the same every-little-bit approach
> > > to latency that you seem to be taking for CPU cycles.  ;-)
> 
> Understand...
> 
> So I simply do not know. Please tell me if you think it would be
> better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched,
> and I'll send the patch.

I doubt if it makes a measurable difference for either throughput or
latency.  One could argue that rcu_read_lock() would be better for
readability, but making sure that the preempt_disable() is clearly
commented as starting an RCU-sched read-side critical section would
be just as good.

So I am OK with the current preempt_disable() approach.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/08, Paul E. McKenney wrote:
>
> On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> > >
> > > On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> > >
> > > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > > > Oleg Nesterov  wrote:
> > > > >
> > > >
> > > > The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> > > > synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> > > > synchronize_sched().  The real-time guys would prefer the change
> > > > to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> > > > you mention it.
> > > >
> > > > Oleg, Mikulas, any reason not to move to 
> > > > rcu_read_lock()/rcu_read_unlock()
> > > > and synchronize_rcu()?
> > >
> > > preempt_disable/preempt_enable is faster than
> > > rcu_read_lock/rcu_read_unlock for preemptive kernels.

Yes, I chose preempt_disable() because it is the fastest/simplest
primitive and the critical section is really tiny.

But:

> > Significantly faster in this case?  Can you measure the difference
> > from a user-mode test?

I do not think rcu_read_lock() or rcu_read_lock_sched() can actually
make a measurable difference.

> Actually, the fact that __this_cpu_add() will malfunction on some
> architectures is preemption is not disabled seems a more compelling
> reason to keep preempt_enable() than any performance improvement.  ;-)

Yes, but this_cpu_add() should work.

> > Careful.  The real-time guys might take the same every-little-bit approach
> > to latency that you seem to be taking for CPU cycles.  ;-)

Understand...


So I simply do not know. Please tell me if you think it would be
better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched,
and I'll send the patch.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/08, Andrew Morton wrote:
>
> On Thu, 8 Nov 2012 14:48:49 +0100
> Oleg Nesterov  wrote:
>
> >
> >  include/linux/percpu-rwsem.h |   83 +
> >  lib/Makefile |2 +-
> >  lib/percpu-rwsem.c   |  123 
> > ++
>
> The patch also uninlines everything.
>
> And it didn't export the resulting symbols to modules, so it isn't an
> equivalent.  We can export thing later if needed I guess.

Yes, currently it is only used by block_dev.c

> It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> avoid including the code altogether, methinks?

I am going to add another user (uprobes), this was my motivation for
this patch. And perhaps it will have more users.

But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send
the patch.

> > +#include 
> > +#include 
> > +#include 
>
> This list is nowhere near sufficient to support this file's
> requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> more.  IOW, if it compiles, it was sheer luck.

OK, thanks, I'll send
send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix

> > +/*
> > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > + * update_fast_ctr().
> > + *
> > + * After that the readers can only inc/dec the slow ->slow_read_ctr 
> > counter,
> > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > + * counter it represents the number of active readers.
> > + *
> > + * Finally the writer takes ->rw_sem for writing and blocks the new 
> > readers,
> > + * then waits until the slow counter becomes zero.
> > + */
>
> Some overview of how fast/slow_read_ctr are supposed to work would be
> useful.  This comment seems to assume that the reader already knew
> that.

I hate to say this, but I'll try to update this comment too ;)

> > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > +{
> > +   /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > +   mutex_lock(>writer_mutex);
> > +
> > +   /*
> > +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > +*so that update_fast_ctr() can't succeed.
> > +*
> > +* 2. Ensures we see the result of every previous this_cpu_add() in
> > +*update_fast_ctr().
> > +*
> > +* 3. Ensures that if any reader has exited its critical section via
> > +*fast-path, it executes a full memory barrier before we return.
> > +*/
> > +   synchronize_sched();
>
> Here's where I get horridly confused.  Your patch completely deRCUifies
> this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> be using it not as an RCU primitive but as a handy thing which happens
> to have desirable side-effects.  But the implementation of
> synchronize_sched() differs considerably according to which rcu
> flavor-of-the-minute you're using.

It is documented that synchronize_sched() should play well with
preempt_disable/enable. From the comment:

Note that preempt_disable(),
local_irq_disable(), and so on may be used in place of
rcu_read_lock_sched().

But I guess this needs more discussion, I see other emails in this
thread...

> And part 3 talks about the reader's critical section.  The only
> critical sections I can see on the reader side are already covered by
> mutex_lock() and preempt_diable().

Yes, but we need to ensure that if we take the lock for writing, we
should see all memory modifications done under down_read/up_read().

IOW. Suppose that the reader does

percpu_down_read();
STORE;
percpu_up_read();   // no barriers in the fast path

The writer should see the result of that STORE under percpu_down_write().

Part 3 tries to say that at this point we should already see the result,
so we should not worry about acquire/release semantics.

> If this code isn't as brain damaged as it
> initially appears then please,

I hope ;)

> go easy on us simpletons in the next
> version?

Well, I'll try to update the comments... but the code is simple, I do
not think I can simplify it more. The nontrivial part is the barriers,
but this is always nontrivial.

Contrary, I am going to try to add some complications later, so that
it can have more users. In particular, I think it can replace
get_online_cpus/cpu_hotplug_begin, just we need
percpu_down_write_but_dont_deadlock_with_recursive_readers().

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Mikulas Patocka


On Thu, 8 Nov 2012, Andrew Morton wrote:

> On Thu, 8 Nov 2012 14:48:49 +0100
> Oleg Nesterov  wrote:
> 
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> > 
> > With this patch down_write/up_write does synchronize_sched() twice
> > and down_read/up_read are still possible during this time, just they
> > use the slow path.
> > 
> > percpu_down_write() first forces the readers to use rw_semaphore and
> > increment the "slow" counter to take the lock for reading, then it
> > takes that rw_semaphore for writing and blocks the readers.
> > 
> > Also. With this patch the code relies on the documented behaviour of
> > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > barrier.
> > 
> > ...
> >
> >  include/linux/percpu-rwsem.h |   83 +
> >  lib/Makefile |2 +-
> >  lib/percpu-rwsem.c   |  123 
> > ++
> 
> The patch also uninlines everything.
> 
> And it didn't export the resulting symbols to modules, so it isn't an
> equivalent.  We can export thing later if needed I guess.
> 
> It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> avoid including the code altogether, methinks?

If you want to use percpu-rwsem only for block devices then you can remove 
Oleg's patch at all. Oleg's optimizations are useless for block device use 
case (the contention between readers and writers is very rare and it 
doesn't matter if readers are blocked in case of contention). I suppose 
that Oleg made the optimizations because he wants to use percpu-rwsem for 
something else - if not, you can drop the patch and revert to the previois 
version that is simpler.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Mikulas Patocka


On Thu, 8 Nov 2012, Andrew Morton wrote:

 On Thu, 8 Nov 2012 14:48:49 +0100
 Oleg Nesterov o...@redhat.com wrote:
 
  Currently the writer does msleep() plus synchronize_sched() 3 times
  to acquire/release the semaphore, and during this time the readers
  are blocked completely. Even if the write section was not actually
  started or if it was already finished.
  
  With this patch down_write/up_write does synchronize_sched() twice
  and down_read/up_read are still possible during this time, just they
  use the slow path.
  
  percpu_down_write() first forces the readers to use rw_semaphore and
  increment the slow counter to take the lock for reading, then it
  takes that rw_semaphore for writing and blocks the readers.
  
  Also. With this patch the code relies on the documented behaviour of
  synchronize_sched(), it doesn't try to pair synchronize_sched() with
  barrier.
  
  ...
 
   include/linux/percpu-rwsem.h |   83 +
   lib/Makefile |2 +-
   lib/percpu-rwsem.c   |  123 
  ++
 
 The patch also uninlines everything.
 
 And it didn't export the resulting symbols to modules, so it isn't an
 equivalent.  We can export thing later if needed I guess.
 
 It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
 avoid including the code altogether, methinks?

If you want to use percpu-rwsem only for block devices then you can remove 
Oleg's patch at all. Oleg's optimizations are useless for block device use 
case (the contention between readers and writers is very rare and it 
doesn't matter if readers are blocked in case of contention). I suppose 
that Oleg made the optimizations because he wants to use percpu-rwsem for 
something else - if not, you can drop the patch and revert to the previois 
version that is simpler.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/08, Andrew Morton wrote:

 On Thu, 8 Nov 2012 14:48:49 +0100
 Oleg Nesterov o...@redhat.com wrote:

 
   include/linux/percpu-rwsem.h |   83 +
   lib/Makefile |2 +-
   lib/percpu-rwsem.c   |  123 
  ++

 The patch also uninlines everything.

 And it didn't export the resulting symbols to modules, so it isn't an
 equivalent.  We can export thing later if needed I guess.

Yes, currently it is only used by block_dev.c

 It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
 avoid including the code altogether, methinks?

I am going to add another user (uprobes), this was my motivation for
this patch. And perhaps it will have more users.

But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send
the patch.

  +#include linux/percpu-rwsem.h
  +#include linux/rcupdate.h
  +#include linux/sched.h

 This list is nowhere near sufficient to support this file's
 requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
 more.  IOW, if it compiles, it was sheer luck.

OK, thanks, I'll send
send percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix

  +/*
  + * A writer takes -writer_mutex to exclude other writers and to force the
  + * readers to switch to the slow mode, note the mutex_is_locked() check in
  + * update_fast_ctr().
  + *
  + * After that the readers can only inc/dec the slow -slow_read_ctr 
  counter,
  + * -fast_read_ctr is stable. Once the writer moves its sum into the slow
  + * counter it represents the number of active readers.
  + *
  + * Finally the writer takes -rw_sem for writing and blocks the new 
  readers,
  + * then waits until the slow counter becomes zero.
  + */

 Some overview of how fast/slow_read_ctr are supposed to work would be
 useful.  This comment seems to assume that the reader already knew
 that.

I hate to say this, but I'll try to update this comment too ;)

  +void percpu_down_write(struct percpu_rw_semaphore *brw)
  +{
  +   /* also blocks update_fast_ctr() which checks mutex_is_locked() */
  +   mutex_lock(brw-writer_mutex);
  +
  +   /*
  +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read
  +*so that update_fast_ctr() can't succeed.
  +*
  +* 2. Ensures we see the result of every previous this_cpu_add() in
  +*update_fast_ctr().
  +*
  +* 3. Ensures that if any reader has exited its critical section via
  +*fast-path, it executes a full memory barrier before we return.
  +*/
  +   synchronize_sched();

 Here's where I get horridly confused.  Your patch completely deRCUifies
 this code, yes?  Yet here we're using an RCU primitive.  And we seem to
 be using it not as an RCU primitive but as a handy thing which happens
 to have desirable side-effects.  But the implementation of
 synchronize_sched() differs considerably according to which rcu
 flavor-of-the-minute you're using.

It is documented that synchronize_sched() should play well with
preempt_disable/enable. From the comment:

Note that preempt_disable(),
local_irq_disable(), and so on may be used in place of
rcu_read_lock_sched().

But I guess this needs more discussion, I see other emails in this
thread...

 And part 3 talks about the reader's critical section.  The only
 critical sections I can see on the reader side are already covered by
 mutex_lock() and preempt_diable().

Yes, but we need to ensure that if we take the lock for writing, we
should see all memory modifications done under down_read/up_read().

IOW. Suppose that the reader does

percpu_down_read();
STORE;
percpu_up_read();   // no barriers in the fast path

The writer should see the result of that STORE under percpu_down_write().

Part 3 tries to say that at this point we should already see the result,
so we should not worry about acquire/release semantics.

 If this code isn't as brain damaged as it
 initially appears then please,

I hope ;)

 go easy on us simpletons in the next
 version?

Well, I'll try to update the comments... but the code is simple, I do
not think I can simplify it more. The nontrivial part is the barriers,
but this is always nontrivial.

Contrary, I am going to try to add some complications later, so that
it can have more users. In particular, I think it can replace
get_online_cpus/cpu_hotplug_begin, just we need
percpu_down_write_but_dont_deadlock_with_recursive_readers().

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/08, Paul E. McKenney wrote:

 On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
  On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
  
   On Thu, 8 Nov 2012, Paul E. McKenney wrote:
  
On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
 On Thu, 8 Nov 2012 14:48:49 +0100
 Oleg Nesterov o...@redhat.com wrote:

   
The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
synchronize_rcu() in place of preempt_disable()/preempt_enable() and
synchronize_sched().  The real-time guys would prefer the change
to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
you mention it.
   
Oleg, Mikulas, any reason not to move to 
rcu_read_lock()/rcu_read_unlock()
and synchronize_rcu()?
  
   preempt_disable/preempt_enable is faster than
   rcu_read_lock/rcu_read_unlock for preemptive kernels.

Yes, I chose preempt_disable() because it is the fastest/simplest
primitive and the critical section is really tiny.

But:

  Significantly faster in this case?  Can you measure the difference
  from a user-mode test?

I do not think rcu_read_lock() or rcu_read_lock_sched() can actually
make a measurable difference.

 Actually, the fact that __this_cpu_add() will malfunction on some
 architectures is preemption is not disabled seems a more compelling
 reason to keep preempt_enable() than any performance improvement.  ;-)

Yes, but this_cpu_add() should work.

  Careful.  The real-time guys might take the same every-little-bit approach
  to latency that you seem to be taking for CPU cycles.  ;-)

Understand...


So I simply do not know. Please tell me if you think it would be
better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched,
and I'll send the patch.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Paul E. McKenney
On Fri, Nov 09, 2012 at 05:35:38PM +0100, Oleg Nesterov wrote:
 On 11/08, Paul E. McKenney wrote:
 
  On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
   On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
   
On Thu, 8 Nov 2012, Paul E. McKenney wrote:
   
 On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
  On Thu, 8 Nov 2012 14:48:49 +0100
  Oleg Nesterov o...@redhat.com wrote:
 

 The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
 synchronize_rcu() in place of preempt_disable()/preempt_enable() and
 synchronize_sched().  The real-time guys would prefer the change
 to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
 you mention it.

 Oleg, Mikulas, any reason not to move to 
 rcu_read_lock()/rcu_read_unlock()
 and synchronize_rcu()?
   
preempt_disable/preempt_enable is faster than
rcu_read_lock/rcu_read_unlock for preemptive kernels.
 
 Yes, I chose preempt_disable() because it is the fastest/simplest
 primitive and the critical section is really tiny.
 
 But:
 
   Significantly faster in this case?  Can you measure the difference
   from a user-mode test?
 
 I do not think rcu_read_lock() or rcu_read_lock_sched() can actually
 make a measurable difference.
 
  Actually, the fact that __this_cpu_add() will malfunction on some
  architectures is preemption is not disabled seems a more compelling
  reason to keep preempt_enable() than any performance improvement.  ;-)
 
 Yes, but this_cpu_add() should work.

Indeed!  But this_cpu_add() just does the preempt_enable() under the
covers, so not much difference from a latency viewpoint.

   Careful.  The real-time guys might take the same every-little-bit approach
   to latency that you seem to be taking for CPU cycles.  ;-)
 
 Understand...
 
 So I simply do not know. Please tell me if you think it would be
 better to use rcu_read_lock/synchronize_rcu or rcu_read_lock_sched,
 and I'll send the patch.

I doubt if it makes a measurable difference for either throughput or
latency.  One could argue that rcu_read_lock() would be better for
readability, but making sure that the preempt_disable() is clearly
commented as starting an RCU-sched read-side critical section would
be just as good.

So I am OK with the current preempt_disable() approach.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Paul E. McKenney
On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
 On 11/08, Andrew Morton wrote:
 
  On Thu, 8 Nov 2012 14:48:49 +0100
  Oleg Nesterov o...@redhat.com wrote:
 
  
include/linux/percpu-rwsem.h |   83 +
lib/Makefile |2 +-
lib/percpu-rwsem.c   |  123 
   ++
 
  The patch also uninlines everything.
 
  And it didn't export the resulting symbols to modules, so it isn't an
  equivalent.  We can export thing later if needed I guess.
 
 Yes, currently it is only used by block_dev.c
 
  It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
  avoid including the code altogether, methinks?
 
 I am going to add another user (uprobes), this was my motivation for
 this patch. And perhaps it will have more users.
 
 But I agree, CONFIG_PERCPU_RWSEM makes sense at least now, I'll send
 the patch.
 
   +#include linux/percpu-rwsem.h
   +#include linux/rcupdate.h
   +#include linux/sched.h
 
  This list is nowhere near sufficient to support this file's
  requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
  more.  IOW, if it compiles, it was sheer luck.
 
 OK, thanks, I'll send
 send 
 percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily.fix
 
   +/*
   + * A writer takes -writer_mutex to exclude other writers and to force 
   the
   + * readers to switch to the slow mode, note the mutex_is_locked() check 
   in
   + * update_fast_ctr().
   + *
   + * After that the readers can only inc/dec the slow -slow_read_ctr 
   counter,
   + * -fast_read_ctr is stable. Once the writer moves its sum into the slow
   + * counter it represents the number of active readers.
   + *
   + * Finally the writer takes -rw_sem for writing and blocks the new 
   readers,
   + * then waits until the slow counter becomes zero.
   + */
 
  Some overview of how fast/slow_read_ctr are supposed to work would be
  useful.  This comment seems to assume that the reader already knew
  that.
 
 I hate to say this, but I'll try to update this comment too ;)
 
   +void percpu_down_write(struct percpu_rw_semaphore *brw)
   +{
   + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
   + mutex_lock(brw-writer_mutex);
   +
   + /*
   +  * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
   +  *so that update_fast_ctr() can't succeed.
   +  *
   +  * 2. Ensures we see the result of every previous this_cpu_add() in
   +  *update_fast_ctr().
   +  *
   +  * 3. Ensures that if any reader has exited its critical section via
   +  *fast-path, it executes a full memory barrier before we return.
   +  */
   + synchronize_sched();
 
  Here's where I get horridly confused.  Your patch completely deRCUifies
  this code, yes?  Yet here we're using an RCU primitive.  And we seem to
  be using it not as an RCU primitive but as a handy thing which happens
  to have desirable side-effects.  But the implementation of
  synchronize_sched() differs considerably according to which rcu
  flavor-of-the-minute you're using.
 
 It is documented that synchronize_sched() should play well with
 preempt_disable/enable. From the comment:
 
   Note that preempt_disable(),
   local_irq_disable(), and so on may be used in place of
   rcu_read_lock_sched().
 
 But I guess this needs more discussion, I see other emails in this
 thread...
 
  And part 3 talks about the reader's critical section.  The only
  critical sections I can see on the reader side are already covered by
  mutex_lock() and preempt_diable().
 
 Yes, but we need to ensure that if we take the lock for writing, we
 should see all memory modifications done under down_read/up_read().
 
 IOW. Suppose that the reader does
 
   percpu_down_read();
   STORE;
   percpu_up_read();   // no barriers in the fast path
 
 The writer should see the result of that STORE under percpu_down_write().
 
 Part 3 tries to say that at this point we should already see the result,
 so we should not worry about acquire/release semantics.
 
  If this code isn't as brain damaged as it
  initially appears then please,
 
 I hope ;)
 
  go easy on us simpletons in the next
  version?
 
 Well, I'll try to update the comments... but the code is simple, I do
 not think I can simplify it more. The nontrivial part is the barriers,
 but this is always nontrivial.
 
 Contrary, I am going to try to add some complications later, so that
 it can have more users. In particular, I think it can replace
 get_online_cpus/cpu_hotplug_begin, just we need
 percpu_down_write_but_dont_deadlock_with_recursive_readers().

I must confess that I am a bit concerned about possible scalability
bottlenecks in the current get_online_cpus(), so +1 from me on this one.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to 

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/09, Paul E. McKenney wrote:

 On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
  Contrary, I am going to try to add some complications later, so that
  it can have more users. In particular, I think it can replace
  get_online_cpus/cpu_hotplug_begin, just we need
  percpu_down_write_but_dont_deadlock_with_recursive_readers().

 I must confess that I am a bit concerned about possible scalability
 bottlenecks in the current get_online_cpus(), so +1 from me on this one.

OK, thanks...

And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is
trivial, just it needs down_write(rw_sem) inside wait_event(), not
before. But I'm afraid I will never manage to write the comments ;)

static bool xxx(brw)
{
down_write(brw-rw_sem);
if (!atomic_read(brw-slow_read_ctr))
return true;

up_write(brw-rw_sem);
return false;
}

static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool 
recursive_readers)
{
mutex_lock(brw-writer_mutex);

synchronize_sched();

atomic_add(clear_fast_ctr(brw), brw-slow_read_ctr);

if (recursive_readers)  {
wait_event(brw-write_waitq, xxx(brw));
} else {
down_write(brw-rw_sem);

wait_event(brw-write_waitq, 
!atomic_read(brw-slow_read_ctr));
}
}

Of course, cpu.c still needs .active_writer to allow get_online_cpus()
under cpu_hotplug_begin(), but this is simple.

But first we should do other changes, I think. IMHO we should not do
synchronize_sched() under mutex_lock() and this will add (a bit) more
complications. We will see.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Oleg Nesterov
On 11/09, Oleg Nesterov wrote:

   static bool xxx(brw)
   {
   down_write(brw-rw_sem);
   if (!atomic_read(brw-slow_read_ctr))
   return true;

I meant, try_to_down_write(). Otherwise this can obviously deadlock.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-09 Thread Paul E. McKenney
On Fri, Nov 09, 2012 at 07:10:48PM +0100, Oleg Nesterov wrote:
 On 11/09, Paul E. McKenney wrote:
 
  On Fri, Nov 09, 2012 at 04:46:56PM +0100, Oleg Nesterov wrote:
   Contrary, I am going to try to add some complications later, so that
   it can have more users. In particular, I think it can replace
   get_online_cpus/cpu_hotplug_begin, just we need
   percpu_down_write_but_dont_deadlock_with_recursive_readers().
 
  I must confess that I am a bit concerned about possible scalability
  bottlenecks in the current get_online_cpus(), so +1 from me on this one.
 
 OK, thanks...
 
 And btw percpu_down_write_but_dont_deadlock_with_recursive_readers() is
 trivial, just it needs down_write(rw_sem) inside wait_event(), not
 before. But I'm afraid I will never manage to write the comments ;)
 
   static bool xxx(brw)
   {
   down_write(brw-rw_sem);

down_write_trylock()

As you noted in your later email.  Presumably you return false if
the attempt to acquire it fails.

   if (!atomic_read(brw-slow_read_ctr))
   return true;
 
   up_write(brw-rw_sem);
   return false;
   }
 
   static void __percpu_down_write(struct percpu_rw_semaphore *brw, bool 
 recursive_readers)
   {
   mutex_lock(brw-writer_mutex);
 
   synchronize_sched();
 
   atomic_add(clear_fast_ctr(brw), brw-slow_read_ctr);
 
   if (recursive_readers)  {
   wait_event(brw-write_waitq, xxx(brw));

I see what you mean about acquiring brw-rw_sem inside of wait_event().

Cute trick!

The recursive_readers is a global initialization-time thing, right?

   } else {
   down_write(brw-rw_sem);
 
   wait_event(brw-write_waitq, 
 !atomic_read(brw-slow_read_ctr));
   }
   }

Looks like it should work, and would perform and scale nicely even
if we end up having to greatly increase the number of calls to
get_online_cpus().

 Of course, cpu.c still needs .active_writer to allow get_online_cpus()
 under cpu_hotplug_begin(), but this is simple.

Yep, same check as now.

 But first we should do other changes, I think. IMHO we should not do
 synchronize_sched() under mutex_lock() and this will add (a bit) more
 complications. We will see.

Indeed, that does put considerable delay on the writers.  There is always
synchronize_sched_expedited(), I suppose.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Paul E. McKenney
On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> > 
> > > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > > Oleg Nesterov  wrote:
> > > > 
> > > > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > > > to acquire/release the semaphore, and during this time the readers
> > > > > are blocked completely. Even if the "write" section was not actually
> > > > > started or if it was already finished.
> > > > > 
> > > > > With this patch down_write/up_write does synchronize_sched() twice
> > > > > and down_read/up_read are still possible during this time, just they
> > > > > use the slow path.
> > > > > 
> > > > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > > > increment the "slow" counter to take the lock for reading, then it
> > > > > takes that rw_semaphore for writing and blocks the readers.
> > > > > 
> > > > > Also. With this patch the code relies on the documented behaviour of
> > > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > > > barrier.
> > > > > 
> > > > > ...
> > > > >
> > > > >  include/linux/percpu-rwsem.h |   83 +
> > > > >  lib/Makefile |2 +-
> > > > >  lib/percpu-rwsem.c   |  123 
> > > > > ++
> > > > 
> > > > The patch also uninlines everything.
> > > > 
> > > > And it didn't export the resulting symbols to modules, so it isn't an
> > > > equivalent.  We can export thing later if needed I guess.
> > > > 
> > > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > > > avoid including the code altogether, methinks?
> > > > 
> > > > >
> > > > > ...
> > > > >
> > > > > --- /dev/null
> > > > > +++ b/lib/percpu-rwsem.c
> > > > > @@ -0,0 +1,123 @@
> > > > 
> > > > That was nice and terse ;)
> > > > 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > 
> > > > This list is nowhere near sufficient to support this file's
> > > > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > > > more.  IOW, if it compiles, it was sheer luck.
> > > > 
> > > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > + brw->fast_read_ctr = alloc_percpu(int);
> > > > > + if (unlikely(!brw->fast_read_ctr))
> > > > > + return -ENOMEM;
> > > > > +
> > > > > + mutex_init(>writer_mutex);
> > > > > + init_rwsem(>rw_sem);
> > > > > + atomic_set(>slow_read_ctr, 0);
> > > > > + init_waitqueue_head(>write_waitq);
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > + free_percpu(brw->fast_read_ctr);
> > > > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > > > +}
> > > > > +
> > > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, 
> > > > > unsigned int val)
> > > > > +{
> > > > > + bool success = false;
> > > > > +
> > > > > + preempt_disable();
> > > > > + if (likely(!mutex_is_locked(>writer_mutex))) {
> > > > > + __this_cpu_add(*brw->fast_read_ctr, val);
> > > > > + success = true;
> > > > > + }
> > > > > + preempt_enable();
> > > > > +
> > > > > + return success;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Like the normal down_read() this is not recursive, the writer can
> > > > > + * come after the first percpu_down_read() and create the deadlock.
> > > > > + */
> > > > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > + if (likely(update_fast_ctr(brw, +1)))
> > > > > + return;
> > > > > +
> > > > > + down_read(>rw_sem);
> > > > > + atomic_inc(>slow_read_ctr);
> > > > > + up_read(>rw_sem);
> > > > > +}
> > > > > +
> > > > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > + if (likely(update_fast_ctr(brw, -1)))
> > > > > + return;
> > > > > +
> > > > > + /* false-positive is possible but harmless */
> > > > > + if (atomic_dec_and_test(>slow_read_ctr))
> > > > > + wake_up_all(>write_waitq);
> > > > > +}
> > > > > +
> > > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > > > +{
> > > > > + unsigned int sum = 0;
> > > > > + int cpu;
> > > > > +
> > > > > + for_each_possible_cpu(cpu) {
> > > > > + sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > > > + per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > > > + }
> > > > > +
> > > > > + return sum;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * A writer takes ->writer_mutex to exclude other writers and to 
> > > > > force the
> > > > > + * readers to switch to the slow mode, note the mutex_is_locked() 
> > > > > 

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Paul E. McKenney
On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
> 
> 
> On Thu, 8 Nov 2012, Paul E. McKenney wrote:
> 
> > On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > > On Thu, 8 Nov 2012 14:48:49 +0100
> > > Oleg Nesterov  wrote:
> > > 
> > > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > > to acquire/release the semaphore, and during this time the readers
> > > > are blocked completely. Even if the "write" section was not actually
> > > > started or if it was already finished.
> > > > 
> > > > With this patch down_write/up_write does synchronize_sched() twice
> > > > and down_read/up_read are still possible during this time, just they
> > > > use the slow path.
> > > > 
> > > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > > increment the "slow" counter to take the lock for reading, then it
> > > > takes that rw_semaphore for writing and blocks the readers.
> > > > 
> > > > Also. With this patch the code relies on the documented behaviour of
> > > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > > barrier.
> > > > 
> > > > ...
> > > >
> > > >  include/linux/percpu-rwsem.h |   83 +
> > > >  lib/Makefile |2 +-
> > > >  lib/percpu-rwsem.c   |  123 
> > > > ++
> > > 
> > > The patch also uninlines everything.
> > > 
> > > And it didn't export the resulting symbols to modules, so it isn't an
> > > equivalent.  We can export thing later if needed I guess.
> > > 
> > > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > > avoid including the code altogether, methinks?
> > > 
> > > >
> > > > ...
> > > >
> > > > --- /dev/null
> > > > +++ b/lib/percpu-rwsem.c
> > > > @@ -0,0 +1,123 @@
> > > 
> > > That was nice and terse ;)
> > > 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > 
> > > This list is nowhere near sufficient to support this file's
> > > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > > more.  IOW, if it compiles, it was sheer luck.
> > > 
> > > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +   brw->fast_read_ctr = alloc_percpu(int);
> > > > +   if (unlikely(!brw->fast_read_ctr))
> > > > +   return -ENOMEM;
> > > > +
> > > > +   mutex_init(>writer_mutex);
> > > > +   init_rwsem(>rw_sem);
> > > > +   atomic_set(>slow_read_ctr, 0);
> > > > +   init_waitqueue_head(>write_waitq);
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +   free_percpu(brw->fast_read_ctr);
> > > > +   brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > > +}
> > > > +
> > > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned 
> > > > int val)
> > > > +{
> > > > +   bool success = false;
> > > > +
> > > > +   preempt_disable();
> > > > +   if (likely(!mutex_is_locked(>writer_mutex))) {
> > > > +   __this_cpu_add(*brw->fast_read_ctr, val);
> > > > +   success = true;
> > > > +   }
> > > > +   preempt_enable();
> > > > +
> > > > +   return success;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Like the normal down_read() this is not recursive, the writer can
> > > > + * come after the first percpu_down_read() and create the deadlock.
> > > > + */
> > > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +   if (likely(update_fast_ctr(brw, +1)))
> > > > +   return;
> > > > +
> > > > +   down_read(>rw_sem);
> > > > +   atomic_inc(>slow_read_ctr);
> > > > +   up_read(>rw_sem);
> > > > +}
> > > > +
> > > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +   if (likely(update_fast_ctr(brw, -1)))
> > > > +   return;
> > > > +
> > > > +   /* false-positive is possible but harmless */
> > > > +   if (atomic_dec_and_test(>slow_read_ctr))
> > > > +   wake_up_all(>write_waitq);
> > > > +}
> > > > +
> > > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > > +{
> > > > +   unsigned int sum = 0;
> > > > +   int cpu;
> > > > +
> > > > +   for_each_possible_cpu(cpu) {
> > > > +   sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > > +   per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > > +   }
> > > > +
> > > > +   return sum;
> > > > +}
> > > > +
> > > > +/*
> > > > + * A writer takes ->writer_mutex to exclude other writers and to force 
> > > > the
> > > > + * readers to switch to the slow mode, note the mutex_is_locked() 
> > > > check in
> > > > + * update_fast_ctr().
> > > > + *
> > > > + * After that the readers can only inc/dec the slow ->slow_read_ctr 
> > > > counter,
> > > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the 
> > > > slow
> > > > + * counter it 

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Mikulas Patocka


On Thu, 8 Nov 2012, Paul E. McKenney wrote:

> On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2012 14:48:49 +0100
> > Oleg Nesterov  wrote:
> > 
> > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > to acquire/release the semaphore, and during this time the readers
> > > are blocked completely. Even if the "write" section was not actually
> > > started or if it was already finished.
> > > 
> > > With this patch down_write/up_write does synchronize_sched() twice
> > > and down_read/up_read are still possible during this time, just they
> > > use the slow path.
> > > 
> > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > increment the "slow" counter to take the lock for reading, then it
> > > takes that rw_semaphore for writing and blocks the readers.
> > > 
> > > Also. With this patch the code relies on the documented behaviour of
> > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > barrier.
> > > 
> > > ...
> > >
> > >  include/linux/percpu-rwsem.h |   83 +
> > >  lib/Makefile |2 +-
> > >  lib/percpu-rwsem.c   |  123 
> > > ++
> > 
> > The patch also uninlines everything.
> > 
> > And it didn't export the resulting symbols to modules, so it isn't an
> > equivalent.  We can export thing later if needed I guess.
> > 
> > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > avoid including the code altogether, methinks?
> > 
> > >
> > > ...
> > >
> > > --- /dev/null
> > > +++ b/lib/percpu-rwsem.c
> > > @@ -0,0 +1,123 @@
> > 
> > That was nice and terse ;)
> > 
> > > +#include 
> > > +#include 
> > > +#include 
> > 
> > This list is nowhere near sufficient to support this file's
> > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > more.  IOW, if it compiles, it was sheer luck.
> > 
> > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > + brw->fast_read_ctr = alloc_percpu(int);
> > > + if (unlikely(!brw->fast_read_ctr))
> > > + return -ENOMEM;
> > > +
> > > + mutex_init(>writer_mutex);
> > > + init_rwsem(>rw_sem);
> > > + atomic_set(>slow_read_ctr, 0);
> > > + init_waitqueue_head(>write_waitq);
> > > + return 0;
> > > +}
> > > +
> > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > + free_percpu(brw->fast_read_ctr);
> > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > +}
> > > +
> > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned 
> > > int val)
> > > +{
> > > + bool success = false;
> > > +
> > > + preempt_disable();
> > > + if (likely(!mutex_is_locked(>writer_mutex))) {
> > > + __this_cpu_add(*brw->fast_read_ctr, val);
> > > + success = true;
> > > + }
> > > + preempt_enable();
> > > +
> > > + return success;
> > > +}
> > > +
> > > +/*
> > > + * Like the normal down_read() this is not recursive, the writer can
> > > + * come after the first percpu_down_read() and create the deadlock.
> > > + */
> > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > + if (likely(update_fast_ctr(brw, +1)))
> > > + return;
> > > +
> > > + down_read(>rw_sem);
> > > + atomic_inc(>slow_read_ctr);
> > > + up_read(>rw_sem);
> > > +}
> > > +
> > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > + if (likely(update_fast_ctr(brw, -1)))
> > > + return;
> > > +
> > > + /* false-positive is possible but harmless */
> > > + if (atomic_dec_and_test(>slow_read_ctr))
> > > + wake_up_all(>write_waitq);
> > > +}
> > > +
> > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > +{
> > > + unsigned int sum = 0;
> > > + int cpu;
> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > + per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > + }
> > > +
> > > + return sum;
> > > +}
> > > +
> > > +/*
> > > + * A writer takes ->writer_mutex to exclude other writers and to force 
> > > the
> > > + * readers to switch to the slow mode, note the mutex_is_locked() check 
> > > in
> > > + * update_fast_ctr().
> > > + *
> > > + * After that the readers can only inc/dec the slow ->slow_read_ctr 
> > > counter,
> > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > + * counter it represents the number of active readers.
> > > + *
> > > + * Finally the writer takes ->rw_sem for writing and blocks the new 
> > > readers,
> > > + * then waits until the slow counter becomes zero.
> > > + */
> > 
> > Some overview of how fast/slow_read_ctr are supposed to work would be
> > useful.  This comment seems to assume that the reader already knew
> > that.
> > 
> > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > + mutex_lock(>writer_mutex);
> > > 

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Paul E. McKenney
On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> On Thu, 8 Nov 2012 14:48:49 +0100
> Oleg Nesterov  wrote:
> 
> > Currently the writer does msleep() plus synchronize_sched() 3 times
> > to acquire/release the semaphore, and during this time the readers
> > are blocked completely. Even if the "write" section was not actually
> > started or if it was already finished.
> > 
> > With this patch down_write/up_write does synchronize_sched() twice
> > and down_read/up_read are still possible during this time, just they
> > use the slow path.
> > 
> > percpu_down_write() first forces the readers to use rw_semaphore and
> > increment the "slow" counter to take the lock for reading, then it
> > takes that rw_semaphore for writing and blocks the readers.
> > 
> > Also. With this patch the code relies on the documented behaviour of
> > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > barrier.
> > 
> > ...
> >
> >  include/linux/percpu-rwsem.h |   83 +
> >  lib/Makefile |2 +-
> >  lib/percpu-rwsem.c   |  123 
> > ++
> 
> The patch also uninlines everything.
> 
> And it didn't export the resulting symbols to modules, so it isn't an
> equivalent.  We can export thing later if needed I guess.
> 
> It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> avoid including the code altogether, methinks?
> 
> >
> > ...
> >
> > --- /dev/null
> > +++ b/lib/percpu-rwsem.c
> > @@ -0,0 +1,123 @@
> 
> That was nice and terse ;)
> 
> > +#include 
> > +#include 
> > +#include 
> 
> This list is nowhere near sufficient to support this file's
> requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> more.  IOW, if it compiles, it was sheer luck.
> 
> > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > +{
> > +   brw->fast_read_ctr = alloc_percpu(int);
> > +   if (unlikely(!brw->fast_read_ctr))
> > +   return -ENOMEM;
> > +
> > +   mutex_init(>writer_mutex);
> > +   init_rwsem(>rw_sem);
> > +   atomic_set(>slow_read_ctr, 0);
> > +   init_waitqueue_head(>write_waitq);
> > +   return 0;
> > +}
> > +
> > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > +{
> > +   free_percpu(brw->fast_read_ctr);
> > +   brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > +}
> > +
> > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int 
> > val)
> > +{
> > +   bool success = false;
> > +
> > +   preempt_disable();
> > +   if (likely(!mutex_is_locked(>writer_mutex))) {
> > +   __this_cpu_add(*brw->fast_read_ctr, val);
> > +   success = true;
> > +   }
> > +   preempt_enable();
> > +
> > +   return success;
> > +}
> > +
> > +/*
> > + * Like the normal down_read() this is not recursive, the writer can
> > + * come after the first percpu_down_read() and create the deadlock.
> > + */
> > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > +{
> > +   if (likely(update_fast_ctr(brw, +1)))
> > +   return;
> > +
> > +   down_read(>rw_sem);
> > +   atomic_inc(>slow_read_ctr);
> > +   up_read(>rw_sem);
> > +}
> > +
> > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > +{
> > +   if (likely(update_fast_ctr(brw, -1)))
> > +   return;
> > +
> > +   /* false-positive is possible but harmless */
> > +   if (atomic_dec_and_test(>slow_read_ctr))
> > +   wake_up_all(>write_waitq);
> > +}
> > +
> > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > +{
> > +   unsigned int sum = 0;
> > +   int cpu;
> > +
> > +   for_each_possible_cpu(cpu) {
> > +   sum += per_cpu(*brw->fast_read_ctr, cpu);
> > +   per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > +   }
> > +
> > +   return sum;
> > +}
> > +
> > +/*
> > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > + * update_fast_ctr().
> > + *
> > + * After that the readers can only inc/dec the slow ->slow_read_ctr 
> > counter,
> > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > + * counter it represents the number of active readers.
> > + *
> > + * Finally the writer takes ->rw_sem for writing and blocks the new 
> > readers,
> > + * then waits until the slow counter becomes zero.
> > + */
> 
> Some overview of how fast/slow_read_ctr are supposed to work would be
> useful.  This comment seems to assume that the reader already knew
> that.
> 
> > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > +{
> > +   /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > +   mutex_lock(>writer_mutex);
> > +
> > +   /*
> > +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > +*so that update_fast_ctr() can't succeed.
> > +*
> > +* 2. Ensures we see the result of every previous this_cpu_add() in
> > +*update_fast_ctr().
> > +*
> > +

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Andrew Morton
On Thu, 8 Nov 2012 14:48:49 +0100
Oleg Nesterov  wrote:

> Currently the writer does msleep() plus synchronize_sched() 3 times
> to acquire/release the semaphore, and during this time the readers
> are blocked completely. Even if the "write" section was not actually
> started or if it was already finished.
> 
> With this patch down_write/up_write does synchronize_sched() twice
> and down_read/up_read are still possible during this time, just they
> use the slow path.
> 
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
> 
> Also. With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
> 
> ...
>
>  include/linux/percpu-rwsem.h |   83 +
>  lib/Makefile |2 +-
>  lib/percpu-rwsem.c   |  123 
> ++

The patch also uninlines everything.

And it didn't export the resulting symbols to modules, so it isn't an
equivalent.  We can export thing later if needed I guess.

It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
avoid including the code altogether, methinks?

>
> ...
>
> --- /dev/null
> +++ b/lib/percpu-rwsem.c
> @@ -0,0 +1,123 @@

That was nice and terse ;)

> +#include 
> +#include 
> +#include 

This list is nowhere near sufficient to support this file's
requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
more.  IOW, if it compiles, it was sheer luck.

> +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> +{
> + brw->fast_read_ctr = alloc_percpu(int);
> + if (unlikely(!brw->fast_read_ctr))
> + return -ENOMEM;
> +
> + mutex_init(>writer_mutex);
> + init_rwsem(>rw_sem);
> + atomic_set(>slow_read_ctr, 0);
> + init_waitqueue_head(>write_waitq);
> + return 0;
> +}
> +
> +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> +{
> + free_percpu(brw->fast_read_ctr);
> + brw->fast_read_ctr = NULL; /* catch use after free bugs */
> +}
> +
> +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int 
> val)
> +{
> + bool success = false;
> +
> + preempt_disable();
> + if (likely(!mutex_is_locked(>writer_mutex))) {
> + __this_cpu_add(*brw->fast_read_ctr, val);
> + success = true;
> + }
> + preempt_enable();
> +
> + return success;
> +}
> +
> +/*
> + * Like the normal down_read() this is not recursive, the writer can
> + * come after the first percpu_down_read() and create the deadlock.
> + */
> +void percpu_down_read(struct percpu_rw_semaphore *brw)
> +{
> + if (likely(update_fast_ctr(brw, +1)))
> + return;
> +
> + down_read(>rw_sem);
> + atomic_inc(>slow_read_ctr);
> + up_read(>rw_sem);
> +}
> +
> +void percpu_up_read(struct percpu_rw_semaphore *brw)
> +{
> + if (likely(update_fast_ctr(brw, -1)))
> + return;
> +
> + /* false-positive is possible but harmless */
> + if (atomic_dec_and_test(>slow_read_ctr))
> + wake_up_all(>write_waitq);
> +}
> +
> +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> +{
> + unsigned int sum = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + sum += per_cpu(*brw->fast_read_ctr, cpu);
> + per_cpu(*brw->fast_read_ctr, cpu) = 0;
> + }
> +
> + return sum;
> +}
> +
> +/*
> + * A writer takes ->writer_mutex to exclude other writers and to force the
> + * readers to switch to the slow mode, note the mutex_is_locked() check in
> + * update_fast_ctr().
> + *
> + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> + * counter it represents the number of active readers.
> + *
> + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> + * then waits until the slow counter becomes zero.
> + */

Some overview of how fast/slow_read_ctr are supposed to work would be
useful.  This comment seems to assume that the reader already knew
that.

> +void percpu_down_write(struct percpu_rw_semaphore *brw)
> +{
> + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> + mutex_lock(>writer_mutex);
> +
> + /*
> +  * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> +  *so that update_fast_ctr() can't succeed.
> +  *
> +  * 2. Ensures we see the result of every previous this_cpu_add() in
> +  *update_fast_ctr().
> +  *
> +  * 3. Ensures that if any reader has exited its critical section via
> +  *fast-path, it executes a full memory barrier before we return.
> +  */
> + synchronize_sched();

Here's where I get horridly confused.  Your patch completely deRCUifies
this code, yes?  Yet here 

[PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Oleg Nesterov
Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the "write" section was not actually
started or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice
and down_read/up_read are still possible during this time, just they
use the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Paul E. McKenney 
---
 include/linux/percpu-rwsem.h |   83 +
 lib/Makefile |2 +-
 lib/percpu-rwsem.c   |  123 ++
 3 files changed, 137 insertions(+), 71 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 250a4ac..592f0d6 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include 
+#include 
 #include 
-#include 
-#include 
+#include 
 
 struct percpu_rw_semaphore {
-   unsigned __percpu *counters;
-   bool locked;
-   struct mutex mtx;
+   unsigned int __percpu   *fast_read_ctr;
+   struct mutexwriter_mutex;
+   struct rw_semaphore rw_sem;
+   atomic_tslow_read_ctr;
+   wait_queue_head_t   write_waitq;
 };
 
-#define light_mb() barrier()
-#define heavy_mb() synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-   rcu_read_lock_sched();
-   if (unlikely(p->locked)) {
-   rcu_read_unlock_sched();
-   mutex_lock(>mtx);
-   this_cpu_inc(*p->counters);
-   mutex_unlock(>mtx);
-   return;
-   }
-   this_cpu_inc(*p->counters);
-   rcu_read_unlock_sched();
-   light_mb(); /* A, between read of p->locked and read of data, paired 
with D */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-   light_mb(); /* B, between read of the data and write to p->counter, 
paired with C */
-   this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-   unsigned total = 0;
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-   return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-   mutex_lock(>mtx);
-   p->locked = true;
-   synchronize_sched(); /* make sure that all readers exit the 
rcu_read_lock_sched region */
-   while (__percpu_count(p->counters))
-   msleep(1);
-   heavy_mb(); /* C, between read of p->counter and write to data, paired 
with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-   heavy_mb(); /* D, between write to data and write to p->locked, paired 
with A */
-   p->locked = false;
-   mutex_unlock(>mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
-   p->counters = alloc_percpu(unsigned);
-   if (unlikely(!p->counters))
-   return -ENOMEM;
-   p->locked = false;
-   mutex_init(>mtx);
-   return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
-   free_percpu(p->counters);
-   p->counters = NULL; /* catch use after free bugs */
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..4dad4a7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 idr.o int_sqrt.o extable.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-is_single_threaded.o plist.o decompress.o
+is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
new file mode 100644
index 000..0e3bc0f
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,123 @@
+#include 
+#include 
+#include 
+
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+   brw->fast_read_ctr = 

[PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Oleg Nesterov
Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the write section was not actually
started or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice
and down_read/up_read are still possible during this time, just they
use the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the slow counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov o...@redhat.com
Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com
---
 include/linux/percpu-rwsem.h |   83 +
 lib/Makefile |2 +-
 lib/percpu-rwsem.c   |  123 ++
 3 files changed, 137 insertions(+), 71 deletions(-)
 create mode 100644 lib/percpu-rwsem.c

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 250a4ac..592f0d6 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include linux/mutex.h
+#include linux/rwsem.h
 #include linux/percpu.h
-#include linux/rcupdate.h
-#include linux/delay.h
+#include linux/wait.h
 
 struct percpu_rw_semaphore {
-   unsigned __percpu *counters;
-   bool locked;
-   struct mutex mtx;
+   unsigned int __percpu   *fast_read_ctr;
+   struct mutexwriter_mutex;
+   struct rw_semaphore rw_sem;
+   atomic_tslow_read_ctr;
+   wait_queue_head_t   write_waitq;
 };
 
-#define light_mb() barrier()
-#define heavy_mb() synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-   rcu_read_lock_sched();
-   if (unlikely(p-locked)) {
-   rcu_read_unlock_sched();
-   mutex_lock(p-mtx);
-   this_cpu_inc(*p-counters);
-   mutex_unlock(p-mtx);
-   return;
-   }
-   this_cpu_inc(*p-counters);
-   rcu_read_unlock_sched();
-   light_mb(); /* A, between read of p-locked and read of data, paired 
with D */
-}
+extern void percpu_down_write(struct percpu_rw_semaphore *);
+extern void percpu_up_write(struct percpu_rw_semaphore *);
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-   light_mb(); /* B, between read of the data and write to p-counter, 
paired with C */
-   this_cpu_dec(*p-counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-   unsigned total = 0;
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-   return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-   mutex_lock(p-mtx);
-   p-locked = true;
-   synchronize_sched(); /* make sure that all readers exit the 
rcu_read_lock_sched region */
-   while (__percpu_count(p-counters))
-   msleep(1);
-   heavy_mb(); /* C, between read of p-counter and write to data, paired 
with B */
-}
-
-static inline void percpu_up_write(struct percpu_rw_semaphore *p)
-{
-   heavy_mb(); /* D, between write to data and write to p-locked, paired 
with A */
-   p-locked = false;
-   mutex_unlock(p-mtx);
-}
-
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
-{
-   p-counters = alloc_percpu(unsigned);
-   if (unlikely(!p-counters))
-   return -ENOMEM;
-   p-locked = false;
-   mutex_init(p-mtx);
-   return 0;
-}
-
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
-{
-   free_percpu(p-counters);
-   p-counters = NULL; /* catch use after free bugs */
-}
+extern int percpu_init_rwsem(struct percpu_rw_semaphore *);
+extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..4dad4a7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 idr.o int_sqrt.o extable.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
-is_single_threaded.o plist.o decompress.o
+is_single_threaded.o plist.o decompress.o percpu-rwsem.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/percpu-rwsem.c b/lib/percpu-rwsem.c
new file mode 100644
index 000..0e3bc0f
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,123 @@
+#include linux/percpu-rwsem.h

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Andrew Morton
On Thu, 8 Nov 2012 14:48:49 +0100
Oleg Nesterov o...@redhat.com wrote:

 Currently the writer does msleep() plus synchronize_sched() 3 times
 to acquire/release the semaphore, and during this time the readers
 are blocked completely. Even if the write section was not actually
 started or if it was already finished.
 
 With this patch down_write/up_write does synchronize_sched() twice
 and down_read/up_read are still possible during this time, just they
 use the slow path.
 
 percpu_down_write() first forces the readers to use rw_semaphore and
 increment the slow counter to take the lock for reading, then it
 takes that rw_semaphore for writing and blocks the readers.
 
 Also. With this patch the code relies on the documented behaviour of
 synchronize_sched(), it doesn't try to pair synchronize_sched() with
 barrier.
 
 ...

  include/linux/percpu-rwsem.h |   83 +
  lib/Makefile |2 +-
  lib/percpu-rwsem.c   |  123 
 ++

The patch also uninlines everything.

And it didn't export the resulting symbols to modules, so it isn't an
equivalent.  We can export thing later if needed I guess.

It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
avoid including the code altogether, methinks?


 ...

 --- /dev/null
 +++ b/lib/percpu-rwsem.c
 @@ -0,0 +1,123 @@

That was nice and terse ;)

 +#include linux/percpu-rwsem.h
 +#include linux/rcupdate.h
 +#include linux/sched.h

This list is nowhere near sufficient to support this file's
requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
more.  IOW, if it compiles, it was sheer luck.

 +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
 +{
 + brw-fast_read_ctr = alloc_percpu(int);
 + if (unlikely(!brw-fast_read_ctr))
 + return -ENOMEM;
 +
 + mutex_init(brw-writer_mutex);
 + init_rwsem(brw-rw_sem);
 + atomic_set(brw-slow_read_ctr, 0);
 + init_waitqueue_head(brw-write_waitq);
 + return 0;
 +}
 +
 +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 +{
 + free_percpu(brw-fast_read_ctr);
 + brw-fast_read_ctr = NULL; /* catch use after free bugs */
 +}
 +
 +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int 
 val)
 +{
 + bool success = false;
 +
 + preempt_disable();
 + if (likely(!mutex_is_locked(brw-writer_mutex))) {
 + __this_cpu_add(*brw-fast_read_ctr, val);
 + success = true;
 + }
 + preempt_enable();
 +
 + return success;
 +}
 +
 +/*
 + * Like the normal down_read() this is not recursive, the writer can
 + * come after the first percpu_down_read() and create the deadlock.
 + */
 +void percpu_down_read(struct percpu_rw_semaphore *brw)
 +{
 + if (likely(update_fast_ctr(brw, +1)))
 + return;
 +
 + down_read(brw-rw_sem);
 + atomic_inc(brw-slow_read_ctr);
 + up_read(brw-rw_sem);
 +}
 +
 +void percpu_up_read(struct percpu_rw_semaphore *brw)
 +{
 + if (likely(update_fast_ctr(brw, -1)))
 + return;
 +
 + /* false-positive is possible but harmless */
 + if (atomic_dec_and_test(brw-slow_read_ctr))
 + wake_up_all(brw-write_waitq);
 +}
 +
 +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
 +{
 + unsigned int sum = 0;
 + int cpu;
 +
 + for_each_possible_cpu(cpu) {
 + sum += per_cpu(*brw-fast_read_ctr, cpu);
 + per_cpu(*brw-fast_read_ctr, cpu) = 0;
 + }
 +
 + return sum;
 +}
 +
 +/*
 + * A writer takes -writer_mutex to exclude other writers and to force the
 + * readers to switch to the slow mode, note the mutex_is_locked() check in
 + * update_fast_ctr().
 + *
 + * After that the readers can only inc/dec the slow -slow_read_ctr counter,
 + * -fast_read_ctr is stable. Once the writer moves its sum into the slow
 + * counter it represents the number of active readers.
 + *
 + * Finally the writer takes -rw_sem for writing and blocks the new readers,
 + * then waits until the slow counter becomes zero.
 + */

Some overview of how fast/slow_read_ctr are supposed to work would be
useful.  This comment seems to assume that the reader already knew
that.

 +void percpu_down_write(struct percpu_rw_semaphore *brw)
 +{
 + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
 + mutex_lock(brw-writer_mutex);
 +
 + /*
 +  * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
 +  *so that update_fast_ctr() can't succeed.
 +  *
 +  * 2. Ensures we see the result of every previous this_cpu_add() in
 +  *update_fast_ctr().
 +  *
 +  * 3. Ensures that if any reader has exited its critical section via
 +  *fast-path, it executes a full memory barrier before we return.
 +  */
 + synchronize_sched();

Here's where I get horridly confused.  Your patch completely deRCUifies
this code, yes?  Yet here we're using an RCU primitive.  And we seem to
be 

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Paul E. McKenney
On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
 On Thu, 8 Nov 2012 14:48:49 +0100
 Oleg Nesterov o...@redhat.com wrote:
 
  Currently the writer does msleep() plus synchronize_sched() 3 times
  to acquire/release the semaphore, and during this time the readers
  are blocked completely. Even if the write section was not actually
  started or if it was already finished.
  
  With this patch down_write/up_write does synchronize_sched() twice
  and down_read/up_read are still possible during this time, just they
  use the slow path.
  
  percpu_down_write() first forces the readers to use rw_semaphore and
  increment the slow counter to take the lock for reading, then it
  takes that rw_semaphore for writing and blocks the readers.
  
  Also. With this patch the code relies on the documented behaviour of
  synchronize_sched(), it doesn't try to pair synchronize_sched() with
  barrier.
  
  ...
 
   include/linux/percpu-rwsem.h |   83 +
   lib/Makefile |2 +-
   lib/percpu-rwsem.c   |  123 
  ++
 
 The patch also uninlines everything.
 
 And it didn't export the resulting symbols to modules, so it isn't an
 equivalent.  We can export thing later if needed I guess.
 
 It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
 avoid including the code altogether, methinks?
 
 
  ...
 
  --- /dev/null
  +++ b/lib/percpu-rwsem.c
  @@ -0,0 +1,123 @@
 
 That was nice and terse ;)
 
  +#include linux/percpu-rwsem.h
  +#include linux/rcupdate.h
  +#include linux/sched.h
 
 This list is nowhere near sufficient to support this file's
 requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
 more.  IOW, if it compiles, it was sheer luck.
 
  +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
  +{
  +   brw-fast_read_ctr = alloc_percpu(int);
  +   if (unlikely(!brw-fast_read_ctr))
  +   return -ENOMEM;
  +
  +   mutex_init(brw-writer_mutex);
  +   init_rwsem(brw-rw_sem);
  +   atomic_set(brw-slow_read_ctr, 0);
  +   init_waitqueue_head(brw-write_waitq);
  +   return 0;
  +}
  +
  +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
  +{
  +   free_percpu(brw-fast_read_ctr);
  +   brw-fast_read_ctr = NULL; /* catch use after free bugs */
  +}
  +
  +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int 
  val)
  +{
  +   bool success = false;
  +
  +   preempt_disable();
  +   if (likely(!mutex_is_locked(brw-writer_mutex))) {
  +   __this_cpu_add(*brw-fast_read_ctr, val);
  +   success = true;
  +   }
  +   preempt_enable();
  +
  +   return success;
  +}
  +
  +/*
  + * Like the normal down_read() this is not recursive, the writer can
  + * come after the first percpu_down_read() and create the deadlock.
  + */
  +void percpu_down_read(struct percpu_rw_semaphore *brw)
  +{
  +   if (likely(update_fast_ctr(brw, +1)))
  +   return;
  +
  +   down_read(brw-rw_sem);
  +   atomic_inc(brw-slow_read_ctr);
  +   up_read(brw-rw_sem);
  +}
  +
  +void percpu_up_read(struct percpu_rw_semaphore *brw)
  +{
  +   if (likely(update_fast_ctr(brw, -1)))
  +   return;
  +
  +   /* false-positive is possible but harmless */
  +   if (atomic_dec_and_test(brw-slow_read_ctr))
  +   wake_up_all(brw-write_waitq);
  +}
  +
  +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
  +{
  +   unsigned int sum = 0;
  +   int cpu;
  +
  +   for_each_possible_cpu(cpu) {
  +   sum += per_cpu(*brw-fast_read_ctr, cpu);
  +   per_cpu(*brw-fast_read_ctr, cpu) = 0;
  +   }
  +
  +   return sum;
  +}
  +
  +/*
  + * A writer takes -writer_mutex to exclude other writers and to force the
  + * readers to switch to the slow mode, note the mutex_is_locked() check in
  + * update_fast_ctr().
  + *
  + * After that the readers can only inc/dec the slow -slow_read_ctr 
  counter,
  + * -fast_read_ctr is stable. Once the writer moves its sum into the slow
  + * counter it represents the number of active readers.
  + *
  + * Finally the writer takes -rw_sem for writing and blocks the new 
  readers,
  + * then waits until the slow counter becomes zero.
  + */
 
 Some overview of how fast/slow_read_ctr are supposed to work would be
 useful.  This comment seems to assume that the reader already knew
 that.
 
  +void percpu_down_write(struct percpu_rw_semaphore *brw)
  +{
  +   /* also blocks update_fast_ctr() which checks mutex_is_locked() */
  +   mutex_lock(brw-writer_mutex);
  +
  +   /*
  +* 1. Ensures mutex_is_locked() is visible to any down_read/up_read
  +*so that update_fast_ctr() can't succeed.
  +*
  +* 2. Ensures we see the result of every previous this_cpu_add() in
  +*update_fast_ctr().
  +*
  +* 3. Ensures that if any reader has exited its critical section via
  +*fast-path, it executes a full memory barrier before we return.
  +*/
  +   synchronize_sched();
 
 Here's where I get 

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Mikulas Patocka


On Thu, 8 Nov 2012, Paul E. McKenney wrote:

 On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
  On Thu, 8 Nov 2012 14:48:49 +0100
  Oleg Nesterov o...@redhat.com wrote:
  
   Currently the writer does msleep() plus synchronize_sched() 3 times
   to acquire/release the semaphore, and during this time the readers
   are blocked completely. Even if the write section was not actually
   started or if it was already finished.
   
   With this patch down_write/up_write does synchronize_sched() twice
   and down_read/up_read are still possible during this time, just they
   use the slow path.
   
   percpu_down_write() first forces the readers to use rw_semaphore and
   increment the slow counter to take the lock for reading, then it
   takes that rw_semaphore for writing and blocks the readers.
   
   Also. With this patch the code relies on the documented behaviour of
   synchronize_sched(), it doesn't try to pair synchronize_sched() with
   barrier.
   
   ...
  
include/linux/percpu-rwsem.h |   83 +
lib/Makefile |2 +-
lib/percpu-rwsem.c   |  123 
   ++
  
  The patch also uninlines everything.
  
  And it didn't export the resulting symbols to modules, so it isn't an
  equivalent.  We can export thing later if needed I guess.
  
  It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
  avoid including the code altogether, methinks?
  
  
   ...
  
   --- /dev/null
   +++ b/lib/percpu-rwsem.c
   @@ -0,0 +1,123 @@
  
  That was nice and terse ;)
  
   +#include linux/percpu-rwsem.h
   +#include linux/rcupdate.h
   +#include linux/sched.h
  
  This list is nowhere near sufficient to support this file's
  requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
  more.  IOW, if it compiles, it was sheer luck.
  
   +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
   +{
   + brw-fast_read_ctr = alloc_percpu(int);
   + if (unlikely(!brw-fast_read_ctr))
   + return -ENOMEM;
   +
   + mutex_init(brw-writer_mutex);
   + init_rwsem(brw-rw_sem);
   + atomic_set(brw-slow_read_ctr, 0);
   + init_waitqueue_head(brw-write_waitq);
   + return 0;
   +}
   +
   +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
   +{
   + free_percpu(brw-fast_read_ctr);
   + brw-fast_read_ctr = NULL; /* catch use after free bugs */
   +}
   +
   +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned 
   int val)
   +{
   + bool success = false;
   +
   + preempt_disable();
   + if (likely(!mutex_is_locked(brw-writer_mutex))) {
   + __this_cpu_add(*brw-fast_read_ctr, val);
   + success = true;
   + }
   + preempt_enable();
   +
   + return success;
   +}
   +
   +/*
   + * Like the normal down_read() this is not recursive, the writer can
   + * come after the first percpu_down_read() and create the deadlock.
   + */
   +void percpu_down_read(struct percpu_rw_semaphore *brw)
   +{
   + if (likely(update_fast_ctr(brw, +1)))
   + return;
   +
   + down_read(brw-rw_sem);
   + atomic_inc(brw-slow_read_ctr);
   + up_read(brw-rw_sem);
   +}
   +
   +void percpu_up_read(struct percpu_rw_semaphore *brw)
   +{
   + if (likely(update_fast_ctr(brw, -1)))
   + return;
   +
   + /* false-positive is possible but harmless */
   + if (atomic_dec_and_test(brw-slow_read_ctr))
   + wake_up_all(brw-write_waitq);
   +}
   +
   +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
   +{
   + unsigned int sum = 0;
   + int cpu;
   +
   + for_each_possible_cpu(cpu) {
   + sum += per_cpu(*brw-fast_read_ctr, cpu);
   + per_cpu(*brw-fast_read_ctr, cpu) = 0;
   + }
   +
   + return sum;
   +}
   +
   +/*
   + * A writer takes -writer_mutex to exclude other writers and to force 
   the
   + * readers to switch to the slow mode, note the mutex_is_locked() check 
   in
   + * update_fast_ctr().
   + *
   + * After that the readers can only inc/dec the slow -slow_read_ctr 
   counter,
   + * -fast_read_ctr is stable. Once the writer moves its sum into the slow
   + * counter it represents the number of active readers.
   + *
   + * Finally the writer takes -rw_sem for writing and blocks the new 
   readers,
   + * then waits until the slow counter becomes zero.
   + */
  
  Some overview of how fast/slow_read_ctr are supposed to work would be
  useful.  This comment seems to assume that the reader already knew
  that.
  
   +void percpu_down_write(struct percpu_rw_semaphore *brw)
   +{
   + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
   + mutex_lock(brw-writer_mutex);
   +
   + /*
   +  * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
   +  *so that update_fast_ctr() can't succeed.
   +  *
   +  * 2. Ensures we see the result of every previous this_cpu_add() in
   +  *update_fast_ctr().
   +  *
   +  * 3. Ensures that if any reader has exited its critical section via
   +  *

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Paul E. McKenney
On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
 
 
 On Thu, 8 Nov 2012, Paul E. McKenney wrote:
 
  On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
   On Thu, 8 Nov 2012 14:48:49 +0100
   Oleg Nesterov o...@redhat.com wrote:
   
Currently the writer does msleep() plus synchronize_sched() 3 times
to acquire/release the semaphore, and during this time the readers
are blocked completely. Even if the write section was not actually
started or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice
and down_read/up_read are still possible during this time, just they
use the slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the slow counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also. With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

...
   
 include/linux/percpu-rwsem.h |   83 +
 lib/Makefile |2 +-
 lib/percpu-rwsem.c   |  123 
++
   
   The patch also uninlines everything.
   
   And it didn't export the resulting symbols to modules, so it isn't an
   equivalent.  We can export thing later if needed I guess.
   
   It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
   avoid including the code altogether, methinks?
   
   
...
   
--- /dev/null
+++ b/lib/percpu-rwsem.c
@@ -0,0 +1,123 @@
   
   That was nice and terse ;)
   
+#include linux/percpu-rwsem.h
+#include linux/rcupdate.h
+#include linux/sched.h
   
   This list is nowhere near sufficient to support this file's
   requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
   more.  IOW, if it compiles, it was sheer luck.
   
+int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
+{
+   brw-fast_read_ctr = alloc_percpu(int);
+   if (unlikely(!brw-fast_read_ctr))
+   return -ENOMEM;
+
+   mutex_init(brw-writer_mutex);
+   init_rwsem(brw-rw_sem);
+   atomic_set(brw-slow_read_ctr, 0);
+   init_waitqueue_head(brw-write_waitq);
+   return 0;
+}
+
+void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
+{
+   free_percpu(brw-fast_read_ctr);
+   brw-fast_read_ctr = NULL; /* catch use after free bugs */
+}
+
+static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned 
int val)
+{
+   bool success = false;
+
+   preempt_disable();
+   if (likely(!mutex_is_locked(brw-writer_mutex))) {
+   __this_cpu_add(*brw-fast_read_ctr, val);
+   success = true;
+   }
+   preempt_enable();
+
+   return success;
+}
+
+/*
+ * Like the normal down_read() this is not recursive, the writer can
+ * come after the first percpu_down_read() and create the deadlock.
+ */
+void percpu_down_read(struct percpu_rw_semaphore *brw)
+{
+   if (likely(update_fast_ctr(brw, +1)))
+   return;
+
+   down_read(brw-rw_sem);
+   atomic_inc(brw-slow_read_ctr);
+   up_read(brw-rw_sem);
+}
+
+void percpu_up_read(struct percpu_rw_semaphore *brw)
+{
+   if (likely(update_fast_ctr(brw, -1)))
+   return;
+
+   /* false-positive is possible but harmless */
+   if (atomic_dec_and_test(brw-slow_read_ctr))
+   wake_up_all(brw-write_waitq);
+}
+
+static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
+{
+   unsigned int sum = 0;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   sum += per_cpu(*brw-fast_read_ctr, cpu);
+   per_cpu(*brw-fast_read_ctr, cpu) = 0;
+   }
+
+   return sum;
+}
+
+/*
+ * A writer takes -writer_mutex to exclude other writers and to force 
the
+ * readers to switch to the slow mode, note the mutex_is_locked() 
check in
+ * update_fast_ctr().
+ *
+ * After that the readers can only inc/dec the slow -slow_read_ctr 
counter,
+ * -fast_read_ctr is stable. Once the writer moves its sum into the 
slow
+ * counter it represents the number of active readers.
+ *
+ * Finally the writer takes -rw_sem for writing and blocks the new 
readers,
+ * then waits until the slow counter becomes zero.
+ */
   
   Some overview of how fast/slow_read_ctr are supposed to work would be
   useful.  This comment seems to assume that the reader already knew
   that.
   
+void percpu_down_write(struct percpu_rw_semaphore *brw)
+{
+   /* also 

Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not block the readers unnecessarily

2012-11-08 Thread Paul E. McKenney
On Thu, Nov 08, 2012 at 04:41:36PM -0800, Paul E. McKenney wrote:
 On Thu, Nov 08, 2012 at 06:41:10PM -0500, Mikulas Patocka wrote:
  
  
  On Thu, 8 Nov 2012, Paul E. McKenney wrote:
  
   On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
On Thu, 8 Nov 2012 14:48:49 +0100
Oleg Nesterov o...@redhat.com wrote:

 Currently the writer does msleep() plus synchronize_sched() 3 times
 to acquire/release the semaphore, and during this time the readers
 are blocked completely. Even if the write section was not actually
 started or if it was already finished.
 
 With this patch down_write/up_write does synchronize_sched() twice
 and down_read/up_read are still possible during this time, just they
 use the slow path.
 
 percpu_down_write() first forces the readers to use rw_semaphore and
 increment the slow counter to take the lock for reading, then it
 takes that rw_semaphore for writing and blocks the readers.
 
 Also. With this patch the code relies on the documented behaviour of
 synchronize_sched(), it doesn't try to pair synchronize_sched() with
 barrier.
 
 ...

  include/linux/percpu-rwsem.h |   83 +
  lib/Makefile |2 +-
  lib/percpu-rwsem.c   |  123 
 ++

The patch also uninlines everything.

And it didn't export the resulting symbols to modules, so it isn't an
equivalent.  We can export thing later if needed I guess.

It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
avoid including the code altogether, methinks?


 ...

 --- /dev/null
 +++ b/lib/percpu-rwsem.c
 @@ -0,0 +1,123 @@

That was nice and terse ;)

 +#include linux/percpu-rwsem.h
 +#include linux/rcupdate.h
 +#include linux/sched.h

This list is nowhere near sufficient to support this file's
requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
more.  IOW, if it compiles, it was sheer luck.

 +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
 +{
 + brw-fast_read_ctr = alloc_percpu(int);
 + if (unlikely(!brw-fast_read_ctr))
 + return -ENOMEM;
 +
 + mutex_init(brw-writer_mutex);
 + init_rwsem(brw-rw_sem);
 + atomic_set(brw-slow_read_ctr, 0);
 + init_waitqueue_head(brw-write_waitq);
 + return 0;
 +}
 +
 +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
 +{
 + free_percpu(brw-fast_read_ctr);
 + brw-fast_read_ctr = NULL; /* catch use after free bugs */
 +}
 +
 +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, 
 unsigned int val)
 +{
 + bool success = false;
 +
 + preempt_disable();
 + if (likely(!mutex_is_locked(brw-writer_mutex))) {
 + __this_cpu_add(*brw-fast_read_ctr, val);
 + success = true;
 + }
 + preempt_enable();
 +
 + return success;
 +}
 +
 +/*
 + * Like the normal down_read() this is not recursive, the writer can
 + * come after the first percpu_down_read() and create the deadlock.
 + */
 +void percpu_down_read(struct percpu_rw_semaphore *brw)
 +{
 + if (likely(update_fast_ctr(brw, +1)))
 + return;
 +
 + down_read(brw-rw_sem);
 + atomic_inc(brw-slow_read_ctr);
 + up_read(brw-rw_sem);
 +}
 +
 +void percpu_up_read(struct percpu_rw_semaphore *brw)
 +{
 + if (likely(update_fast_ctr(brw, -1)))
 + return;
 +
 + /* false-positive is possible but harmless */
 + if (atomic_dec_and_test(brw-slow_read_ctr))
 + wake_up_all(brw-write_waitq);
 +}
 +
 +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
 +{
 + unsigned int sum = 0;
 + int cpu;
 +
 + for_each_possible_cpu(cpu) {
 + sum += per_cpu(*brw-fast_read_ctr, cpu);
 + per_cpu(*brw-fast_read_ctr, cpu) = 0;
 + }
 +
 + return sum;
 +}
 +
 +/*
 + * A writer takes -writer_mutex to exclude other writers and to 
 force the
 + * readers to switch to the slow mode, note the mutex_is_locked() 
 check in
 + * update_fast_ctr().
 + *
 + * After that the readers can only inc/dec the slow -slow_read_ctr 
 counter,
 + * -fast_read_ctr is stable. Once the writer moves its sum into the 
 slow
 + * counter it represents the number of active readers.
 + *
 + * Finally the writer takes -rw_sem for writing and blocks the new 
 readers,
 + * then waits until the slow counter becomes zero.
 + */

Some overview of how fast/slow_read_ctr are supposed to work would be
useful.  This comment