Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-25 Thread Mikulas Patocka


On Thu, 25 Oct 2012, Paul E. McKenney wrote:

> On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> > 
> > > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> > > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> > > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> > > > 
> > > > This is an optimization. The RCU-protected region is very small, so
> > > > there will be no latency problems if we disable preempt in this region.
> > > > 
> > > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> > > > to preempt_disable / preempt_disable. It is smaller (and supposedly
> > > > faster) than preemptible rcu_read_lock / rcu_read_unlock.
> > > > 
> > > > Signed-off-by: Mikulas Patocka 
> > > 
> > > OK, as promised/threatened, I finally got a chance to take a closer look.
> > > 
> > > The light_mb() and heavy_mb() definitions aren't doing much for me,
> > > the code would be cleared with them expanded inline.  And while the
> > > approach of pairing barrier() with synchronize_sched() is interesting,
> > > it would be simpler to rely on RCU's properties.  The key point is that
> > > if RCU cannot prove that a given RCU-sched read-side critical section
> > > is seen by all CPUs to have started after a given synchronize_sched(),
> > > then that synchronize_sched() must wait for that RCU-sched read-side
> > > critical section to complete.
> > 
> > Also note that you can define both light_mb() and heavy_mb() to be 
> > smp_mb() and slow down the reader path a bit and speed up the writer path.
> > 
> > On architectures with in-order memory access (and thus smp_mb() equals 
> > barrier()), it doesn't hurt the reader but helps the writer, for example:
> > #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
> > #define light_mb()  smp_mb()
> > #define heavy_mb()  smp_mb()
> > #else
> > #define light_mb()  barrier()
> > #define heavy_mb()  synchronize_sched()
> > #endif
> 
> Except that there are no systems running Linux with in-order memory
> access.  Even x86 and s390 require a barrier instruction for smp_mb()
> on SMP=y builds.
> 
>   Thanx, Paul

PA-RISC is in-order. But it is used very rarely.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-25 Thread Paul E. McKenney
On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote:
> 
> 
> On Wed, 24 Oct 2012, Paul E. McKenney wrote:
> 
> > On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> > > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> > > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> > > 
> > > This is an optimization. The RCU-protected region is very small, so
> > > there will be no latency problems if we disable preempt in this region.
> > > 
> > > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> > > to preempt_disable / preempt_disable. It is smaller (and supposedly
> > > faster) than preemptible rcu_read_lock / rcu_read_unlock.
> > > 
> > > Signed-off-by: Mikulas Patocka 
> > 
> > OK, as promised/threatened, I finally got a chance to take a closer look.
> > 
> > The light_mb() and heavy_mb() definitions aren't doing much for me,
> > the code would be cleared with them expanded inline.  And while the
> > approach of pairing barrier() with synchronize_sched() is interesting,
> > it would be simpler to rely on RCU's properties.  The key point is that
> > if RCU cannot prove that a given RCU-sched read-side critical section
> > is seen by all CPUs to have started after a given synchronize_sched(),
> > then that synchronize_sched() must wait for that RCU-sched read-side
> > critical section to complete.
> 
> Also note that you can define both light_mb() and heavy_mb() to be 
> smp_mb() and slow down the reader path a bit and speed up the writer path.
> 
> On architectures with in-order memory access (and thus smp_mb() equals 
> barrier()), it doesn't hurt the reader but helps the writer, for example:
> #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
> #define light_mb()  smp_mb()
> #define heavy_mb()  smp_mb()
> #else
> #define light_mb()  barrier()
> #define heavy_mb()  synchronize_sched()
> #endif

Except that there are no systems running Linux with in-order memory
access.  Even x86 and s390 require a barrier instruction for smp_mb()
on SMP=y builds.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-25 Thread Mikulas Patocka


On Wed, 24 Oct 2012, Paul E. McKenney wrote:

> On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> > Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> > instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> > 
> > This is an optimization. The RCU-protected region is very small, so
> > there will be no latency problems if we disable preempt in this region.
> > 
> > So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> > to preempt_disable / preempt_disable. It is smaller (and supposedly
> > faster) than preemptible rcu_read_lock / rcu_read_unlock.
> > 
> > Signed-off-by: Mikulas Patocka 
> 
> OK, as promised/threatened, I finally got a chance to take a closer look.
> 
> The light_mb() and heavy_mb() definitions aren't doing much for me,
> the code would be cleared with them expanded inline.  And while the
> approach of pairing barrier() with synchronize_sched() is interesting,
> it would be simpler to rely on RCU's properties.  The key point is that
> if RCU cannot prove that a given RCU-sched read-side critical section
> is seen by all CPUs to have started after a given synchronize_sched(),
> then that synchronize_sched() must wait for that RCU-sched read-side
> critical section to complete.

Also note that you can define both light_mb() and heavy_mb() to be 
smp_mb() and slow down the reader path a bit and speed up the writer path.

On architectures with in-order memory access (and thus smp_mb() equals 
barrier()), it doesn't hurt the reader but helps the writer, for example:
#ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
#define light_mb()  smp_mb()
#define heavy_mb()  smp_mb()
#else
#define light_mb()  barrier()
#define heavy_mb()  synchronize_sched()
#endif

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-25 Thread Mikulas Patocka


On Wed, 24 Oct 2012, Paul E. McKenney wrote:

 On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
  Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
  instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
  
  This is an optimization. The RCU-protected region is very small, so
  there will be no latency problems if we disable preempt in this region.
  
  So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
  to preempt_disable / preempt_disable. It is smaller (and supposedly
  faster) than preemptible rcu_read_lock / rcu_read_unlock.
  
  Signed-off-by: Mikulas Patocka mpato...@redhat.com
 
 OK, as promised/threatened, I finally got a chance to take a closer look.
 
 The light_mb() and heavy_mb() definitions aren't doing much for me,
 the code would be cleared with them expanded inline.  And while the
 approach of pairing barrier() with synchronize_sched() is interesting,
 it would be simpler to rely on RCU's properties.  The key point is that
 if RCU cannot prove that a given RCU-sched read-side critical section
 is seen by all CPUs to have started after a given synchronize_sched(),
 then that synchronize_sched() must wait for that RCU-sched read-side
 critical section to complete.

Also note that you can define both light_mb() and heavy_mb() to be 
smp_mb() and slow down the reader path a bit and speed up the writer path.

On architectures with in-order memory access (and thus smp_mb() equals 
barrier()), it doesn't hurt the reader but helps the writer, for example:
#ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
#define light_mb()  smp_mb()
#define heavy_mb()  smp_mb()
#else
#define light_mb()  barrier()
#define heavy_mb()  synchronize_sched()
#endif

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-25 Thread Paul E. McKenney
On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote:
 
 
 On Wed, 24 Oct 2012, Paul E. McKenney wrote:
 
  On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
   Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
   instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
   
   This is an optimization. The RCU-protected region is very small, so
   there will be no latency problems if we disable preempt in this region.
   
   So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
   to preempt_disable / preempt_disable. It is smaller (and supposedly
   faster) than preemptible rcu_read_lock / rcu_read_unlock.
   
   Signed-off-by: Mikulas Patocka mpato...@redhat.com
  
  OK, as promised/threatened, I finally got a chance to take a closer look.
  
  The light_mb() and heavy_mb() definitions aren't doing much for me,
  the code would be cleared with them expanded inline.  And while the
  approach of pairing barrier() with synchronize_sched() is interesting,
  it would be simpler to rely on RCU's properties.  The key point is that
  if RCU cannot prove that a given RCU-sched read-side critical section
  is seen by all CPUs to have started after a given synchronize_sched(),
  then that synchronize_sched() must wait for that RCU-sched read-side
  critical section to complete.
 
 Also note that you can define both light_mb() and heavy_mb() to be 
 smp_mb() and slow down the reader path a bit and speed up the writer path.
 
 On architectures with in-order memory access (and thus smp_mb() equals 
 barrier()), it doesn't hurt the reader but helps the writer, for example:
 #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
 #define light_mb()  smp_mb()
 #define heavy_mb()  smp_mb()
 #else
 #define light_mb()  barrier()
 #define heavy_mb()  synchronize_sched()
 #endif

Except that there are no systems running Linux with in-order memory
access.  Even x86 and s390 require a barrier instruction for smp_mb()
on SMP=y builds.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-25 Thread Mikulas Patocka


On Thu, 25 Oct 2012, Paul E. McKenney wrote:

 On Thu, Oct 25, 2012 at 10:54:11AM -0400, Mikulas Patocka wrote:
  
  
  On Wed, 24 Oct 2012, Paul E. McKenney wrote:
  
   On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.

This is an optimization. The RCU-protected region is very small, so
there will be no latency problems if we disable preempt in this region.

So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
to preempt_disable / preempt_disable. It is smaller (and supposedly
faster) than preemptible rcu_read_lock / rcu_read_unlock.

Signed-off-by: Mikulas Patocka mpato...@redhat.com
   
   OK, as promised/threatened, I finally got a chance to take a closer look.
   
   The light_mb() and heavy_mb() definitions aren't doing much for me,
   the code would be cleared with them expanded inline.  And while the
   approach of pairing barrier() with synchronize_sched() is interesting,
   it would be simpler to rely on RCU's properties.  The key point is that
   if RCU cannot prove that a given RCU-sched read-side critical section
   is seen by all CPUs to have started after a given synchronize_sched(),
   then that synchronize_sched() must wait for that RCU-sched read-side
   critical section to complete.
  
  Also note that you can define both light_mb() and heavy_mb() to be 
  smp_mb() and slow down the reader path a bit and speed up the writer path.
  
  On architectures with in-order memory access (and thus smp_mb() equals 
  barrier()), it doesn't hurt the reader but helps the writer, for example:
  #ifdef ARCH_HAS_INORDER_MEMORY_ACCESS
  #define light_mb()  smp_mb()
  #define heavy_mb()  smp_mb()
  #else
  #define light_mb()  barrier()
  #define heavy_mb()  synchronize_sched()
  #endif
 
 Except that there are no systems running Linux with in-order memory
 access.  Even x86 and s390 require a barrier instruction for smp_mb()
 on SMP=y builds.
 
   Thanx, Paul

PA-RISC is in-order. But it is used very rarely.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Paul E. McKenney
On Wed, Oct 24, 2012 at 08:43:11PM +0200, Oleg Nesterov wrote:
> On 10/24, Paul E. McKenney wrote:
> >
> > On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
> > > On 10/24, Paul E. McKenney wrote:
> > > >
> > > > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > > > {
> > > > /*
> > > >  * Decrement our count, but protected by RCU-sched so that
> > > >  * the writer can force proper serialization.
> > > >  */
> > > > rcu_read_lock_sched();
> > > > this_cpu_dec(*p->counters);
> > > > rcu_read_unlock_sched();
> > > > }
> > >
> > > Yes, the explicit lock/unlock makes the new assumptions about
> > > synchronize_sched && barriers unnecessary. And iiuc this could
> > > even written as
> > >
> > >   rcu_read_lock_sched();
> > >   rcu_read_unlock_sched();
> > >
> > >   this_cpu_dec(*p->counters);
> >
> > But this would lose the memory barrier that is inserted by
> > synchronize_sched() after the CPU's last RCU-sched read-side critical
> > section.
> 
> How? Afaics there is no need to synchronize with this_cpu_dec(), its
> result was already seen before the 2nd synchronize_sched() was called
> in percpu_down_write().
> 
> IOW, this memory barrier is only needed to synchronize with memory
> changes inside down_read/up_read.
> 
> To clarify, of course I do not suggest to write is this way. I am just
> trying to check my understanding.

You are quite correct -- once the writer has seen the change in the
counter, it knows that the reader's empty RCU-sched read must have
at least started, and thus can rely on the following memory barrier
to guarantee that it sees the reader's critical section.

But that code really does look strange, I will grant you that!  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Oleg Nesterov
On 10/24, Paul E. McKenney wrote:
>
> On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
> > On 10/24, Paul E. McKenney wrote:
> > >
> > > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > > {
> > >   /*
> > >* Decrement our count, but protected by RCU-sched so that
> > >* the writer can force proper serialization.
> > >*/
> > >   rcu_read_lock_sched();
> > >   this_cpu_dec(*p->counters);
> > >   rcu_read_unlock_sched();
> > > }
> >
> > Yes, the explicit lock/unlock makes the new assumptions about
> > synchronize_sched && barriers unnecessary. And iiuc this could
> > even written as
> >
> > rcu_read_lock_sched();
> > rcu_read_unlock_sched();
> >
> > this_cpu_dec(*p->counters);
>
> But this would lose the memory barrier that is inserted by
> synchronize_sched() after the CPU's last RCU-sched read-side critical
> section.

How? Afaics there is no need to synchronize with this_cpu_dec(), its
result was already seen before the 2nd synchronize_sched() was called
in percpu_down_write().

IOW, this memory barrier is only needed to synchronize with memory
changes inside down_read/up_read.

To clarify, of course I do not suggest to write is this way. I am just
trying to check my understanding.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Paul E. McKenney
On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
> On 10/24, Paul E. McKenney wrote:
> >
> > static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> > {
> > /*
> >  * Decrement our count, but protected by RCU-sched so that
> >  * the writer can force proper serialization.
> >  */
> > rcu_read_lock_sched();
> > this_cpu_dec(*p->counters);
> > rcu_read_unlock_sched();
> > }
> 
> Yes, the explicit lock/unlock makes the new assumptions about
> synchronize_sched && barriers unnecessary. And iiuc this could
> even written as
> 
>   rcu_read_lock_sched();
>   rcu_read_unlock_sched();
> 
>   this_cpu_dec(*p->counters);

But this would lose the memory barrier that is inserted by
synchronize_sched() after the CPU's last RCU-sched read-side critical
section.

> > Of course, it would be nice to get rid of the extra synchronize_sched().
> > One way to do this is to use SRCU, which allows blocking operations in
> > its read-side critical sections (though also increasing read-side overhead
> > a bit, and also untested):
> >
> > 
> >
> > struct percpu_rw_semaphore {
> > bool locked;
> > struct mutex mtx; /* Could also be rw_semaphore. */
> > struct srcu_struct s;
> > wait_queue_head_t wq;
> > };
> 
> but in this case I don't understand
> 
> > static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> > {
> > /* Allow others to proceed, but not yet locklessly. */
> > mutex_unlock(>mtx);
> >
> > /*
> >  * Ensure that all calls to percpu_down_read() that did not
> >  * start unambiguously after the above mutex_unlock() still
> >  * acquire the lock, forcing their critical sections to be
> >  * serialized with the one terminated by this call to
> >  * percpu_up_write().
> >  */
> > synchronize_sched();
> 
> how this synchronize_sched() can help...

Indeed it cannot!  It should instead be synchronize_srcu(>s).  I guess that
I really meant it when I said it was untested.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Oleg Nesterov
On 10/24, Paul E. McKenney wrote:
>
> static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> {
>   /*
>* Decrement our count, but protected by RCU-sched so that
>* the writer can force proper serialization.
>*/
>   rcu_read_lock_sched();
>   this_cpu_dec(*p->counters);
>   rcu_read_unlock_sched();
> }

Yes, the explicit lock/unlock makes the new assumptions about
synchronize_sched && barriers unnecessary. And iiuc this could
even written as

rcu_read_lock_sched();
rcu_read_unlock_sched();

this_cpu_dec(*p->counters);


> Of course, it would be nice to get rid of the extra synchronize_sched().
> One way to do this is to use SRCU, which allows blocking operations in
> its read-side critical sections (though also increasing read-side overhead
> a bit, and also untested):
>
> 
>
> struct percpu_rw_semaphore {
>   bool locked;
>   struct mutex mtx; /* Could also be rw_semaphore. */
>   struct srcu_struct s;
>   wait_queue_head_t wq;
> };

but in this case I don't understand

> static inline void percpu_up_write(struct percpu_rw_semaphore *p)
> {
>   /* Allow others to proceed, but not yet locklessly. */
>   mutex_unlock(>mtx);
>
>   /*
>* Ensure that all calls to percpu_down_read() that did not
>* start unambiguously after the above mutex_unlock() still
>* acquire the lock, forcing their critical sections to be
>* serialized with the one terminated by this call to
>* percpu_up_write().
>*/
>   synchronize_sched();

how this synchronize_sched() can help...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Paul E. McKenney
On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
> Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
> instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
> 
> This is an optimization. The RCU-protected region is very small, so
> there will be no latency problems if we disable preempt in this region.
> 
> So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
> to preempt_disable / preempt_disable. It is smaller (and supposedly
> faster) than preemptible rcu_read_lock / rcu_read_unlock.
> 
> Signed-off-by: Mikulas Patocka 

OK, as promised/threatened, I finally got a chance to take a closer look.

The light_mb() and heavy_mb() definitions aren't doing much for me,
the code would be cleared with them expanded inline.  And while the
approach of pairing barrier() with synchronize_sched() is interesting,
it would be simpler to rely on RCU's properties.  The key point is that
if RCU cannot prove that a given RCU-sched read-side critical section
is seen by all CPUs to have started after a given synchronize_sched(),
then that synchronize_sched() must wait for that RCU-sched read-side
critical section to complete.

This means, as discussed earlier, that there will be a memory barrier
somewhere following the end of that RCU-sched read-side critical section,
and that this memory barrier executes before the completion of the
synchronize_sched().

So I suggest something like the following (untested!) implementation:



struct percpu_rw_semaphore {
unsigned __percpu *counters;
bool locked;
struct mutex mtx;
wait_queue_head_t wq;
};

static inline void percpu_down_read(struct percpu_rw_semaphore *p)
{
rcu_read_lock_sched();
if (unlikely(p->locked)) {
rcu_read_unlock_sched();

/*
 * There might (or might not) be a writer.  Acquire >mtx,
 * it is always safe (if a bit slow) to do so.
 */
mutex_lock(>mtx);
this_cpu_inc(*p->counters);
mutex_unlock(>mtx);
return;
}

/* No writer, proceed locklessly. */
this_cpu_inc(*p->counters);
rcu_read_unlock_sched();
}

static inline void percpu_up_read(struct percpu_rw_semaphore *p)
{
/*
 * Decrement our count, but protected by RCU-sched so that
 * the writer can force proper serialization.
 */
rcu_read_lock_sched();
this_cpu_dec(*p->counters);
rcu_read_unlock_sched();
}

static inline unsigned __percpu_count(unsigned __percpu *counters)
{
unsigned total = 0;
int cpu;

for_each_possible_cpu(cpu)
total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));

return total;
}

static inline void percpu_down_write(struct percpu_rw_semaphore *p)
{
mutex_lock(>mtx);

/* Wait for a previous writer, if necessary. */
wait_event(p->wq, !ACCESS_ONCE(p->locked));

/* Force the readers to acquire the lock when manipulating counts. */
ACCESS_ONCE(p->locked) = true;

/* Wait for all pre-existing readers' checks of ->locked to finish. */
synchronize_sched();
/*
 * At this point, all percpu_down_read() invocations will
 * acquire p->mtx.
 */

/*
 * Wait for all pre-existing readers to complete their
 * percpu_up_read() calls.  Because ->locked is set and
 * because we hold ->mtx, there cannot be any new readers.
 * ->counters will therefore monotonically decrement to zero.
 */
while (__percpu_count(p->counters))
msleep(1);

/*
 * Invoke synchronize_sched() in order to force the last
 * caller of percpu_up_read() to exit its RCU-sched read-side
 * critical section.  On SMP systems, this also forces the CPU
 * that invoked that percpu_up_read() to execute a full memory
 * barrier between the time it exited the RCU-sched read-side
 * critical section and the time that synchronize_sched() returns,
 * so that the critical section begun by this invocation of
 * percpu_down_write() will happen after the critical section
 * ended by percpu_up_read().
 */
synchronize_sched();
}

static inline void percpu_up_write(struct percpu_rw_semaphore *p)
{
/* Allow others to proceed, but not yet locklessly. */
mutex_unlock(>mtx);

/*
 * Ensure that all calls to percpu_down_read() that did not
 * start unambiguously after the above mutex_unlock() still
 * acquire the lock, forcing their critical sections to be
 * serialized with the one terminated by this call to
 * percpu_up_write().
 */
synchronize_sched();

/* Now it is safe to allow readers 

Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Paul E. McKenney
On Mon, Oct 22, 2012 at 07:39:16PM -0400, Mikulas Patocka wrote:
 Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
 instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.
 
 This is an optimization. The RCU-protected region is very small, so
 there will be no latency problems if we disable preempt in this region.
 
 So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
 to preempt_disable / preempt_disable. It is smaller (and supposedly
 faster) than preemptible rcu_read_lock / rcu_read_unlock.
 
 Signed-off-by: Mikulas Patocka mpato...@redhat.com

OK, as promised/threatened, I finally got a chance to take a closer look.

The light_mb() and heavy_mb() definitions aren't doing much for me,
the code would be cleared with them expanded inline.  And while the
approach of pairing barrier() with synchronize_sched() is interesting,
it would be simpler to rely on RCU's properties.  The key point is that
if RCU cannot prove that a given RCU-sched read-side critical section
is seen by all CPUs to have started after a given synchronize_sched(),
then that synchronize_sched() must wait for that RCU-sched read-side
critical section to complete.

This means, as discussed earlier, that there will be a memory barrier
somewhere following the end of that RCU-sched read-side critical section,
and that this memory barrier executes before the completion of the
synchronize_sched().

So I suggest something like the following (untested!) implementation:



struct percpu_rw_semaphore {
unsigned __percpu *counters;
bool locked;
struct mutex mtx;
wait_queue_head_t wq;
};

static inline void percpu_down_read(struct percpu_rw_semaphore *p)
{
rcu_read_lock_sched();
if (unlikely(p-locked)) {
rcu_read_unlock_sched();

/*
 * There might (or might not) be a writer.  Acquire p-mtx,
 * it is always safe (if a bit slow) to do so.
 */
mutex_lock(p-mtx);
this_cpu_inc(*p-counters);
mutex_unlock(p-mtx);
return;
}

/* No writer, proceed locklessly. */
this_cpu_inc(*p-counters);
rcu_read_unlock_sched();
}

static inline void percpu_up_read(struct percpu_rw_semaphore *p)
{
/*
 * Decrement our count, but protected by RCU-sched so that
 * the writer can force proper serialization.
 */
rcu_read_lock_sched();
this_cpu_dec(*p-counters);
rcu_read_unlock_sched();
}

static inline unsigned __percpu_count(unsigned __percpu *counters)
{
unsigned total = 0;
int cpu;

for_each_possible_cpu(cpu)
total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));

return total;
}

static inline void percpu_down_write(struct percpu_rw_semaphore *p)
{
mutex_lock(p-mtx);

/* Wait for a previous writer, if necessary. */
wait_event(p-wq, !ACCESS_ONCE(p-locked));

/* Force the readers to acquire the lock when manipulating counts. */
ACCESS_ONCE(p-locked) = true;

/* Wait for all pre-existing readers' checks of -locked to finish. */
synchronize_sched();
/*
 * At this point, all percpu_down_read() invocations will
 * acquire p-mtx.
 */

/*
 * Wait for all pre-existing readers to complete their
 * percpu_up_read() calls.  Because -locked is set and
 * because we hold -mtx, there cannot be any new readers.
 * -counters will therefore monotonically decrement to zero.
 */
while (__percpu_count(p-counters))
msleep(1);

/*
 * Invoke synchronize_sched() in order to force the last
 * caller of percpu_up_read() to exit its RCU-sched read-side
 * critical section.  On SMP systems, this also forces the CPU
 * that invoked that percpu_up_read() to execute a full memory
 * barrier between the time it exited the RCU-sched read-side
 * critical section and the time that synchronize_sched() returns,
 * so that the critical section begun by this invocation of
 * percpu_down_write() will happen after the critical section
 * ended by percpu_up_read().
 */
synchronize_sched();
}

static inline void percpu_up_write(struct percpu_rw_semaphore *p)
{
/* Allow others to proceed, but not yet locklessly. */
mutex_unlock(p-mtx);

/*
 * Ensure that all calls to percpu_down_read() that did not
 * start unambiguously after the above mutex_unlock() still
 * acquire the lock, forcing their critical sections to be
 * serialized with the one terminated by this call to
 * percpu_up_write().
 */
synchronize_sched();

/* Now it is safe to allow readers 

Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Oleg Nesterov
On 10/24, Paul E. McKenney wrote:

 static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
   /*
* Decrement our count, but protected by RCU-sched so that
* the writer can force proper serialization.
*/
   rcu_read_lock_sched();
   this_cpu_dec(*p-counters);
   rcu_read_unlock_sched();
 }

Yes, the explicit lock/unlock makes the new assumptions about
synchronize_sched  barriers unnecessary. And iiuc this could
even written as

rcu_read_lock_sched();
rcu_read_unlock_sched();

this_cpu_dec(*p-counters);


 Of course, it would be nice to get rid of the extra synchronize_sched().
 One way to do this is to use SRCU, which allows blocking operations in
 its read-side critical sections (though also increasing read-side overhead
 a bit, and also untested):

 

 struct percpu_rw_semaphore {
   bool locked;
   struct mutex mtx; /* Could also be rw_semaphore. */
   struct srcu_struct s;
   wait_queue_head_t wq;
 };

but in this case I don't understand

 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
   /* Allow others to proceed, but not yet locklessly. */
   mutex_unlock(p-mtx);

   /*
* Ensure that all calls to percpu_down_read() that did not
* start unambiguously after the above mutex_unlock() still
* acquire the lock, forcing their critical sections to be
* serialized with the one terminated by this call to
* percpu_up_write().
*/
   synchronize_sched();

how this synchronize_sched() can help...

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Paul E. McKenney
On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
 On 10/24, Paul E. McKenney wrote:
 
  static inline void percpu_up_read(struct percpu_rw_semaphore *p)
  {
  /*
   * Decrement our count, but protected by RCU-sched so that
   * the writer can force proper serialization.
   */
  rcu_read_lock_sched();
  this_cpu_dec(*p-counters);
  rcu_read_unlock_sched();
  }
 
 Yes, the explicit lock/unlock makes the new assumptions about
 synchronize_sched  barriers unnecessary. And iiuc this could
 even written as
 
   rcu_read_lock_sched();
   rcu_read_unlock_sched();
 
   this_cpu_dec(*p-counters);

But this would lose the memory barrier that is inserted by
synchronize_sched() after the CPU's last RCU-sched read-side critical
section.

  Of course, it would be nice to get rid of the extra synchronize_sched().
  One way to do this is to use SRCU, which allows blocking operations in
  its read-side critical sections (though also increasing read-side overhead
  a bit, and also untested):
 
  
 
  struct percpu_rw_semaphore {
  bool locked;
  struct mutex mtx; /* Could also be rw_semaphore. */
  struct srcu_struct s;
  wait_queue_head_t wq;
  };
 
 but in this case I don't understand
 
  static inline void percpu_up_write(struct percpu_rw_semaphore *p)
  {
  /* Allow others to proceed, but not yet locklessly. */
  mutex_unlock(p-mtx);
 
  /*
   * Ensure that all calls to percpu_down_read() that did not
   * start unambiguously after the above mutex_unlock() still
   * acquire the lock, forcing their critical sections to be
   * serialized with the one terminated by this call to
   * percpu_up_write().
   */
  synchronize_sched();
 
 how this synchronize_sched() can help...

Indeed it cannot!  It should instead be synchronize_srcu(p-s).  I guess that
I really meant it when I said it was untested.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Oleg Nesterov
On 10/24, Paul E. McKenney wrote:

 On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
  On 10/24, Paul E. McKenney wrote:
  
   static inline void percpu_up_read(struct percpu_rw_semaphore *p)
   {
 /*
  * Decrement our count, but protected by RCU-sched so that
  * the writer can force proper serialization.
  */
 rcu_read_lock_sched();
 this_cpu_dec(*p-counters);
 rcu_read_unlock_sched();
   }
 
  Yes, the explicit lock/unlock makes the new assumptions about
  synchronize_sched  barriers unnecessary. And iiuc this could
  even written as
 
  rcu_read_lock_sched();
  rcu_read_unlock_sched();
 
  this_cpu_dec(*p-counters);

 But this would lose the memory barrier that is inserted by
 synchronize_sched() after the CPU's last RCU-sched read-side critical
 section.

How? Afaics there is no need to synchronize with this_cpu_dec(), its
result was already seen before the 2nd synchronize_sched() was called
in percpu_down_write().

IOW, this memory barrier is only needed to synchronize with memory
changes inside down_read/up_read.

To clarify, of course I do not suggest to write is this way. I am just
trying to check my understanding.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-24 Thread Paul E. McKenney
On Wed, Oct 24, 2012 at 08:43:11PM +0200, Oleg Nesterov wrote:
 On 10/24, Paul E. McKenney wrote:
 
  On Wed, Oct 24, 2012 at 07:18:55PM +0200, Oleg Nesterov wrote:
   On 10/24, Paul E. McKenney wrote:
   
static inline void percpu_up_read(struct percpu_rw_semaphore *p)
{
/*
 * Decrement our count, but protected by RCU-sched so that
 * the writer can force proper serialization.
 */
rcu_read_lock_sched();
this_cpu_dec(*p-counters);
rcu_read_unlock_sched();
}
  
   Yes, the explicit lock/unlock makes the new assumptions about
   synchronize_sched  barriers unnecessary. And iiuc this could
   even written as
  
 rcu_read_lock_sched();
 rcu_read_unlock_sched();
  
 this_cpu_dec(*p-counters);
 
  But this would lose the memory barrier that is inserted by
  synchronize_sched() after the CPU's last RCU-sched read-side critical
  section.
 
 How? Afaics there is no need to synchronize with this_cpu_dec(), its
 result was already seen before the 2nd synchronize_sched() was called
 in percpu_down_write().
 
 IOW, this memory barrier is only needed to synchronize with memory
 changes inside down_read/up_read.
 
 To clarify, of course I do not suggest to write is this way. I am just
 trying to check my understanding.

You are quite correct -- once the writer has seen the change in the
counter, it knows that the reader's empty RCU-sched read must have
at least started, and thus can rely on the following memory barrier
to guarantee that it sees the reader's critical section.

But that code really does look strange, I will grant you that!  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-22 Thread Mikulas Patocka
Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.

This is an optimization. The RCU-protected region is very small, so
there will be no latency problems if we disable preempt in this region.

So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
to preempt_disable / preempt_disable. It is smaller (and supposedly
faster) than preemptible rcu_read_lock / rcu_read_unlock.

Signed-off-by: Mikulas Patocka 

---
 include/linux/percpu-rwsem.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h
===
--- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h  2012-10-23 
01:21:49.0 +0200
+++ linux-3.6.3-fast/include/linux/percpu-rwsem.h   2012-10-23 
01:36:23.0 +0200
@@ -17,16 +17,16 @@ struct percpu_rw_semaphore {
 
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
-   rcu_read_lock();
+   rcu_read_lock_sched();
if (unlikely(p->locked)) {
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
mutex_lock(>mtx);
this_cpu_inc(*p->counters);
mutex_unlock(>mtx);
return;
}
this_cpu_inc(*p->counters);
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
light_mb(); /* A, between read of p->locked and read of data, paired 
with D */
 }
 
@@ -51,7 +51,7 @@ static inline void percpu_down_write(str
 {
mutex_lock(>mtx);
p->locked = true;
-   synchronize_rcu();
+   synchronize_sched(); /* make sure that all readers exit the 
rcu_read_lock_sched region */
while (__percpu_count(p->counters))
msleep(1);
heavy_mb(); /* C, between read of p->counter and write to data, paired 
with B */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] percpu-rw-semaphores: use rcu_read_lock_sched

2012-10-22 Thread Mikulas Patocka
Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.

This is an optimization. The RCU-protected region is very small, so
there will be no latency problems if we disable preempt in this region.

So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
to preempt_disable / preempt_disable. It is smaller (and supposedly
faster) than preemptible rcu_read_lock / rcu_read_unlock.

Signed-off-by: Mikulas Patocka mpato...@redhat.com

---
 include/linux/percpu-rwsem.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-3.6.3-fast/include/linux/percpu-rwsem.h
===
--- linux-3.6.3-fast.orig/include/linux/percpu-rwsem.h  2012-10-23 
01:21:49.0 +0200
+++ linux-3.6.3-fast/include/linux/percpu-rwsem.h   2012-10-23 
01:36:23.0 +0200
@@ -17,16 +17,16 @@ struct percpu_rw_semaphore {
 
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
-   rcu_read_lock();
+   rcu_read_lock_sched();
if (unlikely(p-locked)) {
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
mutex_lock(p-mtx);
this_cpu_inc(*p-counters);
mutex_unlock(p-mtx);
return;
}
this_cpu_inc(*p-counters);
-   rcu_read_unlock();
+   rcu_read_unlock_sched();
light_mb(); /* A, between read of p-locked and read of data, paired 
with D */
 }
 
@@ -51,7 +51,7 @@ static inline void percpu_down_write(str
 {
mutex_lock(p-mtx);
p-locked = true;
-   synchronize_rcu();
+   synchronize_sched(); /* make sure that all readers exit the 
rcu_read_lock_sched region */
while (__percpu_count(p-counters))
msleep(1);
heavy_mb(); /* C, between read of p-counter and write to data, paired 
with B */

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/