Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-19 Thread Linus Torvalds
On Thu, Oct 18, 2012 at 9:00 AM, Mikulas Patocka  wrote:
>
> What is the procedure for making changes that require support of
> architectures? It is trivial to make a patch that moves this into
> arch-specific includes, the problem is that the patch break all the
> architectures - I wrote support for x86, sparc, parisc, alpha (I can test
> those) but not the others.

We'd need to add it to everybody.

It shouldn't need per-architecture testing - since "smp_mb()" is
always safe. So we could just make all architectures default to that,
and then for x86 (and potentially others that have cheaper models for
release-consistency) just do the optimized one.

We *could* also simply do something like

   #ifndef smp_release_before_store
 #define smp_release_before_store() smp_mb()
   #endif

and basically make the rule be that only architectures that have a
cheaper one need to define it at all. That may be the correct
short-term fix, since there initially would be only a single user.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-19 Thread Linus Torvalds
On Thu, Oct 18, 2012 at 9:00 AM, Mikulas Patocka mpato...@redhat.com wrote:

 What is the procedure for making changes that require support of
 architectures? It is trivial to make a patch that moves this into
 arch-specific includes, the problem is that the patch break all the
 architectures - I wrote support for x86, sparc, parisc, alpha (I can test
 those) but not the others.

We'd need to add it to everybody.

It shouldn't need per-architecture testing - since smp_mb() is
always safe. So we could just make all architectures default to that,
and then for x86 (and potentially others that have cheaper models for
release-consistency) just do the optimized one.

We *could* also simply do something like

   #ifndef smp_release_before_store
 #define smp_release_before_store() smp_mb()
   #endif

and basically make the rule be that only architectures that have a
cheaper one need to define it at all. That may be the correct
short-term fix, since there initially would be only a single user.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka
This patch looks sensible.

I'd apply either this or my previous patch that adds synchronize_rcu() to 
percpu_up_write.

This patch avoids the memory barrier on non-x86 cpus in percpu_up_read, so 
it is faster than the previous approach.

Mikulas


On Thu, 18 Oct 2012, Lai Jiangshan wrote:

> ---
> 
> a very draft example of paired-mb()s is here:
> 
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index cf80f7e..84a93c0 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -12,6 +12,14 @@ struct percpu_rw_semaphore {
>   struct mutex mtx;
>  };
>  
> +#if 1
> +#define light_mb() barrier()
> +#define heavy_mb() synchronize_sched()
> +#else
> +#define light_mb() smp_mb()
> +#define heavy_mb() smp_mb();
> +#endif
> +
>  static inline void percpu_down_read(struct percpu_rw_semaphore *p)
>  {
>   rcu_read_lock();
> @@ -24,22 +32,12 @@ static inline void percpu_down_read(struct 
> percpu_rw_semaphore *p)
>   }
>   this_cpu_inc(*p->counters);
>   rcu_read_unlock();
> + light_mb(); /* A, between read of p->locked and read of data, paired 
> with D */
>  }
>  
>  static inline void percpu_up_read(struct percpu_rw_semaphore *p)
>  {
> - /*
> -  * On X86, write operation in this_cpu_dec serves as a memory unlock
> -  * barrier (i.e. memory accesses may be moved before the write, but
> -  * no memory accesses are moved past the write).
> -  * On other architectures this may not be the case, so we need smp_mb()
> -  * there.
> -  */
> -#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
> !defined(CONFIG_X86_OOSTORE))
> - barrier();
> -#else
> - smp_mb();
> -#endif
> + light_mb(); /* B, between read of the data and write to p->counter, 
> paired with C */
>   this_cpu_dec(*p->counters);
>  }
>  
> @@ -61,11 +59,12 @@ static inline void percpu_down_write(struct 
> percpu_rw_semaphore *p)
>   synchronize_rcu();
>   while (__percpu_count(p->counters))
>   msleep(1);
> - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
> + heavy_mb(); /* C, between read of p->counter and write to data, paired 
> with B */
>  }
>  
>  static inline void percpu_up_write(struct percpu_rw_semaphore *p)
>  {
> + heavy_mb(); /* D, between write to data and write to p->locked, paired 
> with A */
>   p->locked = false;
>   mutex_unlock(>mtx);
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Tue, 16 Oct 2012, Linus Torvalds wrote:

> [ Architecture people, note the potential new SMP barrier! ]
> 
> On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka  wrote:
> > +   /*
> > +* The lock is considered unlocked when p->locked is set to false.
> > +* Use barrier prevent reordering of operations around p->locked.
> > +*/
> > +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
> > !defined(CONFIG_X86_OOSTORE))
> > +   barrier();
> > +#else
> > +   smp_mb();
> > +#endif
> > p->locked = false;
> 
> Ugh. The #if is too ugly to live.

One instance of this is already present in the code.

I suggest that you drop this patch and use synchronize_rcu() that I 
have just posted.

But another instance of this "#if defined(CONFIG_X86) && ..." is already 
there in percpu_up_read.


What is the procedure for making changes that require support of 
architectures? It is trivial to make a patch that moves this into 
arch-specific includes, the problem is that the patch break all the 
architectures - I wrote support for x86, sparc, parisc, alpha (I can test 
those) but not the others.

Are you going to apply this patch, break majority of architectures and 
wait until architecture maintainers fix it? Or is there some arch-specific 
git tree, where the patch should be applied and where the maintainers fix 
things?

> This is a classic case of "people who write their own serialization
> primitives invariably get them wrong". And this fix is just horrible,
> and code like this should not be allowed.
> 
> I suspect we could make the above x86-specific optimization be valid
> by introducing a new barrier, called "smp_release_before_store()" or
> something like that, which on x86 just happens to be a no-op (but
> still a compiler barrier). That's because earlier reads will not pass
> a later stores, and stores are viewed in program order, so all x86
> stores have "release consistency" modulo the theoretical PPro bugs
> (that I don't think we actually ever saw in practice).
> 
> But it is possible that that is true on other architectures too, so
> your hand-crafted x86-specific #if is not just ugly, it's liable to
> get worse.
> 
> The alternative is to just say that we should use "smp_mb()"
> unconditionally, depending on how critical this code actually ends up
> being.
> 
> Added linux-arch in case architecture-maintainers have comments on
> "smp_release_before_store()" thing. It would be kind of similar to the
> odd "smp_read_barrier_depends()", except it would normally be a full
> memory barrier, except on architectures where a weaker barrier might
> be sufficient.
> 
> I suspect there may be many architectures where a "smp_wmb()" is
> sufficient for this case, for the simple reason that no sane
> microarchitecture would *ever* move earlier reads down past a later
> write,

Alpha can move reads down past writes (if the read is not in cache and the 
write hits the cache, it may make sense to do this reordering).

> so release consistency really only needs the local writes to be
> ordered, not the full memory ordering.
> 
> Arch people?
> 
> The more optimal solution may be to mark the store *itself* to be
> "store with release consistency", which on x86 would be a regular
> store (with the compiler barrier), but on other architectures may be a
> special memory operation. On architectures with
> release/aqcuire-consistency, there's not a separate barrier before the
> store, the store instruction itself is done with special semantics. So
> maybe the right thing to do is
> 
>#define smp_release_consistency_store(val, ptr) ...
> 
> where on x86, the implementation would be a simple
> 
>do { barrier(); *(ptr)=(val); } while (0)

smp_release_consistency_store doesn't work. In 
include/linux/percpu-rwsem.h it is required that 
"this_cpu_dec(*p->counters);" works as an unlock barrier. So we could do 
either.

"smp_mb(); this_cpu_dec(*p->counters);" - generates pointless barrier on x86

"#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
!defined(CONFIG_X86_OOSTORE))
barrier();
#else
smp_mb();
#endif
this_cpu_dec(*p->counters);
" - removes the barrier on the most common architecture (X86), but the 
code looks dirty

"smp_release_before_store();this_cpu_dec(*p->counters);" - the code is 
clean, but the downside is that it breaks architectures that don't have 
smp_release_before_store().

> but on other architectures it might be a inline asm with the required
> magic store-with-release instruction.
> 
> How important is this code sequence? Is the "percpu_up_write()"
> function really so critical that we can't have an extra memory
> barrier?

percpu_up_write() is not critical. But percpu_up_read() is critical and it 
should be as fast as possible.

Mikulas

> Or do people perhaps see *other* places where
> release-consistency-stores might be worth doing?
> 
> But in no event do I want to see that butt-ugly #if statement.
> 
>

Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Thu, 18 Oct 2012, Steven Rostedt wrote:

> On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote:
> > > 
> > > Looking at the patch, you are correct. The read side doesn't need the
> > > memory barrier as the worse thing that will happen is that it sees the
> > > locked = false, and will just grab the mutex unnecessarily.
> > 
> > -
> > A memory barrier can be added iff these two things are known:
> > 1) it disables the disordering between what and what.
> > 2) what is the corresponding mb() that it pairs with.
> > 
> 
> OK, I was just looking at the protection and actions of the locked flag,
> but I see what you are saying with the data itself.
> 
> > You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
> > disordering
> > between the writes to the protected data and the statement "p->locked = 
> > false",
> > But I can't find out the corresponding mb() that it pairs with.
> > 
> > percpu_down_read()  writes to the data
> > The cpu cache/prefetch the data writes to the data
> > which is chaos  writes to the data
> > percpu_up_write()
> > mb()
> > p->locked = 
> > false;
> > unlikely(p->locked)
> > the cpu see p->lock = false,
> > don't discard the cached/prefetch data
> > this_cpu_inc(*p->counters);
> > the code of read-access to the data
> > and we use the chaos data*
> > 
> > So you need to add a mb() after "unlikely(p->locked)".
> 
> Does it need a full mb() or could it be just a rmb()? The down_read I
> wouldn't think would need to protect against stores, would it? The
> memory barrier should probably go in front of the unlikely() too. The
> write to p->counters is handled by the synchronized sched, and adding a
> rmb() in front of the unlikely check would keep prefetched data from
> passing this barrier.
> 
> This is a perfect example why this primitive should be vetted outside of
> mainline before it gets merged.
> 
> -- Steve

If we do synchronize_rcu() in percpu_up_write, we don't need a barrier in 
percpu_down_read(). So I would do that.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Wed, 17 Oct 2012, Steven Rostedt wrote:

> On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
> > > 
> > > Even the previous patch is applied, percpu_down_read() still
> > > needs mb() to pair with it.
> > 
> > percpu_down_read uses rcu_read_lock which should guarantee that memory 
> > accesses don't escape in front of a rcu-protected section.
> 
> You do realize that rcu_read_lock() does nothing more that a barrier(),
> right?
> 
> Paul worked really hard to get rcu_read_locks() to not call HW barriers.
> 
> > 
> > If rcu_read_unlock has only an unlock barrier and not a full barrier, 
> > memory accesses could be moved in front of rcu_read_unlock and reordered 
> > with this_cpu_inc(*p->counters), but it doesn't matter because 
> > percpu_down_write does synchronize_rcu(), so it never sees these accesses 
> > halfway through.
> 
> Looking at the patch, you are correct. The read side doesn't need the
> memory barrier as the worse thing that will happen is that it sees the
> locked = false, and will just grab the mutex unnecessarily.

It wasn't correct.

CPU 1 is holding the write lock.

CPU 2 could get to percpu_down_read past rcu_read_lock and prefetch some 
data that are accessed after percpu_down_read.

CPU 1 goes into percpu_up_write(), calls a barrier, p->locked = false; and 
mutex_unlock(>mtx);

CPU 2 now sees p->locked == false, so it goes through percpu_down_read. It 
exists percpu_down_read and uses the invalid prefetched data.

It could be fixed by using synchronize_rcu(); in percpu_up_write, I sent a 
patch for that.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Thu, 18 Oct 2012, Lai Jiangshan wrote:

> On 10/18/2012 04:28 AM, Steven Rostedt wrote:
> > On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
> >>>
> >>> Even the previous patch is applied, percpu_down_read() still
> >>> needs mb() to pair with it.
> >>
> >> percpu_down_read uses rcu_read_lock which should guarantee that memory 
> >> accesses don't escape in front of a rcu-protected section.
> > 
> > You do realize that rcu_read_lock() does nothing more that a barrier(),
> > right?
> > 
> > Paul worked really hard to get rcu_read_locks() to not call HW barriers.
> > 
> >>
> >> If rcu_read_unlock has only an unlock barrier and not a full barrier, 
> >> memory accesses could be moved in front of rcu_read_unlock and reordered 
> >> with this_cpu_inc(*p->counters), but it doesn't matter because 
> >> percpu_down_write does synchronize_rcu(), so it never sees these accesses 
> >> halfway through.
> > 
> > Looking at the patch, you are correct. The read side doesn't need the
> > memory barrier as the worse thing that will happen is that it sees the
> > locked = false, and will just grab the mutex unnecessarily.
> 
> -
> A memory barrier can be added iff these two things are known:
>   1) it disables the disordering between what and what.
>   2) what is the corresponding mb() that it pairs with.
> 
> You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
> disordering
> between the writes to the protected data and the statement "p->locked = 
> false",
> But I can't find out the corresponding mb() that it pairs with.

Or alternativelly, instead of barrier, we can do this.

Mikulas

---

percpu-rwsem: use barrier in unlock path

The lock is considered unlocked when p->locked is set to false.
Use barrier prevent reordering of operations around p->locked.

Signed-off-by: Mikulas Patocka 

---
 include/linux/percpu-rwsem.h |6 ++
 1 file changed, 6 insertions(+)

Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h  2012-10-17 
20:48:40.0 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h   2012-10-18 
17:19:24.0 +0200
@@ -66,6 +66,12 @@ static inline void percpu_down_write(str
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+   /*
+* Make sure that other processes that are in rcu section and that
+* may have seen partially modified state exit the rcu section and
+* try to grab the mutex.
+*/
+   synchronize_rcu();
p->locked = false;
mutex_unlock(>mtx);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Thu, 18 Oct 2012, Lai Jiangshan wrote:

 On 10/18/2012 04:28 AM, Steven Rostedt wrote:
  On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
 
  Even the previous patch is applied, percpu_down_read() still
  needs mb() to pair with it.
 
  percpu_down_read uses rcu_read_lock which should guarantee that memory 
  accesses don't escape in front of a rcu-protected section.
  
  You do realize that rcu_read_lock() does nothing more that a barrier(),
  right?
  
  Paul worked really hard to get rcu_read_locks() to not call HW barriers.
  
 
  If rcu_read_unlock has only an unlock barrier and not a full barrier, 
  memory accesses could be moved in front of rcu_read_unlock and reordered 
  with this_cpu_inc(*p-counters), but it doesn't matter because 
  percpu_down_write does synchronize_rcu(), so it never sees these accesses 
  halfway through.
  
  Looking at the patch, you are correct. The read side doesn't need the
  memory barrier as the worse thing that will happen is that it sees the
  locked = false, and will just grab the mutex unnecessarily.
 
 -
 A memory barrier can be added iff these two things are known:
   1) it disables the disordering between what and what.
   2) what is the corresponding mb() that it pairs with.
 
 You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
 disordering
 between the writes to the protected data and the statement p-locked = 
 false,
 But I can't find out the corresponding mb() that it pairs with.

Or alternativelly, instead of barrier, we can do this.

Mikulas

---

percpu-rwsem: use barrier in unlock path

The lock is considered unlocked when p-locked is set to false.
Use barrier prevent reordering of operations around p-locked.

Signed-off-by: Mikulas Patocka mpato...@redhat.com

---
 include/linux/percpu-rwsem.h |6 ++
 1 file changed, 6 insertions(+)

Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h  2012-10-17 
20:48:40.0 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h   2012-10-18 
17:19:24.0 +0200
@@ -66,6 +66,12 @@ static inline void percpu_down_write(str
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+   /*
+* Make sure that other processes that are in rcu section and that
+* may have seen partially modified state exit the rcu section and
+* try to grab the mutex.
+*/
+   synchronize_rcu();
p-locked = false;
mutex_unlock(p-mtx);
 }
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Wed, 17 Oct 2012, Steven Rostedt wrote:

 On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
   
   Even the previous patch is applied, percpu_down_read() still
   needs mb() to pair with it.
  
  percpu_down_read uses rcu_read_lock which should guarantee that memory 
  accesses don't escape in front of a rcu-protected section.
 
 You do realize that rcu_read_lock() does nothing more that a barrier(),
 right?
 
 Paul worked really hard to get rcu_read_locks() to not call HW barriers.
 
  
  If rcu_read_unlock has only an unlock barrier and not a full barrier, 
  memory accesses could be moved in front of rcu_read_unlock and reordered 
  with this_cpu_inc(*p-counters), but it doesn't matter because 
  percpu_down_write does synchronize_rcu(), so it never sees these accesses 
  halfway through.
 
 Looking at the patch, you are correct. The read side doesn't need the
 memory barrier as the worse thing that will happen is that it sees the
 locked = false, and will just grab the mutex unnecessarily.

It wasn't correct.

CPU 1 is holding the write lock.

CPU 2 could get to percpu_down_read past rcu_read_lock and prefetch some 
data that are accessed after percpu_down_read.

CPU 1 goes into percpu_up_write(), calls a barrier, p-locked = false; and 
mutex_unlock(p-mtx);

CPU 2 now sees p-locked == false, so it goes through percpu_down_read. It 
exists percpu_down_read and uses the invalid prefetched data.

It could be fixed by using synchronize_rcu(); in percpu_up_write, I sent a 
patch for that.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Thu, 18 Oct 2012, Steven Rostedt wrote:

 On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote:
   
   Looking at the patch, you are correct. The read side doesn't need the
   memory barrier as the worse thing that will happen is that it sees the
   locked = false, and will just grab the mutex unnecessarily.
  
  -
  A memory barrier can be added iff these two things are known:
  1) it disables the disordering between what and what.
  2) what is the corresponding mb() that it pairs with.
  
 
 OK, I was just looking at the protection and actions of the locked flag,
 but I see what you are saying with the data itself.
 
  You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
  disordering
  between the writes to the protected data and the statement p-locked = 
  false,
  But I can't find out the corresponding mb() that it pairs with.
  
  percpu_down_read()  writes to the data
  The cpu cache/prefetch the data writes to the data
  which is chaos  writes to the data
  percpu_up_write()
  mb()
  p-locked = 
  false;
  unlikely(p-locked)
  the cpu see p-lock = false,
  don't discard the cached/prefetch data
  this_cpu_inc(*p-counters);
  the code of read-access to the data
  and we use the chaos data*
  
  So you need to add a mb() after unlikely(p-locked).
 
 Does it need a full mb() or could it be just a rmb()? The down_read I
 wouldn't think would need to protect against stores, would it? The
 memory barrier should probably go in front of the unlikely() too. The
 write to p-counters is handled by the synchronized sched, and adding a
 rmb() in front of the unlikely check would keep prefetched data from
 passing this barrier.
 
 This is a perfect example why this primitive should be vetted outside of
 mainline before it gets merged.
 
 -- Steve

If we do synchronize_rcu() in percpu_up_write, we don't need a barrier in 
percpu_down_read(). So I would do that.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka


On Tue, 16 Oct 2012, Linus Torvalds wrote:

 [ Architecture people, note the potential new SMP barrier! ]
 
 On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka mpato...@redhat.com wrote:
  +   /*
  +* The lock is considered unlocked when p-locked is set to false.
  +* Use barrier prevent reordering of operations around p-locked.
  +*/
  +#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
  !defined(CONFIG_X86_OOSTORE))
  +   barrier();
  +#else
  +   smp_mb();
  +#endif
  p-locked = false;
 
 Ugh. The #if is too ugly to live.

One instance of this is already present in the code.

I suggest that you drop this patch and use synchronize_rcu() that I 
have just posted.

But another instance of this #if defined(CONFIG_X86)  ... is already 
there in percpu_up_read.


What is the procedure for making changes that require support of 
architectures? It is trivial to make a patch that moves this into 
arch-specific includes, the problem is that the patch break all the 
architectures - I wrote support for x86, sparc, parisc, alpha (I can test 
those) but not the others.

Are you going to apply this patch, break majority of architectures and 
wait until architecture maintainers fix it? Or is there some arch-specific 
git tree, where the patch should be applied and where the maintainers fix 
things?

 This is a classic case of people who write their own serialization
 primitives invariably get them wrong. And this fix is just horrible,
 and code like this should not be allowed.
 
 I suspect we could make the above x86-specific optimization be valid
 by introducing a new barrier, called smp_release_before_store() or
 something like that, which on x86 just happens to be a no-op (but
 still a compiler barrier). That's because earlier reads will not pass
 a later stores, and stores are viewed in program order, so all x86
 stores have release consistency modulo the theoretical PPro bugs
 (that I don't think we actually ever saw in practice).
 
 But it is possible that that is true on other architectures too, so
 your hand-crafted x86-specific #if is not just ugly, it's liable to
 get worse.
 
 The alternative is to just say that we should use smp_mb()
 unconditionally, depending on how critical this code actually ends up
 being.
 
 Added linux-arch in case architecture-maintainers have comments on
 smp_release_before_store() thing. It would be kind of similar to the
 odd smp_read_barrier_depends(), except it would normally be a full
 memory barrier, except on architectures where a weaker barrier might
 be sufficient.
 
 I suspect there may be many architectures where a smp_wmb() is
 sufficient for this case, for the simple reason that no sane
 microarchitecture would *ever* move earlier reads down past a later
 write,

Alpha can move reads down past writes (if the read is not in cache and the 
write hits the cache, it may make sense to do this reordering).

 so release consistency really only needs the local writes to be
 ordered, not the full memory ordering.
 
 Arch people?
 
 The more optimal solution may be to mark the store *itself* to be
 store with release consistency, which on x86 would be a regular
 store (with the compiler barrier), but on other architectures may be a
 special memory operation. On architectures with
 release/aqcuire-consistency, there's not a separate barrier before the
 store, the store instruction itself is done with special semantics. So
 maybe the right thing to do is
 
#define smp_release_consistency_store(val, ptr) ...
 
 where on x86, the implementation would be a simple
 
do { barrier(); *(ptr)=(val); } while (0)

smp_release_consistency_store doesn't work. In 
include/linux/percpu-rwsem.h it is required that 
this_cpu_dec(*p-counters); works as an unlock barrier. So we could do 
either.

smp_mb(); this_cpu_dec(*p-counters); - generates pointless barrier on x86

#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
!defined(CONFIG_X86_OOSTORE))
barrier();
#else
smp_mb();
#endif
this_cpu_dec(*p-counters);
 - removes the barrier on the most common architecture (X86), but the 
code looks dirty

smp_release_before_store();this_cpu_dec(*p-counters); - the code is 
clean, but the downside is that it breaks architectures that don't have 
smp_release_before_store().

 but on other architectures it might be a inline asm with the required
 magic store-with-release instruction.
 
 How important is this code sequence? Is the percpu_up_write()
 function really so critical that we can't have an extra memory
 barrier?

percpu_up_write() is not critical. But percpu_up_read() is critical and it 
should be as fast as possible.

Mikulas

 Or do people perhaps see *other* places where
 release-consistency-stores might be worth doing?
 
 But in no event do I want to see that butt-ugly #if statement.
 
 Linus
 

---
Introduce smp_release_before_store()

smp_release_before_store() with the following store 

Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-18 Thread Mikulas Patocka
This patch looks sensible.

I'd apply either this or my previous patch that adds synchronize_rcu() to 
percpu_up_write.

This patch avoids the memory barrier on non-x86 cpus in percpu_up_read, so 
it is faster than the previous approach.

Mikulas


On Thu, 18 Oct 2012, Lai Jiangshan wrote:

 ---
 
 a very draft example of paired-mb()s is here:
 
 
 diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
 index cf80f7e..84a93c0 100644
 --- a/include/linux/percpu-rwsem.h
 +++ b/include/linux/percpu-rwsem.h
 @@ -12,6 +12,14 @@ struct percpu_rw_semaphore {
   struct mutex mtx;
  };
  
 +#if 1
 +#define light_mb() barrier()
 +#define heavy_mb() synchronize_sched()
 +#else
 +#define light_mb() smp_mb()
 +#define heavy_mb() smp_mb();
 +#endif
 +
  static inline void percpu_down_read(struct percpu_rw_semaphore *p)
  {
   rcu_read_lock();
 @@ -24,22 +32,12 @@ static inline void percpu_down_read(struct 
 percpu_rw_semaphore *p)
   }
   this_cpu_inc(*p-counters);
   rcu_read_unlock();
 + light_mb(); /* A, between read of p-locked and read of data, paired 
 with D */
  }
  
  static inline void percpu_up_read(struct percpu_rw_semaphore *p)
  {
 - /*
 -  * On X86, write operation in this_cpu_dec serves as a memory unlock
 -  * barrier (i.e. memory accesses may be moved before the write, but
 -  * no memory accesses are moved past the write).
 -  * On other architectures this may not be the case, so we need smp_mb()
 -  * there.
 -  */
 -#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
 !defined(CONFIG_X86_OOSTORE))
 - barrier();
 -#else
 - smp_mb();
 -#endif
 + light_mb(); /* B, between read of the data and write to p-counter, 
 paired with C */
   this_cpu_dec(*p-counters);
  }
  
 @@ -61,11 +59,12 @@ static inline void percpu_down_write(struct 
 percpu_rw_semaphore *p)
   synchronize_rcu();
   while (__percpu_count(p-counters))
   msleep(1);
 - smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
 + heavy_mb(); /* C, between read of p-counter and write to data, paired 
 with B */
  }
  
  static inline void percpu_up_write(struct percpu_rw_semaphore *p)
  {
 + heavy_mb(); /* D, between write to data and write to p-locked, paired 
 with A */
   p-locked = false;
   mutex_unlock(p-mtx);
  }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Steven Rostedt
On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote:
> > 
> > Looking at the patch, you are correct. The read side doesn't need the
> > memory barrier as the worse thing that will happen is that it sees the
> > locked = false, and will just grab the mutex unnecessarily.
> 
> -
> A memory barrier can be added iff these two things are known:
>   1) it disables the disordering between what and what.
>   2) what is the corresponding mb() that it pairs with.
> 

OK, I was just looking at the protection and actions of the locked flag,
but I see what you are saying with the data itself.

> You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
> disordering
> between the writes to the protected data and the statement "p->locked = 
> false",
> But I can't find out the corresponding mb() that it pairs with.
> 
> percpu_down_read()writes to the data
>   The cpu cache/prefetch the data writes to the data
>   which is chaos  writes to the data
>   percpu_up_write()
>   mb()
>   p->locked = 
> false;
>   unlikely(p->locked)
>   the cpu see p->lock = false,
>   don't discard the cached/prefetch data
>   this_cpu_inc(*p->counters);
>   the code of read-access to the data
>   and we use the chaos data*
> 
> So you need to add a mb() after "unlikely(p->locked)".

Does it need a full mb() or could it be just a rmb()? The down_read I
wouldn't think would need to protect against stores, would it? The
memory barrier should probably go in front of the unlikely() too. The
write to p->counters is handled by the synchronized sched, and adding a
rmb() in front of the unlikely check would keep prefetched data from
passing this barrier.

This is a perfect example why this primitive should be vetted outside of
mainline before it gets merged.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Lai Jiangshan
On 10/18/2012 04:28 AM, Steven Rostedt wrote:
> On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
>>>
>>> Even the previous patch is applied, percpu_down_read() still
>>> needs mb() to pair with it.
>>
>> percpu_down_read uses rcu_read_lock which should guarantee that memory 
>> accesses don't escape in front of a rcu-protected section.
> 
> You do realize that rcu_read_lock() does nothing more that a barrier(),
> right?
> 
> Paul worked really hard to get rcu_read_locks() to not call HW barriers.
> 
>>
>> If rcu_read_unlock has only an unlock barrier and not a full barrier, 
>> memory accesses could be moved in front of rcu_read_unlock and reordered 
>> with this_cpu_inc(*p->counters), but it doesn't matter because 
>> percpu_down_write does synchronize_rcu(), so it never sees these accesses 
>> halfway through.
> 
> Looking at the patch, you are correct. The read side doesn't need the
> memory barrier as the worse thing that will happen is that it sees the
> locked = false, and will just grab the mutex unnecessarily.

-
A memory barrier can be added iff these two things are known:
1) it disables the disordering between what and what.
2) what is the corresponding mb() that it pairs with.

You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
disordering
between the writes to the protected data and the statement "p->locked = false",
But I can't find out the corresponding mb() that it pairs with.

percpu_down_read()  writes to the data
The cpu cache/prefetch the data writes to the data
which is chaos  writes to the data
percpu_up_write()
mb()
p->locked = 
false;
unlikely(p->locked)
the cpu see p->lock = false,
don't discard the cached/prefetch data
this_cpu_inc(*p->counters);
the code of read-access to the data
and we use the chaos data*

So you need to add a mb() after "unlikely(p->locked)".

-

The RCU you use don't protect any data. It protects codes of the fast path:
unlikely(p->locked);
this_cpu_inc(*p->counters);

and synchronize_rcu() ensures all previous fast path had fully finished
"this_cpu_inc(*p->counters);".

It don't protect other code/data, if you want to protect other code or other
data, please add more synchronizations or mb()s.

---

I extremely hate a synchronization protects code instead of data.
but sometimes I also have to do it.

---

a very draft example of paired-mb()s is here:


diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index cf80f7e..84a93c0 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -12,6 +12,14 @@ struct percpu_rw_semaphore {
struct mutex mtx;
 };
 
+#if 1
+#define light_mb() barrier()
+#define heavy_mb() synchronize_sched()
+#else
+#define light_mb() smp_mb()
+#define heavy_mb() smp_mb();
+#endif
+
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
rcu_read_lock();
@@ -24,22 +32,12 @@ static inline void percpu_down_read(struct 
percpu_rw_semaphore *p)
}
this_cpu_inc(*p->counters);
rcu_read_unlock();
+   light_mb(); /* A, between read of p->locked and read of data, paired 
with D */
 }
 
 static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-   /*
-* On X86, write operation in this_cpu_dec serves as a memory unlock
-* barrier (i.e. memory accesses may be moved before the write, but
-* no memory accesses are moved past the write).
-* On other architectures this may not be the case, so we need smp_mb()
-* there.
-*/
-#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
!defined(CONFIG_X86_OOSTORE))
-   barrier();
-#else
-   smp_mb();
-#endif
+   light_mb(); /* B, between read of the data and write to p->counter, 
paired with C */
this_cpu_dec(*p->counters);
 }
 
@@ -61,11 +59,12 @@ static inline void percpu_down_write(struct 
percpu_rw_semaphore *p)
synchronize_rcu();
while (__percpu_count(p->counters))
msleep(1);
-   smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
+   heavy_mb(); /* C, between read of p->counter and write to data, paired 
with B */
 }
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+   heavy_mb(); /* D, between write to data and write to p->locked, paired 
with A */
p->locked = false;
mutex_unlock(>mtx);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Steven Rostedt
On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
> > 
> > Even the previous patch is applied, percpu_down_read() still
> > needs mb() to pair with it.
> 
> percpu_down_read uses rcu_read_lock which should guarantee that memory 
> accesses don't escape in front of a rcu-protected section.

You do realize that rcu_read_lock() does nothing more that a barrier(),
right?

Paul worked really hard to get rcu_read_locks() to not call HW barriers.

> 
> If rcu_read_unlock has only an unlock barrier and not a full barrier, 
> memory accesses could be moved in front of rcu_read_unlock and reordered 
> with this_cpu_inc(*p->counters), but it doesn't matter because 
> percpu_down_write does synchronize_rcu(), so it never sees these accesses 
> halfway through.

Looking at the patch, you are correct. The read side doesn't need the
memory barrier as the worse thing that will happen is that it sees the
locked = false, and will just grab the mutex unnecessarily.

> > 
> > I suggest any new synchronization should stay in -tip for 2 or more cycles
> > before merged to mainline.
> 
> But the bug that this synchronization is fixing is quite serious (it 
> causes random crashes when block size is being changed, the crash happens 
> regularly at multiple important business sites) so it must be fixed soon 
> and not wait half a year.

I don't think Lai was suggesting to wait on this fix, but instead to
totally rip out the percpu_rwsems and work on them some more, and then
re-introduce them in a half a year.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Mikulas Patocka
Hi


On Wed, 17 Oct 2012, Lai Jiangshan wrote:

> On 10/17/2012 10:23 AM, Linus Torvalds wrote:
> > [ Architecture people, note the potential new SMP barrier! ]
> > 
> > On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka  
> > wrote:
> >> +   /*
> >> +* The lock is considered unlocked when p->locked is set to false.
> >> +* Use barrier prevent reordering of operations around p->locked.
> >> +*/
> >> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
> >> !defined(CONFIG_X86_OOSTORE))
> >> +   barrier();
> >> +#else
> >> +   smp_mb();
> >> +#endif
> >> p->locked = false;
> > 
> > Ugh. The #if is too ugly to live.
> 
> Even the previous patch is applied, percpu_down_read() still
> needs mb() to pair with it.

percpu_down_read uses rcu_read_lock which should guarantee that memory 
accesses don't escape in front of a rcu-protected section.

If rcu_read_unlock has only an unlock barrier and not a full barrier, 
memory accesses could be moved in front of rcu_read_unlock and reordered 
with this_cpu_inc(*p->counters), but it doesn't matter because 
percpu_down_write does synchronize_rcu(), so it never sees these accesses 
halfway through.

> > This is a classic case of "people who write their own serialization
> > primitives invariably get them wrong". And this fix is just horrible,
> > and code like this should not be allowed.
> 
> One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is 
> that
> it is merged without Ackeds or Revieweds from Paul or Peter or someone else
> who are expert at synchronization/arch memory models.
> 
> I suggest any new synchronization should stay in -tip for 2 or more cycles
> before merged to mainline.

But the bug that this synchronization is fixing is quite serious (it 
causes random crashes when block size is being changed, the crash happens 
regularly at multiple important business sites) so it must be fixed soon 
and not wait half a year.

> Thanks,
> Lai

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Alan Cox
> a later stores, and stores are viewed in program order, so all x86
> stores have "release consistency" modulo the theoretical PPro bugs
> (that I don't think we actually ever saw in practice).

We did seem the for real. The PPro is on its way out however so I hardly
thing we care about optimising for that case.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Alan Cox
 a later stores, and stores are viewed in program order, so all x86
 stores have release consistency modulo the theoretical PPro bugs
 (that I don't think we actually ever saw in practice).

We did seem the for real. The PPro is on its way out however so I hardly
thing we care about optimising for that case.

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Mikulas Patocka
Hi


On Wed, 17 Oct 2012, Lai Jiangshan wrote:

 On 10/17/2012 10:23 AM, Linus Torvalds wrote:
  [ Architecture people, note the potential new SMP barrier! ]
  
  On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka mpato...@redhat.com 
  wrote:
  +   /*
  +* The lock is considered unlocked when p-locked is set to false.
  +* Use barrier prevent reordering of operations around p-locked.
  +*/
  +#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
  !defined(CONFIG_X86_OOSTORE))
  +   barrier();
  +#else
  +   smp_mb();
  +#endif
  p-locked = false;
  
  Ugh. The #if is too ugly to live.
 
 Even the previous patch is applied, percpu_down_read() still
 needs mb() to pair with it.

percpu_down_read uses rcu_read_lock which should guarantee that memory 
accesses don't escape in front of a rcu-protected section.

If rcu_read_unlock has only an unlock barrier and not a full barrier, 
memory accesses could be moved in front of rcu_read_unlock and reordered 
with this_cpu_inc(*p-counters), but it doesn't matter because 
percpu_down_write does synchronize_rcu(), so it never sees these accesses 
halfway through.

  This is a classic case of people who write their own serialization
  primitives invariably get them wrong. And this fix is just horrible,
  and code like this should not be allowed.
 
 One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is 
 that
 it is merged without Ackeds or Revieweds from Paul or Peter or someone else
 who are expert at synchronization/arch memory models.
 
 I suggest any new synchronization should stay in -tip for 2 or more cycles
 before merged to mainline.

But the bug that this synchronization is fixing is quite serious (it 
causes random crashes when block size is being changed, the crash happens 
regularly at multiple important business sites) so it must be fixed soon 
and not wait half a year.

 Thanks,
 Lai

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Steven Rostedt
On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:
  
  Even the previous patch is applied, percpu_down_read() still
  needs mb() to pair with it.
 
 percpu_down_read uses rcu_read_lock which should guarantee that memory 
 accesses don't escape in front of a rcu-protected section.

You do realize that rcu_read_lock() does nothing more that a barrier(),
right?

Paul worked really hard to get rcu_read_locks() to not call HW barriers.

 
 If rcu_read_unlock has only an unlock barrier and not a full barrier, 
 memory accesses could be moved in front of rcu_read_unlock and reordered 
 with this_cpu_inc(*p-counters), but it doesn't matter because 
 percpu_down_write does synchronize_rcu(), so it never sees these accesses 
 halfway through.

Looking at the patch, you are correct. The read side doesn't need the
memory barrier as the worse thing that will happen is that it sees the
locked = false, and will just grab the mutex unnecessarily.

  
  I suggest any new synchronization should stay in -tip for 2 or more cycles
  before merged to mainline.
 
 But the bug that this synchronization is fixing is quite serious (it 
 causes random crashes when block size is being changed, the crash happens 
 regularly at multiple important business sites) so it must be fixed soon 
 and not wait half a year.

I don't think Lai was suggesting to wait on this fix, but instead to
totally rip out the percpu_rwsems and work on them some more, and then
re-introduce them in a half a year.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Lai Jiangshan
On 10/18/2012 04:28 AM, Steven Rostedt wrote:
 On Wed, Oct 17, 2012 at 11:07:21AM -0400, Mikulas Patocka wrote:

 Even the previous patch is applied, percpu_down_read() still
 needs mb() to pair with it.

 percpu_down_read uses rcu_read_lock which should guarantee that memory 
 accesses don't escape in front of a rcu-protected section.
 
 You do realize that rcu_read_lock() does nothing more that a barrier(),
 right?
 
 Paul worked really hard to get rcu_read_locks() to not call HW barriers.
 

 If rcu_read_unlock has only an unlock barrier and not a full barrier, 
 memory accesses could be moved in front of rcu_read_unlock and reordered 
 with this_cpu_inc(*p-counters), but it doesn't matter because 
 percpu_down_write does synchronize_rcu(), so it never sees these accesses 
 halfway through.
 
 Looking at the patch, you are correct. The read side doesn't need the
 memory barrier as the worse thing that will happen is that it sees the
 locked = false, and will just grab the mutex unnecessarily.

-
A memory barrier can be added iff these two things are known:
1) it disables the disordering between what and what.
2) what is the corresponding mb() that it pairs with.

You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
disordering
between the writes to the protected data and the statement p-locked = false,
But I can't find out the corresponding mb() that it pairs with.

percpu_down_read()  writes to the data
The cpu cache/prefetch the data writes to the data
which is chaos  writes to the data
percpu_up_write()
mb()
p-locked = 
false;
unlikely(p-locked)
the cpu see p-lock = false,
don't discard the cached/prefetch data
this_cpu_inc(*p-counters);
the code of read-access to the data
and we use the chaos data*

So you need to add a mb() after unlikely(p-locked).

-

The RCU you use don't protect any data. It protects codes of the fast path:
unlikely(p-locked);
this_cpu_inc(*p-counters);

and synchronize_rcu() ensures all previous fast path had fully finished
this_cpu_inc(*p-counters);.

It don't protect other code/data, if you want to protect other code or other
data, please add more synchronizations or mb()s.

---

I extremely hate a synchronization protects code instead of data.
but sometimes I also have to do it.

---

a very draft example of paired-mb()s is here:


diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index cf80f7e..84a93c0 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -12,6 +12,14 @@ struct percpu_rw_semaphore {
struct mutex mtx;
 };
 
+#if 1
+#define light_mb() barrier()
+#define heavy_mb() synchronize_sched()
+#else
+#define light_mb() smp_mb()
+#define heavy_mb() smp_mb();
+#endif
+
 static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
rcu_read_lock();
@@ -24,22 +32,12 @@ static inline void percpu_down_read(struct 
percpu_rw_semaphore *p)
}
this_cpu_inc(*p-counters);
rcu_read_unlock();
+   light_mb(); /* A, between read of p-locked and read of data, paired 
with D */
 }
 
 static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-   /*
-* On X86, write operation in this_cpu_dec serves as a memory unlock
-* barrier (i.e. memory accesses may be moved before the write, but
-* no memory accesses are moved past the write).
-* On other architectures this may not be the case, so we need smp_mb()
-* there.
-*/
-#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
!defined(CONFIG_X86_OOSTORE))
-   barrier();
-#else
-   smp_mb();
-#endif
+   light_mb(); /* B, between read of the data and write to p-counter, 
paired with C */
this_cpu_dec(*p-counters);
 }
 
@@ -61,11 +59,12 @@ static inline void percpu_down_write(struct 
percpu_rw_semaphore *p)
synchronize_rcu();
while (__percpu_count(p-counters))
msleep(1);
-   smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
+   heavy_mb(); /* C, between read of p-counter and write to data, paired 
with B */
 }
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+   heavy_mb(); /* D, between write to data and write to p-locked, paired 
with A */
p-locked = false;
mutex_unlock(p-mtx);
 }
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-17 Thread Steven Rostedt
On Thu, 2012-10-18 at 10:18 +0800, Lai Jiangshan wrote:
  
  Looking at the patch, you are correct. The read side doesn't need the
  memory barrier as the worse thing that will happen is that it sees the
  locked = false, and will just grab the mutex unnecessarily.
 
 -
 A memory barrier can be added iff these two things are known:
   1) it disables the disordering between what and what.
   2) what is the corresponding mb() that it pairs with.
 

OK, I was just looking at the protection and actions of the locked flag,
but I see what you are saying with the data itself.

 You tried to add a mb() in percpu_up_write(), OK, I know it disables the 
 disordering
 between the writes to the protected data and the statement p-locked = 
 false,
 But I can't find out the corresponding mb() that it pairs with.
 
 percpu_down_read()writes to the data
   The cpu cache/prefetch the data writes to the data
   which is chaos  writes to the data
   percpu_up_write()
   mb()
   p-locked = 
 false;
   unlikely(p-locked)
   the cpu see p-lock = false,
   don't discard the cached/prefetch data
   this_cpu_inc(*p-counters);
   the code of read-access to the data
   and we use the chaos data*
 
 So you need to add a mb() after unlikely(p-locked).

Does it need a full mb() or could it be just a rmb()? The down_read I
wouldn't think would need to protect against stores, would it? The
memory barrier should probably go in front of the unlikely() too. The
write to p-counters is handled by the synchronized sched, and adding a
rmb() in front of the unlikely check would keep prefetched data from
passing this barrier.

This is a perfect example why this primitive should be vetted outside of
mainline before it gets merged.

-- Steve


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-16 Thread Lai Jiangshan
On 10/17/2012 10:23 AM, Linus Torvalds wrote:
> [ Architecture people, note the potential new SMP barrier! ]
> 
> On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka  wrote:
>> +   /*
>> +* The lock is considered unlocked when p->locked is set to false.
>> +* Use barrier prevent reordering of operations around p->locked.
>> +*/
>> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
>> !defined(CONFIG_X86_OOSTORE))
>> +   barrier();
>> +#else
>> +   smp_mb();
>> +#endif
>> p->locked = false;
> 
> Ugh. The #if is too ugly to live.

Even the previous patch is applied, percpu_down_read() still
needs mb() to pair with it.

> 
> This is a classic case of "people who write their own serialization
> primitives invariably get them wrong". And this fix is just horrible,
> and code like this should not be allowed.

One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is 
that
it is merged without Ackeds or Revieweds from Paul or Peter or someone else
who are expert at synchronization/arch memory models.

I suggest any new synchronization should stay in -tip for 2 or more cycles
before merged to mainline.

Thanks,
Lai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-16 Thread Linus Torvalds
[ Architecture people, note the potential new SMP barrier! ]

On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka  wrote:
> +   /*
> +* The lock is considered unlocked when p->locked is set to false.
> +* Use barrier prevent reordering of operations around p->locked.
> +*/
> +#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
> !defined(CONFIG_X86_OOSTORE))
> +   barrier();
> +#else
> +   smp_mb();
> +#endif
> p->locked = false;

Ugh. The #if is too ugly to live.

This is a classic case of "people who write their own serialization
primitives invariably get them wrong". And this fix is just horrible,
and code like this should not be allowed.

I suspect we could make the above x86-specific optimization be valid
by introducing a new barrier, called "smp_release_before_store()" or
something like that, which on x86 just happens to be a no-op (but
still a compiler barrier). That's because earlier reads will not pass
a later stores, and stores are viewed in program order, so all x86
stores have "release consistency" modulo the theoretical PPro bugs
(that I don't think we actually ever saw in practice).

But it is possible that that is true on other architectures too, so
your hand-crafted x86-specific #if is not just ugly, it's liable to
get worse.

The alternative is to just say that we should use "smp_mb()"
unconditionally, depending on how critical this code actually ends up
being.

Added linux-arch in case architecture-maintainers have comments on
"smp_release_before_store()" thing. It would be kind of similar to the
odd "smp_read_barrier_depends()", except it would normally be a full
memory barrier, except on architectures where a weaker barrier might
be sufficient.

I suspect there may be many architectures where a "smp_wmb()" is
sufficient for this case, for the simple reason that no sane
microarchitecture would *ever* move earlier reads down past a later
write, so release consistency really only needs the local writes to be
ordered, not the full memory ordering.

Arch people?

The more optimal solution may be to mark the store *itself* to be
"store with release consistency", which on x86 would be a regular
store (with the compiler barrier), but on other architectures may be a
special memory operation. On architectures with
release/aqcuire-consistency, there's not a separate barrier before the
store, the store instruction itself is done with special semantics. So
maybe the right thing to do is

   #define smp_release_consistency_store(val, ptr) ...

where on x86, the implementation would be a simple

   do { barrier(); *(ptr)=(val); } while (0)

but on other architectures it might be a inline asm with the required
magic store-with-release instruction.

How important is this code sequence? Is the "percpu_up_write()"
function really so critical that we can't have an extra memory
barrier? Or do people perhaps see *other* places where
release-consistency-stores might be worth doing?

But in no event do I want to see that butt-ugly #if statement.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] percpu-rwsem: use barrier in unlock path

2012-10-16 Thread Mikulas Patocka
Hi

When I read my patch 62ac665ff9fc07497ca524bd20d6a96893d11071 again, it 
turns out that it is missing a barrier in percpu_up_write. Please apply 
this.

Mikulas

---

percpu-rwsem: use barrier in unlock path

Fix missing barrier in patch 62ac665ff9fc07497ca524bd20d6a96893d11071.

The lock is considered unlocked when p->locked is set to false.
Use barrier prevent reordering of operations around p->locked.

Signed-off-by: Mikulas Patocka 

---
 include/linux/percpu-rwsem.h |9 +
 1 file changed, 9 insertions(+)

Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h  2012-10-17 
01:17:55.0 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h   2012-10-17 
01:21:46.0 +0200
@@ -66,6 +66,15 @@ static inline void percpu_down_write(str
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+   /*
+* The lock is considered unlocked when p->locked is set to false.
+* Use barrier prevent reordering of operations around p->locked.
+*/
+#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && 
!defined(CONFIG_X86_OOSTORE))
+   barrier();
+#else
+   smp_mb();
+#endif
p->locked = false;
mutex_unlock(>mtx);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] percpu-rwsem: use barrier in unlock path

2012-10-16 Thread Mikulas Patocka
Hi

When I read my patch 62ac665ff9fc07497ca524bd20d6a96893d11071 again, it 
turns out that it is missing a barrier in percpu_up_write. Please apply 
this.

Mikulas

---

percpu-rwsem: use barrier in unlock path

Fix missing barrier in patch 62ac665ff9fc07497ca524bd20d6a96893d11071.

The lock is considered unlocked when p-locked is set to false.
Use barrier prevent reordering of operations around p-locked.

Signed-off-by: Mikulas Patocka mpato...@redhat.com

---
 include/linux/percpu-rwsem.h |9 +
 1 file changed, 9 insertions(+)

Index: linux-3.6.2-fast/include/linux/percpu-rwsem.h
===
--- linux-3.6.2-fast.orig/include/linux/percpu-rwsem.h  2012-10-17 
01:17:55.0 +0200
+++ linux-3.6.2-fast/include/linux/percpu-rwsem.h   2012-10-17 
01:21:46.0 +0200
@@ -66,6 +66,15 @@ static inline void percpu_down_write(str
 
 static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
+   /*
+* The lock is considered unlocked when p-locked is set to false.
+* Use barrier prevent reordering of operations around p-locked.
+*/
+#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
!defined(CONFIG_X86_OOSTORE))
+   barrier();
+#else
+   smp_mb();
+#endif
p-locked = false;
mutex_unlock(p-mtx);
 }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-16 Thread Linus Torvalds
[ Architecture people, note the potential new SMP barrier! ]

On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka mpato...@redhat.com wrote:
 +   /*
 +* The lock is considered unlocked when p-locked is set to false.
 +* Use barrier prevent reordering of operations around p-locked.
 +*/
 +#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
 !defined(CONFIG_X86_OOSTORE))
 +   barrier();
 +#else
 +   smp_mb();
 +#endif
 p-locked = false;

Ugh. The #if is too ugly to live.

This is a classic case of people who write their own serialization
primitives invariably get them wrong. And this fix is just horrible,
and code like this should not be allowed.

I suspect we could make the above x86-specific optimization be valid
by introducing a new barrier, called smp_release_before_store() or
something like that, which on x86 just happens to be a no-op (but
still a compiler barrier). That's because earlier reads will not pass
a later stores, and stores are viewed in program order, so all x86
stores have release consistency modulo the theoretical PPro bugs
(that I don't think we actually ever saw in practice).

But it is possible that that is true on other architectures too, so
your hand-crafted x86-specific #if is not just ugly, it's liable to
get worse.

The alternative is to just say that we should use smp_mb()
unconditionally, depending on how critical this code actually ends up
being.

Added linux-arch in case architecture-maintainers have comments on
smp_release_before_store() thing. It would be kind of similar to the
odd smp_read_barrier_depends(), except it would normally be a full
memory barrier, except on architectures where a weaker barrier might
be sufficient.

I suspect there may be many architectures where a smp_wmb() is
sufficient for this case, for the simple reason that no sane
microarchitecture would *ever* move earlier reads down past a later
write, so release consistency really only needs the local writes to be
ordered, not the full memory ordering.

Arch people?

The more optimal solution may be to mark the store *itself* to be
store with release consistency, which on x86 would be a regular
store (with the compiler barrier), but on other architectures may be a
special memory operation. On architectures with
release/aqcuire-consistency, there's not a separate barrier before the
store, the store instruction itself is done with special semantics. So
maybe the right thing to do is

   #define smp_release_consistency_store(val, ptr) ...

where on x86, the implementation would be a simple

   do { barrier(); *(ptr)=(val); } while (0)

but on other architectures it might be a inline asm with the required
magic store-with-release instruction.

How important is this code sequence? Is the percpu_up_write()
function really so critical that we can't have an extra memory
barrier? Or do people perhaps see *other* places where
release-consistency-stores might be worth doing?

But in no event do I want to see that butt-ugly #if statement.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] percpu-rwsem: use barrier in unlock path

2012-10-16 Thread Lai Jiangshan
On 10/17/2012 10:23 AM, Linus Torvalds wrote:
 [ Architecture people, note the potential new SMP barrier! ]
 
 On Tue, Oct 16, 2012 at 4:30 PM, Mikulas Patocka mpato...@redhat.com wrote:
 +   /*
 +* The lock is considered unlocked when p-locked is set to false.
 +* Use barrier prevent reordering of operations around p-locked.
 +*/
 +#if defined(CONFIG_X86)  (!defined(CONFIG_X86_PPRO_FENCE)  
 !defined(CONFIG_X86_OOSTORE))
 +   barrier();
 +#else
 +   smp_mb();
 +#endif
 p-locked = false;
 
 Ugh. The #if is too ugly to live.

Even the previous patch is applied, percpu_down_read() still
needs mb() to pair with it.

 
 This is a classic case of people who write their own serialization
 primitives invariably get them wrong. And this fix is just horrible,
 and code like this should not be allowed.

One of the most major problems of 62ac665ff9fc07497ca524bd20d6a96893d11071 is 
that
it is merged without Ackeds or Revieweds from Paul or Peter or someone else
who are expert at synchronization/arch memory models.

I suggest any new synchronization should stay in -tip for 2 or more cycles
before merged to mainline.

Thanks,
Lai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/