Re: Adding plain accesses and detecting data races in the LKMM

2019-04-21 Thread Paul E. McKenney
On Sat, Apr 20, 2019 at 11:50:14PM +0900, Akira Yokosawa wrote:
> On Fri, 19 Apr 2019 11:06:41 -0700, Paul E. McKenney wrote:
> > On Sat, Apr 20, 2019 at 12:06:58AM +0900, Akira Yokosawa wrote:
> >> Hi Paul,
> >>
> [...]
> > 
> >>> + (1) The compiler can reorder the load from a to precede the
> >>> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> >>> + compiler barrier, the CPU can reorder the preceding store to
> >>> + obj->dead with the later load from a.
> >>> +
> >>> + This could be avoided by using READ_ONCE(), which would prevent the
> >>> + compiler from reordering due to both atomic_dec() and READ_ONCE()
> >>> + being volatile accesses, and is usually preferable for loads from
> >>> + shared variables.  However, weakly ordered CPUs would still be
> >>> + free to reorder the atomic_dec() with the load from a, so a more
> >>> + readable option is to also use smp_mb__after_atomic() as follows:
> >>
> >> The point here is not just "readability", but also the portability of the
> >> code, isn't it?
> > 
> > As Andrea noted, in this particular case, the guarantee that the
> > store to obj->dead precedes the load from x is portable.  Either the
> > smp_mb__before_atomic() or the atomic_dec() must provide the ordering.
> 
> I think I understood this. What I wanted to say was the code for x86 implied
> in the subjunctive sentence:
> 
>   obj->dead = 1;
>   smp_mb__before_atomic();
>   atomic_dec(&obj->ref_count);
>   r1 = READ_ONCE(x);
> 
> , which was not spelled out, is not portable if we expect the ordering of
> atomic_dec() with READ_ONCE().

I now understand that you understood.  ;-)

> > However, you are right that there is some non-portability.  But this
> > non-portability involves the order of the atomic_dec() and the store to x.
> 
> Yes, you've guessed it right.

Don't worry, it won't happen again!

> > So what I did was ...
> > 
> >> Thanks, Akira
> >>
> >>> +
> >>> + WRITE_ONCE(obj->dead, 1);
> >>> + smp_mb__before_atomic();
> >>> + atomic_dec(&obj->ref_count);
> >>> + smp_mb__after_atomic();
> >>> + r1 = READ_ONCE(a);
> >>> +
> >>> + This orders all three accesses against each other, and also makes
> >>> + the intent quite clear.
> > 
> > ... change the above paragraph to read as follows:
> > 
> >  In addition, the example without the smp_mb__after_atomic() does
> >  not necessarily order the atomic_dec() with the load from x.
> >  In contrast, the example with both smp_mb__before_atomic() and
> >  smp_mb__after_atomic() orders all three accesses against each other,
> >  and also makes the intent quite clear.
> > 
> > Does that help?
> 
> This looks a little bit redundant to me. The original one is clear
> enough.
> 
> How about editing the leading sentence above:
> 
> >>> + shared variables.  However, weakly ordered CPUs would still be
> >>> + free to reorder the atomic_dec() with the load from a, so a more
> >>> + readable option is to also use smp_mb__after_atomic() as follows:
> 
> to read as follows?
> 
>  shared variables.  However, weakly ordered CPUs would still be
>  free to reorder the atomic_dec() with the load from x, so a
>  portable and more readable option is to also use
>  smp_mb__after_atomic() as follows:

Adding "portable and", correct?  Makes sense, so I applied this change.

> Obviously, the interesting discussion going on in another thread will
> surely affect this patch.

Quite possibly!  ;-)


Thanx, Paul

> >>>   See Documentation/atomic_{t,bitops}.txt for more information.
> >>>  
> >>> diff --git a/tools/memory-model/linux-kernel.cat 
> >>> b/tools/memory-model/linux-kernel.cat
> >>> index 8dcb37835b61..b6866f93abb8 100644
> >>> --- a/tools/memory-model/linux-kernel.cat
> >>> +++ b/tools/memory-model/linux-kernel.cat
> >>> @@ -28,8 +28,8 @@ include "lock.cat"
> >>>  let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
> >>>  let wmb = [W] ; fencerel(Wmb) ; [W]
> >>>  let mb = ([M] ; fencerel(Mb) ; [M]) |
> >>> - ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> >>> - ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> >>> + ([M] ; fencerel(Before-atomic) ; [RMW]) |
> >>> + ([RMW] ; fencerel(After-atomic) ; [M]) |
> >>>   ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> >>>   ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> >>>   fencerel(After-unlock-lock) ; [M])
> >>>
> >>
> > 
> 



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-20 Thread Akira Yokosawa
On Fri, 19 Apr 2019 11:06:41 -0700, Paul E. McKenney wrote:
> On Sat, Apr 20, 2019 at 12:06:58AM +0900, Akira Yokosawa wrote:
>> Hi Paul,
>>
[...]
> 
>>> + (1) The compiler can reorder the load from a to precede the
>>> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
>>> + compiler barrier, the CPU can reorder the preceding store to
>>> + obj->dead with the later load from a.
>>> +
>>> + This could be avoided by using READ_ONCE(), which would prevent the
>>> + compiler from reordering due to both atomic_dec() and READ_ONCE()
>>> + being volatile accesses, and is usually preferable for loads from
>>> + shared variables.  However, weakly ordered CPUs would still be
>>> + free to reorder the atomic_dec() with the load from a, so a more
>>> + readable option is to also use smp_mb__after_atomic() as follows:
>>
>> The point here is not just "readability", but also the portability of the
>> code, isn't it?
> 
> As Andrea noted, in this particular case, the guarantee that the
> store to obj->dead precedes the load from x is portable.  Either the
> smp_mb__before_atomic() or the atomic_dec() must provide the ordering.

I think I understood this. What I wanted to say was the code for x86 implied
in the subjunctive sentence:

obj->dead = 1;
smp_mb__before_atomic();
atomic_dec(&obj->ref_count);
r1 = READ_ONCE(x);

, which was not spelled out, is not portable if we expect the ordering of
atomic_dec() with READ_ONCE().

> However, you are right that there is some non-portability.  But this
> non-portability involves the order of the atomic_dec() and the store to x.

Yes, you've guessed it right.

> 
> So what I did was ...
> 
>> Thanks, Akira
>>
>>> +
>>> +   WRITE_ONCE(obj->dead, 1);
>>> +   smp_mb__before_atomic();
>>> +   atomic_dec(&obj->ref_count);
>>> +   smp_mb__after_atomic();
>>> +   r1 = READ_ONCE(a);
>>> +
>>> + This orders all three accesses against each other, and also makes
>>> + the intent quite clear.
> 
> ... change the above paragraph to read as follows:
> 
>  In addition, the example without the smp_mb__after_atomic() does
>  not necessarily order the atomic_dec() with the load from x.
>  In contrast, the example with both smp_mb__before_atomic() and
>  smp_mb__after_atomic() orders all three accesses against each other,
>  and also makes the intent quite clear.
> 
> Does that help?

This looks a little bit redundant to me. The original one is clear
enough.

How about editing the leading sentence above:

>>> + shared variables.  However, weakly ordered CPUs would still be
>>> + free to reorder the atomic_dec() with the load from a, so a more
>>> + readable option is to also use smp_mb__after_atomic() as follows:

to read as follows?

 shared variables.  However, weakly ordered CPUs would still be
 free to reorder the atomic_dec() with the load from x, so a
 portable and more readable option is to also use
 smp_mb__after_atomic() as follows:

Obviously, the interesting discussion going on in another thread will
surely affect this patch.

Thanks, Akira

> 
>   Thanx, Paul
> 
>>>   See Documentation/atomic_{t,bitops}.txt for more information.
>>>  
>>> diff --git a/tools/memory-model/linux-kernel.cat 
>>> b/tools/memory-model/linux-kernel.cat
>>> index 8dcb37835b61..b6866f93abb8 100644
>>> --- a/tools/memory-model/linux-kernel.cat
>>> +++ b/tools/memory-model/linux-kernel.cat
>>> @@ -28,8 +28,8 @@ include "lock.cat"
>>>  let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
>>>  let wmb = [W] ; fencerel(Wmb) ; [W]
>>>  let mb = ([M] ; fencerel(Mb) ; [M]) |
>>> -   ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
>>> -   ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
>>> +   ([M] ; fencerel(Before-atomic) ; [RMW]) |
>>> +   ([RMW] ; fencerel(After-atomic) ; [M]) |
>>> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
>>> ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
>>> fencerel(After-unlock-lock) ; [M])
>>>
>>
> 



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-19 Thread Akira Yokosawa
Hi Paul,

Please find inline comments below.

On Fri, 19 Apr 2019 05:47:20 -0700, Paul E. McKenney wrote:
> On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
>>> Are you saying that on x86, atomic_inc() acts as a full memory barrier 
>>> but not as a compiler barrier, and vice versa for 
>>> smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
>>> smp_mb__after_atomic() implements a full memory barrier?
>>
>> I'd say the former; AFAICT, these boil down to:
>>
>>   
>> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
>>   
>> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
> 
> OK, how about the following?
> 
>   Thanx, Paul
> 
> 
> 
> commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> Author: Paul E. McKenney 
> Date:   Fri Apr 19 05:20:30 2019 -0700
> 
> tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
> 
> Read-modify-write atomic operations that do not return values need not
> provide any ordering guarantees, and this means that both the compiler
> and the CPU are free to reorder accesses across things like atomic_inc()
> and atomic_dec().  The stronger systems such as x86 allow the compiler
> to do the reordering, but prevent the CPU from so doing, and these
> systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> The weaker systems such as Power allow both the compiler and the CPU
> to reorder accesses across things like atomic_inc() and atomic_dec(),
> and implement smp_mb__{before,after}_atomic() as full memory barriers.
> 
> This means that smp_mb__before_atomic() only orders the atomic operation
> itself with accesses preceding the smp_mb__before_atomic(), and does
> not necessarily provide any ordering whatsoever against accesses
> folowing the atomic operation.  Similarly, smp_mb__after_atomic()

s/folowing/following/

> only orders the atomic operation itself with accesses following the
> smp_mb__after_atomic(), and does not necessarily provide any ordering
> whatsoever against accesses preceding the atomic operation.  Full ordering
> therefore requires both an smp_mb__before_atomic() before the atomic
> operation and an smp_mb__after_atomic() after the atomic operation.
> 
> Therefore, linux-kernel.cat's current model of Before-atomic
> and After-atomic is too strong, as it guarantees ordering of
> accesses on the other side of the atomic operation from the
> smp_mb__{before,after}_atomic().  This commit therefore weakens
> the guarantee to match the semantics called out above.
> 
> Reported-by: Andrea Parri 
> Suggested-by: Alan Stern 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index 169d938c0b53..e5b97c3e8e39 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
>   atomic_dec(&obj->ref_count);
>  
>   This makes sure that the death mark on the object is perceived to be set
> - *before* the reference counter is decremented.
> + *before* the reference counter is decremented.  However, please note
> + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> + extend beyond the atomic operation.  For example:
> +
> + obj->dead = 1;
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + r1 = a;
> +
> + Here the store to obj->dead is not guaranteed to be ordered with
> + with the load from a.  This reordering can happen on x86 as follows:

s/with//

And I beg you to avoid using the single letter variable "a".
It's confusing.

> + (1) The compiler can reorder the load from a to precede the
> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> + compiler barrier, the CPU can reorder the preceding store to
> + obj->dead with the later load from a.
> +
> + This could be avoided by using READ_ONCE(), which would prevent the
> + compiler from reordering due to both atomic_dec() and READ_ONCE()
> + being volatile accesses, and is usually preferable for loads from
> + shared variables.  However, weakly ordered CPUs would still be
> + free to reorder the atomic_dec() with the load from a, so a more
> + readable option is to also use smp_mb__after_atomic() as follows:

The point here is not just "readability", but also the portability of the
code, isn't it?

Thanks, Akira

> +
> + WRITE_ONCE(obj->dead, 1);
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + smp_mb__after_atomic();
> + r1 = READ_ONCE(a);
> +
> + This orders all three accesses against each other, and also makes

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-19 Thread Andrea Parri
> > + (1) The compiler can reorder the load from a to precede the
> > + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> > + compiler barrier, the CPU can reorder the preceding store to
> > + obj->dead with the later load from a.
> > +
> > + This could be avoided by using READ_ONCE(), which would prevent the
> > + compiler from reordering due to both atomic_dec() and READ_ONCE()
> > + being volatile accesses, and is usually preferable for loads from
> > + shared variables.  However, weakly ordered CPUs would still be
> > + free to reorder the atomic_dec() with the load from a, so a more
> > + readable option is to also use smp_mb__after_atomic() as follows:
> > +
> > +   WRITE_ONCE(obj->dead, 1);
> > +   smp_mb__before_atomic();
> > +   atomic_dec(&obj->ref_count);
> > +   smp_mb__after_atomic();
> > +   r1 = READ_ONCE(a);
> 
> The point here is not just "readability", but also the portability of the
> code, isn't it?

The implicit assumption was, I guess, that all weakly ordered CPUs which
are free to reorder the atomic_dec() with the READ_ONCE() execute a full
memory barrier in smp_mb__before_atomic() ...  This assumption currently
holds, AFAICT, but yes: it may well become "non-portable"! ... ;-)

  Andrea


Re: Adding plain accesses and detecting data races in the LKMM

2019-04-19 Thread Paul E. McKenney
On Sat, Apr 20, 2019 at 12:06:58AM +0900, Akira Yokosawa wrote:
> Hi Paul,
> 
> Please find inline comments below.
> 
> On Fri, 19 Apr 2019 05:47:20 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> >>> Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> >>> but not as a compiler barrier, and vice versa for 
> >>> smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> >>> smp_mb__after_atomic() implements a full memory barrier?
> >>
> >> I'd say the former; AFAICT, these boil down to:
> >>
> >>   
> >> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
> >>   
> >> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
> > 
> > OK, how about the following?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> > Author: Paul E. McKenney 
> > Date:   Fri Apr 19 05:20:30 2019 -0700
> > 
> > tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
> > 
> > Read-modify-write atomic operations that do not return values need not
> > provide any ordering guarantees, and this means that both the compiler
> > and the CPU are free to reorder accesses across things like atomic_inc()
> > and atomic_dec().  The stronger systems such as x86 allow the compiler
> > to do the reordering, but prevent the CPU from so doing, and these
> > systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> > The weaker systems such as Power allow both the compiler and the CPU
> > to reorder accesses across things like atomic_inc() and atomic_dec(),
> > and implement smp_mb__{before,after}_atomic() as full memory barriers.
> > 
> > This means that smp_mb__before_atomic() only orders the atomic operation
> > itself with accesses preceding the smp_mb__before_atomic(), and does
> > not necessarily provide any ordering whatsoever against accesses
> > folowing the atomic operation.  Similarly, smp_mb__after_atomic()
> 
> s/folowing/following/

Good eyes, fixed!

> > only orders the atomic operation itself with accesses following the
> > smp_mb__after_atomic(), and does not necessarily provide any ordering
> > whatsoever against accesses preceding the atomic operation.  Full 
> > ordering
> > therefore requires both an smp_mb__before_atomic() before the atomic
> > operation and an smp_mb__after_atomic() after the atomic operation.
> > 
> > Therefore, linux-kernel.cat's current model of Before-atomic
> > and After-atomic is too strong, as it guarantees ordering of
> > accesses on the other side of the atomic operation from the
> > smp_mb__{before,after}_atomic().  This commit therefore weakens
> > the guarantee to match the semantics called out above.
> > 
> > Reported-by: Andrea Parri 
> > Suggested-by: Alan Stern 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/Documentation/memory-barriers.txt 
> > b/Documentation/memory-barriers.txt
> > index 169d938c0b53..e5b97c3e8e39 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
> > atomic_dec(&obj->ref_count);
> >  
> >   This makes sure that the death mark on the object is perceived to be 
> > set
> > - *before* the reference counter is decremented.
> > + *before* the reference counter is decremented.  However, please note
> > + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> > + extend beyond the atomic operation.  For example:
> > +
> > +   obj->dead = 1;
> > +   smp_mb__before_atomic();
> > +   atomic_dec(&obj->ref_count);
> > +   r1 = a;
> > +
> > + Here the store to obj->dead is not guaranteed to be ordered with
> > + with the load from a.  This reordering can happen on x86 as follows:
> 
> s/with//

Fixed fixed.  ;-)

> And I beg you to avoid using the single letter variable "a".
> It's confusing.

Good point!  I changed it to "x".

> > + (1) The compiler can reorder the load from a to precede the
> > + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> > + compiler barrier, the CPU can reorder the preceding store to
> > + obj->dead with the later load from a.
> > +
> > + This could be avoided by using READ_ONCE(), which would prevent the
> > + compiler from reordering due to both atomic_dec() and READ_ONCE()
> > + being volatile accesses, and is usually preferable for loads from
> > + shared variables.  However, weakly ordered CPUs would still be
> > + free to reorder the atomic_dec() with the load from a, so a more
> > + readable option is to also use smp_mb__after_atomic() as follows:
> 
> The point here is not ju

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-19 Thread Paul E. McKenney
On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> > Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> > but not as a compiler barrier, and vice versa for 
> > smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> > smp_mb__after_atomic() implements a full memory barrier?
> 
> I'd say the former; AFAICT, these boil down to:
> 
>   
> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
>   
> https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84

OK, how about the following?

Thanx, Paul



commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
Author: Paul E. McKenney 
Date:   Fri Apr 19 05:20:30 2019 -0700

tools/memory-model: Make smp_mb__{before,after}_atomic() match x86

Read-modify-write atomic operations that do not return values need not
provide any ordering guarantees, and this means that both the compiler
and the CPU are free to reorder accesses across things like atomic_inc()
and atomic_dec().  The stronger systems such as x86 allow the compiler
to do the reordering, but prevent the CPU from so doing, and these
systems implement smp_mb__{before,after}_atomic() as compiler barriers.
The weaker systems such as Power allow both the compiler and the CPU
to reorder accesses across things like atomic_inc() and atomic_dec(),
and implement smp_mb__{before,after}_atomic() as full memory barriers.

This means that smp_mb__before_atomic() only orders the atomic operation
itself with accesses preceding the smp_mb__before_atomic(), and does
not necessarily provide any ordering whatsoever against accesses
folowing the atomic operation.  Similarly, smp_mb__after_atomic()
only orders the atomic operation itself with accesses following the
smp_mb__after_atomic(), and does not necessarily provide any ordering
whatsoever against accesses preceding the atomic operation.  Full ordering
therefore requires both an smp_mb__before_atomic() before the atomic
operation and an smp_mb__after_atomic() after the atomic operation.

Therefore, linux-kernel.cat's current model of Before-atomic
and After-atomic is too strong, as it guarantees ordering of
accesses on the other side of the atomic operation from the
smp_mb__{before,after}_atomic().  This commit therefore weakens
the guarantee to match the semantics called out above.

Reported-by: Andrea Parri 
Suggested-by: Alan Stern 
Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 169d938c0b53..e5b97c3e8e39 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
atomic_dec(&obj->ref_count);
 
  This makes sure that the death mark on the object is perceived to be set
- *before* the reference counter is decremented.
+ *before* the reference counter is decremented.  However, please note
+ that smp_mb__before_atomic()'s ordering guarantee does not necessarily
+ extend beyond the atomic operation.  For example:
+
+   obj->dead = 1;
+   smp_mb__before_atomic();
+   atomic_dec(&obj->ref_count);
+   r1 = a;
+
+ Here the store to obj->dead is not guaranteed to be ordered with
+ with the load from a.  This reordering can happen on x86 as follows:
+ (1) The compiler can reorder the load from a to precede the
+ atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
+ compiler barrier, the CPU can reorder the preceding store to
+ obj->dead with the later load from a.
+
+ This could be avoided by using READ_ONCE(), which would prevent the
+ compiler from reordering due to both atomic_dec() and READ_ONCE()
+ being volatile accesses, and is usually preferable for loads from
+ shared variables.  However, weakly ordered CPUs would still be
+ free to reorder the atomic_dec() with the load from a, so a more
+ readable option is to also use smp_mb__after_atomic() as follows:
+
+   WRITE_ONCE(obj->dead, 1);
+   smp_mb__before_atomic();
+   atomic_dec(&obj->ref_count);
+   smp_mb__after_atomic();
+   r1 = READ_ONCE(a);
+
+ This orders all three accesses against each other, and also makes
+ the intent quite clear.
 
  See Documentation/atomic_{t,bitops}.txt for more information.
 
diff --git a/tools/memory-model/linux-kernel.cat 
b/tools/memory-model/linux-kernel.cat
index 8dcb37835b61..b6866f93abb8 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -28,8 +28,8 @@ include "lock.cat"
 let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn]
 let wmb = [W] ; fencerel(Wmb) ; [W]
 let mb = ([M] ; fencerel(Mb) ; [M]) |

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-19 Thread Alan Stern
On Fri, 19 Apr 2019, Paul E. McKenney wrote:

> On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> > > Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> > > but not as a compiler barrier, and vice versa for 
> > > smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> > > smp_mb__after_atomic() implements a full memory barrier?
> > 
> > I'd say the former; AFAICT, these boil down to:
> > 
> >   
> > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
> >   
> > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
> 
> OK, how about the following?
> 
>   Thanx, Paul
> 
> 
> 
> commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> Author: Paul E. McKenney 
> Date:   Fri Apr 19 05:20:30 2019 -0700
> 
> tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
> 
> Read-modify-write atomic operations that do not return values need not
> provide any ordering guarantees, and this means that both the compiler
> and the CPU are free to reorder accesses across things like atomic_inc()
> and atomic_dec().  The stronger systems such as x86 allow the compiler
> to do the reordering, but prevent the CPU from so doing, and these
> systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> The weaker systems such as Power allow both the compiler and the CPU
> to reorder accesses across things like atomic_inc() and atomic_dec(),
> and implement smp_mb__{before,after}_atomic() as full memory barriers.
> 
> This means that smp_mb__before_atomic() only orders the atomic operation
> itself with accesses preceding the smp_mb__before_atomic(), and does
> not necessarily provide any ordering whatsoever against accesses
> folowing the atomic operation.  Similarly, smp_mb__after_atomic()
> only orders the atomic operation itself with accesses following the
> smp_mb__after_atomic(), and does not necessarily provide any ordering
> whatsoever against accesses preceding the atomic operation.  Full ordering
> therefore requires both an smp_mb__before_atomic() before the atomic
> operation and an smp_mb__after_atomic() after the atomic operation.
> 
> Therefore, linux-kernel.cat's current model of Before-atomic
> and After-atomic is too strong, as it guarantees ordering of
> accesses on the other side of the atomic operation from the
> smp_mb__{before,after}_atomic().  This commit therefore weakens
> the guarantee to match the semantics called out above.
> 
> Reported-by: Andrea Parri 
> Suggested-by: Alan Stern 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index 169d938c0b53..e5b97c3e8e39 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
>   atomic_dec(&obj->ref_count);
>  
>   This makes sure that the death mark on the object is perceived to be set
> - *before* the reference counter is decremented.
> + *before* the reference counter is decremented.  However, please note
> + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> + extend beyond the atomic operation.  For example:
> +
> + obj->dead = 1;
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + r1 = a;
> +
> + Here the store to obj->dead is not guaranteed to be ordered with
> + with the load from a.  This reordering can happen on x86 as follows:
> + (1) The compiler can reorder the load from a to precede the
> + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> + compiler barrier, the CPU can reorder the preceding store to
> + obj->dead with the later load from a.
> +
> + This could be avoided by using READ_ONCE(), which would prevent the
> + compiler from reordering due to both atomic_dec() and READ_ONCE()
> + being volatile accesses, and is usually preferable for loads from
> + shared variables.  However, weakly ordered CPUs would still be
> + free to reorder the atomic_dec() with the load from a, so a more
> + readable option is to also use smp_mb__after_atomic() as follows:
> +
> + WRITE_ONCE(obj->dead, 1);
> + smp_mb__before_atomic();
> + atomic_dec(&obj->ref_count);
> + smp_mb__after_atomic();
> + r1 = READ_ONCE(a);
> +
> + This orders all three accesses against each other, and also makes
> + the intent quite clear.
>  
>   See Documentation/atomic_{t,bitops}.txt for more information.
>  
> diff --git a/tools/memory-model/linux-kernel.cat 
> b/tools/memory-model/linux-kernel.cat
> index 8dcb37835b61..b6866f93abb8 100644
> --- a/tools/memory-model/

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-19 Thread Paul E. McKenney
On Fri, Apr 19, 2019 at 10:34:06AM -0400, Alan Stern wrote:
> On Fri, 19 Apr 2019, Paul E. McKenney wrote:
> 
> > On Fri, Apr 19, 2019 at 02:53:02AM +0200, Andrea Parri wrote:
> > > > Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> > > > but not as a compiler barrier, and vice versa for 
> > > > smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> > > > smp_mb__after_atomic() implements a full memory barrier?
> > > 
> > > I'd say the former; AFAICT, these boil down to:
> > > 
> > >   
> > > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
> > >   
> > > https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84
> > 
> > OK, how about the following?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 19d166dadc4e1bba4b248fb46d32ca4f2d10896b
> > Author: Paul E. McKenney 
> > Date:   Fri Apr 19 05:20:30 2019 -0700
> > 
> > tools/memory-model: Make smp_mb__{before,after}_atomic() match x86
> > 
> > Read-modify-write atomic operations that do not return values need not
> > provide any ordering guarantees, and this means that both the compiler
> > and the CPU are free to reorder accesses across things like atomic_inc()
> > and atomic_dec().  The stronger systems such as x86 allow the compiler
> > to do the reordering, but prevent the CPU from so doing, and these
> > systems implement smp_mb__{before,after}_atomic() as compiler barriers.
> > The weaker systems such as Power allow both the compiler and the CPU
> > to reorder accesses across things like atomic_inc() and atomic_dec(),
> > and implement smp_mb__{before,after}_atomic() as full memory barriers.
> > 
> > This means that smp_mb__before_atomic() only orders the atomic operation
> > itself with accesses preceding the smp_mb__before_atomic(), and does
> > not necessarily provide any ordering whatsoever against accesses
> > folowing the atomic operation.  Similarly, smp_mb__after_atomic()
> > only orders the atomic operation itself with accesses following the
> > smp_mb__after_atomic(), and does not necessarily provide any ordering
> > whatsoever against accesses preceding the atomic operation.  Full 
> > ordering
> > therefore requires both an smp_mb__before_atomic() before the atomic
> > operation and an smp_mb__after_atomic() after the atomic operation.
> > 
> > Therefore, linux-kernel.cat's current model of Before-atomic
> > and After-atomic is too strong, as it guarantees ordering of
> > accesses on the other side of the atomic operation from the
> > smp_mb__{before,after}_atomic().  This commit therefore weakens
> > the guarantee to match the semantics called out above.
> > 
> > Reported-by: Andrea Parri 
> > Suggested-by: Alan Stern 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/Documentation/memory-barriers.txt 
> > b/Documentation/memory-barriers.txt
> > index 169d938c0b53..e5b97c3e8e39 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1888,7 +1888,37 @@ There are some more advanced barrier functions:
> > atomic_dec(&obj->ref_count);
> >  
> >   This makes sure that the death mark on the object is perceived to be 
> > set
> > - *before* the reference counter is decremented.
> > + *before* the reference counter is decremented.  However, please note
> > + that smp_mb__before_atomic()'s ordering guarantee does not necessarily
> > + extend beyond the atomic operation.  For example:
> > +
> > +   obj->dead = 1;
> > +   smp_mb__before_atomic();
> > +   atomic_dec(&obj->ref_count);
> > +   r1 = a;
> > +
> > + Here the store to obj->dead is not guaranteed to be ordered with
> > + with the load from a.  This reordering can happen on x86 as follows:
> > + (1) The compiler can reorder the load from a to precede the
> > + atomic_dec(), (2) Because x86 smp_mb__before_atomic() is only a
> > + compiler barrier, the CPU can reorder the preceding store to
> > + obj->dead with the later load from a.
> > +
> > + This could be avoided by using READ_ONCE(), which would prevent the
> > + compiler from reordering due to both atomic_dec() and READ_ONCE()
> > + being volatile accesses, and is usually preferable for loads from
> > + shared variables.  However, weakly ordered CPUs would still be
> > + free to reorder the atomic_dec() with the load from a, so a more
> > + readable option is to also use smp_mb__after_atomic() as follows:
> > +
> > +   WRITE_ONCE(obj->dead, 1);
> > +   smp_mb__before_atomic();
> > +   atomic_dec(&obj->ref_count);
> > +   smp_mb__after_atomic();
> > +   r1 = READ_ONCE(a);
> > +
> > + This orders all three accesses against each other, and also makes
> > + the intent quite cl

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-18 Thread Andrea Parri
> Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> but not as a compiler barrier, and vice versa for 
> smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> smp_mb__after_atomic() implements a full memory barrier?

I'd say the former; AFAICT, these boil down to:

  
https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
  
https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84

  Andrea


Re: Adding plain accesses and detecting data races in the LKMM

2019-04-18 Thread Alan Stern
On Thu, 18 Apr 2019, Paul E. McKenney wrote:

> > Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> > but not as a compiler barrier, and vice versa for 
> > smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> > smp_mb__after_atomic() implements a full memory barrier?
> > 
> > Either one seems like a very dangerous situation indeed.
> 
> If I am following the macro-name breadcrumb trails correctly, x86's
> atomic_inc() does have a compiler barrier.  But this is an accident
> of implementation -- from what I can see, it is not required to do so.
> 
> So smb_mb__after_atomic() is only guaranteed to order the atomic_inc()
> before B, not A.  To order A before B in the above example, an
> smp_mb__before_atomic() is also needed.

Are you certain?

> But now that I look, LKMM looks to be stating a stronger guarantee:
> 
>   ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
>   ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
>   ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
>   ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
>   fencerel(After-unlock-lock) ; [M])
> 
> Maybe something like this?
> 
>   ([M] ; fencerel(Before-atomic) ; [RMW] ; fencerel(After-atomic) ; [M]) |
>   ([M] ; fencerel(Before-atomic) ; [RMW] |
>   ( [RMW] ; fencerel(After-atomic) ; [M]) |
>   ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
>   ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
>   fencerel(After-unlock-lock) ; [M])

The first line you wrote is redundant; it follows from the second and 
third lines.

Aside from that, while this proposal may be correct and may express
what smb_mb__{before|after}_atomic really are intended to do, it
contradicts Documentation/atomic_t.txt.  That file says:

These barriers provide a full smp_mb().

And of course, a full smp_mb() would order everything before it against 
everything after it.  If we update the model then we should also update 
that file.

In addition, it's noteworthy that smp_mb__after_spinlock and 
smp_mb__after_unlock_lock do not behave in this way.

Alan



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-18 Thread Paul E. McKenney
On Thu, Apr 18, 2019 at 01:44:36PM -0400, Alan Stern wrote:
> On Thu, 18 Apr 2019, Andrea Parri wrote:
> 
> > > Another question is "should the kernel permit smp_mb__{before,after}*()
> > > anywhere other than immediately before or after the primitive being
> > > strengthened?"
> > 
> > Mmh, I do think that keeping these barriers "immediately before or after
> > the primitive being strengthened" is a good practice (readability, and
> > all that), if this is what you're suggesting.
> > 
> > However, a first auditing of the callsites showed that this practice is
> > in fact not always applied, notably... ;-)
> > 
> > kernel/rcu/tree_exp.h:sync_exp_work_done
> > kernel/sched/cpupri.c:cpupri_set
> > 
> > So there appear, at least, to be some exceptions/reasons for not always
> > following it?  Thoughts?
> > 
> > BTW, while auditing these callsites, I've stumbled across the following
> > snippet (from kernel/futex.c):
> > 
> > *futex = newval;
> > sys_futex(WAKE, futex);
> >   futex_wake(futex);
> >   smp_mb(); (B)
> >   if (waiters)
> > ...
> > 
> > where B is actually (c.f., futex_get_mm()):
> > 
> > atomic_inc(...->mm_count);
> > smp_mb__after_atomic();
> > 
> > It seems worth mentioning the fact that, AFAICT, this sequence does not
> > necessarily provide ordering when plain accesses are involved: consider,
> > e.g., the following variant of the snippet:
> > 
> > A:*x = 1;
> > /*
> >  * I've "ignored" the syscall, which should provide
> >  * (at least) a compiler barrier...
> >  */
> > atomic_inc(u);
> > smp_mb__after_atomic();
> > B:r0 = *y;
> > 
> > On x86, AFAICT, the compiler can do this:
> > 
> > atomic_inc(u);
> > A:*x = 1;
> > smp_mb__after_atomic();
> > B:r0 = *y;
> > 
> > (the implementation of atomic_inc() contains no compiler barrier), then
> > the CPU can "reorder" A and B (smp_mb__after_atomic() being #defined to
> > a compiler barrier).
> 
> Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> but not as a compiler barrier, and vice versa for 
> smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> smp_mb__after_atomic() implements a full memory barrier?
> 
> Either one seems like a very dangerous situation indeed.

If I am following the macro-name breadcrumb trails correctly, x86's
atomic_inc() does have a compiler barrier.  But this is an accident
of implementation -- from what I can see, it is not required to do so.

So smb_mb__after_atomic() is only guaranteed to order the atomic_inc()
before B, not A.  To order A before B in the above example, an
smp_mb__before_atomic() is also needed.

But now that I look, LKMM looks to be stating a stronger guarantee:

([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
([M] ; po ; [UL] ; (co | po) ; [LKW] ;
fencerel(After-unlock-lock) ; [M])

Maybe something like this?

([M] ; fencerel(Before-atomic) ; [RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; fencerel(Before-atomic) ; [RMW] |
( [RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
([M] ; po ; [UL] ; (co | po) ; [LKW] ;
fencerel(After-unlock-lock) ; [M])

Who is the lead maintainer for this stuff, anyway???  ;-)

Thanx, Paul

> Alan
> 
> > The mips implementation seems also affected by such "reorderings": I am
> > not familiar with this implementation but, AFAICT, it does not enforce
> > ordering from A to B in the following snippet:
> > 
> > A:*x = 1;
> > atomic_inc(u);
> > smp_mb__after_atomic();
> > B:WRITE_ONCE(*y, 1);
> > 
> > when CONFIG_WEAK_ORDERING=y, CONFIG_WEAK_REORDERING_BEYOND_LLSC=n.
> > 
> > Do these observations make sense to you?  Thoughts?
> > 
> >   Andrea
> 



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-18 Thread Alan Stern
On Thu, 18 Apr 2019, Andrea Parri wrote:

> > Another question is "should the kernel permit smp_mb__{before,after}*()
> > anywhere other than immediately before or after the primitive being
> > strengthened?"
> 
> Mmh, I do think that keeping these barriers "immediately before or after
> the primitive being strengthened" is a good practice (readability, and
> all that), if this is what you're suggesting.
> 
> However, a first auditing of the callsites showed that this practice is
> in fact not always applied, notably... ;-)
> 
>   kernel/rcu/tree_exp.h:sync_exp_work_done
>   kernel/sched/cpupri.c:cpupri_set
> 
> So there appear, at least, to be some exceptions/reasons for not always
> following it?  Thoughts?
> 
> BTW, while auditing these callsites, I've stumbled across the following
> snippet (from kernel/futex.c):
> 
>   *futex = newval;
>   sys_futex(WAKE, futex);
>   futex_wake(futex);
>   smp_mb(); (B)
> if (waiters)
>   ...
> 
> where B is actually (c.f., futex_get_mm()):
> 
>   atomic_inc(...->mm_count);
>   smp_mb__after_atomic();
> 
> It seems worth mentioning the fact that, AFAICT, this sequence does not
> necessarily provide ordering when plain accesses are involved: consider,
> e.g., the following variant of the snippet:
> 
>   A:*x = 1;
>   /*
>* I've "ignored" the syscall, which should provide
>* (at least) a compiler barrier...
>*/
>   atomic_inc(u);
>   smp_mb__after_atomic();
>   B:r0 = *y;
> 
> On x86, AFAICT, the compiler can do this:
> 
>   atomic_inc(u);
>   A:*x = 1;
>   smp_mb__after_atomic();
>   B:r0 = *y;
> 
> (the implementation of atomic_inc() contains no compiler barrier), then
> the CPU can "reorder" A and B (smp_mb__after_atomic() being #defined to
> a compiler barrier).

Are you saying that on x86, atomic_inc() acts as a full memory barrier 
but not as a compiler barrier, and vice versa for 
smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
smp_mb__after_atomic() implements a full memory barrier?

Either one seems like a very dangerous situation indeed.

Alan

> The mips implementation seems also affected by such "reorderings": I am
> not familiar with this implementation but, AFAICT, it does not enforce
> ordering from A to B in the following snippet:
> 
>   A:*x = 1;
>   atomic_inc(u);
>   smp_mb__after_atomic();
>   B:WRITE_ONCE(*y, 1);
> 
> when CONFIG_WEAK_ORDERING=y, CONFIG_WEAK_REORDERING_BEYOND_LLSC=n.
> 
> Do these observations make sense to you?  Thoughts?
> 
>   Andrea



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-18 Thread Andrea Parri
> Another question is "should the kernel permit smp_mb__{before,after}*()
> anywhere other than immediately before or after the primitive being
> strengthened?"

Mmh, I do think that keeping these barriers "immediately before or after
the primitive being strengthened" is a good practice (readability, and
all that), if this is what you're suggesting.

However, a first auditing of the callsites showed that this practice is
in fact not always applied, notably... ;-)

kernel/rcu/tree_exp.h:sync_exp_work_done
kernel/sched/cpupri.c:cpupri_set

So there appear, at least, to be some exceptions/reasons for not always
following it?  Thoughts?

BTW, while auditing these callsites, I've stumbled across the following
snippet (from kernel/futex.c):

*futex = newval;
sys_futex(WAKE, futex);
  futex_wake(futex);
  smp_mb(); (B)
  if (waiters)
...

where B is actually (c.f., futex_get_mm()):

atomic_inc(...->mm_count);
smp_mb__after_atomic();

It seems worth mentioning the fact that, AFAICT, this sequence does not
necessarily provide ordering when plain accesses are involved: consider,
e.g., the following variant of the snippet:

A:*x = 1;
/*
 * I've "ignored" the syscall, which should provide
 * (at least) a compiler barrier...
 */
atomic_inc(u);
smp_mb__after_atomic();
B:r0 = *y;

On x86, AFAICT, the compiler can do this:

atomic_inc(u);
A:*x = 1;
smp_mb__after_atomic();
B:r0 = *y;

(the implementation of atomic_inc() contains no compiler barrier), then
the CPU can "reorder" A and B (smp_mb__after_atomic() being #defined to
a compiler barrier).

The mips implementation seems also affected by such "reorderings": I am
not familiar with this implementation but, AFAICT, it does not enforce
ordering from A to B in the following snippet:

A:*x = 1;
atomic_inc(u);
smp_mb__after_atomic();
B:WRITE_ONCE(*y, 1);

when CONFIG_WEAK_ORDERING=y, CONFIG_WEAK_REORDERING_BEYOND_LLSC=n.

Do these observations make sense to you?  Thoughts?

  Andrea


Re: Adding plain accesses and detecting data races in the LKMM

2019-04-15 Thread Paul E. McKenney
On Mon, Apr 15, 2019 at 09:50:03AM -0400, Alan Stern wrote:
> On Mon, 15 Apr 2019, Paul E. McKenney wrote:
> 
> > Another question is "should the kernel permit smp_mb__{before,after}*()
> > anywhere other than immediately before or after the primitive being
> > strengthened?"
> > 
> > Thoughts?
> 
> How would the kernel forbid other constructions?

Coccinelle scripts added to 0day test robot.

Thanx, Paul



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-15 Thread Alan Stern
On Mon, 15 Apr 2019, Paul E. McKenney wrote:

> Another question is "should the kernel permit smp_mb__{before,after}*()
> anywhere other than immediately before or after the primitive being
> strengthened?"
> 
> Thoughts?

How would the kernel forbid other constructions?

Alan



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-15 Thread Paul E. McKenney
On Sat, Apr 13, 2019 at 11:39:38PM +0200, Andrea Parri wrote:
> On Tue, Apr 09, 2019 at 08:01:32AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote:
> > > > > The formula was more along the line of "do not assume either of these
> > > > > cases to hold; use barrier() is you need an unconditional barrier..."
> > > > > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > > > > provides a compiler barrier with either barrier() or "memory" clobber.
> > > > 
> > > > Well, we have two reasonable choices: Say that 
> > > > smp_mb__{before,after}_atomic will always provide a compiler barrier, 
> > > > or don't say this.  I see no point in saying that the combination of 
> > > > Before-atomic followed by RMW provides a barrier.
> > > 
> > > ;-/ I'm fine with the first choice. I don't see how the second choice
> > > (this proposal/patch) would be consistent with some documentation and
> > > with the current implementations; for example,
> > > 
> > > 1) Documentation/atomic_t.txt says:
> > > 
> > > Thus:
> > > 
> > >   atomic_fetch_add();
> > > 
> > > is equivalent to:
> > > 
> > >   smp_mb__before_atomic();
> > >   atomic_fetch_add_relaxed();
> > >   smp_mb__after_atomic();
> > > 
> > > [...]
> > > 
> > > 2) Some implementations of the _relaxed() variants do not provide any
> > > compiler barrier currently.
> > 
> > But don't all implementations of smp_mb__before_atomic() and
> > smp_mb__after_atomic() currently supply a compiler barrier?
> 
> Yes, AFAICS, all implementations of smp_mb__{before,after}_atomic() currently
> supply a compiler barrier.
> 
> Nevertheless, there's a difference between:  (1) Specify that these barriers
> supply a compiler barrier,  (2) Specify that (certain) combinations of these
> barriers and RMWs supply a compiler barrier, and  (3) This patch...  ;-)
> 
> FWIW, I'm not aware of current/informal documentation following (the arguably
> simpler but slightly stronger) (1).  But again (amending my last remark): (1)
> and (2) both make sense to me.

Another question is "should the kernel permit smp_mb__{before,after}*()
anywhere other than immediately before or after the primitive being
strengthened?"

Thoughts?

Thanx, Paul



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-13 Thread Andrea Parri
On Tue, Apr 09, 2019 at 08:01:32AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote:
> > > > The formula was more along the line of "do not assume either of these
> > > > cases to hold; use barrier() is you need an unconditional barrier..."
> > > > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > > > provides a compiler barrier with either barrier() or "memory" clobber.
> > > 
> > > Well, we have two reasonable choices: Say that 
> > > smp_mb__{before,after}_atomic will always provide a compiler barrier, 
> > > or don't say this.  I see no point in saying that the combination of 
> > > Before-atomic followed by RMW provides a barrier.
> > 
> > ;-/ I'm fine with the first choice. I don't see how the second choice
> > (this proposal/patch) would be consistent with some documentation and
> > with the current implementations; for example,
> > 
> > 1) Documentation/atomic_t.txt says:
> > 
> > Thus:
> > 
> >   atomic_fetch_add();
> > 
> > is equivalent to:
> > 
> >   smp_mb__before_atomic();
> >   atomic_fetch_add_relaxed();
> >   smp_mb__after_atomic();
> > 
> > [...]
> > 
> > 2) Some implementations of the _relaxed() variants do not provide any
> > compiler barrier currently.
> 
> But don't all implementations of smp_mb__before_atomic() and
> smp_mb__after_atomic() currently supply a compiler barrier?

Yes, AFAICS, all implementations of smp_mb__{before,after}_atomic() currently
supply a compiler barrier.

Nevertheless, there's a difference between:  (1) Specify that these barriers
supply a compiler barrier,  (2) Specify that (certain) combinations of these
barriers and RMWs supply a compiler barrier, and  (3) This patch...  ;-)

FWIW, I'm not aware of current/informal documentation following (the arguably
simpler but slightly stronger) (1).  But again (amending my last remark): (1)
and (2) both make sense to me.

Thanx,
  Andrea


> 
>   Thanx, Paul
> 


Re: Adding plain accesses and detecting data races in the LKMM

2019-04-09 Thread Paul E. McKenney
On Tue, Apr 09, 2019 at 03:36:18AM +0200, Andrea Parri wrote:
> > > The formula was more along the line of "do not assume either of these
> > > cases to hold; use barrier() is you need an unconditional barrier..."
> > > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > > provides a compiler barrier with either barrier() or "memory" clobber.
> > 
> > Well, we have two reasonable choices: Say that 
> > smp_mb__{before,after}_atomic will always provide a compiler barrier, 
> > or don't say this.  I see no point in saying that the combination of 
> > Before-atomic followed by RMW provides a barrier.
> 
> ;-/ I'm fine with the first choice. I don't see how the second choice
> (this proposal/patch) would be consistent with some documentation and
> with the current implementations; for example,
> 
> 1) Documentation/atomic_t.txt says:
> 
> Thus:
> 
>   atomic_fetch_add();
> 
> is equivalent to:
> 
>   smp_mb__before_atomic();
>   atomic_fetch_add_relaxed();
>   smp_mb__after_atomic();
> 
> [...]
> 
> 2) Some implementations of the _relaxed() variants do not provide any
> compiler barrier currently.

But don't all implementations of smp_mb__before_atomic() and
smp_mb__after_atomic() currently supply a compiler barrier?

Thanx, Paul



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-08 Thread Andrea Parri
> > The formula was more along the line of "do not assume either of these
> > cases to hold; use barrier() is you need an unconditional barrier..."
> > AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> > provides a compiler barrier with either barrier() or "memory" clobber.
> 
> Well, we have two reasonable choices: Say that 
> smp_mb__{before,after}_atomic will always provide a compiler barrier, 
> or don't say this.  I see no point in saying that the combination of 
> Before-atomic followed by RMW provides a barrier.

;-/ I'm fine with the first choice. I don't see how the second choice
(this proposal/patch) would be consistent with some documentation and
with the current implementations; for example,

1) Documentation/atomic_t.txt says:

Thus:

  atomic_fetch_add();

is equivalent to:

  smp_mb__before_atomic();
  atomic_fetch_add_relaxed();
  smp_mb__after_atomic();

[...]

2) Some implementations of the _relaxed() variants do not provide any
compiler barrier currently.

  Andrea


Re: Adding plain accesses and detecting data races in the LKMM

2019-04-08 Thread Alan Stern
On Mon, 8 Apr 2019, Andrea Parri wrote:

> > > > > I'd have:
> > > > > 
> > > > >   *x = 1; /* A */
> > > > >   smp_mb__before_atomic();
> > > > >   r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> > > > > 
> > > > >   => (A ->barrier B)
> > > > 
> > > > Perhaps so.  It wasn't clear initially how these should be treated, so
> > > > I just ignored them.  For example, what should happen here if there
> > > > were other statements between the smp_mb__before_atomic and the
> > > > xchg_relaxed?
> > > 
> > > I'd say that the link "A ->barrier B" should still be present and that
> > > no other barrier-links should appear.  More formally, I am suggesting
> > > that Before-atomic induced the following barrier-links:
> > > 
> > >   [M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]
> > 
> > Hmmm, maybe.  But exactly where is the barrier introduced?  Are you 
> > saying that on some architectures the barrier is contained in the 
> > smp_mb__before_atomic and on others it is contained in the 
> > xchg_relaxed?
> 
> The formula was more along the line of "do not assume either of these
> cases to hold; use barrier() is you need an unconditional barrier..."
> AFAICT, all current implementations of smp_mb__{before,after}_atomic()
> provides a compiler barrier with either barrier() or "memory" clobber.

Well, we have two reasonable choices: Say that 
smp_mb__{before,after}_atomic will always provide a compiler barrier, 
or don't say this.  I see no point in saying that the combination of 
Before-atomic followed by RMW provides a barrier.


> > > > > Also, consider the snippets:
> > > > > 
> > > > >   WRITE_ONCE(*x, 1);
> > > > >   *x = 2;
> > > > > 
> > > > > and
> > > > > 
> > > > >   smp_store_release(x, 1);
> > > > >   *x = 2;
> > > > > 
> > > > > The first snippet would be flagged "mixed-accesses" by the patch while
> > > > > the second snippet would not (thanks to the above component); was this
> > > > > intentional?  (Notice that some implementations map the latter to:
> > > > > 
> > > > >   barrier();
> > > > >   WRITE_ONCE(*x, 1);
> > > > >   *x = 2;
> > > > > 
> > > > > )
> > > > 
> > > > That last observation is a good reason for changing the Cat code.
> > > 
> > > You mean in:
> > > 
> > >   po-rel | acq-po
> > > 
> > > ? (together with the fencerel()-ifications of Release and Acquire already
> > > present in the patch).
> > 
> > No, I meant changing the definition of "barrier" to say:
> > 
> > (po ; [Release]) | ([Acquire] ; po)
> > 
> > (for example) instead of the code you quoted above.
> 
> This makes sense to me.

I'll put it into the next version.


> > > > > > To avoid complications, the LKMM will consider only code in which
> > > > > > plain writes are separated by a compiler barrier from any marked
> > > > > > accesses of the same location.  (Question: Can this restriction be
> > > > > > removed?)  As a consequence, if a region contains any marked 
> > > > > > accesses
> > > > > > to a particular location then it cannot contain any plain writes to
> > > > > > that location.
> > > > > 
> > > > > I don't know if it can be removed...  Said this, maybe we should go 
> > > > > back
> > > > > to the (simpler?) question: "What can go wrong without the 
> > > > > restriction?",
> > > > > ;-)  IOW, what are those "complications", can you provide some 
> > > > > examples?
> > > > 
> > > > Here's an example I sent to Will a little while ago.  Suppose we 
> > > > have:
> > > >  
> > > >   r = rcu_dereference(ptr);
> > > >   *r = 1;
> > > >   WRITE_ONCE(x, 2);
> > > > 
> > > > Imagine if the compiler transformed this to:
> > > > 
> > > >   r = rcu_dereference(ptr);
> > > >   WRITE_ONCE(x, 2);
> > > >   if (r != &x)
> > > >   *r = 1;
> > > > 
> > > > This is bad because it destroys the address dependency when ptr
> > > > contains &x.
> > > 
> > > Oh, indeed!  (In fact, we discussed this example before... ;-/)  BTW,
> > > can you also point out an example which does not involve dependencies
> > > (or destruction thereof)?
> > 
> > I believe all the examples did involve dependencies.  However, Paul has
> > provided several use cases showing that one might have good reasons for
> > mixing plain reads with marked acccesses -- but he didn't have any use
> > cases for mixing plain writes.
> 
> I see.  BTW, if we'll converge to add this new flag (or some variant),
> it might be a good idea to log some of these examples, maybe just the
> snippet above, in a commit message (these could come in handy when we
> will be addressed with the question "Why am I seeing this flag?" ...)

Okay.

> > There are some open questions remaining.  For instance, given:
> > 
> > r1 = READ_ONCE(*x);
> > r2 = *x;
> > 
> > is the compiler allowed to replace the plain read with the value from 
> > the READ_ONCE?  What if the statements were in the opposite order?  
> > What if the READ_ONCE was changed to WRITE_ONCE?
> > 
> > At the moment, I think it

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-07 Thread Andrea Parri
> > > > I'd have:
> > > > 
> > > > *x = 1; /* A */
> > > > smp_mb__before_atomic();
> > > > r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> > > > 
> > > > => (A ->barrier B)
> > > 
> > > Perhaps so.  It wasn't clear initially how these should be treated, so
> > > I just ignored them.  For example, what should happen here if there
> > > were other statements between the smp_mb__before_atomic and the
> > > xchg_relaxed?
> > 
> > I'd say that the link "A ->barrier B" should still be present and that
> > no other barrier-links should appear.  More formally, I am suggesting
> > that Before-atomic induced the following barrier-links:
> > 
> > [M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]
> 
> Hmmm, maybe.  But exactly where is the barrier introduced?  Are you 
> saying that on some architectures the barrier is contained in the 
> smp_mb__before_atomic and on others it is contained in the 
> xchg_relaxed?

The formula was more along the line of "do not assume either of these
cases to hold; use barrier() is you need an unconditional barrier..."
AFAICT, all current implementations of smp_mb__{before,after}_atomic()
provides a compiler barrier with either barrier() or "memory" clobber.


> > I was suggesting to consider something like:
> > 
> > let barrier = [...] | mb
> > 
> > but I'm still not sure about those After-unlock-lock and After-spinlock.
> 
> I don't see any point in saying that After-unlock-lock or 
> After-spinlock contains a barrier, because spin_lock and spin_unlock 
> already do.

(They would not "contain a barrier", they would induce some when paired
with the specified locking primitive/sequence: this distinction matters
for some implementation here doesn't provide the compiler barrier.  But) 
I agree with you: these barriers seem either redundant or unnecessary.


> > > > Also, consider the snippets:
> > > > 
> > > > WRITE_ONCE(*x, 1);
> > > > *x = 2;
> > > > 
> > > > and
> > > > 
> > > > smp_store_release(x, 1);
> > > > *x = 2;
> > > > 
> > > > The first snippet would be flagged "mixed-accesses" by the patch while
> > > > the second snippet would not (thanks to the above component); was this
> > > > intentional?  (Notice that some implementations map the latter to:
> > > > 
> > > > barrier();
> > > > WRITE_ONCE(*x, 1);
> > > > *x = 2;
> > > > 
> > > > )
> > > 
> > > That last observation is a good reason for changing the Cat code.
> > 
> > You mean in:
> > 
> > po-rel | acq-po
> > 
> > ? (together with the fencerel()-ifications of Release and Acquire already
> > present in the patch).
> 
> No, I meant changing the definition of "barrier" to say:
> 
>   (po ; [Release]) | ([Acquire] ; po)
> 
> (for example) instead of the code you quoted above.

This makes sense to me.


> > > > > To avoid complications, the LKMM will consider only code in which
> > > > > plain writes are separated by a compiler barrier from any marked
> > > > > accesses of the same location.  (Question: Can this restriction be
> > > > > removed?)  As a consequence, if a region contains any marked accesses
> > > > > to a particular location then it cannot contain any plain writes to
> > > > > that location.
> > > > 
> > > > I don't know if it can be removed...  Said this, maybe we should go back
> > > > to the (simpler?) question: "What can go wrong without the 
> > > > restriction?",
> > > > ;-)  IOW, what are those "complications", can you provide some examples?
> > > 
> > > Here's an example I sent to Will a little while ago.  Suppose we 
> > > have:
> > >  
> > >   r = rcu_dereference(ptr);
> > >   *r = 1;
> > >   WRITE_ONCE(x, 2);
> > > 
> > > Imagine if the compiler transformed this to:
> > > 
> > >   r = rcu_dereference(ptr);
> > >   WRITE_ONCE(x, 2);
> > >   if (r != &x)
> > >   *r = 1;
> > > 
> > > This is bad because it destroys the address dependency when ptr
> > > contains &x.
> > 
> > Oh, indeed!  (In fact, we discussed this example before... ;-/)  BTW,
> > can you also point out an example which does not involve dependencies
> > (or destruction thereof)?
> 
> I believe all the examples did involve dependencies.  However, Paul has
> provided several use cases showing that one might have good reasons for
> mixing plain reads with marked acccesses -- but he didn't have any use
> cases for mixing plain writes.

I see.  BTW, if we'll converge to add this new flag (or some variant),
it might be a good idea to log some of these examples, maybe just the
snippet above, in a commit message (these could come in handy when we
will be addressed with the question "Why am I seeing this flag?" ...)


> There are some open questions remaining.  For instance, given:
> 
>   r1 = READ_ONCE(*x);
>   r2 = *x;
> 
> is the compiler allowed to replace the plain read with the value from 
> the READ_ONCE?  What if the statements were in the opposite order?  
> What if the READ_ONCE was 

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-06 Thread Alan Stern
On Sat, 6 Apr 2019, Andrea Parri wrote:

> > > I'd have:
> > > 
> > >   *x = 1; /* A */
> > >   smp_mb__before_atomic();
> > >   r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> > > 
> > >   => (A ->barrier B)
> > 
> > Perhaps so.  It wasn't clear initially how these should be treated, so
> > I just ignored them.  For example, what should happen here if there
> > were other statements between the smp_mb__before_atomic and the
> > xchg_relaxed?
> 
> I'd say that the link "A ->barrier B" should still be present and that
> no other barrier-links should appear.  More formally, I am suggesting
> that Before-atomic induced the following barrier-links:
> 
>   [M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]

Hmmm, maybe.  But exactly where is the barrier introduced?  Are you 
saying that on some architectures the barrier is contained in the 
smp_mb__before_atomic and on others it is contained in the 
xchg_relaxed?

> > > smp_mb__after_unlock_lock() seems more tricky, given that it generates
> > > inter-threads links.
> > 
> > The inter-thread part is completely irrelevant as far as compiler 
> > barriers are concerned.
> > 
> > > Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
> > > not saying that this is feasible, but this seems worth considering...)
> > 
> > I don't understand that comment.
> 
> I was suggesting to consider something like:
> 
>   let barrier = [...] | mb
> 
> but I'm still not sure about those After-unlock-lock and After-spinlock.

I don't see any point in saying that After-unlock-lock or 
After-spinlock contains a barrier, because spin_lock and spin_unlock 
already do.


> > > I don't get the motivations for the component:
> > > 
> > >   (po ; [Acquire | Release]) |
> > >   ([Acquire | Release] ; po)
> > > 
> > > in "barrier".  Why did you write it?
> > 
> > I was thinking that in situations like:
> > 
> > A: z = 1;
> > B: smp_store_release(z, 2);
> > C: r = z;
> > 
> > in addition to there being a compiler barrier between A and C, there
> > effectively has to be a barrier between A and B (that is, the object
> > code has to store first 1 and then 2 to z) and between B and C (that
> > is, the object code can't skip doing the load from z and just set r to
> > 2).
> > 
> > The basic principle was the idea that load-acquire and store-release 
> > are special because each of them is both a memory access and a memory 
> > barrier at the same time.  Now, I didn't give this a tremendous amount 
> > of thought and it's quite possible that the Cat code should be changed.
> > For example, maybe there should be a compiler barrier immediately 
> > before a store-release but not immediately after, and vice versa for 
> > load-acquire.
> > 
> > > Also, consider the snippets:
> > > 
> > >   WRITE_ONCE(*x, 1);
> > >   *x = 2;
> > > 
> > > and
> > > 
> > >   smp_store_release(x, 1);
> > >   *x = 2;
> > > 
> > > The first snippet would be flagged "mixed-accesses" by the patch while
> > > the second snippet would not (thanks to the above component); was this
> > > intentional?  (Notice that some implementations map the latter to:
> > > 
> > >   barrier();
> > >   WRITE_ONCE(*x, 1);
> > >   *x = 2;
> > > 
> > > )
> > 
> > That last observation is a good reason for changing the Cat code.
> 
> You mean in:
> 
>   po-rel | acq-po
> 
> ? (together with the fencerel()-ifications of Release and Acquire already
> present in the patch).

No, I meant changing the definition of "barrier" to say:

(po ; [Release]) | ([Acquire] ; po)

(for example) instead of the code you quoted above.


> > > > To avoid complications, the LKMM will consider only code in which
> > > > plain writes are separated by a compiler barrier from any marked
> > > > accesses of the same location.  (Question: Can this restriction be
> > > > removed?)  As a consequence, if a region contains any marked accesses
> > > > to a particular location then it cannot contain any plain writes to
> > > > that location.
> > > 
> > > I don't know if it can be removed...  Said this, maybe we should go back
> > > to the (simpler?) question: "What can go wrong without the restriction?",
> > > ;-)  IOW, what are those "complications", can you provide some examples?
> > 
> > Here's an example I sent to Will a little while ago.  Suppose we 
> > have:
> >  
> >   r = rcu_dereference(ptr);
> >   *r = 1;
> >   WRITE_ONCE(x, 2);
> > 
> > Imagine if the compiler transformed this to:
> > 
> >   r = rcu_dereference(ptr);
> >   WRITE_ONCE(x, 2);
> >   if (r != &x)
> >   *r = 1;
> > 
> > This is bad because it destroys the address dependency when ptr
> > contains &x.
> 
> Oh, indeed!  (In fact, we discussed this example before... ;-/)  BTW,
> can you also point out an example which does not involve dependencies
> (or destruction thereof)?

I believe all the examples did involve dependencies.  However, Paul has
provided several use cases showing that one might have good reasons for
mi

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-05 Thread Andrea Parri
> > I'd have:
> > 
> > *x = 1; /* A */
> > smp_mb__before_atomic();
> > r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> > 
> > => (A ->barrier B)
> 
> Perhaps so.  It wasn't clear initially how these should be treated, so
> I just ignored them.  For example, what should happen here if there
> were other statements between the smp_mb__before_atomic and the
> xchg_relaxed?

I'd say that the link "A ->barrier B" should still be present and that
no other barrier-links should appear.  More formally, I am suggesting
that Before-atomic induced the following barrier-links:

[M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]


> > smp_mb__after_unlock_lock() seems more tricky, given that it generates
> > inter-threads links.
> 
> The inter-thread part is completely irrelevant as far as compiler 
> barriers are concerned.
> 
> > Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
> > not saying that this is feasible, but this seems worth considering...)
> 
> I don't understand that comment.

I was suggesting to consider something like:

let barrier = [...] | mb

but I'm still not sure about those After-unlock-lock and After-spinlock.


> > I don't get the motivations for the component:
> > 
> > (po ; [Acquire | Release]) |
> > ([Acquire | Release] ; po)
> > 
> > in "barrier".  Why did you write it?
> 
> I was thinking that in situations like:
> 
>   A: z = 1;
>   B: smp_store_release(z, 2);
>   C: r = z;
> 
> in addition to there being a compiler barrier between A and C, there
> effectively has to be a barrier between A and B (that is, the object
> code has to store first 1 and then 2 to z) and between B and C (that
> is, the object code can't skip doing the load from z and just set r to
> 2).
> 
> The basic principle was the idea that load-acquire and store-release 
> are special because each of them is both a memory access and a memory 
> barrier at the same time.  Now, I didn't give this a tremendous amount 
> of thought and it's quite possible that the Cat code should be changed.
> For example, maybe there should be a compiler barrier immediately 
> before a store-release but not immediately after, and vice versa for 
> load-acquire.
> 
> > Also, consider the snippets:
> > 
> > WRITE_ONCE(*x, 1);
> > *x = 2;
> > 
> > and
> > 
> > smp_store_release(x, 1);
> > *x = 2;
> > 
> > The first snippet would be flagged "mixed-accesses" by the patch while
> > the second snippet would not (thanks to the above component); was this
> > intentional?  (Notice that some implementations map the latter to:
> > 
> > barrier();
> > WRITE_ONCE(*x, 1);
> > *x = 2;
> > 
> > )
> 
> That last observation is a good reason for changing the Cat code.

You mean in:

po-rel | acq-po

? (together with the fencerel()-ifications of Release and Acquire already
present in the patch).


> > > To avoid complications, the LKMM will consider only code in which
> > > plain writes are separated by a compiler barrier from any marked
> > > accesses of the same location.  (Question: Can this restriction be
> > > removed?)  As a consequence, if a region contains any marked accesses
> > > to a particular location then it cannot contain any plain writes to
> > > that location.
> > 
> > I don't know if it can be removed...  Said this, maybe we should go back
> > to the (simpler?) question: "What can go wrong without the restriction?",
> > ;-)  IOW, what are those "complications", can you provide some examples?
> 
> Here's an example I sent to Will a little while ago.  Suppose we 
> have:
>  
>   r = rcu_dereference(ptr);
>   *r = 1;
>   WRITE_ONCE(x, 2);
> 
> Imagine if the compiler transformed this to:
> 
>   r = rcu_dereference(ptr);
>   WRITE_ONCE(x, 2);
>   if (r != &x)
>   *r = 1;
> 
> This is bad because it destroys the address dependency when ptr
> contains &x.

Oh, indeed!  (In fact, we discussed this example before... ;-/)  BTW,
can you also point out an example which does not involve dependencies
(or destruction thereof)?


> > > Given an object-level execution, let A1 and B1 be accesses in region
> > > R1 and let A2 and B2 be accesses in region R2, all of the same
> > > location, not all reads, and not all corresponding to marked accesses.
> > > (We explicitly allow A1 and B1 to be the same access, and the same for
> > > A2 and B2.)  The execution has a data race if the coherence ordering
> > > from A1 to A2 is opposite to the ordering from B1 to B2.
> > > 
> > > Definition: Two accesses of the same location "conflict" if they are
> > > in different threads, they are not both marked, and they are not both
> > > reads.
> > > 
> > > The notions "executes before" and "is visible to" have already been
> > > defined for marked accesses in the LKMM or in earlier proposals for
> > > handling plain accesses (the xb -- or more properly, xb* -- and vis
> > > relations).  I trust the ideas are familiar to everyon

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-02 Thread Alan Stern
On Tue, 2 Apr 2019, Andrea Parri wrote:

> On Tue, Mar 19, 2019 at 03:38:19PM -0400, Alan Stern wrote:
> > LKMM maintainers and other interested parties:
> > 
> > Here is my latest proposal for extending the Linux Kernel Memory Model 
> > to handle plain accesses and data races.  The document below explains 
> > the rationale behind the proposal, and a patch (based on the dev 
> > branch of Paul's linux-rcu tree) is attached.
> 
> Thank you for this!  Unfortunately, I didn't have enough time to give it
> justice yet...  Below are some first thoughts/questions.

Thanks for the feedback.

> > This extension is not a complete treatment of plain accesses, because 
> > it pretty much ignores any ordering that plain accesses can provide.  
> > For example, given:
> > 
> > r0 = rcu_dereference(ptr);
> > r1 = r0->branch;
> > r2 = READ_ONCE(&r1->data);
> > 
> > the memory model will not realize that the READ_ONCE must execute after
> > the rcu_dereference, because it doesn't take into account the address
> > dependency from the intermediate plain read.  Hopefully we will add 
> > such things to the memory model later on.  Concentrating on data races 
> > seems like enough for now.
> > 
> > Some of the ideas below were originally proposed for the LKMM by Andrea
> > Parri.
> > 
> > Alan
> > 
> > ---
> > 
> > A Plan For Modeling Plain Accesses In The Linux Kernel Memory Model
> > 
> > Definition: A "plain" access is one which is not "marked".  Marked
> > accesses include all those annotated with READ_ONCE or WRITE_ONCE,
> > plus acquire and release accesses, Read-Modify-Write (RMW) atomic
> > and bitop accesses, and locking accesses.
> > 
> > Plain accesses are those the compiler is free to mess around with.
> > Marked accesses are implemented in the kernel by means of inline
> > assembly or as volatile accesses, which greatly restricts what the
> > compiler can do.
> > 
> > Definition: A "region" of code is a program-order source segment
> > contained in a single thread that is maximal with respect to not
> > containing any compiler barriers (i.e., it is bounded at each end by
> > either a compiler barrier or the start or end of its thread).
> > 
> > Assumption: All memory barriers are also compiler barriers.  This
> > includes acquire loads and release stores, which are not considered to
> > be compiler barriers in the C++11 Standard, as well as non-relaxed RMW
> > atomic operations.  (Question: should this include situation-specific
> > memory barriers such as smp_mb__after_unlock_lock?  Probably not.)
> 
> I'd have:
> 
>   *x = 1; /* A */
>   smp_mb__before_atomic();
>   r0 = xchg_relaxed(x, 2); /* B (read or write part) */
> 
>   => (A ->barrier B)

Perhaps so.  It wasn't clear initially how these should be treated, so
I just ignored them.  For example, what should happen here if there
were other statements between the smp_mb__before_atomic and the
xchg_relaxed?

> smp_mb__after_unlock_lock() seems more tricky, given that it generates
> inter-threads links.

The inter-thread part is completely irrelevant as far as compiler 
barriers are concerned.

> Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
> not saying that this is feasible, but this seems worth considering...)

I don't understand that comment.

> ---
> I don't get the motivations for the component:
> 
>   (po ; [Acquire | Release]) |
>   ([Acquire | Release] ; po)
> 
> in "barrier".  Why did you write it?

I was thinking that in situations like:

A: z = 1;
B: smp_store_release(z, 2);
C: r = z;

in addition to there being a compiler barrier between A and C, there
effectively has to be a barrier between A and B (that is, the object
code has to store first 1 and then 2 to z) and between B and C (that
is, the object code can't skip doing the load from z and just set r to
2).

The basic principle was the idea that load-acquire and store-release 
are special because each of them is both a memory access and a memory 
barrier at the same time.  Now, I didn't give this a tremendous amount 
of thought and it's quite possible that the Cat code should be changed.
For example, maybe there should be a compiler barrier immediately 
before a store-release but not immediately after, and vice versa for 
load-acquire.

> Also, consider the snippets:
> 
>   WRITE_ONCE(*x, 1);
>   *x = 2;
> 
> and
> 
>   smp_store_release(x, 1);
>   *x = 2;
> 
> The first snippet would be flagged "mixed-accesses" by the patch while
> the second snippet would not (thanks to the above component); was this
> intentional?  (Notice that some implementations map the latter to:
> 
>   barrier();
>   WRITE_ONCE(*x, 1);
>   *x = 2;
> 
> )

That last observation is a good reason for changing the Cat code.

> > To avoid complications, the LKMM will consider only code in which
> > plain writes are separated by a compiler barrie

Re: Adding plain accesses and detecting data races in the LKMM

2019-04-02 Thread Andrea Parri
On Tue, Mar 19, 2019 at 03:38:19PM -0400, Alan Stern wrote:
> LKMM maintainers and other interested parties:
> 
> Here is my latest proposal for extending the Linux Kernel Memory Model 
> to handle plain accesses and data races.  The document below explains 
> the rationale behind the proposal, and a patch (based on the dev 
> branch of Paul's linux-rcu tree) is attached.

Thank you for this!  Unfortunately, I didn't have enough time to give it
justice yet...  Below are some first thoughts/questions.


> 
> This extension is not a complete treatment of plain accesses, because 
> it pretty much ignores any ordering that plain accesses can provide.  
> For example, given:
> 
>   r0 = rcu_dereference(ptr);
>   r1 = r0->branch;
>   r2 = READ_ONCE(&r1->data);
> 
> the memory model will not realize that the READ_ONCE must execute after
> the rcu_dereference, because it doesn't take into account the address
> dependency from the intermediate plain read.  Hopefully we will add 
> such things to the memory model later on.  Concentrating on data races 
> seems like enough for now.
> 
> Some of the ideas below were originally proposed for the LKMM by Andrea
> Parri.
> 
> Alan
> 
> ---
> 
> A Plan For Modeling Plain Accesses In The Linux Kernel Memory Model
> 
> Definition: A "plain" access is one which is not "marked".  Marked
> accesses include all those annotated with READ_ONCE or WRITE_ONCE,
> plus acquire and release accesses, Read-Modify-Write (RMW) atomic
> and bitop accesses, and locking accesses.
> 
> Plain accesses are those the compiler is free to mess around with.
> Marked accesses are implemented in the kernel by means of inline
> assembly or as volatile accesses, which greatly restricts what the
> compiler can do.
> 
> Definition: A "region" of code is a program-order source segment
> contained in a single thread that is maximal with respect to not
> containing any compiler barriers (i.e., it is bounded at each end by
> either a compiler barrier or the start or end of its thread).
> 
> Assumption: All memory barriers are also compiler barriers.  This
> includes acquire loads and release stores, which are not considered to
> be compiler barriers in the C++11 Standard, as well as non-relaxed RMW
> atomic operations.  (Question: should this include situation-specific
> memory barriers such as smp_mb__after_unlock_lock?  Probably not.)

I'd have:

*x = 1; /* A */
smp_mb__before_atomic();
r0 = xchg_relaxed(x, 2); /* B (read or write part) */

=> (A ->barrier B)

smp_mb__after_unlock_lock() seems more tricky, given that it generates
inter-threads links.

Intuitively, we would have a "barrier" that is a superset of "mb" (I'm
not saying that this is feasible, but this seems worth considering...)

---
I don't get the motivations for the component:

(po ; [Acquire | Release]) |
([Acquire | Release] ; po)

in "barrier".  Why did you write it?

Also, consider the snippets:

WRITE_ONCE(*x, 1);
*x = 2;

and

smp_store_release(x, 1);
*x = 2;

The first snippet would be flagged "mixed-accesses" by the patch while
the second snippet would not (thanks to the above component); was this
intentional?  (Notice that some implementations map the latter to:

barrier();
WRITE_ONCE(*x, 1);
*x = 2;

)


> 
> To avoid complications, the LKMM will consider only code in which
> plain writes are separated by a compiler barrier from any marked
> accesses of the same location.  (Question: Can this restriction be
> removed?)  As a consequence, if a region contains any marked accesses
> to a particular location then it cannot contain any plain writes to
> that location.

I don't know if it can be removed...  Said this, maybe we should go back
to the (simpler?) question: "What can go wrong without the restriction?",
;-)  IOW, what are those "complications", can you provide some examples?


> 
> This analysis is guided by the intuition that one can set up a limited
> correspondence between an execution of the source code and an
> execution of the generated object code.  Such a correspondence is
> based on some limitations on what compilers can do when translating a
> program:
> 
>   Each marked access in the source-level execution must
>   correspond to a unique instruction in the object-level
>   execution.  The corresponding events must access the same
>   location, have the same values, and be of the same type (read
>   or write).
> 
>   If a marked access is address-dependent on a marked read then
>   the corresponding object-level instructions must have the same
>   dependency (i.e., the compiler does not break address
>   dependencies for marked accesses).

We known that this "limitation" does not hold, in this generality.  Your
proposal goes even further than that (by adding address dep. from marked
reads to marked acc