subject:"rcu_read_lock lost its compiler barrier"

Re: rcu_read_lock lost its compiler barrier

2019-06-08 Thread Paul E. McKenney

On Sat, Jun 08, 2019 at 11:56:04AM -0400, Alan Stern wrote:
> On Sat, 8 Jun 2019, Paul E. McKenney wrote:
> 
> > On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote:
> > > On Thu, 6 Jun 2019, Andrea Parri wrote:
> > > 
> > > > This seems a sensible change to me: looking forward to seeing a patch,
> > > > on top of -rcu/dev, for further review and testing!
> > > > 
> > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> > > > discussed in this thread (maybe once the RCU code and the informal doc
> > > > will have settled in such direction).
> > > 
> > > Yes.  Also for SRCU.  That point had not escaped me.
> > 
> > And it does seem pretty settled.  There are quite a few examples where
> > there are normal accesses at either end of the RCU read-side critical
> > sections, for example, the one in the requirements diffs below.
> > 
> > For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler
> > barriers since 2006.  ;-)
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
> > b/Documentation/RCU/Design/Requirements/Requirements.html
> > index 5a9238a2883c..080b39cc1dbb 100644
> > --- a/Documentation/RCU/Design/Requirements/Requirements.html
> > +++ b/Documentation/RCU/Design/Requirements/Requirements.html
> > @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as 
> > follows:
> > Hotplug CPU.
> > Scheduler and RCU.
> > Tracing and RCU.
> > +   
> ^
> > +Accesses to User Mamory and RCU.
> -^
> > Energy Efficiency.
> > 
> > Scheduling-Clock Interrupts and RCU.
> > @@ -2521,6 +2523,75 @@ cannot be used.
> >  The tracing folks both located the requirement and provided the
> >  needed fix, so this surprise requirement was relatively painless.
> >  
> > +
> --^
> > +Accesses to User Mamory and RCU
> -^
> 
> Are these issues especially notable for female programmers?  :-)

*Red face*  Some days it just doesn't pay to get up in the morning...

Well, those issues certainly seem a bit inconsiderate to non-mammalian
programmers.  :-/

How about the updated version shown below?

Thanx, Paul



diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index 5a9238a2883c..f04c467e55c5 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
Hotplug CPU.
Scheduler and RCU.
Tracing and RCU.
+   
+Accesses to User Memory and RCU.
Energy Efficiency.

Scheduling-Clock Interrupts and RCU.
@@ -2521,6 +2523,75 @@ cannot be used.
 The tracing folks both located the requirement and provided the
 needed fix, so this surprise requirement was relatively painless.
 
+
+Accesses to User Memory and RCU
+
+
+The kernel needs to access user-space memory, for example, to access
+data referenced by system-call parameters.
+The get_user() macro does this job.
+
+
+However, user-space memory might well be paged out, which means
+that get_user() might well page-fault and thus block while
+waiting for the resulting I/O to complete.
+It would be a very bad thing for the compiler to reorder
+a get_user() invocation into an RCU read-side critical
+section.
+For example, suppose that the source code looked like this:
+
+
+
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 v = p->value;
+ 4 rcu_read_unlock();
+ 5 get_user(user_v, user_p);
+ 6 do_something_with(v, user_v);
+
+
+
+
+The compiler must not be permitted to transform this source code into
+the following:
+
+
+
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
+ 4 v = p->value;
+ 5 rcu_read_unlock();
+ 6 do_something_with(v, user_v);
+
+
+
+
+If the compiler did make this transformation in a
+CONFIG_PREEMPT=n kernel build, and if get_user() did
+page fault, the result would be a quiescent state in the middle
+of an RCU read-side critical section.
+This misplaced quiescent state could result in line 4 being
+a use-after-free access, which could be bad for your kernel's
+actuarial statistics.
+Similar examples can be constructed with the call to get_user()
+preceding the rcu_read_lock().
+
+
+Unfortunately, get_user() doesn't have any particular
+ordering properties, and in some architectures the underlying asm
+isn't even marked volatile.
+And even if it was marked volatile, the above access to
+p->value is not volatile, so the compiler would not have any
+reason to k

Re: rcu_read_lock lost its compiler barrier

2019-06-08 Thread Alan Stern

On Sat, 8 Jun 2019, Paul E. McKenney wrote:

> On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote:
> > On Thu, 6 Jun 2019, Andrea Parri wrote:
> > 
> > > This seems a sensible change to me: looking forward to seeing a patch,
> > > on top of -rcu/dev, for further review and testing!
> > > 
> > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> > > discussed in this thread (maybe once the RCU code and the informal doc
> > > will have settled in such direction).
> > 
> > Yes.  Also for SRCU.  That point had not escaped me.
> 
> And it does seem pretty settled.  There are quite a few examples where
> there are normal accesses at either end of the RCU read-side critical
> sections, for example, the one in the requirements diffs below.
> 
> For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler
> barriers since 2006.  ;-)
> 
>   Thanx, Paul
> 
> 
> 
> diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
> b/Documentation/RCU/Design/Requirements/Requirements.html
> index 5a9238a2883c..080b39cc1dbb 100644
> --- a/Documentation/RCU/Design/Requirements/Requirements.html
> +++ b/Documentation/RCU/Design/Requirements/Requirements.html
> @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
>   Hotplug CPU.
>   Scheduler and RCU.
>   Tracing and RCU.
> + 
^
> +Accesses to User Mamory and RCU.
-^
>   Energy Efficiency.
>   
>   Scheduling-Clock Interrupts and RCU.
> @@ -2521,6 +2523,75 @@ cannot be used.
>  The tracing folks both located the requirement and provided the
>  needed fix, so this surprise requirement was relatively painless.
>  
> +
--^
> +Accesses to User Mamory and RCU
-^

Are these issues especially notable for female programmers?  :-)

Alan

Re: rcu_read_lock lost its compiler barrier

2019-06-08 Thread Paul E. McKenney

On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote:
> On Thu, 6 Jun 2019, Andrea Parri wrote:
> 
> > This seems a sensible change to me: looking forward to seeing a patch,
> > on top of -rcu/dev, for further review and testing!
> > 
> > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> > discussed in this thread (maybe once the RCU code and the informal doc
> > will have settled in such direction).
> 
> Yes.  Also for SRCU.  That point had not escaped me.

And it does seem pretty settled.  There are quite a few examples where
there are normal accesses at either end of the RCU read-side critical
sections, for example, the one in the requirements diffs below.

For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler
barriers since 2006.  ;-)

Thanx, Paul



diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index 5a9238a2883c..080b39cc1dbb 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows:
Hotplug CPU.
Scheduler and RCU.
Tracing and RCU.
+   
+Accesses to User Mamory and RCU.
Energy Efficiency.

Scheduling-Clock Interrupts and RCU.
@@ -2521,6 +2523,75 @@ cannot be used.
 The tracing folks both located the requirement and provided the
 needed fix, so this surprise requirement was relatively painless.
 
+
+Accesses to User Mamory and RCU
+
+
+The kernel needs to access user-space memory, for example, to access
+data referenced by system-call parameters.
+The get_user() macro does this job.
+
+
+However, user-space memory might well be paged out, which means
+that get_user() might well page-fault and thus block while
+waiting for the resulting I/O to complete.
+It would be a very bad thing for the compiler to reorder
+a get_user() invocation into an RCU read-side critical
+section.
+For example, suppose that the source code looked like this:
+
+
+
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 v = p->value;
+ 4 rcu_read_unlock();
+ 5 get_user(user_v, user_p);
+ 6 do_something_with(v, user_v);
+
+
+
+
+The compiler must not be permitted to transform this source code into
+the following:
+
+
+
+ 1 rcu_read_lock();
+ 2 p = rcu_dereference(gp);
+ 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!!
+ 4 v = p->value;
+ 5 rcu_read_unlock();
+ 6 do_something_with(v, user_v);
+
+
+
+
+If the compiler did make this transformation in a
+CONFIG_PREEMPT=n kernel build, and if get_user() did
+page fault, the result would be a quiescent state in the middle
+of an RCU read-side critical section.
+This misplaced quiescent state could result in line 4 being
+a use-after-free access, which could be bad for your kernel's
+actuarial statistics.
+Similar examples can be constructed with the call to get_user()
+preceding the rcu_read_lock().
+
+
+Unfortunately, get_user() doesn't have any particular
+ordering properties, and in some architectures the underlying asm
+isn't even marked volatile.
+And even if it was marked volatile, the above access to
+p->value is not volatile, so the compiler would not have any
+reason to keep those two accesses in order.
+
+
+Therefore, the Linux-kernel definitions of rcu_read_lock()
+and rcu_read_unlock() must act as compiler barriers,
+at least for outermost instances of rcu_read_lock() and
+rcu_read_unlock() within a nested set of RCU read-side critical
+sections.
+
 Energy Efficiency

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Alan Stern

On Thu, 6 Jun 2019, Andrea Parri wrote:

> This seems a sensible change to me: looking forward to seeing a patch,
> on top of -rcu/dev, for further review and testing!
> 
> We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
> discussed in this thread (maybe once the RCU code and the informal doc
> will have settled in such direction).

Yes.  Also for SRCU.  That point had not escaped me.

Alan

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Paul E. McKenney

On Thu, Jun 06, 2019 at 09:38:24PM +0800, Herbert Xu wrote:
> On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote:
> >
> > I cannot immediately think of a way that the compiler could get this
> > wrong even in theory, but similar code sequences can be messed up.
> > The reason for this is that in theory, the compiler could use the
> > stored-to location as temporary storage, like this:
> > 
> > a = whatever;   // Compiler uses "a" as a temporary
> > do_something();
> > whatever = a;
> > a = 1;  // Intended store
> 
> Well if the compiler is going to do this then surely it would
> continue to do this even if you used WRITE_ONCE.  Remember a is
> not volatile, only the access of a through WRITE_ONCE is volatile.

I disagree.  Given a volatile store, the compiler cannot assume that the
stored-to location is normal memory at that point in time, and therefore
cannot assume that it is safe to invent a store to that location (as
shown above).  Thus far, the C++ standards committee seems on-board with
this, though time will tell.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r1.pdf

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Herbert Xu

On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote:
>
> I cannot immediately think of a way that the compiler could get this
> wrong even in theory, but similar code sequences can be messed up.
> The reason for this is that in theory, the compiler could use the
> stored-to location as temporary storage, like this:
> 
>   a = whatever;   // Compiler uses "a" as a temporary
>   do_something();
>   whatever = a;
>   a = 1;  // Intended store

Well if the compiler is going to do this then surely it would
continue to do this even if you used WRITE_ONCE.  Remember a is
not volatile, only the access of a through WRITE_ONCE is volatile.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Paul E. McKenney

On Thu, Jun 06, 2019 at 05:28:55PM +0800, Herbert Xu wrote:
> On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote:
> >
> > Or is your point instead that given the initial value of "a" being
> > zero and the value stored to "a" being one, there is no way that
> > any possible load and store tearing (your slicing and dicing) could
> > possibly mess up the test of the value loaded from "a"?
> 
> Exactly.  If you can dream up of a scenario where the compiler can
> get this wrong I'm all ears.

I believe that this is safe in practice, as long as you exercise
constant vigilance.  (OK, OK, I might be overdramatizing...)

I cannot immediately think of a way that the compiler could get this
wrong even in theory, but similar code sequences can be messed up.
The reason for this is that in theory, the compiler could use the
stored-to location as temporary storage, like this:

a = whatever;   // Compiler uses "a" as a temporary
do_something();
whatever = a;
a = 1;  // Intended store

The compiler is allowed to do this (again, in theory and according to a
strict interpretation of the standard) because you haven't told it that
anything else is paying attention to variable "a".  As a result, the
compiler is within its rights to use "a" as temporary storage immediately
prior to any plain C-language store to "a".  

In practice, I don't know of any compilers that actually do this, nor
have I heard anyone suggesting that they might soon actually do this.

And even if they could, your example would still work because your example
doesn't care about anything other than zero and non-zero, so wouldn't
get confused by the compiler storing a temporary value of 42 or whatever.

> > > But I do concede that in the general RCU case you must have the
> > > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.
> > 
> > OK, good that we are in agreement on this part, at least!  ;-)
> 
> Well only because we're allowing crazy compilers that can turn
> a simple word-aligned word assignment (a = b) into two stores.

In my experience, the insanity of compilers increases with time,
but yes.

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Herbert Xu

On Thu, Jun 06, 2019 at 10:38:56AM +0200, Andrea Parri wrote:
> On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:
> 
> > The case we were discussing is from net/ipv4/inet_fragment.c from
> > the net-next tree:
> 
> BTW, thank you for keeping me and other people who intervened in that
> discussion in Cc:...

FWIW I didn't drop you from the Cc list.  The email discussion was
taken off-list by someone else and I simply kept that Cc list when
I brought it back onto lkml.  On a second look I did end up dropping
Eric but I think he's had enough of this discussion :)
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Herbert Xu

On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote:
>
> Or is your point instead that given the initial value of "a" being
> zero and the value stored to "a" being one, there is no way that
> any possible load and store tearing (your slicing and dicing) could
> possibly mess up the test of the value loaded from "a"?

Exactly.  If you can dream up of a scenario where the compiler can
get this wrong I'm all ears.

> > But I do concede that in the general RCU case you must have the
> > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.
> 
> OK, good that we are in agreement on this part, at least!  ;-)

Well only because we're allowing crazy compilers that can turn
a simple word-aligned word assignment (a = b) into two stores.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Paul E. McKenney

On Thu, Jun 06, 2019 at 02:14:38PM +0800, Herbert Xu wrote:
> On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote:
> >
> > In case you were wondering, the reason that I was giving you such
> > a hard time was that from what I could see, you were pushing for no
> > {READ,WRITE}_ONCE() at all.  ;-)
> 
> Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c.
> We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at
> all.  Even if the compiler dices and slices the reads/writes of
> "a" into a thousand pieces, it should still work if the RCU
> primitives are worth their salt.

OK, so I take it that there is additional synchronization in there
somewhere that is not captured in your simplified example code?

Or is your point instead that given the initial value of "a" being
zero and the value stored to "a" being one, there is no way that
any possible load and store tearing (your slicing and dicing) could
possibly mess up the test of the value loaded from "a"?

> But I do concede that in the general RCU case you must have the
> READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.

OK, good that we are in agreement on this part, at least!  ;-)

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Andrea Parri

On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:

> The case we were discussing is from net/ipv4/inet_fragment.c from
> the net-next tree:

BTW, thank you for keeping me and other people who intervened in that
discussion in Cc:...

  Andrea


> 
> void fqdir_exit(struct fqdir *fqdir)
> {
>   ...
>   fqdir->dead = true;
> 
>   /* call_rcu is supposed to provide memory barrier semantics,
>* separating the setting of fqdir->dead with the destruction
>* work.  This implicit barrier is paired with inet_frag_kill().
>*/
> 
>   INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
>   queue_rcu_work(system_wq, &fqdir->destroy_rwork);
> }
> 
> and
> 
> void inet_frag_kill(struct inet_frag_queue *fq)
> {
>   ...
>   rcu_read_lock();
>   /* The RCU read lock provides a memory barrier
>* guaranteeing that if fqdir->dead is false then
>* the hash table destruction will not start until
>* after we unlock.  Paired with inet_frags_exit_net().
>*/
>   if (!fqdir->dead) {
>   rhashtable_remove_fast(&fqdir->rhashtable, &fq->node,
>  fqdir->f->rhash_params);
>   ...
>   }
>   ...
>   rcu_read_unlock();
>   ...
> }
> 
> I simplified this to
> 
> Initial values:
> 
> a = 0
> b = 0
> 
> CPU1  CPU2
>   
> a = 1 rcu_read_lock
> synchronize_rcu   if (a == 0)
> b = 2 b = 1
>   rcu_read_unlock
> 
> On exit we want this to be true:
> b == 2

Re: rcu_read_lock lost its compiler barrier

2019-06-06 Thread Andrea Parri

> This example really does point out a weakness in the LKMM's handling of 
> data races.  Herbert's litmus test is a great starting point:
> 
> 
> C xu
> 
> {}
> 
> P0(int *a, int *b)
> {
>   WRITE_ONCE(*a, 1);
>   synchronize_rcu();
>   *b = 2;
> }
> 
> P1(int *a, int *b)
> {
>   rcu_read_lock();
>   if (READ_ONCE(*a) == 0)
>   *b = 1;
>   rcu_read_unlock();
> }
> 
> exists (~b=2)
> 
> 
> Currently the LKMM says the test is allowed and there is a data race, 
> but this answer clearly is wrong since it would violate the RCU 
> guarantee.
> 
> The problem is that the LKMM currently requires all ordering/visibility
> of plain accesses to be mediated by marked accesses.  But in this case,
> the visibility is mediated by RCU.  Technically, we need to add a
> relation like
> 
>   ([M] ; po ; rcu-fence ; po ; [M])
> 
> into the definitions of ww-vis, wr-vis, and rw-xbstar.  Doing so
> changes the litmus test's result to "not allowed" and no data race.  
> However, I'm not certain that this single change is the entire fix;  
> more thought is needed.

This seems a sensible change to me: looking forward to seeing a patch,
on top of -rcu/dev, for further review and testing!

We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}()
discussed in this thread (maybe once the RCU code and the informal doc
will have settled in such direction).

It seems worth stressing the fact that _neither_ of these changes will
prevent the test below from being racy, considered the two accesses to
"a" happen concurrently / without synchronization.

Thanks,
  Andrea

C xu-2

{}

P0(int *a, int *b)
{
*a = 1;
synchronize_rcu();
WRITE_ONCE(*b, 2);
}
 
P1(int *a, int *b)
{
rcu_read_lock();
if (*a == 0)
WRITE_ONCE(*b, 1);
rcu_read_unlock();
}
 
exists (~b=2)

Re: rcu_read_lock lost its compiler barrier

2019-06-05 Thread Herbert Xu

On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote:
>
> In case you were wondering, the reason that I was giving you such
> a hard time was that from what I could see, you were pushing for no
> {READ,WRITE}_ONCE() at all.  ;-)

Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c.
We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at
all.  Even if the compiler dices and slices the reads/writes of
"a" into a thousand pieces, it should still work if the RCU
primitives are worth their salt.

But I do concede that in the general RCU case you must have the
READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-05 Thread Paul E. McKenney

On Thu, Jun 06, 2019 at 12:51:09PM +0800, Herbert Xu wrote:
> On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote:
> >
> > Currently the LKMM says the test is allowed and there is a data race, 
> > but this answer clearly is wrong since it would violate the RCU 
> > guarantee.
> 
> Thank you! This is what I tried to say all along in this thread
> but you expressed it in a much better way :)

In case you were wondering, the reason that I was giving you such
a hard time was that from what I could see, you were pushing for no
{READ,WRITE}_ONCE() at all.  ;-)

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-05 Thread Herbert Xu

On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote:
>
> Currently the LKMM says the test is allowed and there is a data race, 
> but this answer clearly is wrong since it would violate the RCU 
> guarantee.

Thank you! This is what I tried to say all along in this thread
but you expressed it in a much better way :)
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-05 Thread Herbert Xu

On Tue, Jun 04, 2019 at 08:30:39PM -0700, Paul E. McKenney wrote:
>
> Understood.  Does the patch I sent out a few hours ago cover it?  Or is
> something else needed?

It looks good to me.

> Other than updates to the RCU requirements documentation, which is
> forthcoming.

Thanks Paul.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-04 Thread Paul E. McKenney

On Wed, Jun 05, 2019 at 10:21:17AM +0800, Herbert Xu wrote:
> On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote:
> >
> > Yeah, I know, even with the "volatile" keyword, it is not entirely clear
> > how much reordering the compiler is allowed to do.  I was relying on
> > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says:
> 
> The volatile keyword doesn't give any guarantees of this kind.
> The key to ensuring ordering between unrelated variable/register
> reads/writes is the memory clobber:
> 
>   6.47.2.6 Clobbers and Scratch Registers
> 
>   ...
> 
>   "memory" The "memory" clobber tells the compiler that the assembly
>   code performs memory reads or writes to items other than those
>   listed in the input and output operands (for example, accessing
>   the memory pointed to by one of the input parameters). To ensure
>   memory contains correct values, GCC may need to flush specific
>   register values to memory before executing the asm. Further,
>   the compiler does not assume that any values read from memory
>   before an asm remain unchanged after that asm; it reloads them as
>   needed. Using the "memory" clobber effectively forms a read/write
>   memory barrier for the compiler.
> 
>   Note that this clobber does not prevent the processor from
>   doing speculative reads past the asm statement. To prevent that,
>   you need processor-specific fence instructions.
> 
> IOW you need a barrier().

Understood.  Does the patch I sent out a few hours ago cover it?  Or is
something else needed?

Other than updates to the RCU requirements documentation, which is
forthcoming.

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-04 Thread Herbert Xu

On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote:
>
> Yeah, I know, even with the "volatile" keyword, it is not entirely clear
> how much reordering the compiler is allowed to do.  I was relying on
> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says:

The volatile keyword doesn't give any guarantees of this kind.
The key to ensuring ordering between unrelated variable/register
reads/writes is the memory clobber:

6.47.2.6 Clobbers and Scratch Registers

...

"memory" The "memory" clobber tells the compiler that the assembly
code performs memory reads or writes to items other than those
listed in the input and output operands (for example, accessing
the memory pointed to by one of the input parameters). To ensure
memory contains correct values, GCC may need to flush specific
register values to memory before executing the asm. Further,
the compiler does not assume that any values read from memory
before an asm remain unchanged after that asm; it reloads them as
needed. Using the "memory" clobber effectively forms a read/write
memory barrier for the compiler.

Note that this clobber does not prevent the processor from
doing speculative reads past the asm statement. To prevent that,
you need processor-specific fence instructions.

IOW you need a barrier().

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-04 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 01:24:32PM -0700, Linus Torvalds wrote:
> On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney  
> wrote:
> >
> > I agree that !PREEMPT rcu_read_lock() would not affect compiler code
> > generation, but given that get_user() is a volatile asm, isn't the
> > compiler already forbidden from reordering it with the volatile-casted
> > WRITE_ONCE() access, even if there was nothing at all between them?
> > Or are asms an exception to the rule that volatile executions cannot
> > be reordered?
> 
> Paul, you MAKE NO SENSE.
> 
> What is wrong with you?

Mostly that I didn't check all architectures' definitions of get_user().
Had I done so, I would have seen that not all of the corresponding asms
have the "volatile" keyword.  And of course, without that keyword, there
is absolutely nothing preventing the compiler from reordering the asm
with pretty much anything.  The only things that would be absolutely
guaranteed to prevent reordering would be things like memory clobbers
(barrier()) or accesses that overlap the asm's input/output list.

Yeah, I know, even with the "volatile" keyword, it is not entirely clear
how much reordering the compiler is allowed to do.  I was relying on
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says:

Qualifiers

volatile

The typical use of extended asm statements is to
manipulate input values to produce output values. However,
your asm statements may also produce side effects. If so,
you may need to use the volatile qualifier to disable
certain optimizations. See Volatile.

But the linked-to "Volatile" section later in that same web page mostly
talks about the compiler's ability to hoist asms out of loops.

> I just showed you an example of where rcu_read_lock() needs to be a
> compiler barrier, and then you make incoherent noises about
> WRITE_ONCE() that do not even exist in that example.

I thought we were discussing this example, but it doesn't matter because
I was missing your point about get_user() and page faults:

 get_user(val, ptr)
 rcu_read_lock();
 WRITE_ONCE(state, 1);

But regardless, given that some architectures omit volatile from their
asms implementing get_user(), even an optimistic interpretation of that
part of the GCC documentation would still permit reordering the above.
And again, I was missing your point about get_user() causing page faults
and thus context switches.

> Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already
> showed code that doesn't have those accessors, so reality doesn't
> match your fevered imagination.

I get the feeling that you believe that I want LKMM to be some sort of
final judge and absolute arbiter of what code is legal and not from
a memory-ordering perspective.  This is absolutely -not- the case.
The current state of the art, despite the recent impressive progress,
simply cannot reasonably do this.  So all I can claim is that LKMM
dispenses advice, hopefully good advice.  (It is early days for LKMM's
handling of plain accesses, so some work might be required to deliver
on the "good advice" promise, but we have to start somewhere.  Plus it
is progressing nicely.)

The places where long-standing RCU patterns require rcu_dereference()
and rcu_assign_pointer() do require some attention to avoid compiler
optimizations, and {READ,WRITE}_ONCE() is one way of addressing this.
But not the only way, nor always the best way.  For example, some
fastpaths might need the optimizations that {READ,WRITE}_ONCE()
suppresses.  Therefore, Linux kernel hackers have a number of other
ways of paying attention.  For example, accesses might be constrained
via barrier() and friends.  For another example, some developers might
check assembly output (hopefully scripted somehow).

Again, the Linux-kernel memory model dispenses advice, not absolutes.
Furthermore, the way it dispenses advice is currently a bit limited.
It can currently say that it is nervous about lack of {READ,WRITE}_ONCE(),
as in "Flag data-race", but it would be difficult to make it recommend
the other options in an intelligent way.  So we should interpret "Flag
data-race" as LKMM saying "I am nervous about your unmarked accesses"
rather than "You absolutely must call {READ,WRITE}_ONCE() more often!!!"
Again, advice, not absolutes.

So the idea is that you add and remove {READ,WRITE}_ONCE() to/from the
-litmus- -tests- to determine which accesses LKMM is nervous about.
But that doesn't necessarily mean that {READ,WRITE}_ONCE() goes into
the corresponding places in the Linux kernel.

Does that help, or am I still confused?

> And sometimes it's not even possible, since you can't do a bitfield
> access, for example, with READ_ONCE().

Ah, good point.  So the Linux kernel uses bitfields to communicate
between mainline and interrupt handlers.  New one on me.  :-/

> > We can of course put them back in,
> 
> Stop the craziness. It's not "we can".

Re: rcu_read_lock lost its compiler barrier

2019-06-04 Thread Linus Torvalds

On Tue, Jun 4, 2019 at 10:00 AM Alan Stern  wrote:
>
> Which suggests asking whether these higher expectations should be
> reflected in the Linux Kernel Memory Model.  So far we have largely
> avoided doing that sort of thing, although there are a few exceptions.

I think they might be hard to describe - which in turn may be why the
standard leaves it open and only describes the simple cases.

Exactly that "we expect an assignment to be done as a single write" is
probably a good example. Yes, we do have that expectation for the
simple cases. But it's not an absolute rule, obviously, because it's
clearly violated by bitfield writes, and similarly it's obviously
violated for data that is bigger than the word size (ie a "u64"
assignment is obviously not a single write when you're on a 32-bit
target).

And while those cases are static and could be described separately,
the "oh, and we have _other_ accesses to the same variable nearby, and
we expect that the compiler might take those into account unless we
explicitly use WRITE_ONCE()" things make for much more hard to
describe issues.

Doing writes with speculative values is clearly bogus and garbage
(although compiler writers _have_ tried to do that too: "we then
overwrite it with the right value later, so it's ok"), and thankfully
even user space people complain about that due to threading and
signals. But eliding writes entirely by combining them with a later
one is clearly ok - and eliding them when there was an explcit earlier
read and value comparison (like in my example) sounds reasonable to me
too. Yet silently adding the elision that wasn't visible in the source
due to other accesses would be bad.

How do you say "sane and reasonable compiler" in a spec? You usually
don't - or  you make the rules so odd and complex than nobody really
understands them any more, and make it all moot ;)

 Linus

Re: rcu_read_lock lost its compiler barrier

2019-06-04 Thread Alan Stern

On Tue, 4 Jun 2019, Linus Torvalds wrote:

> So I don't technically disagree with anything you say,

That's good to know!

>  I just wanted
> to point out that as far as the kernel is concerned, we do have higher
> quality expectations from the compiler than just "technically valid
> according to the C standard".

Which suggests asking whether these higher expectations should be
reflected in the Linux Kernel Memory Model.  So far we have largely
avoided doing that sort of thing, although there are a few exceptions.

(For example, we assume the compiler does not destroy address
dependencies from volatile reads -- but we also warn that this
assumption may fail if the programmer does not follow some rules
described in one of Paul's documentation files.)

Alan

Re: rcu_read_lock lost its compiler barrier

2019-06-04 Thread Linus Torvalds

On Tue, Jun 4, 2019 at 7:44 AM Alan Stern  wrote:
>
> Even if you don't think the compiler will ever do this, the C standard
> gives compilers the right to invent read accesses if a plain (i.e.,
> non-atomic and non-volatile) write is present.

Note that for the kernel, it's not like we go strictly by what the
standard says anyway.

The reason for the "read for write" thing is that obviously you
sometimes have broken architectures (*cough*alpha*cough*) that need to
do some common writes are read-maskmodify-write accesses, and for
bitfields you obviously *always* have that issue.

In general, a sane compiler (as opposed to "we just read the
standard") had better not add random reads, because people might have
some mappings be write-only when the hardware supports it.

And we do generally expect sanity from our compilers, and will add
compiler flags to disable bad behavior if required - even if said
behavior would be "technically allowed by the standard".

> (Incidentally, regardless of whether the compiler will ever do this, I
> have seen examples in the kernel where people did exactly this
> manually, in order to avoid dirtying a cache line unnecessarily.)

I really hope and expect that this is not something that the compiler ever does.

If a compiler turns "x = y" into if (x != y) x = y" (like we do
manually at times, as you say), that compiler is garbage that we
cannot use for the kernel. It would break things like "smp_wmb()"
ordering guarantees, I'm pretty sure.

And as always, if we're doing actively stupid things, and the compiler
then turns our stupid code into something we don't expect, the
corollary is that then it's on _us_. IOW, if we do

if (x != 1) {
 ...
}
x = 1;

and the compiler goes "oh, we already checked that 'x == 1'" and moves
that "unconditional" 'x = 1' into the conditional section like

if (x != 1) {
..
x = 1;
}

then that's not something we can then complain about.

So our expectation is that the compiler is _sane_, not that it's some
"C as assembler".  Adding spurious reads is not ok. But if we already
had reads in the code and the compiler combines them with other ops,
that's on us.

End result: in general, we do expect that the compiler turns a regular
assignment into a single plain write when that's what the code does,
and does not add extra logic over and beyond that.

In fact, the alpha port was always subtly buggy exactly because of the
"byte write turns into a read-and-masked-write", even if I don't think
anybody ever noticed (we did fix cases where people _did_ notice,
though, and we might still have some cases where we use 'int' for
booleans because of alpha issues.).

So I don't technically disagree with anything you say, I just wanted
to point out that as far as the kernel is concerned, we do have higher
quality expectations from the compiler than just "technically valid
according to the C standard".

   Linus

Re: rcu_read_lock lost its compiler barrier

2019-06-04 Thread Alan Stern

On Mon, 3 Jun 2019, Paul E. McKenney wrote:

> On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote:
> > On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > 1.  These guarantees are of full memory barriers, -not- compiler
> > > > barriers.
> > > 
> > > What I'm saying is that wherever they are, they must come with
> > > compiler barriers.  I'm not aware of any synchronisation mechanism
> > > in the kernel that gives a memory barrier without a compiler barrier.
> > > 
> > > > 2.  These rules don't say exactly where these full memory barriers
> > > > go.  SRCU is at one extreme, placing those full barriers in
> > > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > > > at the other, placing these barriers entirely within the 
> > > > callback
> > > > queueing/invocation, grace-period computation, and the 
> > > > scheduler.
> > > > Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > > > sometimes including a full memory barrier, but other times with
> > > > the full memory barrier being confined as it is with !PREEMPT
> > > > Tree RCU.
> > > 
> > > The rules do say that the (full) memory barrier must precede any
> > > RCU read-side that occur after the synchronize_rcu and after the
> > > end of any RCU read-side that occur before the synchronize_rcu.
> > > 
> > > All I'm arguing is that wherever that full mb is, as long as it
> > > also carries with it a barrier() (which it must do if it's done
> > > using an existing kernel mb/locking primitive), then we're fine.
> > > 
> > > > Interleaving and inserting full memory barriers as per the rules above:
> > > > 
> > > > CPU1: WRITE_ONCE(a, 1)
> > > > CPU1: synchronize_rcu   
> > > > /* Could put a full memory barrier here, but it wouldn't help. 
> > > > */
> > > 
> > >   CPU1: smp_mb();
> > >   CPU2: smp_mb();
> > > 
> > > Let's put them in because I think they are critical.  smp_mb() also
> > > carries with it a barrier().
> > > 
> > > > CPU2: rcu_read_lock();
> > > > CPU1: b = 2;
> > > > CPU2: if (READ_ONCE(a) == 0)
> > > > CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! 
> > > > */
> > > > CPU2: b = 1;
> > > > CPU2: rcu_read_unlock
> > > > 
> > > > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > > > which (I believe) is why the model complains in this case.
> > > 
> > > Let's put aside my doubt over how we're even allowing a compiler
> > > to turn
> > > 
> > >   b = 1
> > > 
> > > into
> > > 
> > >   if (b != 1)
> > >   b = 1

Even if you don't think the compiler will ever do this, the C standard
gives compilers the right to invent read accesses if a plain (i.e.,
non-atomic and non-volatile) write is present.  The Linux Kernel Memory
Model has to assume that compilers will sometimes do this, even if it
doesn't take the exact form of checking a variable's value before
writing to it.

(Incidentally, regardless of whether the compiler will ever do this, I 
have seen examples in the kernel where people did exactly this 
manually, in order to avoid dirtying a cache line unnecessarily.)

> > > Since you seem to be assuming that (a == 0) is true in this case
> > 
> > I think Paul's example assuming (a == 0) is false, and maybe
> 
> Yes, otherwise, P0()'s write to "b" cannot have happened.
> 
> > speculative writes (by compilers) needs to added into consideration?

On the other hand, the C standard does not allow compilers to add
speculative writes.  The LKMM assumes they will never occur.

> I would instead call it the compiler eliminating needless writes
> by inventing reads -- if the variable already has the correct value,
> no write happens.  So no compiler speculation.
> 
> However, it is difficult to create a solid defensible example.  Yes,
> from LKMM's viewpoint, the weakly reordered invented read from "b"
> can be concurrent with P0()'s write to "b", but in that case the value
> loaded would have to manage to be equal to 1 for anything bad to happen.
> This does feel wrong to me, but again, it is difficult to create a solid
> defensible example.
> 
> > Please consider the following case (I add a few smp_mb()s), the case may
> > be a little bit crasy, you have been warned ;-)
> > 
> > CPU1: WRITE_ONCE(a, 1)
> > CPU1: synchronize_rcu called
> > 
> > CPU1: smp_mb(); /* let assume there is one here */
> > 
> > CPU2: rcu_read_lock();
> > CPU2: smp_mb(); /* let assume there is one here */
> > 
> > /* "if (b != 1) b = 1" reordered  */
> > CPU2: r0 = b;   /* if (b != 1) reordered here, r0 == 0 */
> > CPU2: if (r0 != 1)  /* true */
> > CPU2: b = 1;/* b == 1 now, this is a speculative write
> >by compiler
> >  */
> >

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Linus Torvalds

On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney  wrote:
>
> I agree that !PREEMPT rcu_read_lock() would not affect compiler code
> generation, but given that get_user() is a volatile asm, isn't the
> compiler already forbidden from reordering it with the volatile-casted
> WRITE_ONCE() access, even if there was nothing at all between them?
> Or are asms an exception to the rule that volatile executions cannot
> be reordered?

Paul, you MAKE NO SENSE.

What is wrong with you?

I just showed you an example of where rcu_read_lock() needs to be a
compiler barrier, and then you make incoherent noises about
WRITE_ONCE() that do not even exist in that example.

Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already
showed code that doesn't have those accessors, so reality doesn't
match your fevered imagination.

And sometimes it's not even possible, since you can't do a bitfield
access, for exmaple, with READ_ONCE().

> We can of course put them back in,

Stop the craziness. It's not "we can". It is a "we will".

So I will add that barrier, and you need to stop arguing against it
based on specious theoretical arguments that do not match reality. And
we will not ever remove that barrier again. Herbert already pointed to
me having to do this once before in commit 386afc91144b ("spinlocks
and preemption points need to be at least compiler barriers"), and
rcu_read_lock() clearly has at a minimum that same preemption point
issue.

 Linus

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote:
> On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > > 
> > > 1.These guarantees are of full memory barriers, -not- compiler
> > >   barriers.
> > 
> > What I'm saying is that wherever they are, they must come with
> > compiler barriers.  I'm not aware of any synchronisation mechanism
> > in the kernel that gives a memory barrier without a compiler barrier.
> > 
> > > 2.These rules don't say exactly where these full memory barriers
> > >   go.  SRCU is at one extreme, placing those full barriers in
> > >   srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > >   at the other, placing these barriers entirely within the callback
> > >   queueing/invocation, grace-period computation, and the scheduler.
> > >   Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > >   sometimes including a full memory barrier, but other times with
> > >   the full memory barrier being confined as it is with !PREEMPT
> > >   Tree RCU.
> > 
> > The rules do say that the (full) memory barrier must precede any
> > RCU read-side that occur after the synchronize_rcu and after the
> > end of any RCU read-side that occur before the synchronize_rcu.
> > 
> > All I'm arguing is that wherever that full mb is, as long as it
> > also carries with it a barrier() (which it must do if it's done
> > using an existing kernel mb/locking primitive), then we're fine.
> > 
> > > Interleaving and inserting full memory barriers as per the rules above:
> > > 
> > >   CPU1: WRITE_ONCE(a, 1)
> > >   CPU1: synchronize_rcu   
> > >   /* Could put a full memory barrier here, but it wouldn't help. */
> > 
> > CPU1: smp_mb();
> > CPU2: smp_mb();
> > 
> > Let's put them in because I think they are critical.  smp_mb() also
> > carries with it a barrier().
> > 
> > >   CPU2: rcu_read_lock();
> > >   CPU1: b = 2;
> > >   CPU2: if (READ_ONCE(a) == 0)
> > >   CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
> > >   CPU2: b = 1;
> > >   CPU2: rcu_read_unlock
> > > 
> > > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > > which (I believe) is why the model complains in this case.
> > 
> > Let's put aside my doubt over how we're even allowing a compiler
> > to turn
> > 
> > b = 1
> > 
> > into
> > 
> > if (b != 1)
> > b = 1
> > 
> > Since you seem to be assuming that (a == 0) is true in this case
> 
> I think Paul's example assuming (a == 0) is false, and maybe

Yes, otherwise, P0()'s write to "b" cannot have happened.

> speculative writes (by compilers) needs to added into consideration?

I would instead call it the compiler eliminating needless writes
by inventing reads -- if the variable already has the correct value,
no write happens.  So no compiler speculation.

However, it is difficult to create a solid defensible example.  Yes,
from LKMM's viewpoint, the weakly reordered invented read from "b"
can be concurrent with P0()'s write to "b", but in that case the value
loaded would have to manage to be equal to 1 for anything bad to happen.
This does feel wrong to me, but again, it is difficult to create a solid
defensible example.

> Please consider the following case (I add a few smp_mb()s), the case may
> be a little bit crasy, you have been warned ;-)
> 
>   CPU1: WRITE_ONCE(a, 1)
>   CPU1: synchronize_rcu called
> 
>   CPU1: smp_mb(); /* let assume there is one here */
> 
>   CPU2: rcu_read_lock();
>   CPU2: smp_mb(); /* let assume there is one here */
> 
>   /* "if (b != 1) b = 1" reordered  */
>   CPU2: r0 = b;   /* if (b != 1) reordered here, r0 == 0 */
>   CPU2: if (r0 != 1)  /* true */
>   CPU2: b = 1;/* b == 1 now, this is a speculative write
>  by compiler
>*/
> 
>   CPU1: b = 2;/* b == 2 */
> 
>   CPU2: if (READ_ONCE(a) == 0) /* false */
>   CPU2: ...
>   CPU2  else   /* undo the speculative write */
>   CPU2: b = r0;   /* b == 0 */
> 
>   CPU2: smp_mb();
>   CPU2: read_read_unlock();
> 
> I know this is too crasy for us to think a compiler like this, but this
> might be the reason why the model complain about this.
> 
> Paul, did I get this right? Or you mean something else?

Mostly there, except that I am not yet desperate enough to appeal to
compilers speculating stores.  ;-)

Thanx, Paul

> Regards,
> Boqun
> 
> 
> 
> > (as the assignment b = 1 is carried out), then because of the
> > presence of the full memory barrier, the RCU read-side section
> > must have started prior to the synchronize_rcu.  This means that
> > synchronize_rcu is not allowed to return until at least the end
> > of the grace period, or at least until the end of rcu_read_unlock.
> > 
> > So

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 09:07:29AM -0700, Linus Torvalds wrote:
> On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds
>  wrote:
> >
> > I don't believe that it would necessarily help to turn a
> > rcu_read_lock() into a compiler barrier, because for the non-preempt
> > case rcu_read_lock() doesn't need to actually _do_ anything, and
> > anything that matters for the RCU read lock will already be a compiler
> > barrier for other reasons (ie a function call that can schedule).
> 
> Actually, thinking a bit more about this, and trying to come up with
> special cases, I'm not at all convinced.
> 
> Even if we don't have preemption enabled, it turns out that we *do*
> have things that can cause scheduling without being compiler barriers.
> 
> In particular, user accesses are not necessarily full compiler
> barriers. One common pattern (x86) is
> 
> asm volatile("call __get_user_%P4"
> 
> which explicitly has a "asm volaile" so that it doesn't re-order wrt
> other asms (and thus other user accesses), but it does *not* have a
> "memory" clobber, because the user access doesn't actually change
> kernel memory. Not even if it's a "put_user()".
> 
> So we've made those fairly relaxed on purpose. And they might be
> relaxed enough that they'd allow re-ordering wrt something that does a
> rcu read lock, unless the rcu read lock has some compiler barrier in
> it.
> 
> IOW, imagine completely made up code like
> 
>  get_user(val, ptr)
>  rcu_read_lock();
>  WRITE_ONCE(state, 1);
> 
> and unless the rcu lock has a barrier in it, I actually think that
> write to 'state' could migrate to *before* the get_user().
> 
> I'm not convinced we have anything that remotely looks like the above,
> but I'm actually starting to think that yes, all RCU barriers had
> better be compiler barriers.
> 
> Because this is very much an example of something where you don't
> necessarily need a memory barrier, but there's a code generation
> barrier needed because of local ordering requirements. The possible
> faulting behavior of "get_user()" must not migrate into the RCU
> critical region.
> 
> Paul?

I agree that !PREEMPT rcu_read_lock() would not affect compiler code
generation, but given that get_user() is a volatile asm, isn't the
compiler already forbidden from reordering it with the volatile-casted
WRITE_ONCE() access, even if there was nothing at all between them?
Or are asms an exception to the rule that volatile executions cannot
be reordered?

> So I think the rule really should be: every single form of locking
> that has any semantic meaning at all, absolutely needs to be at least
> a compiler barrier.
> 
> (That "any semantic meaning" weaselwording is because I suspect that
> we have locking that truly and intentionally becomes no-ops because
> it's based on things that aren't relevant in some configurations. But
> generally compiler barriers are really pretty damn cheap, even from a
> code generation standpoint, and can help make the resulting code more
> legible, so I think we should not try to aggressively remove them
> without _very_ good reasons)

We can of course put them back in, but this won't help in the typical
rcu_assign_pointer(), rcu_dereference(), and synchronize_rcu() situation
(nor do I see how it helps in Hubert's example).  And in other RCU
use cases, the accesses analogous to the rcu_assign_pointer() and
rcu_dereference() (in Hubert's example, the accesses to variable "a")
really need to be READ_ONCE()/WRITE_ONCE() or stronger, correct?

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Linus Torvalds

On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds
 wrote:
>
> I don't believe that it would necessarily help to turn a
> rcu_read_lock() into a compiler barrier, because for the non-preempt
> case rcu_read_lock() doesn't need to actually _do_ anything, and
> anything that matters for the RCU read lock will already be a compiler
> barrier for other reasons (ie a function call that can schedule).

Actually, thinking a bit more about this, and trying to come up with
special cases, I'm not at all convinced.

Even if we don't have preemption enabled, it turns out that we *do*
have things that can cause scheduling without being compiler barriers.

In particular, user accesses are not necessarily full compiler
barriers. One common pattern (x86) is

asm volatile("call __get_user_%P4"

which explicitly has a "asm volaile" so that it doesn't re-order wrt
other asms (and thus other user accesses), but it does *not* have a
"memory" clobber, because the user access doesn't actually change
kernel memory. Not even if it's a "put_user()".

So we've made those fairly relaxed on purpose. And they might be
relaxed enough that they'd allow re-ordering wrt something that does a
rcu read lock, unless the rcu read lock has some compiler barrier in
it.

IOW, imagine completely made up code like

 get_user(val, ptr)
 rcu_read_lock();
 WRITE_ONCE(state, 1);

and unless the rcu lock has a barrier in it, I actually think that
write to 'state' could migrate to *before* the get_user().

I'm not convinced we have anything that remotely looks like the above,
but I'm actually starting to think that yes, all RCU barriers had
better be compiler barriers.

Because this is very much an example of something where you don't
necessarily need a memory barrier, but there's a code generation
barrier needed because of local ordering requirements. The possible
faulting behavior of "get_user()" must not migrate into the RCU
critical region.

Paul?

So I think the rule really should be: every single form of locking
that has any semantic meaning at all, absolutely needs to be at least
a compiler barrier.

(That "any semantic meaning" weaselwording is because I suspect that
we have locking that truly and intentionally becomes no-ops because
it's based on things that aren't relevant in some configurations. But
generally compiler barriers are really pretty damn cheap, even from a
code generation standpoint, and can help make the resulting code more
legible, so I think we should not try to aggressively remove them
without _very_ good reasons)

  Linus

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Linus Torvalds

On Sun, Jun 2, 2019 at 8:03 PM Herbert Xu  wrote:
>
> In any case, I am now even more certain that compiler barriers are
> not needed in the code in question.  The reasoning is quite simple.
> If you need those compiler barriers then you surely need real memory
> barriers.

So the above statement is not necessarily correct.

Compiler barriers very much can have real effects even in the absense
of "real" memory barriers.

But those effects are obviously not about multiple CPU's - they are
about code generation and can be about ordering on _one_ CPU. Those
effects definitely happen, though.

So a compiler barrier without a memory barrier may make a difference if you

 (a) compile for UP (when a compiler barrier basically _is_ a memory barrier)

 (b) have deadlock or ordering avoidance with only the local CPU
taking interrupts.

 (c) need to re-load a value in a loop, but ordering isn't a concern

and possibly other situations.

In the above, (a) may be pointless and trivial, but (b) and (c) are
issues even on SMP. Some things only matter for the local CPU - an
interrupt or a code sequence that happens on another CPU can just
block, but if an interrupt comes in on the same CPU may dead-lock and
depend on particular access ordering. And (c) is for things like
cpu_relax() in loops that read stuff (although honestly, READ_ONCE()
is generally a much better pattern).

But it sounds like in this case at least, Herbert's and Paul's
disagreements aren't really all that fundamentally about the memory
barriers and locking, as just the fact that in general the only thing
that RCU protects against is single accesses, and thus any RCU region
technically should use something that guarantees that the compiler
might not do stupid things.

We do require a _minimum_ of compiler sanity, but the compiler turning
a non-marked read into two reads has definitely happened, and can be
very subtle. Even if on a C level it looks like a single access, and
correctness doesn't care about _what_ the value we read is, it might
be turned by the compiler into two separate accesses that get two
different values, and then the end result may be insane and
unreliable.

So on the whole, I do think that it's usually a good idea to show
_which_ access is protected by RCU. Perhaps with a
READ_ONCE/WRITE_ONCE pattern, although it can be other things.

I don't believe that it would necessarily help to turn a
rcu_read_lock() into a compiler barrier, because for the non-preempt
case rcu_read_lock() doesn't need to actually _do_ anything, and
anything that matters for the RCU read lock will already be a compiler
barrier for other reasons (ie a function call that can schedule).

Anyway, I suspect the code is correct in practice, but I do have some
sympathy for the "RCU protected accesses should be marked".

Linus

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Linus Torvalds

On Mon, Jun 3, 2019 at 8:26 AM David Laight  wrote:
>
> From: Paul E. McKenney
>
> > We do
> > occasionally use READ_ONCE() to prevent load-fusing optimizations that
> > would otherwise cause the compiler to turn while-loops into if-statements
> > guarding infinite loops.
>
> In that case the variable ought to be volatile...

No.

We do not use volatile on variables.

The C people got the semantics wrong, probably because 'volatile' was
historically for IO, not for access atomicity without locking.

It's not the memory location that is volatile, it is really the
_access_ that is volatile.

The same memory location might be completely stable in other access
situations (ie when done under a lock).

In other words, we should *never* use volatile in the kernel. It's
fundamentally mis-designed for modern use.

(Of course, we then can use volatile in a cast in code, which drives
some compiler people crazy, but that's because said compiler people
don't care about reality, they care about some paperwork).

  Linus

RE: rcu_read_lock lost its compiler barrier

2019-06-03 Thread David Laight

From: Paul E. McKenney
> Sent: 03 June 2019 09:42
...
> On kissing the kernel goodbye, a reasonable strategy might be to
> identify the transformations that are actually occuring (like the
> stores of certain constants called out above) and fix those.

> We do
> occasionally use READ_ONCE() to prevent load-fusing optimizations that
> would otherwise cause the compiler to turn while-loops into if-statements
> guarding infinite loops.

In that case the variable ought to be volatile...

> There is also the possibility of having the
> compiler guys give us more command-line arguments.

I wonder how much the code size (of anything) would increase
if the compiler:
1) Never read a value into a local more than once.
2) Never write a value that wasn't requested by the code.
3) Never used multiple memory accesses for 'machine word' (and
   smaller) items.

(1) makes all reads READ_ONCE() except that the actual read
can be delayed until further down the code.
If I have a naive #define bswap32() I'd expect:
v = bswap32(foo->bar)
to possibly read foo->bar multiple times, but not:
int foo_bar = foo->bar;
v = bswap32(foo_bar);

(2) makes all writes WRITE_ONCE() except that if there are
multiple writes to the same location, only the last need
be done.
In particular it stops speculative writes and the use of
locations that are going to be written to as temporaries.
It also stop foo->bar = ~0; being implemented as a clear
then decrement.

(3) I'd never imagined the compiler would write the two halves
of a word separately!

If the compiler behaved like that (as one might expect it would)
then READ_ONCE() would be a READ_NOW() for when the sequencing
mattered.

I was also recently surprised by the code I got from this loop:
for (i = 0; i < limit; i++)
sum64 += array32[i];
(as in the IP checksum sum without add carry support).
The compiler unrolled it to used horrid sequences of sse3/avx
instructions.
This might be a gain for large enough buffers and 'hot cache'
but for small buffer and likely cold cache it is horrid.
I guess such optimisations are valid, but I wonder how often
they are an actual win for real programs.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > 
> > 1.  These guarantees are of full memory barriers, -not- compiler
> > barriers.
> 
> What I'm saying is that wherever they are, they must come with
> compiler barriers.  I'm not aware of any synchronisation mechanism
> in the kernel that gives a memory barrier without a compiler barrier.

Yes, if a given synchronization mechanism requires that memory references
need to be ordered, both the compiler and the CPU must maintain that
ordering.

> > 2.  These rules don't say exactly where these full memory barriers
> > go.  SRCU is at one extreme, placing those full barriers in
> > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > at the other, placing these barriers entirely within the callback
> > queueing/invocation, grace-period computation, and the scheduler.
> > Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > sometimes including a full memory barrier, but other times with
> > the full memory barrier being confined as it is with !PREEMPT
> > Tree RCU.
> 
> The rules do say that the (full) memory barrier must precede any
> RCU read-side that occur after the synchronize_rcu and after the
> end of any RCU read-side that occur before the synchronize_rcu.
> 
> All I'm arguing is that wherever that full mb is, as long as it
> also carries with it a barrier() (which it must do if it's done
> using an existing kernel mb/locking primitive), then we're fine.

Fair enough, and smp_mb() does provide what is needed.

> > Interleaving and inserting full memory barriers as per the rules above:
> > 
> > CPU1: WRITE_ONCE(a, 1)
> > CPU1: synchronize_rcu   
> > /* Could put a full memory barrier here, but it wouldn't help. */
> 
>   CPU1: smp_mb();
>   CPU2: smp_mb();

What is CPU2's smp_mb() ordering?  In other words, what comment would
you put on each of the above smp_mb() calls?

> Let's put them in because I think they are critical.  smp_mb() also
> carries with it a barrier().

Again, agreed, smp_mb() implies barrier().

> > CPU2: rcu_read_lock();
> > CPU1: b = 2;
> > CPU2: if (READ_ONCE(a) == 0)
> > CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
> > CPU2: b = 1;
> > CPU2: rcu_read_unlock
> > 
> > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > which (I believe) is why the model complains in this case.
> 
> Let's put aside my doubt over how we're even allowing a compiler
> to turn
> 
>   b = 1
> 
> into
> 
>   if (b != 1)
>   b = 1
> 
> Since you seem to be assuming that (a == 0) is true in this case
> (as the assignment b = 1 is carried out), then because of the
> presence of the full memory barrier, the RCU read-side section
> must have started prior to the synchronize_rcu.  This means that
> synchronize_rcu is not allowed to return until at least the end
> of the grace period, or at least until the end of rcu_read_unlock.
> 
> So it actually should be:
> 
>   CPU1: WRITE_ONCE(a, 1)
>   CPU1: synchronize_rcu called
>   /* Could put a full memory barrier here, but it wouldn't help. */
> 
>   CPU1: smp_mb();
>   CPU2: smp_mb();
> 
>   CPU2: grace period starts
>   ...time passes...
>   CPU2: rcu_read_lock();
>   CPU2: if (READ_ONCE(a) == 0)
>   CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
>   CPU2: b = 1;
>   CPU2: rcu_read_unlock
>   ...time passes...
>   CPU2: grace period ends
> 
>   /* This full memory barrier is also guaranteed by RCU. */
>   CPU2: smp_mb();

But in this case, given that there are no more statements for CPU2,
what is this smp_mb() ordering?

Thanx, Paul

>   CPU1 synchronize_rcu returns
>   CPU1: b = 2;
> 
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 11:03:24AM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote:
> >
> > Please note that preemptible Tree RCU has lacked the compiler barrier on
> > all but the outermost rcu_read_unlock() for years before Boqun's patch.
> 
> Actually this is not true.  Boqun's patch (commit bb73c52bad36) does
> not add a barrier() to __rcu_read_lock.  In fact I dug into the git
> history and this compiler barrier() has existed in preemptible tree
> RCU since the very start in 2009:

I said rcu_read_unlock() and you said __rcu_read_lock().

> : commit f41d911f8c49a5d65c86504c19e8204bb605c4fd
> : Author: Paul E. McKenney 
> : Date:   Sat Aug 22 13:56:52 2009 -0700
> :
> : rcu: Merge preemptable-RCU functionality into hierarchical RCU
> :
> : +/*
> : + * Tree-preemptable RCU implementation for rcu_read_lock().
> : + * Just increment ->rcu_read_lock_nesting, shared state will be updated
> : + * if we block.
> : + */
> : +void __rcu_read_lock(void)
> : +{
> : +   ACCESS_ONCE(current->rcu_read_lock_nesting)++;
> : +   barrier();  /* needed if we ever invoke rcu_read_lock in rcutree.c 
> */
> : +}
> : +EXPORT_SYMBOL_GPL(__rcu_read_lock);

Thank you for finding this!  This particular version does have an
unconditional barrier() in __rcu_read_unlock(), for whatever that
is worth:

+void __rcu_read_unlock(void)
+{
+   struct task_struct *t = current;
+
+   barrier();  /* needed if we ever invoke rcu_read_unlock in rcutree.c */
+   if (--ACCESS_ONCE(t->rcu_read_lock_nesting) == 0 &&
+   unlikely(ACCESS_ONCE(t->rcu_read_unlock_special)))
+   rcu_read_unlock_special(t);
+}

I would not have seen the point of a compiler barrier in the non-outermost
__rcu_read_unlock(), since the completion of an inner __rcu_read_unlock()
does not permit the grace period to complete.

> However, you are correct that in the non-preempt tree RCU case,
> the compiler barrier in __rcu_read_lock was not always present.
> In fact it was added by:
> 
> : commit 386afc91144b36b42117b0092893f15bc8798a80
> : Author: Linus Torvalds 
> : Date:   Tue Apr 9 10:48:33 2013 -0700
> :
> : spinlocks and preemption points need to be at least compiler barriers
> 
> I suspect this is what prompted you to remove it in 2015.

If I remember correctly, it was pointed out to me that in !PREEMPT kernels,
the compiler barrier in the preempt_disable() invoked in rcu_read_lock()
(and similar on the rcu_read_unlock() side) wasn't helping anything,

> > I do not believe that reverting that patch will help you at all.
> > 
> > But who knows?  So please point me at the full code body that was being
> > debated earlier on this thread.  It will no doubt take me quite a while to
> > dig through it, given my being on the road for the next couple of weeks,
> > but so it goes.
> 
> Please refer to my response to Linus for the code in question.
> 
> In any case, I am now even more certain that compiler barriers are
> not needed in the code in question.  The reasoning is quite simple.
> If you need those compiler barriers then you surely need real memory
> barriers.

OK, we are in agreement on that point, then!

> Vice versa, if real memory barriers are already present thanks to
> RCU, then you don't need those compiler barriers.

For users of RCU, this seems reasonable.

On the other hand, the compiler barriers in PREEMPT Tree RCU's outermost
__rcu_read_lock() and __rcu_read_unlock() invocations really are needed
by RCU internals.  This is because RCU uses of interrupt handlers that
access per-task and per-CPU variables, and these need to be able to
sense the edges of the nested set of RCU read-side critical sections.
It is OK for these interrupt handlers to think that the critical section
is larger than it really is, but fatal for them to think that the critical
sections are smaller than they really are.

> In fact this calls into question the use of READ_ONCE/WRITE_ONCE in
> RCU primitives such as rcu_dereference and rcu_assign_pointer.

No, these are -not- called into question, or if they are, the question
gets quickly answered it a way that supports current Linux-kernel code.
As mentioned in earlier emails, the traditional uses of RCU that involve
rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() all work
just fine.

In fact, from what I can see, the issue stems from having developed
intuitions from working with the traditional rcu_dereference(),
rcu_assign_pointer(), and synchronize_rcu() linked-structure use cases,
and then attempting to apply these intuition to use cases that have
neither rcu_dereference() nor rcu_assign_pointer().  Don't get me wrong,
it is only natural to try to extend your intuitions to something that
admittedly looks pretty similar to the traditional use cases.  But this
is one of those cases where "pretty similar" might not be good enough.

> IIRC
> when RCU was first added to the

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 12:23:39AM -0700, Paul E. McKenney wrote:
> On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote:
> > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > >
> > >   CPU2: if (b != 1)
> > >   CPU2: b = 1;
> > 
> > Stop right there.  The kernel is full of code that assumes that
> > assignment to an int/long is atomic.  If your compiler breaks this
> > assumption that we can kiss the kernel good-bye.
> 
> Here you go:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028
> 
> TL;DR: On x86, of you are doing a plain store of a 32-bit constant
> that has bits set only in the lower few bits of each 16-bit half of
> that constant, the compiler is plenty happy to use a pair of 16-bit
> store-immediate instructions to carry out that store.  This is also
> known as "store tearing".
> 
> The two bugs were filed (and after some back and forth, fixed) because
> someone forgot to exclude C11 atomics and volatile accesses from this
> store tearing.

I should hasten to add that I have not seen load tearing, nor have I seen
store tearing when storing a value unknown to the compiler.  However,
plain C-language loads and stores can be invented, fused, and a few other
"interesting" optimization can be applied.

On kissing the kernel goodbye, a reasonable strategy might be to
identify the transformations that are actually occuring (like the
stores of certain constants called out above) and fix those.  We do
occasionally use READ_ONCE() to prevent load-fusing optimizations that
would otherwise cause the compiler to turn while-loops into if-statements
guarding infinite loops.  There is also the possibility of having the
compiler guys give us more command-line arguments.

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-03 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> >
> > CPU2: if (b != 1)
> > CPU2: b = 1;
> 
> Stop right there.  The kernel is full of code that assumes that
> assignment to an int/long is atomic.  If your compiler breaks this
> assumption that we can kiss the kernel good-bye.

Here you go:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028

TL;DR: On x86, of you are doing a plain store of a 32-bit constant
that has bits set only in the lower few bits of each 16-bit half of
that constant, the compiler is plenty happy to use a pair of 16-bit
store-immediate instructions to carry out that store.  This is also
known as "store tearing".

The two bugs were filed (and after some back and forth, fixed) because
someone forgot to exclude C11 atomics and volatile accesses from this
store tearing.

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Boqun Feng

On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> > 
> > 1.  These guarantees are of full memory barriers, -not- compiler
> > barriers.
> 
> What I'm saying is that wherever they are, they must come with
> compiler barriers.  I'm not aware of any synchronisation mechanism
> in the kernel that gives a memory barrier without a compiler barrier.
> 
> > 2.  These rules don't say exactly where these full memory barriers
> > go.  SRCU is at one extreme, placing those full barriers in
> > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
> > at the other, placing these barriers entirely within the callback
> > queueing/invocation, grace-period computation, and the scheduler.
> > Preemptible Tree RCU is in the middle, with rcu_read_unlock()
> > sometimes including a full memory barrier, but other times with
> > the full memory barrier being confined as it is with !PREEMPT
> > Tree RCU.
> 
> The rules do say that the (full) memory barrier must precede any
> RCU read-side that occur after the synchronize_rcu and after the
> end of any RCU read-side that occur before the synchronize_rcu.
> 
> All I'm arguing is that wherever that full mb is, as long as it
> also carries with it a barrier() (which it must do if it's done
> using an existing kernel mb/locking primitive), then we're fine.
> 
> > Interleaving and inserting full memory barriers as per the rules above:
> > 
> > CPU1: WRITE_ONCE(a, 1)
> > CPU1: synchronize_rcu   
> > /* Could put a full memory barrier here, but it wouldn't help. */
> 
>   CPU1: smp_mb();
>   CPU2: smp_mb();
> 
> Let's put them in because I think they are critical.  smp_mb() also
> carries with it a barrier().
> 
> > CPU2: rcu_read_lock();
> > CPU1: b = 2;
> > CPU2: if (READ_ONCE(a) == 0)
> > CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
> > CPU2: b = 1;
> > CPU2: rcu_read_unlock
> > 
> > In fact, CPU2's load from b might be moved up to race with CPU1's store,
> > which (I believe) is why the model complains in this case.
> 
> Let's put aside my doubt over how we're even allowing a compiler
> to turn
> 
>   b = 1
> 
> into
> 
>   if (b != 1)
>   b = 1
> 
> Since you seem to be assuming that (a == 0) is true in this case

I think Paul's example assuming (a == 0) is false, and maybe
speculative writes (by compilers) needs to added into consideration?
Please consider the following case (I add a few smp_mb()s), the case may
be a little bit crasy, you have been warned ;-)

CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu called

CPU1: smp_mb(); /* let assume there is one here */

CPU2: rcu_read_lock();
CPU2: smp_mb(); /* let assume there is one here */

/* "if (b != 1) b = 1" reordered  */
CPU2: r0 = b;   /* if (b != 1) reordered here, r0 == 0 */
CPU2: if (r0 != 1)  /* true */
CPU2: b = 1;/* b == 1 now, this is a speculative write
   by compiler
 */

CPU1: b = 2;/* b == 2 */

CPU2: if (READ_ONCE(a) == 0) /* false */
CPU2: ...
CPU2  else   /* undo the speculative write */
CPU2: b = r0;   /* b == 0 */

CPU2: smp_mb();
CPU2: read_read_unlock();

I know this is too crasy for us to think a compiler like this, but this
might be the reason why the model complain about this.

Paul, did I get this right? Or you mean something else?

Regards,
Boqun



> (as the assignment b = 1 is carried out), then because of the
> presence of the full memory barrier, the RCU read-side section
> must have started prior to the synchronize_rcu.  This means that
> synchronize_rcu is not allowed to return until at least the end
> of the grace period, or at least until the end of rcu_read_unlock.
> 
> So it actually should be:
> 
>   CPU1: WRITE_ONCE(a, 1)
>   CPU1: synchronize_rcu called
>   /* Could put a full memory barrier here, but it wouldn't help. */
> 
>   CPU1: smp_mb();
>   CPU2: smp_mb();
> 
>   CPU2: grace period starts
>   ...time passes...
>   CPU2: rcu_read_lock();
>   CPU2: if (READ_ONCE(a) == 0)
>   CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
>   CPU2: b = 1;
>   CPU2: rcu_read_unlock
>   ...time passes...
>   CPU2: grace period ends
> 
>   /* This full memory barrier is also guaranteed by RCU. */
>   CPU2: smp_mb();
> 
>   CPU1 synchronize_rcu returns
>   CPU1: b = 2;
> 
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu

On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> 
> 1.These guarantees are of full memory barriers, -not- compiler
>   barriers.

What I'm saying is that wherever they are, they must come with
compiler barriers.  I'm not aware of any synchronisation mechanism
in the kernel that gives a memory barrier without a compiler barrier.

> 2.These rules don't say exactly where these full memory barriers
>   go.  SRCU is at one extreme, placing those full barriers in
>   srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
>   at the other, placing these barriers entirely within the callback
>   queueing/invocation, grace-period computation, and the scheduler.
>   Preemptible Tree RCU is in the middle, with rcu_read_unlock()
>   sometimes including a full memory barrier, but other times with
>   the full memory barrier being confined as it is with !PREEMPT
>   Tree RCU.

The rules do say that the (full) memory barrier must precede any
RCU read-side that occur after the synchronize_rcu and after the
end of any RCU read-side that occur before the synchronize_rcu.

All I'm arguing is that wherever that full mb is, as long as it
also carries with it a barrier() (which it must do if it's done
using an existing kernel mb/locking primitive), then we're fine.

> Interleaving and inserting full memory barriers as per the rules above:
> 
>   CPU1: WRITE_ONCE(a, 1)
>   CPU1: synchronize_rcu   
>   /* Could put a full memory barrier here, but it wouldn't help. */

CPU1: smp_mb();
CPU2: smp_mb();

Let's put them in because I think they are critical.  smp_mb() also
carries with it a barrier().

>   CPU2: rcu_read_lock();
>   CPU1: b = 2;
>   CPU2: if (READ_ONCE(a) == 0)
>   CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
>   CPU2: b = 1;
>   CPU2: rcu_read_unlock
> 
> In fact, CPU2's load from b might be moved up to race with CPU1's store,
> which (I believe) is why the model complains in this case.

Let's put aside my doubt over how we're even allowing a compiler
to turn

b = 1

into

if (b != 1)
b = 1

Since you seem to be assuming that (a == 0) is true in this case
(as the assignment b = 1 is carried out), then because of the
presence of the full memory barrier, the RCU read-side section
must have started prior to the synchronize_rcu.  This means that
synchronize_rcu is not allowed to return until at least the end
of the grace period, or at least until the end of rcu_read_unlock.

So it actually should be:

CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu called
/* Could put a full memory barrier here, but it wouldn't help. */

CPU1: smp_mb();
CPU2: smp_mb();

CPU2: grace period starts
...time passes...
CPU2: rcu_read_lock();
CPU2: if (READ_ONCE(a) == 0)
CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
CPU2: b = 1;
CPU2: rcu_read_unlock
...time passes...
CPU2: grace period ends

/* This full memory barrier is also guaranteed by RCU. */
CPU2: smp_mb();

CPU1 synchronize_rcu returns
CPU1: b = 2;

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu

On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> >
> > CPU2: if (b != 1)
> > CPU2: b = 1;
> 
> Stop right there.  The kernel is full of code that assumes that
> assignment to an int/long is atomic.  If your compiler breaks this
> assumption that we can kiss the kernel good-bye.

The slippery slope apparently started here:

: commit ea435467500612636f8f4fb639ff6e76b2496e4b
: Author: Matthew Wilcox 
: Date:   Tue Jan 6 14:40:39 2009 -0800
: 
: atomic_t: unify all arch definitions
:
: diff --git a/arch/x86/include/asm/atomic_32.h 
b/arch/x86/include/asm/atomic_32.h
: index ad5b9f6ecddf..85b46fba4229 100644
: --- a/arch/x86/include/asm/atomic_32.h
: +++ b/arch/x86/include/asm/atomic_32.h
: ...
: @@ -10,15 +11,6 @@
:   * resource counting etc..
:   */
:
: -/*
: - * Make sure gcc doesn't try to be clever and move things around
: - * on us. We need to use _exactly_ the address the user gave us,
: - * not some alias that contains the same information.
: - */
: -typedef struct {
: -   int counter;
: -} atomic_t;
:
: diff --git a/include/linux/types.h b/include/linux/types.h
: index 121f349cb7ec..3b864f2d9560 100644
: --- a/include/linux/types.h
: +++ b/include/linux/types.h
: @@ -195,6 +195,16 @@ typedef u32 phys_addr_t;
:  
:  typedef phys_addr_t resource_size_t;
:
: +typedef struct {
: +   volatile int counter;
: +} atomic_t;
: +

Before evolving into the READ_ONCE/WRITE_ONCE that we have now.

Linus, are we now really supporting a compiler where an assignment
(or a read) from an int/long/pointer can be non-atomic without the
volatile marker? Because if that's the case then we have a lot of
code to audit.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu

On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
>
>   CPU2: if (b != 1)
>   CPU2: b = 1;

Stop right there.  The kernel is full of code that assumes that
assignment to an int/long is atomic.  If your compiler breaks this
assumption that we can kiss the kernel good-bye.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Paul E. McKenney

On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote:
> > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu  
> > wrote:
> > >
> > > You can't then go and decide to remove the compiler barrier!  To do
> > > that you'd need to audit every single use of rcu_read_lock in the
> > > kernel to ensure that they're not depending on the compiler barrier.
> > 
> > What's the possible case where it would matter when there is no preemption?
> 
> The case we were discussing is from net/ipv4/inet_fragment.c from
> the net-next tree:
> 
> void fqdir_exit(struct fqdir *fqdir)
> {
>   ...
>   fqdir->dead = true;
> 
>   /* call_rcu is supposed to provide memory barrier semantics,
>* separating the setting of fqdir->dead with the destruction
>* work.  This implicit barrier is paired with inet_frag_kill().
>*/
> 
>   INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
>   queue_rcu_work(system_wq, &fqdir->destroy_rwork);
> }
> 
> and
> 
> void inet_frag_kill(struct inet_frag_queue *fq)
> {
>   ...
>   rcu_read_lock();
>   /* The RCU read lock provides a memory barrier
>* guaranteeing that if fqdir->dead is false then
>* the hash table destruction will not start until
>* after we unlock.  Paired with inet_frags_exit_net().
>*/
>   if (!fqdir->dead) {
>   rhashtable_remove_fast(&fqdir->rhashtable, &fq->node,
>  fqdir->f->rhash_params);
>   ...
>   }
>   ...
>   rcu_read_unlock();
>   ...
> }
> 
> I simplified this to
> 
> Initial values:
> 
> a = 0
> b = 0
> 
> CPU1  CPU2
>   
> a = 1 rcu_read_lock
> synchronize_rcu   if (a == 0)
> b = 2 b = 1
>   rcu_read_unlock
> 
> On exit we want this to be true:
> b == 2
> 
> Now what Paul was telling me is that unless every memory operation
> is done with READ_ONCE/WRITE_ONCE then his memory model shows that
> the exit constraint won't hold.

But please note that the plain-variable portion of the memory model is
very new and likely still has a bug or two.  In fact, see below.

>  IOW, we need
> 
> CPU1  CPU2
>   
> WRITE_ONCE(a, 1)  rcu_read_lock
> synchronize_rcu   if (READ_ONCE(a) == 0)
> WRITE_ONCE(b, 2)  WRITE_ONCE(b, 1)
>   rcu_read_unlock
> 
> Now I think this bullshit because if we really needed these compiler
> barriers then we surely would need real memory barriers to go with
> them.

On the one hand, you have no code before your rcu_read_lock() and also
1no code after your rcu_read_unlock().  So in this particular example,
adding compiler barriers to these guys won't help you.

On the other hand, on CPU 1's write to "b", I agree with you and disagree
with the model, though perhaps my partners in LKMM crime will show me the
error of my ways on this point.  On CPU 2's write to "b", I can see the
memory model's point, but getting there requires some gymnastics on the
part of both the compiler and the CPU.  The WRITE_ONCE() and READ_ONCE()
for "a" is the normal requirement for variables that are concurrently
loaded and stored.

Please note that garden-variety uses of RCU have similar requirements,
namely the rcu_assign_pointer() on the one side and the rcu_dereference()
on the other.  Your use case allows rcu_assign_pointer() to be weakened
to WRITE_ONCE() and rcu_dereference() to be weakened to READ_ONCE()
(not that this last is all that much of a weakening these days).

> In fact, the sole purpose of the RCU mechanism is to provide those
> memory barriers.  Quoting from
> Documentation/RCU/Design/Requirements/Requirements.html:
> 
>   Each CPU that has an RCU read-side critical section that
>   begins before synchronize_rcu() starts is
>   guaranteed to execute a full memory barrier between the time
>   that the RCU read-side critical section ends and the time that
>   synchronize_rcu() returns.
>   Without this guarantee, a pre-existing RCU read-side critical section
>   might hold a reference to the newly removed struct foo
>   after the kfree() on line 14 of
>   remove_gp_synchronous().
>   Each CPU that has an RCU read-side critical section that ends
>   after synchronize_rcu() returns is guaranteed
>   to execute a full memory barrier between the time that
>   synchronize_rcu() begins and the time that the RCU
>   read-side critical section begins.
>   Without this guarantee, a later RCU read-side critical section
>   running after the kfree() on line

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu

On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote:
>
> Please note that preemptible Tree RCU has lacked the compiler barrier on
> all but the outermost rcu_read_unlock() for years before Boqun's patch.

Actually this is not true.  Boqun's patch (commit bb73c52bad36) does
not add a barrier() to __rcu_read_lock.  In fact I dug into the git
history and this compiler barrier() has existed in preemptible tree
RCU since the very start in 2009:

: commit f41d911f8c49a5d65c86504c19e8204bb605c4fd
: Author: Paul E. McKenney 
: Date:   Sat Aug 22 13:56:52 2009 -0700
:
: rcu: Merge preemptable-RCU functionality into hierarchical RCU
:
: +/*
: + * Tree-preemptable RCU implementation for rcu_read_lock().
: + * Just increment ->rcu_read_lock_nesting, shared state will be updated
: + * if we block.
: + */
: +void __rcu_read_lock(void)
: +{
: +   ACCESS_ONCE(current->rcu_read_lock_nesting)++;
: +   barrier();  /* needed if we ever invoke rcu_read_lock in rcutree.c */
: +}
: +EXPORT_SYMBOL_GPL(__rcu_read_lock);

However, you are correct that in the non-preempt tree RCU case,
the compiler barrier in __rcu_read_lock was not always present.
In fact it was added by:

: commit 386afc91144b36b42117b0092893f15bc8798a80
: Author: Linus Torvalds 
: Date:   Tue Apr 9 10:48:33 2013 -0700
:
: spinlocks and preemption points need to be at least compiler barriers

I suspect this is what prompted you to remove it in 2015.

> I do not believe that reverting that patch will help you at all.
> 
> But who knows?  So please point me at the full code body that was being
> debated earlier on this thread.  It will no doubt take me quite a while to
> dig through it, given my being on the road for the next couple of weeks,
> but so it goes.

Please refer to my response to Linus for the code in question.

In any case, I am now even more certain that compiler barriers are
not needed in the code in question.  The reasoning is quite simple.
If you need those compiler barriers then you surely need real memory
barriers.

Vice versa, if real memory barriers are already present thanks to
RCU, then you don't need those compiler barriers.

In fact this calls into question the use of READ_ONCE/WRITE_ONCE in
RCU primitives such as rcu_dereference and rcu_assign_pointer.  IIRC
when RCU was first added to the Linux kernel we did not have compiler
barriers in rcu_dereference and rcu_assign_pointer.  They were added
later on.

As compiler barriers per se are useless, these are surely meant to
be coupled with the memory barriers provided by RCU grace periods
and synchronize_rcu.  But then those real memory barriers would have
compiler barriers too.  So why do we need the compiler barriers in
rcu_dereference and rcu_assign_pointer?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu

On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote:
> On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu  
> wrote:
> >
> > You can't then go and decide to remove the compiler barrier!  To do
> > that you'd need to audit every single use of rcu_read_lock in the
> > kernel to ensure that they're not depending on the compiler barrier.
> 
> What's the possible case where it would matter when there is no preemption?

The case we were discussing is from net/ipv4/inet_fragment.c from
the net-next tree:

void fqdir_exit(struct fqdir *fqdir)
{
...
fqdir->dead = true;

/* call_rcu is supposed to provide memory barrier semantics,
 * separating the setting of fqdir->dead with the destruction
 * work.  This implicit barrier is paired with inet_frag_kill().
 */

INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn);
queue_rcu_work(system_wq, &fqdir->destroy_rwork);
}

and

void inet_frag_kill(struct inet_frag_queue *fq)
{
...
rcu_read_lock();
/* The RCU read lock provides a memory barrier
 * guaranteeing that if fqdir->dead is false then
 * the hash table destruction will not start until
 * after we unlock.  Paired with inet_frags_exit_net().
 */
if (!fqdir->dead) {
rhashtable_remove_fast(&fqdir->rhashtable, &fq->node,
   fqdir->f->rhash_params);
...
}
...
rcu_read_unlock();
...
}

I simplified this to

Initial values:

a = 0
b = 0

CPU1CPU2

a = 1   rcu_read_lock
synchronize_rcu if (a == 0)
b = 2   b = 1
rcu_read_unlock

On exit we want this to be true:
b == 2

Now what Paul was telling me is that unless every memory operation
is done with READ_ONCE/WRITE_ONCE then his memory model shows that
the exit constraint won't hold.  IOW, we need

CPU1CPU2

WRITE_ONCE(a, 1)rcu_read_lock
synchronize_rcu if (READ_ONCE(a) == 0)
WRITE_ONCE(b, 2)WRITE_ONCE(b, 1)
rcu_read_unlock

Now I think this bullshit because if we really needed these compiler
barriers then we surely would need real memory barriers to go with
them.

In fact, the sole purpose of the RCU mechanism is to provide those
memory barriers.  Quoting from
Documentation/RCU/Design/Requirements/Requirements.html:

Each CPU that has an RCU read-side critical section that
begins before synchronize_rcu() starts is
guaranteed to execute a full memory barrier between the time
that the RCU read-side critical section ends and the time that
synchronize_rcu() returns.
Without this guarantee, a pre-existing RCU read-side critical section
might hold a reference to the newly removed struct foo
after the kfree() on line 14 of
remove_gp_synchronous().
Each CPU that has an RCU read-side critical section that ends
after synchronize_rcu() returns is guaranteed
to execute a full memory barrier between the time that
synchronize_rcu() begins and the time that the RCU
read-side critical section begins.
Without this guarantee, a later RCU read-side critical section
running after the kfree() on line 14 of
remove_gp_synchronous() might
later run do_something_gp() and find the
newly deleted struct foo.

My review of the RCU code shows that these memory barriers are
indeed present (at least when we're not in tiny mode where all
this discussion would be moot anyway).  For example, in call_rcu
we eventually get down to rcu_segcblist_enqueue which has an smp_mb.
On the reader side (correct me if I'm wrong Paul) the memory
barrier is implicitly coming from the scheduler.

My point is that within our kernel whenever we have a CPU memory
barrier we always have a compiler barrier too.  Therefore my code
example above does not need any extra compiler barriers such as
the ones provided by READ_ONCE/WRITE_ONCE.

I think perhaps Paul was perhaps thinking that I'm expecting
rcu_read_lock/rcu_read_unlock themselves to provide the memory
or compiler barriers.  That would indeed be wrong but this is
not what I need.  All I need is the RCU semantics as documented
for there to be memory and compiler barriers around the whole
grace period.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Paul E. McKenney

On Sun, Jun 02, 2019 at 01:56:07PM +0800, Herbert Xu wrote:
> Digging up an old email because I was not aware of this previously
> but Paul pointed me to it during another discussion.
> 
> On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote:
> >
> > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > index d63bb77..6c3cece 100644
> > > > --- a/include/linux/rcupdate.h
> > > > +++ b/include/linux/rcupdate.h
> > > > @@ -297,12 +297,14 @@ void synchronize_rcu(void);
> > > >  
> > > >  static inline void __rcu_read_lock(void)
> > > >  {
> > > > -   preempt_disable();
> > > > +   if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
> > > > +   preempt_disable();
> > > 
> > > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right?
> > > Or rather it's a barrier(), which is anyway implied by rcu_read_lock().
> > > 
> > > So perhaps we can get rid of the IS_ENABLED() check?
> > 
> > Actually, barrier() is not intended to be implied by rcu_read_lock().
> > In a non-preemptible RCU implementation, it doesn't help anything
> > to have the compiler flush its temporaries upon rcu_read_lock()
> > and rcu_read_unlock().
> 
> This is seriously broken.  RCU has been around for years and is
> used throughout the kernel while the compiler barrier existed.

Please note that preemptible Tree RCU has lacked the compiler barrier on
all but the outermost rcu_read_unlock() for years before Boqun's patch.

So exactly where in the code that we are currently discussing
are you relying on compiler barriers in either rcu_read_lock() or
rcu_read_unlock()?

The grace-period guarantee allows the compiler ordering to be either in
the readers (SMP&&PREEMPT), in the grace-period mechanism (SMP&&!PREEMPT),
or both (SRCU).

> You can't then go and decide to remove the compiler barrier!  To do
> that you'd need to audit every single use of rcu_read_lock in the
> kernel to ensure that they're not depending on the compiler barrier.
> 
> This is also contrary to the definition of almost every other
> *_lock primitive in the kernel where the compiler barrier is
> included.
> 
> So please revert this patch.

I do not believe that reverting that patch will help you at all.

But who knows?  So please point me at the full code body that was being
debated earlier on this thread.  It will no doubt take me quite a while to
dig through it, given my being on the road for the next couple of weeks,
but so it goes.

Thanx, Paul

Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Linus Torvalds

On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu  wrote:
>
> You can't then go and decide to remove the compiler barrier!  To do
> that you'd need to audit every single use of rcu_read_lock in the
> kernel to ensure that they're not depending on the compiler barrier.

What's the possible case where it would matter when there is no preemption?

  Linus

rcu_read_lock lost its compiler barrier

2019-06-01 Thread Herbert Xu

Digging up an old email because I was not aware of this previously
but Paul pointed me to it during another discussion.

On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote:
> On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote:
>
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index d63bb77..6c3cece 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -297,12 +297,14 @@ void synchronize_rcu(void);
> > >  
> > >  static inline void __rcu_read_lock(void)
> > >  {
> > > - preempt_disable();
> > > + if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
> > > + preempt_disable();
> > 
> > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right?
> > Or rather it's a barrier(), which is anyway implied by rcu_read_lock().
> > 
> > So perhaps we can get rid of the IS_ENABLED() check?
> 
> Actually, barrier() is not intended to be implied by rcu_read_lock().
> In a non-preemptible RCU implementation, it doesn't help anything
> to have the compiler flush its temporaries upon rcu_read_lock()
> and rcu_read_unlock().

This is seriously broken.  RCU has been around for years and is
used throughout the kernel while the compiler barrier existed.

You can't then go and decide to remove the compiler barrier!  To do
that you'd need to audit every single use of rcu_read_lock in the
kernel to ensure that they're not depending on the compiler barrier.

This is also contrary to the definition of almost every other
*_lock primitive in the kernel where the compiler barrier is
included.

So please revert this patch.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

44 matches

Mail list logo