Re: rcu_read_lock lost its compiler barrier
On Sat, Jun 08, 2019 at 11:56:04AM -0400, Alan Stern wrote: > On Sat, 8 Jun 2019, Paul E. McKenney wrote: > > > On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote: > > > On Thu, 6 Jun 2019, Andrea Parri wrote: > > > > > > > This seems a sensible change to me: looking forward to seeing a patch, > > > > on top of -rcu/dev, for further review and testing! > > > > > > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > > > > discussed in this thread (maybe once the RCU code and the informal doc > > > > will have settled in such direction). > > > > > > Yes. Also for SRCU. That point had not escaped me. > > > > And it does seem pretty settled. There are quite a few examples where > > there are normal accesses at either end of the RCU read-side critical > > sections, for example, the one in the requirements diffs below. > > > > For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler > > barriers since 2006. ;-) > > > > Thanx, Paul > > > > > > > > diff --git a/Documentation/RCU/Design/Requirements/Requirements.html > > b/Documentation/RCU/Design/Requirements/Requirements.html > > index 5a9238a2883c..080b39cc1dbb 100644 > > --- a/Documentation/RCU/Design/Requirements/Requirements.html > > +++ b/Documentation/RCU/Design/Requirements/Requirements.html > > @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as > > follows: > > Hotplug CPU. > > Scheduler and RCU. > > Tracing and RCU. > > + > ^ > > +Accesses to User Mamory and RCU. > -^ > > Energy Efficiency. > > > > Scheduling-Clock Interrupts and RCU. > > @@ -2521,6 +2523,75 @@ cannot be used. > > The tracing folks both located the requirement and provided the > > needed fix, so this surprise requirement was relatively painless. > > > > + > --^ > > +Accesses to User Mamory and RCU > -^ > > Are these issues especially notable for female programmers? :-) *Red face* Some days it just doesn't pay to get up in the morning... Well, those issues certainly seem a bit inconsiderate to non-mammalian programmers. :-/ How about the updated version shown below? Thanx, Paul diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 5a9238a2883c..f04c467e55c5 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows: Hotplug CPU. Scheduler and RCU. Tracing and RCU. + +Accesses to User Memory and RCU. Energy Efficiency. Scheduling-Clock Interrupts and RCU. @@ -2521,6 +2523,75 @@ cannot be used. The tracing folks both located the requirement and provided the needed fix, so this surprise requirement was relatively painless. + +Accesses to User Memory and RCU + + +The kernel needs to access user-space memory, for example, to access +data referenced by system-call parameters. +The get_user() macro does this job. + + +However, user-space memory might well be paged out, which means +that get_user() might well page-fault and thus block while +waiting for the resulting I/O to complete. +It would be a very bad thing for the compiler to reorder +a get_user() invocation into an RCU read-side critical +section. +For example, suppose that the source code looked like this: + + + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); + + + + +The compiler must not be permitted to transform this source code into +the following: + + + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); + + + + +If the compiler did make this transformation in a +CONFIG_PREEMPT=n kernel build, and if get_user() did +page fault, the result would be a quiescent state in the middle +of an RCU read-side critical section. +This misplaced quiescent state could result in line 4 being +a use-after-free access, which could be bad for your kernel's +actuarial statistics. +Similar examples can be constructed with the call to get_user() +preceding the rcu_read_lock(). + + +Unfortunately, get_user() doesn't have any particular +ordering properties, and in some architectures the underlying asm +isn't even marked volatile. +And even if it was marked volatile, the above access to +p->value is not volatile, so the compiler would not have any +reason to k
Re: rcu_read_lock lost its compiler barrier
On Sat, 8 Jun 2019, Paul E. McKenney wrote: > On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote: > > On Thu, 6 Jun 2019, Andrea Parri wrote: > > > > > This seems a sensible change to me: looking forward to seeing a patch, > > > on top of -rcu/dev, for further review and testing! > > > > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > > > discussed in this thread (maybe once the RCU code and the informal doc > > > will have settled in such direction). > > > > Yes. Also for SRCU. That point had not escaped me. > > And it does seem pretty settled. There are quite a few examples where > there are normal accesses at either end of the RCU read-side critical > sections, for example, the one in the requirements diffs below. > > For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler > barriers since 2006. ;-) > > Thanx, Paul > > > > diff --git a/Documentation/RCU/Design/Requirements/Requirements.html > b/Documentation/RCU/Design/Requirements/Requirements.html > index 5a9238a2883c..080b39cc1dbb 100644 > --- a/Documentation/RCU/Design/Requirements/Requirements.html > +++ b/Documentation/RCU/Design/Requirements/Requirements.html > @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows: > Hotplug CPU. > Scheduler and RCU. > Tracing and RCU. > + ^ > +Accesses to User Mamory and RCU. -^ > Energy Efficiency. > > Scheduling-Clock Interrupts and RCU. > @@ -2521,6 +2523,75 @@ cannot be used. > The tracing folks both located the requirement and provided the > needed fix, so this surprise requirement was relatively painless. > > + --^ > +Accesses to User Mamory and RCU -^ Are these issues especially notable for female programmers? :-) Alan
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote: > On Thu, 6 Jun 2019, Andrea Parri wrote: > > > This seems a sensible change to me: looking forward to seeing a patch, > > on top of -rcu/dev, for further review and testing! > > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > > discussed in this thread (maybe once the RCU code and the informal doc > > will have settled in such direction). > > Yes. Also for SRCU. That point had not escaped me. And it does seem pretty settled. There are quite a few examples where there are normal accesses at either end of the RCU read-side critical sections, for example, the one in the requirements diffs below. For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler barriers since 2006. ;-) Thanx, Paul diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 5a9238a2883c..080b39cc1dbb 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows: Hotplug CPU. Scheduler and RCU. Tracing and RCU. + +Accesses to User Mamory and RCU. Energy Efficiency. Scheduling-Clock Interrupts and RCU. @@ -2521,6 +2523,75 @@ cannot be used. The tracing folks both located the requirement and provided the needed fix, so this surprise requirement was relatively painless. + +Accesses to User Mamory and RCU + + +The kernel needs to access user-space memory, for example, to access +data referenced by system-call parameters. +The get_user() macro does this job. + + +However, user-space memory might well be paged out, which means +that get_user() might well page-fault and thus block while +waiting for the resulting I/O to complete. +It would be a very bad thing for the compiler to reorder +a get_user() invocation into an RCU read-side critical +section. +For example, suppose that the source code looked like this: + + + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); + + + + +The compiler must not be permitted to transform this source code into +the following: + + + + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); + + + + +If the compiler did make this transformation in a +CONFIG_PREEMPT=n kernel build, and if get_user() did +page fault, the result would be a quiescent state in the middle +of an RCU read-side critical section. +This misplaced quiescent state could result in line 4 being +a use-after-free access, which could be bad for your kernel's +actuarial statistics. +Similar examples can be constructed with the call to get_user() +preceding the rcu_read_lock(). + + +Unfortunately, get_user() doesn't have any particular +ordering properties, and in some architectures the underlying asm +isn't even marked volatile. +And even if it was marked volatile, the above access to +p->value is not volatile, so the compiler would not have any +reason to keep those two accesses in order. + + +Therefore, the Linux-kernel definitions of rcu_read_lock() +and rcu_read_unlock() must act as compiler barriers, +at least for outermost instances of rcu_read_lock() and +rcu_read_unlock() within a nested set of RCU read-side critical +sections. + Energy Efficiency
Re: rcu_read_lock lost its compiler barrier
On Thu, 6 Jun 2019, Andrea Parri wrote: > This seems a sensible change to me: looking forward to seeing a patch, > on top of -rcu/dev, for further review and testing! > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > discussed in this thread (maybe once the RCU code and the informal doc > will have settled in such direction). Yes. Also for SRCU. That point had not escaped me. Alan
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 09:38:24PM +0800, Herbert Xu wrote: > On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote: > > > > I cannot immediately think of a way that the compiler could get this > > wrong even in theory, but similar code sequences can be messed up. > > The reason for this is that in theory, the compiler could use the > > stored-to location as temporary storage, like this: > > > > a = whatever; // Compiler uses "a" as a temporary > > do_something(); > > whatever = a; > > a = 1; // Intended store > > Well if the compiler is going to do this then surely it would > continue to do this even if you used WRITE_ONCE. Remember a is > not volatile, only the access of a through WRITE_ONCE is volatile. I disagree. Given a volatile store, the compiler cannot assume that the stored-to location is normal memory at that point in time, and therefore cannot assume that it is safe to invent a store to that location (as shown above). Thus far, the C++ standards committee seems on-board with this, though time will tell. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r1.pdf Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote: > > I cannot immediately think of a way that the compiler could get this > wrong even in theory, but similar code sequences can be messed up. > The reason for this is that in theory, the compiler could use the > stored-to location as temporary storage, like this: > > a = whatever; // Compiler uses "a" as a temporary > do_something(); > whatever = a; > a = 1; // Intended store Well if the compiler is going to do this then surely it would continue to do this even if you used WRITE_ONCE. Remember a is not volatile, only the access of a through WRITE_ONCE is volatile. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 05:28:55PM +0800, Herbert Xu wrote: > On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote: > > > > Or is your point instead that given the initial value of "a" being > > zero and the value stored to "a" being one, there is no way that > > any possible load and store tearing (your slicing and dicing) could > > possibly mess up the test of the value loaded from "a"? > > Exactly. If you can dream up of a scenario where the compiler can > get this wrong I'm all ears. I believe that this is safe in practice, as long as you exercise constant vigilance. (OK, OK, I might be overdramatizing...) I cannot immediately think of a way that the compiler could get this wrong even in theory, but similar code sequences can be messed up. The reason for this is that in theory, the compiler could use the stored-to location as temporary storage, like this: a = whatever; // Compiler uses "a" as a temporary do_something(); whatever = a; a = 1; // Intended store The compiler is allowed to do this (again, in theory and according to a strict interpretation of the standard) because you haven't told it that anything else is paying attention to variable "a". As a result, the compiler is within its rights to use "a" as temporary storage immediately prior to any plain C-language store to "a". In practice, I don't know of any compilers that actually do this, nor have I heard anyone suggesting that they might soon actually do this. And even if they could, your example would still work because your example doesn't care about anything other than zero and non-zero, so wouldn't get confused by the compiler storing a temporary value of 42 or whatever. > > > But I do concede that in the general RCU case you must have the > > > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. > > > > OK, good that we are in agreement on this part, at least! ;-) > > Well only because we're allowing crazy compilers that can turn > a simple word-aligned word assignment (a = b) into two stores. In my experience, the insanity of compilers increases with time, but yes. Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 10:38:56AM +0200, Andrea Parri wrote: > On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote: > > > The case we were discussing is from net/ipv4/inet_fragment.c from > > the net-next tree: > > BTW, thank you for keeping me and other people who intervened in that > discussion in Cc:... FWIW I didn't drop you from the Cc list. The email discussion was taken off-list by someone else and I simply kept that Cc list when I brought it back onto lkml. On a second look I did end up dropping Eric but I think he's had enough of this discussion :) -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote: > > Or is your point instead that given the initial value of "a" being > zero and the value stored to "a" being one, there is no way that > any possible load and store tearing (your slicing and dicing) could > possibly mess up the test of the value loaded from "a"? Exactly. If you can dream up of a scenario where the compiler can get this wrong I'm all ears. > > But I do concede that in the general RCU case you must have the > > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. > > OK, good that we are in agreement on this part, at least! ;-) Well only because we're allowing crazy compilers that can turn a simple word-aligned word assignment (a = b) into two stores. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 02:14:38PM +0800, Herbert Xu wrote: > On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote: > > > > In case you were wondering, the reason that I was giving you such > > a hard time was that from what I could see, you were pushing for no > > {READ,WRITE}_ONCE() at all. ;-) > > Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c. > We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at > all. Even if the compiler dices and slices the reads/writes of > "a" into a thousand pieces, it should still work if the RCU > primitives are worth their salt. OK, so I take it that there is additional synchronization in there somewhere that is not captured in your simplified example code? Or is your point instead that given the initial value of "a" being zero and the value stored to "a" being one, there is no way that any possible load and store tearing (your slicing and dicing) could possibly mess up the test of the value loaded from "a"? > But I do concede that in the general RCU case you must have the > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. OK, good that we are in agreement on this part, at least! ;-) Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote: > The case we were discussing is from net/ipv4/inet_fragment.c from > the net-next tree: BTW, thank you for keeping me and other people who intervened in that discussion in Cc:... Andrea > > void fqdir_exit(struct fqdir *fqdir) > { > ... > fqdir->dead = true; > > /* call_rcu is supposed to provide memory barrier semantics, >* separating the setting of fqdir->dead with the destruction >* work. This implicit barrier is paired with inet_frag_kill(). >*/ > > INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn); > queue_rcu_work(system_wq, &fqdir->destroy_rwork); > } > > and > > void inet_frag_kill(struct inet_frag_queue *fq) > { > ... > rcu_read_lock(); > /* The RCU read lock provides a memory barrier >* guaranteeing that if fqdir->dead is false then >* the hash table destruction will not start until >* after we unlock. Paired with inet_frags_exit_net(). >*/ > if (!fqdir->dead) { > rhashtable_remove_fast(&fqdir->rhashtable, &fq->node, > fqdir->f->rhash_params); > ... > } > ... > rcu_read_unlock(); > ... > } > > I simplified this to > > Initial values: > > a = 0 > b = 0 > > CPU1 CPU2 > > a = 1 rcu_read_lock > synchronize_rcu if (a == 0) > b = 2 b = 1 > rcu_read_unlock > > On exit we want this to be true: > b == 2
Re: rcu_read_lock lost its compiler barrier
> This example really does point out a weakness in the LKMM's handling of > data races. Herbert's litmus test is a great starting point: > > > C xu > > {} > > P0(int *a, int *b) > { > WRITE_ONCE(*a, 1); > synchronize_rcu(); > *b = 2; > } > > P1(int *a, int *b) > { > rcu_read_lock(); > if (READ_ONCE(*a) == 0) > *b = 1; > rcu_read_unlock(); > } > > exists (~b=2) > > > Currently the LKMM says the test is allowed and there is a data race, > but this answer clearly is wrong since it would violate the RCU > guarantee. > > The problem is that the LKMM currently requires all ordering/visibility > of plain accesses to be mediated by marked accesses. But in this case, > the visibility is mediated by RCU. Technically, we need to add a > relation like > > ([M] ; po ; rcu-fence ; po ; [M]) > > into the definitions of ww-vis, wr-vis, and rw-xbstar. Doing so > changes the litmus test's result to "not allowed" and no data race. > However, I'm not certain that this single change is the entire fix; > more thought is needed. This seems a sensible change to me: looking forward to seeing a patch, on top of -rcu/dev, for further review and testing! We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() discussed in this thread (maybe once the RCU code and the informal doc will have settled in such direction). It seems worth stressing the fact that _neither_ of these changes will prevent the test below from being racy, considered the two accesses to "a" happen concurrently / without synchronization. Thanks, Andrea C xu-2 {} P0(int *a, int *b) { *a = 1; synchronize_rcu(); WRITE_ONCE(*b, 2); } P1(int *a, int *b) { rcu_read_lock(); if (*a == 0) WRITE_ONCE(*b, 1); rcu_read_unlock(); } exists (~b=2)
Re: rcu_read_lock lost its compiler barrier
On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote: > > In case you were wondering, the reason that I was giving you such > a hard time was that from what I could see, you were pushing for no > {READ,WRITE}_ONCE() at all. ;-) Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c. We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at all. Even if the compiler dices and slices the reads/writes of "a" into a thousand pieces, it should still work if the RCU primitives are worth their salt. But I do concede that in the general RCU case you must have the READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Thu, Jun 06, 2019 at 12:51:09PM +0800, Herbert Xu wrote: > On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote: > > > > Currently the LKMM says the test is allowed and there is a data race, > > but this answer clearly is wrong since it would violate the RCU > > guarantee. > > Thank you! This is what I tried to say all along in this thread > but you expressed it in a much better way :) In case you were wondering, the reason that I was giving you such a hard time was that from what I could see, you were pushing for no {READ,WRITE}_ONCE() at all. ;-) Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote: > > Currently the LKMM says the test is allowed and there is a data race, > but this answer clearly is wrong since it would violate the RCU > guarantee. Thank you! This is what I tried to say all along in this thread but you expressed it in a much better way :) -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Tue, Jun 04, 2019 at 08:30:39PM -0700, Paul E. McKenney wrote: > > Understood. Does the patch I sent out a few hours ago cover it? Or is > something else needed? It looks good to me. > Other than updates to the RCU requirements documentation, which is > forthcoming. Thanks Paul. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Wed, Jun 05, 2019 at 10:21:17AM +0800, Herbert Xu wrote: > On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote: > > > > Yeah, I know, even with the "volatile" keyword, it is not entirely clear > > how much reordering the compiler is allowed to do. I was relying on > > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says: > > The volatile keyword doesn't give any guarantees of this kind. > The key to ensuring ordering between unrelated variable/register > reads/writes is the memory clobber: > > 6.47.2.6 Clobbers and Scratch Registers > > ... > > "memory" The "memory" clobber tells the compiler that the assembly > code performs memory reads or writes to items other than those > listed in the input and output operands (for example, accessing > the memory pointed to by one of the input parameters). To ensure > memory contains correct values, GCC may need to flush specific > register values to memory before executing the asm. Further, > the compiler does not assume that any values read from memory > before an asm remain unchanged after that asm; it reloads them as > needed. Using the "memory" clobber effectively forms a read/write > memory barrier for the compiler. > > Note that this clobber does not prevent the processor from > doing speculative reads past the asm statement. To prevent that, > you need processor-specific fence instructions. > > IOW you need a barrier(). Understood. Does the patch I sent out a few hours ago cover it? Or is something else needed? Other than updates to the RCU requirements documentation, which is forthcoming. Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote: > > Yeah, I know, even with the "volatile" keyword, it is not entirely clear > how much reordering the compiler is allowed to do. I was relying on > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says: The volatile keyword doesn't give any guarantees of this kind. The key to ensuring ordering between unrelated variable/register reads/writes is the memory clobber: 6.47.2.6 Clobbers and Scratch Registers ... "memory" The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm. Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler. Note that this clobber does not prevent the processor from doing speculative reads past the asm statement. To prevent that, you need processor-specific fence instructions. IOW you need a barrier(). Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 01:24:32PM -0700, Linus Torvalds wrote: > On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney > wrote: > > > > I agree that !PREEMPT rcu_read_lock() would not affect compiler code > > generation, but given that get_user() is a volatile asm, isn't the > > compiler already forbidden from reordering it with the volatile-casted > > WRITE_ONCE() access, even if there was nothing at all between them? > > Or are asms an exception to the rule that volatile executions cannot > > be reordered? > > Paul, you MAKE NO SENSE. > > What is wrong with you? Mostly that I didn't check all architectures' definitions of get_user(). Had I done so, I would have seen that not all of the corresponding asms have the "volatile" keyword. And of course, without that keyword, there is absolutely nothing preventing the compiler from reordering the asm with pretty much anything. The only things that would be absolutely guaranteed to prevent reordering would be things like memory clobbers (barrier()) or accesses that overlap the asm's input/output list. Yeah, I know, even with the "volatile" keyword, it is not entirely clear how much reordering the compiler is allowed to do. I was relying on https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says: Qualifiers volatile The typical use of extended asm statements is to manipulate input values to produce output values. However, your asm statements may also produce side effects. If so, you may need to use the volatile qualifier to disable certain optimizations. See Volatile. But the linked-to "Volatile" section later in that same web page mostly talks about the compiler's ability to hoist asms out of loops. > I just showed you an example of where rcu_read_lock() needs to be a > compiler barrier, and then you make incoherent noises about > WRITE_ONCE() that do not even exist in that example. I thought we were discussing this example, but it doesn't matter because I was missing your point about get_user() and page faults: get_user(val, ptr) rcu_read_lock(); WRITE_ONCE(state, 1); But regardless, given that some architectures omit volatile from their asms implementing get_user(), even an optimistic interpretation of that part of the GCC documentation would still permit reordering the above. And again, I was missing your point about get_user() causing page faults and thus context switches. > Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already > showed code that doesn't have those accessors, so reality doesn't > match your fevered imagination. I get the feeling that you believe that I want LKMM to be some sort of final judge and absolute arbiter of what code is legal and not from a memory-ordering perspective. This is absolutely -not- the case. The current state of the art, despite the recent impressive progress, simply cannot reasonably do this. So all I can claim is that LKMM dispenses advice, hopefully good advice. (It is early days for LKMM's handling of plain accesses, so some work might be required to deliver on the "good advice" promise, but we have to start somewhere. Plus it is progressing nicely.) The places where long-standing RCU patterns require rcu_dereference() and rcu_assign_pointer() do require some attention to avoid compiler optimizations, and {READ,WRITE}_ONCE() is one way of addressing this. But not the only way, nor always the best way. For example, some fastpaths might need the optimizations that {READ,WRITE}_ONCE() suppresses. Therefore, Linux kernel hackers have a number of other ways of paying attention. For example, accesses might be constrained via barrier() and friends. For another example, some developers might check assembly output (hopefully scripted somehow). Again, the Linux-kernel memory model dispenses advice, not absolutes. Furthermore, the way it dispenses advice is currently a bit limited. It can currently say that it is nervous about lack of {READ,WRITE}_ONCE(), as in "Flag data-race", but it would be difficult to make it recommend the other options in an intelligent way. So we should interpret "Flag data-race" as LKMM saying "I am nervous about your unmarked accesses" rather than "You absolutely must call {READ,WRITE}_ONCE() more often!!!" Again, advice, not absolutes. So the idea is that you add and remove {READ,WRITE}_ONCE() to/from the -litmus- -tests- to determine which accesses LKMM is nervous about. But that doesn't necessarily mean that {READ,WRITE}_ONCE() goes into the corresponding places in the Linux kernel. Does that help, or am I still confused? > And sometimes it's not even possible, since you can't do a bitfield > access, for example, with READ_ONCE(). Ah, good point. So the Linux kernel uses bitfields to communicate between mainline and interrupt handlers. New one on me. :-/ > > We can of course put them back in, > > Stop the craziness. It's not "we can".
Re: rcu_read_lock lost its compiler barrier
On Tue, Jun 4, 2019 at 10:00 AM Alan Stern wrote: > > Which suggests asking whether these higher expectations should be > reflected in the Linux Kernel Memory Model. So far we have largely > avoided doing that sort of thing, although there are a few exceptions. I think they might be hard to describe - which in turn may be why the standard leaves it open and only describes the simple cases. Exactly that "we expect an assignment to be done as a single write" is probably a good example. Yes, we do have that expectation for the simple cases. But it's not an absolute rule, obviously, because it's clearly violated by bitfield writes, and similarly it's obviously violated for data that is bigger than the word size (ie a "u64" assignment is obviously not a single write when you're on a 32-bit target). And while those cases are static and could be described separately, the "oh, and we have _other_ accesses to the same variable nearby, and we expect that the compiler might take those into account unless we explicitly use WRITE_ONCE()" things make for much more hard to describe issues. Doing writes with speculative values is clearly bogus and garbage (although compiler writers _have_ tried to do that too: "we then overwrite it with the right value later, so it's ok"), and thankfully even user space people complain about that due to threading and signals. But eliding writes entirely by combining them with a later one is clearly ok - and eliding them when there was an explcit earlier read and value comparison (like in my example) sounds reasonable to me too. Yet silently adding the elision that wasn't visible in the source due to other accesses would be bad. How do you say "sane and reasonable compiler" in a spec? You usually don't - or you make the rules so odd and complex than nobody really understands them any more, and make it all moot ;) Linus
Re: rcu_read_lock lost its compiler barrier
On Tue, 4 Jun 2019, Linus Torvalds wrote: > So I don't technically disagree with anything you say, That's good to know! > I just wanted > to point out that as far as the kernel is concerned, we do have higher > quality expectations from the compiler than just "technically valid > according to the C standard". Which suggests asking whether these higher expectations should be reflected in the Linux Kernel Memory Model. So far we have largely avoided doing that sort of thing, although there are a few exceptions. (For example, we assume the compiler does not destroy address dependencies from volatile reads -- but we also warn that this assumption may fail if the programmer does not follow some rules described in one of Paul's documentation files.) Alan
Re: rcu_read_lock lost its compiler barrier
On Tue, Jun 4, 2019 at 7:44 AM Alan Stern wrote: > > Even if you don't think the compiler will ever do this, the C standard > gives compilers the right to invent read accesses if a plain (i.e., > non-atomic and non-volatile) write is present. Note that for the kernel, it's not like we go strictly by what the standard says anyway. The reason for the "read for write" thing is that obviously you sometimes have broken architectures (*cough*alpha*cough*) that need to do some common writes are read-maskmodify-write accesses, and for bitfields you obviously *always* have that issue. In general, a sane compiler (as opposed to "we just read the standard") had better not add random reads, because people might have some mappings be write-only when the hardware supports it. And we do generally expect sanity from our compilers, and will add compiler flags to disable bad behavior if required - even if said behavior would be "technically allowed by the standard". > (Incidentally, regardless of whether the compiler will ever do this, I > have seen examples in the kernel where people did exactly this > manually, in order to avoid dirtying a cache line unnecessarily.) I really hope and expect that this is not something that the compiler ever does. If a compiler turns "x = y" into if (x != y) x = y" (like we do manually at times, as you say), that compiler is garbage that we cannot use for the kernel. It would break things like "smp_wmb()" ordering guarantees, I'm pretty sure. And as always, if we're doing actively stupid things, and the compiler then turns our stupid code into something we don't expect, the corollary is that then it's on _us_. IOW, if we do if (x != 1) { ... } x = 1; and the compiler goes "oh, we already checked that 'x == 1'" and moves that "unconditional" 'x = 1' into the conditional section like if (x != 1) { .. x = 1; } then that's not something we can then complain about. So our expectation is that the compiler is _sane_, not that it's some "C as assembler". Adding spurious reads is not ok. But if we already had reads in the code and the compiler combines them with other ops, that's on us. End result: in general, we do expect that the compiler turns a regular assignment into a single plain write when that's what the code does, and does not add extra logic over and beyond that. In fact, the alpha port was always subtly buggy exactly because of the "byte write turns into a read-and-masked-write", even if I don't think anybody ever noticed (we did fix cases where people _did_ notice, though, and we might still have some cases where we use 'int' for booleans because of alpha issues.). So I don't technically disagree with anything you say, I just wanted to point out that as far as the kernel is concerned, we do have higher quality expectations from the compiler than just "technically valid according to the C standard". Linus
Re: rcu_read_lock lost its compiler barrier
On Mon, 3 Jun 2019, Paul E. McKenney wrote: > On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote: > > On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > > > > > 1. These guarantees are of full memory barriers, -not- compiler > > > > barriers. > > > > > > What I'm saying is that wherever they are, they must come with > > > compiler barriers. I'm not aware of any synchronisation mechanism > > > in the kernel that gives a memory barrier without a compiler barrier. > > > > > > > 2. These rules don't say exactly where these full memory barriers > > > > go. SRCU is at one extreme, placing those full barriers in > > > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > > > at the other, placing these barriers entirely within the > > > > callback > > > > queueing/invocation, grace-period computation, and the > > > > scheduler. > > > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > > > sometimes including a full memory barrier, but other times with > > > > the full memory barrier being confined as it is with !PREEMPT > > > > Tree RCU. > > > > > > The rules do say that the (full) memory barrier must precede any > > > RCU read-side that occur after the synchronize_rcu and after the > > > end of any RCU read-side that occur before the synchronize_rcu. > > > > > > All I'm arguing is that wherever that full mb is, as long as it > > > also carries with it a barrier() (which it must do if it's done > > > using an existing kernel mb/locking primitive), then we're fine. > > > > > > > Interleaving and inserting full memory barriers as per the rules above: > > > > > > > > CPU1: WRITE_ONCE(a, 1) > > > > CPU1: synchronize_rcu > > > > /* Could put a full memory barrier here, but it wouldn't help. > > > > */ > > > > > > CPU1: smp_mb(); > > > CPU2: smp_mb(); > > > > > > Let's put them in because I think they are critical. smp_mb() also > > > carries with it a barrier(). > > > > > > > CPU2: rcu_read_lock(); > > > > CPU1: b = 2; > > > > CPU2: if (READ_ONCE(a) == 0) > > > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! > > > > */ > > > > CPU2: b = 1; > > > > CPU2: rcu_read_unlock > > > > > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > > > which (I believe) is why the model complains in this case. > > > > > > Let's put aside my doubt over how we're even allowing a compiler > > > to turn > > > > > > b = 1 > > > > > > into > > > > > > if (b != 1) > > > b = 1 Even if you don't think the compiler will ever do this, the C standard gives compilers the right to invent read accesses if a plain (i.e., non-atomic and non-volatile) write is present. The Linux Kernel Memory Model has to assume that compilers will sometimes do this, even if it doesn't take the exact form of checking a variable's value before writing to it. (Incidentally, regardless of whether the compiler will ever do this, I have seen examples in the kernel where people did exactly this manually, in order to avoid dirtying a cache line unnecessarily.) > > > Since you seem to be assuming that (a == 0) is true in this case > > > > I think Paul's example assuming (a == 0) is false, and maybe > > Yes, otherwise, P0()'s write to "b" cannot have happened. > > > speculative writes (by compilers) needs to added into consideration? On the other hand, the C standard does not allow compilers to add speculative writes. The LKMM assumes they will never occur. > I would instead call it the compiler eliminating needless writes > by inventing reads -- if the variable already has the correct value, > no write happens. So no compiler speculation. > > However, it is difficult to create a solid defensible example. Yes, > from LKMM's viewpoint, the weakly reordered invented read from "b" > can be concurrent with P0()'s write to "b", but in that case the value > loaded would have to manage to be equal to 1 for anything bad to happen. > This does feel wrong to me, but again, it is difficult to create a solid > defensible example. > > > Please consider the following case (I add a few smp_mb()s), the case may > > be a little bit crasy, you have been warned ;-) > > > > CPU1: WRITE_ONCE(a, 1) > > CPU1: synchronize_rcu called > > > > CPU1: smp_mb(); /* let assume there is one here */ > > > > CPU2: rcu_read_lock(); > > CPU2: smp_mb(); /* let assume there is one here */ > > > > /* "if (b != 1) b = 1" reordered */ > > CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */ > > CPU2: if (r0 != 1) /* true */ > > CPU2: b = 1;/* b == 1 now, this is a speculative write > >by compiler > > */ > >
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney wrote: > > I agree that !PREEMPT rcu_read_lock() would not affect compiler code > generation, but given that get_user() is a volatile asm, isn't the > compiler already forbidden from reordering it with the volatile-casted > WRITE_ONCE() access, even if there was nothing at all between them? > Or are asms an exception to the rule that volatile executions cannot > be reordered? Paul, you MAKE NO SENSE. What is wrong with you? I just showed you an example of where rcu_read_lock() needs to be a compiler barrier, and then you make incoherent noises about WRITE_ONCE() that do not even exist in that example. Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already showed code that doesn't have those accessors, so reality doesn't match your fevered imagination. And sometimes it's not even possible, since you can't do a bitfield access, for exmaple, with READ_ONCE(). > We can of course put them back in, Stop the craziness. It's not "we can". It is a "we will". So I will add that barrier, and you need to stop arguing against it based on specious theoretical arguments that do not match reality. And we will not ever remove that barrier again. Herbert already pointed to me having to do this once before in commit 386afc91144b ("spinlocks and preemption points need to be at least compiler barriers"), and rcu_read_lock() clearly has at a minimum that same preemption point issue. Linus
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote: > On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > > > 1.These guarantees are of full memory barriers, -not- compiler > > > barriers. > > > > What I'm saying is that wherever they are, they must come with > > compiler barriers. I'm not aware of any synchronisation mechanism > > in the kernel that gives a memory barrier without a compiler barrier. > > > > > 2.These rules don't say exactly where these full memory barriers > > > go. SRCU is at one extreme, placing those full barriers in > > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > > at the other, placing these barriers entirely within the callback > > > queueing/invocation, grace-period computation, and the scheduler. > > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > > sometimes including a full memory barrier, but other times with > > > the full memory barrier being confined as it is with !PREEMPT > > > Tree RCU. > > > > The rules do say that the (full) memory barrier must precede any > > RCU read-side that occur after the synchronize_rcu and after the > > end of any RCU read-side that occur before the synchronize_rcu. > > > > All I'm arguing is that wherever that full mb is, as long as it > > also carries with it a barrier() (which it must do if it's done > > using an existing kernel mb/locking primitive), then we're fine. > > > > > Interleaving and inserting full memory barriers as per the rules above: > > > > > > CPU1: WRITE_ONCE(a, 1) > > > CPU1: synchronize_rcu > > > /* Could put a full memory barrier here, but it wouldn't help. */ > > > > CPU1: smp_mb(); > > CPU2: smp_mb(); > > > > Let's put them in because I think they are critical. smp_mb() also > > carries with it a barrier(). > > > > > CPU2: rcu_read_lock(); > > > CPU1: b = 2; > > > CPU2: if (READ_ONCE(a) == 0) > > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > > CPU2: b = 1; > > > CPU2: rcu_read_unlock > > > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > > which (I believe) is why the model complains in this case. > > > > Let's put aside my doubt over how we're even allowing a compiler > > to turn > > > > b = 1 > > > > into > > > > if (b != 1) > > b = 1 > > > > Since you seem to be assuming that (a == 0) is true in this case > > I think Paul's example assuming (a == 0) is false, and maybe Yes, otherwise, P0()'s write to "b" cannot have happened. > speculative writes (by compilers) needs to added into consideration? I would instead call it the compiler eliminating needless writes by inventing reads -- if the variable already has the correct value, no write happens. So no compiler speculation. However, it is difficult to create a solid defensible example. Yes, from LKMM's viewpoint, the weakly reordered invented read from "b" can be concurrent with P0()'s write to "b", but in that case the value loaded would have to manage to be equal to 1 for anything bad to happen. This does feel wrong to me, but again, it is difficult to create a solid defensible example. > Please consider the following case (I add a few smp_mb()s), the case may > be a little bit crasy, you have been warned ;-) > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu called > > CPU1: smp_mb(); /* let assume there is one here */ > > CPU2: rcu_read_lock(); > CPU2: smp_mb(); /* let assume there is one here */ > > /* "if (b != 1) b = 1" reordered */ > CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */ > CPU2: if (r0 != 1) /* true */ > CPU2: b = 1;/* b == 1 now, this is a speculative write > by compiler >*/ > > CPU1: b = 2;/* b == 2 */ > > CPU2: if (READ_ONCE(a) == 0) /* false */ > CPU2: ... > CPU2 else /* undo the speculative write */ > CPU2: b = r0; /* b == 0 */ > > CPU2: smp_mb(); > CPU2: read_read_unlock(); > > I know this is too crasy for us to think a compiler like this, but this > might be the reason why the model complain about this. > > Paul, did I get this right? Or you mean something else? Mostly there, except that I am not yet desperate enough to appeal to compilers speculating stores. ;-) Thanx, Paul > Regards, > Boqun > > > > > (as the assignment b = 1 is carried out), then because of the > > presence of the full memory barrier, the RCU read-side section > > must have started prior to the synchronize_rcu. This means that > > synchronize_rcu is not allowed to return until at least the end > > of the grace period, or at least until the end of rcu_read_unlock. > > > > So
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 09:07:29AM -0700, Linus Torvalds wrote: > On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds > wrote: > > > > I don't believe that it would necessarily help to turn a > > rcu_read_lock() into a compiler barrier, because for the non-preempt > > case rcu_read_lock() doesn't need to actually _do_ anything, and > > anything that matters for the RCU read lock will already be a compiler > > barrier for other reasons (ie a function call that can schedule). > > Actually, thinking a bit more about this, and trying to come up with > special cases, I'm not at all convinced. > > Even if we don't have preemption enabled, it turns out that we *do* > have things that can cause scheduling without being compiler barriers. > > In particular, user accesses are not necessarily full compiler > barriers. One common pattern (x86) is > > asm volatile("call __get_user_%P4" > > which explicitly has a "asm volaile" so that it doesn't re-order wrt > other asms (and thus other user accesses), but it does *not* have a > "memory" clobber, because the user access doesn't actually change > kernel memory. Not even if it's a "put_user()". > > So we've made those fairly relaxed on purpose. And they might be > relaxed enough that they'd allow re-ordering wrt something that does a > rcu read lock, unless the rcu read lock has some compiler barrier in > it. > > IOW, imagine completely made up code like > > get_user(val, ptr) > rcu_read_lock(); > WRITE_ONCE(state, 1); > > and unless the rcu lock has a barrier in it, I actually think that > write to 'state' could migrate to *before* the get_user(). > > I'm not convinced we have anything that remotely looks like the above, > but I'm actually starting to think that yes, all RCU barriers had > better be compiler barriers. > > Because this is very much an example of something where you don't > necessarily need a memory barrier, but there's a code generation > barrier needed because of local ordering requirements. The possible > faulting behavior of "get_user()" must not migrate into the RCU > critical region. > > Paul? I agree that !PREEMPT rcu_read_lock() would not affect compiler code generation, but given that get_user() is a volatile asm, isn't the compiler already forbidden from reordering it with the volatile-casted WRITE_ONCE() access, even if there was nothing at all between them? Or are asms an exception to the rule that volatile executions cannot be reordered? > So I think the rule really should be: every single form of locking > that has any semantic meaning at all, absolutely needs to be at least > a compiler barrier. > > (That "any semantic meaning" weaselwording is because I suspect that > we have locking that truly and intentionally becomes no-ops because > it's based on things that aren't relevant in some configurations. But > generally compiler barriers are really pretty damn cheap, even from a > code generation standpoint, and can help make the resulting code more > legible, so I think we should not try to aggressively remove them > without _very_ good reasons) We can of course put them back in, but this won't help in the typical rcu_assign_pointer(), rcu_dereference(), and synchronize_rcu() situation (nor do I see how it helps in Hubert's example). And in other RCU use cases, the accesses analogous to the rcu_assign_pointer() and rcu_dereference() (in Hubert's example, the accesses to variable "a") really need to be READ_ONCE()/WRITE_ONCE() or stronger, correct? Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds wrote: > > I don't believe that it would necessarily help to turn a > rcu_read_lock() into a compiler barrier, because for the non-preempt > case rcu_read_lock() doesn't need to actually _do_ anything, and > anything that matters for the RCU read lock will already be a compiler > barrier for other reasons (ie a function call that can schedule). Actually, thinking a bit more about this, and trying to come up with special cases, I'm not at all convinced. Even if we don't have preemption enabled, it turns out that we *do* have things that can cause scheduling without being compiler barriers. In particular, user accesses are not necessarily full compiler barriers. One common pattern (x86) is asm volatile("call __get_user_%P4" which explicitly has a "asm volaile" so that it doesn't re-order wrt other asms (and thus other user accesses), but it does *not* have a "memory" clobber, because the user access doesn't actually change kernel memory. Not even if it's a "put_user()". So we've made those fairly relaxed on purpose. And they might be relaxed enough that they'd allow re-ordering wrt something that does a rcu read lock, unless the rcu read lock has some compiler barrier in it. IOW, imagine completely made up code like get_user(val, ptr) rcu_read_lock(); WRITE_ONCE(state, 1); and unless the rcu lock has a barrier in it, I actually think that write to 'state' could migrate to *before* the get_user(). I'm not convinced we have anything that remotely looks like the above, but I'm actually starting to think that yes, all RCU barriers had better be compiler barriers. Because this is very much an example of something where you don't necessarily need a memory barrier, but there's a code generation barrier needed because of local ordering requirements. The possible faulting behavior of "get_user()" must not migrate into the RCU critical region. Paul? So I think the rule really should be: every single form of locking that has any semantic meaning at all, absolutely needs to be at least a compiler barrier. (That "any semantic meaning" weaselwording is because I suspect that we have locking that truly and intentionally becomes no-ops because it's based on things that aren't relevant in some configurations. But generally compiler barriers are really pretty damn cheap, even from a code generation standpoint, and can help make the resulting code more legible, so I think we should not try to aggressively remove them without _very_ good reasons) Linus
Re: rcu_read_lock lost its compiler barrier
On Sun, Jun 2, 2019 at 8:03 PM Herbert Xu wrote: > > In any case, I am now even more certain that compiler barriers are > not needed in the code in question. The reasoning is quite simple. > If you need those compiler barriers then you surely need real memory > barriers. So the above statement is not necessarily correct. Compiler barriers very much can have real effects even in the absense of "real" memory barriers. But those effects are obviously not about multiple CPU's - they are about code generation and can be about ordering on _one_ CPU. Those effects definitely happen, though. So a compiler barrier without a memory barrier may make a difference if you (a) compile for UP (when a compiler barrier basically _is_ a memory barrier) (b) have deadlock or ordering avoidance with only the local CPU taking interrupts. (c) need to re-load a value in a loop, but ordering isn't a concern and possibly other situations. In the above, (a) may be pointless and trivial, but (b) and (c) are issues even on SMP. Some things only matter for the local CPU - an interrupt or a code sequence that happens on another CPU can just block, but if an interrupt comes in on the same CPU may dead-lock and depend on particular access ordering. And (c) is for things like cpu_relax() in loops that read stuff (although honestly, READ_ONCE() is generally a much better pattern). But it sounds like in this case at least, Herbert's and Paul's disagreements aren't really all that fundamentally about the memory barriers and locking, as just the fact that in general the only thing that RCU protects against is single accesses, and thus any RCU region technically should use something that guarantees that the compiler might not do stupid things. We do require a _minimum_ of compiler sanity, but the compiler turning a non-marked read into two reads has definitely happened, and can be very subtle. Even if on a C level it looks like a single access, and correctness doesn't care about _what_ the value we read is, it might be turned by the compiler into two separate accesses that get two different values, and then the end result may be insane and unreliable. So on the whole, I do think that it's usually a good idea to show _which_ access is protected by RCU. Perhaps with a READ_ONCE/WRITE_ONCE pattern, although it can be other things. I don't believe that it would necessarily help to turn a rcu_read_lock() into a compiler barrier, because for the non-preempt case rcu_read_lock() doesn't need to actually _do_ anything, and anything that matters for the RCU read lock will already be a compiler barrier for other reasons (ie a function call that can schedule). Anyway, I suspect the code is correct in practice, but I do have some sympathy for the "RCU protected accesses should be marked". Linus
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 3, 2019 at 8:26 AM David Laight wrote: > > From: Paul E. McKenney > > > We do > > occasionally use READ_ONCE() to prevent load-fusing optimizations that > > would otherwise cause the compiler to turn while-loops into if-statements > > guarding infinite loops. > > In that case the variable ought to be volatile... No. We do not use volatile on variables. The C people got the semantics wrong, probably because 'volatile' was historically for IO, not for access atomicity without locking. It's not the memory location that is volatile, it is really the _access_ that is volatile. The same memory location might be completely stable in other access situations (ie when done under a lock). In other words, we should *never* use volatile in the kernel. It's fundamentally mis-designed for modern use. (Of course, we then can use volatile in a cast in code, which drives some compiler people crazy, but that's because said compiler people don't care about reality, they care about some paperwork). Linus
RE: rcu_read_lock lost its compiler barrier
From: Paul E. McKenney > Sent: 03 June 2019 09:42 ... > On kissing the kernel goodbye, a reasonable strategy might be to > identify the transformations that are actually occuring (like the > stores of certain constants called out above) and fix those. > We do > occasionally use READ_ONCE() to prevent load-fusing optimizations that > would otherwise cause the compiler to turn while-loops into if-statements > guarding infinite loops. In that case the variable ought to be volatile... > There is also the possibility of having the > compiler guys give us more command-line arguments. I wonder how much the code size (of anything) would increase if the compiler: 1) Never read a value into a local more than once. 2) Never write a value that wasn't requested by the code. 3) Never used multiple memory accesses for 'machine word' (and smaller) items. (1) makes all reads READ_ONCE() except that the actual read can be delayed until further down the code. If I have a naive #define bswap32() I'd expect: v = bswap32(foo->bar) to possibly read foo->bar multiple times, but not: int foo_bar = foo->bar; v = bswap32(foo_bar); (2) makes all writes WRITE_ONCE() except that if there are multiple writes to the same location, only the last need be done. In particular it stops speculative writes and the use of locations that are going to be written to as temporaries. It also stop foo->bar = ~0; being implemented as a clear then decrement. (3) I'd never imagined the compiler would write the two halves of a word separately! If the compiler behaved like that (as one might expect it would) then READ_ONCE() would be a READ_NOW() for when the sequencing mattered. I was also recently surprised by the code I got from this loop: for (i = 0; i < limit; i++) sum64 += array32[i]; (as in the IP checksum sum without add carry support). The compiler unrolled it to used horrid sequences of sse3/avx instructions. This might be a gain for large enough buffers and 'hot cache' but for small buffer and likely cold cache it is horrid. I guess such optimisations are valid, but I wonder how often they are an actual win for real programs. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > 1. These guarantees are of full memory barriers, -not- compiler > > barriers. > > What I'm saying is that wherever they are, they must come with > compiler barriers. I'm not aware of any synchronisation mechanism > in the kernel that gives a memory barrier without a compiler barrier. Yes, if a given synchronization mechanism requires that memory references need to be ordered, both the compiler and the CPU must maintain that ordering. > > 2. These rules don't say exactly where these full memory barriers > > go. SRCU is at one extreme, placing those full barriers in > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > at the other, placing these barriers entirely within the callback > > queueing/invocation, grace-period computation, and the scheduler. > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > sometimes including a full memory barrier, but other times with > > the full memory barrier being confined as it is with !PREEMPT > > Tree RCU. > > The rules do say that the (full) memory barrier must precede any > RCU read-side that occur after the synchronize_rcu and after the > end of any RCU read-side that occur before the synchronize_rcu. > > All I'm arguing is that wherever that full mb is, as long as it > also carries with it a barrier() (which it must do if it's done > using an existing kernel mb/locking primitive), then we're fine. Fair enough, and smp_mb() does provide what is needed. > > Interleaving and inserting full memory barriers as per the rules above: > > > > CPU1: WRITE_ONCE(a, 1) > > CPU1: synchronize_rcu > > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); What is CPU2's smp_mb() ordering? In other words, what comment would you put on each of the above smp_mb() calls? > Let's put them in because I think they are critical. smp_mb() also > carries with it a barrier(). Again, agreed, smp_mb() implies barrier(). > > CPU2: rcu_read_lock(); > > CPU1: b = 2; > > CPU2: if (READ_ONCE(a) == 0) > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > CPU2: b = 1; > > CPU2: rcu_read_unlock > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > which (I believe) is why the model complains in this case. > > Let's put aside my doubt over how we're even allowing a compiler > to turn > > b = 1 > > into > > if (b != 1) > b = 1 > > Since you seem to be assuming that (a == 0) is true in this case > (as the assignment b = 1 is carried out), then because of the > presence of the full memory barrier, the RCU read-side section > must have started prior to the synchronize_rcu. This means that > synchronize_rcu is not allowed to return until at least the end > of the grace period, or at least until the end of rcu_read_unlock. > > So it actually should be: > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu called > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); > > CPU2: grace period starts > ...time passes... > CPU2: rcu_read_lock(); > CPU2: if (READ_ONCE(a) == 0) > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > CPU2: b = 1; > CPU2: rcu_read_unlock > ...time passes... > CPU2: grace period ends > > /* This full memory barrier is also guaranteed by RCU. */ > CPU2: smp_mb(); But in this case, given that there are no more statements for CPU2, what is this smp_mb() ordering? Thanx, Paul > CPU1 synchronize_rcu returns > CPU1: b = 2; > > Cheers, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt >
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 11:03:24AM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote: > > > > Please note that preemptible Tree RCU has lacked the compiler barrier on > > all but the outermost rcu_read_unlock() for years before Boqun's patch. > > Actually this is not true. Boqun's patch (commit bb73c52bad36) does > not add a barrier() to __rcu_read_lock. In fact I dug into the git > history and this compiler barrier() has existed in preemptible tree > RCU since the very start in 2009: I said rcu_read_unlock() and you said __rcu_read_lock(). > : commit f41d911f8c49a5d65c86504c19e8204bb605c4fd > : Author: Paul E. McKenney > : Date: Sat Aug 22 13:56:52 2009 -0700 > : > : rcu: Merge preemptable-RCU functionality into hierarchical RCU > : > : +/* > : + * Tree-preemptable RCU implementation for rcu_read_lock(). > : + * Just increment ->rcu_read_lock_nesting, shared state will be updated > : + * if we block. > : + */ > : +void __rcu_read_lock(void) > : +{ > : + ACCESS_ONCE(current->rcu_read_lock_nesting)++; > : + barrier(); /* needed if we ever invoke rcu_read_lock in rcutree.c > */ > : +} > : +EXPORT_SYMBOL_GPL(__rcu_read_lock); Thank you for finding this! This particular version does have an unconditional barrier() in __rcu_read_unlock(), for whatever that is worth: +void __rcu_read_unlock(void) +{ + struct task_struct *t = current; + + barrier(); /* needed if we ever invoke rcu_read_unlock in rcutree.c */ + if (--ACCESS_ONCE(t->rcu_read_lock_nesting) == 0 && + unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) + rcu_read_unlock_special(t); +} I would not have seen the point of a compiler barrier in the non-outermost __rcu_read_unlock(), since the completion of an inner __rcu_read_unlock() does not permit the grace period to complete. > However, you are correct that in the non-preempt tree RCU case, > the compiler barrier in __rcu_read_lock was not always present. > In fact it was added by: > > : commit 386afc91144b36b42117b0092893f15bc8798a80 > : Author: Linus Torvalds > : Date: Tue Apr 9 10:48:33 2013 -0700 > : > : spinlocks and preemption points need to be at least compiler barriers > > I suspect this is what prompted you to remove it in 2015. If I remember correctly, it was pointed out to me that in !PREEMPT kernels, the compiler barrier in the preempt_disable() invoked in rcu_read_lock() (and similar on the rcu_read_unlock() side) wasn't helping anything, > > I do not believe that reverting that patch will help you at all. > > > > But who knows? So please point me at the full code body that was being > > debated earlier on this thread. It will no doubt take me quite a while to > > dig through it, given my being on the road for the next couple of weeks, > > but so it goes. > > Please refer to my response to Linus for the code in question. > > In any case, I am now even more certain that compiler barriers are > not needed in the code in question. The reasoning is quite simple. > If you need those compiler barriers then you surely need real memory > barriers. OK, we are in agreement on that point, then! > Vice versa, if real memory barriers are already present thanks to > RCU, then you don't need those compiler barriers. For users of RCU, this seems reasonable. On the other hand, the compiler barriers in PREEMPT Tree RCU's outermost __rcu_read_lock() and __rcu_read_unlock() invocations really are needed by RCU internals. This is because RCU uses of interrupt handlers that access per-task and per-CPU variables, and these need to be able to sense the edges of the nested set of RCU read-side critical sections. It is OK for these interrupt handlers to think that the critical section is larger than it really is, but fatal for them to think that the critical sections are smaller than they really are. > In fact this calls into question the use of READ_ONCE/WRITE_ONCE in > RCU primitives such as rcu_dereference and rcu_assign_pointer. No, these are -not- called into question, or if they are, the question gets quickly answered it a way that supports current Linux-kernel code. As mentioned in earlier emails, the traditional uses of RCU that involve rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() all work just fine. In fact, from what I can see, the issue stems from having developed intuitions from working with the traditional rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() linked-structure use cases, and then attempting to apply these intuition to use cases that have neither rcu_dereference() nor rcu_assign_pointer(). Don't get me wrong, it is only natural to try to extend your intuitions to something that admittedly looks pretty similar to the traditional use cases. But this is one of those cases where "pretty similar" might not be good enough. > IIRC > when RCU was first added to the
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 12:23:39AM -0700, Paul E. McKenney wrote: > On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote: > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > > > CPU2: if (b != 1) > > > CPU2: b = 1; > > > > Stop right there. The kernel is full of code that assumes that > > assignment to an int/long is atomic. If your compiler breaks this > > assumption that we can kiss the kernel good-bye. > > Here you go: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981 > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028 > > TL;DR: On x86, of you are doing a plain store of a 32-bit constant > that has bits set only in the lower few bits of each 16-bit half of > that constant, the compiler is plenty happy to use a pair of 16-bit > store-immediate instructions to carry out that store. This is also > known as "store tearing". > > The two bugs were filed (and after some back and forth, fixed) because > someone forgot to exclude C11 atomics and volatile accesses from this > store tearing. I should hasten to add that I have not seen load tearing, nor have I seen store tearing when storing a value unknown to the compiler. However, plain C-language loads and stores can be invented, fused, and a few other "interesting" optimization can be applied. On kissing the kernel goodbye, a reasonable strategy might be to identify the transformations that are actually occuring (like the stores of certain constants called out above) and fix those. We do occasionally use READ_ONCE() to prevent load-fusing optimizations that would otherwise cause the compiler to turn while-loops into if-statements guarding infinite loops. There is also the possibility of having the compiler guys give us more command-line arguments. Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > CPU2: if (b != 1) > > CPU2: b = 1; > > Stop right there. The kernel is full of code that assumes that > assignment to an int/long is atomic. If your compiler breaks this > assumption that we can kiss the kernel good-bye. Here you go: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028 TL;DR: On x86, of you are doing a plain store of a 32-bit constant that has bits set only in the lower few bits of each 16-bit half of that constant, the compiler is plenty happy to use a pair of 16-bit store-immediate instructions to carry out that store. This is also known as "store tearing". The two bugs were filed (and after some back and forth, fixed) because someone forgot to exclude C11 atomics and volatile accesses from this store tearing. Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > 1. These guarantees are of full memory barriers, -not- compiler > > barriers. > > What I'm saying is that wherever they are, they must come with > compiler barriers. I'm not aware of any synchronisation mechanism > in the kernel that gives a memory barrier without a compiler barrier. > > > 2. These rules don't say exactly where these full memory barriers > > go. SRCU is at one extreme, placing those full barriers in > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > at the other, placing these barriers entirely within the callback > > queueing/invocation, grace-period computation, and the scheduler. > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > sometimes including a full memory barrier, but other times with > > the full memory barrier being confined as it is with !PREEMPT > > Tree RCU. > > The rules do say that the (full) memory barrier must precede any > RCU read-side that occur after the synchronize_rcu and after the > end of any RCU read-side that occur before the synchronize_rcu. > > All I'm arguing is that wherever that full mb is, as long as it > also carries with it a barrier() (which it must do if it's done > using an existing kernel mb/locking primitive), then we're fine. > > > Interleaving and inserting full memory barriers as per the rules above: > > > > CPU1: WRITE_ONCE(a, 1) > > CPU1: synchronize_rcu > > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); > > Let's put them in because I think they are critical. smp_mb() also > carries with it a barrier(). > > > CPU2: rcu_read_lock(); > > CPU1: b = 2; > > CPU2: if (READ_ONCE(a) == 0) > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > CPU2: b = 1; > > CPU2: rcu_read_unlock > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > which (I believe) is why the model complains in this case. > > Let's put aside my doubt over how we're even allowing a compiler > to turn > > b = 1 > > into > > if (b != 1) > b = 1 > > Since you seem to be assuming that (a == 0) is true in this case I think Paul's example assuming (a == 0) is false, and maybe speculative writes (by compilers) needs to added into consideration? Please consider the following case (I add a few smp_mb()s), the case may be a little bit crasy, you have been warned ;-) CPU1: WRITE_ONCE(a, 1) CPU1: synchronize_rcu called CPU1: smp_mb(); /* let assume there is one here */ CPU2: rcu_read_lock(); CPU2: smp_mb(); /* let assume there is one here */ /* "if (b != 1) b = 1" reordered */ CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */ CPU2: if (r0 != 1) /* true */ CPU2: b = 1;/* b == 1 now, this is a speculative write by compiler */ CPU1: b = 2;/* b == 2 */ CPU2: if (READ_ONCE(a) == 0) /* false */ CPU2: ... CPU2 else /* undo the speculative write */ CPU2: b = r0; /* b == 0 */ CPU2: smp_mb(); CPU2: read_read_unlock(); I know this is too crasy for us to think a compiler like this, but this might be the reason why the model complain about this. Paul, did I get this right? Or you mean something else? Regards, Boqun > (as the assignment b = 1 is carried out), then because of the > presence of the full memory barrier, the RCU read-side section > must have started prior to the synchronize_rcu. This means that > synchronize_rcu is not allowed to return until at least the end > of the grace period, or at least until the end of rcu_read_unlock. > > So it actually should be: > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu called > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); > > CPU2: grace period starts > ...time passes... > CPU2: rcu_read_lock(); > CPU2: if (READ_ONCE(a) == 0) > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > CPU2: b = 1; > CPU2: rcu_read_unlock > ...time passes... > CPU2: grace period ends > > /* This full memory barrier is also guaranteed by RCU. */ > CPU2: smp_mb(); > > CPU1 synchronize_rcu returns > CPU1: b = 2; > > Cheers, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt signature.asc Description: PGP signature
Re: rcu_read_lock lost its compiler barrier
On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > 1.These guarantees are of full memory barriers, -not- compiler > barriers. What I'm saying is that wherever they are, they must come with compiler barriers. I'm not aware of any synchronisation mechanism in the kernel that gives a memory barrier without a compiler barrier. > 2.These rules don't say exactly where these full memory barriers > go. SRCU is at one extreme, placing those full barriers in > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > at the other, placing these barriers entirely within the callback > queueing/invocation, grace-period computation, and the scheduler. > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > sometimes including a full memory barrier, but other times with > the full memory barrier being confined as it is with !PREEMPT > Tree RCU. The rules do say that the (full) memory barrier must precede any RCU read-side that occur after the synchronize_rcu and after the end of any RCU read-side that occur before the synchronize_rcu. All I'm arguing is that wherever that full mb is, as long as it also carries with it a barrier() (which it must do if it's done using an existing kernel mb/locking primitive), then we're fine. > Interleaving and inserting full memory barriers as per the rules above: > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu > /* Could put a full memory barrier here, but it wouldn't help. */ CPU1: smp_mb(); CPU2: smp_mb(); Let's put them in because I think they are critical. smp_mb() also carries with it a barrier(). > CPU2: rcu_read_lock(); > CPU1: b = 2; > CPU2: if (READ_ONCE(a) == 0) > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > CPU2: b = 1; > CPU2: rcu_read_unlock > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > which (I believe) is why the model complains in this case. Let's put aside my doubt over how we're even allowing a compiler to turn b = 1 into if (b != 1) b = 1 Since you seem to be assuming that (a == 0) is true in this case (as the assignment b = 1 is carried out), then because of the presence of the full memory barrier, the RCU read-side section must have started prior to the synchronize_rcu. This means that synchronize_rcu is not allowed to return until at least the end of the grace period, or at least until the end of rcu_read_unlock. So it actually should be: CPU1: WRITE_ONCE(a, 1) CPU1: synchronize_rcu called /* Could put a full memory barrier here, but it wouldn't help. */ CPU1: smp_mb(); CPU2: smp_mb(); CPU2: grace period starts ...time passes... CPU2: rcu_read_lock(); CPU2: if (READ_ONCE(a) == 0) CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ CPU2: b = 1; CPU2: rcu_read_unlock ...time passes... CPU2: grace period ends /* This full memory barrier is also guaranteed by RCU. */ CPU2: smp_mb(); CPU1 synchronize_rcu returns CPU1: b = 2; Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > CPU2: if (b != 1) > > CPU2: b = 1; > > Stop right there. The kernel is full of code that assumes that > assignment to an int/long is atomic. If your compiler breaks this > assumption that we can kiss the kernel good-bye. The slippery slope apparently started here: : commit ea435467500612636f8f4fb639ff6e76b2496e4b : Author: Matthew Wilcox : Date: Tue Jan 6 14:40:39 2009 -0800 : : atomic_t: unify all arch definitions : : diff --git a/arch/x86/include/asm/atomic_32.h b/arch/x86/include/asm/atomic_32.h : index ad5b9f6ecddf..85b46fba4229 100644 : --- a/arch/x86/include/asm/atomic_32.h : +++ b/arch/x86/include/asm/atomic_32.h : ... : @@ -10,15 +11,6 @@ : * resource counting etc.. : */ : : -/* : - * Make sure gcc doesn't try to be clever and move things around : - * on us. We need to use _exactly_ the address the user gave us, : - * not some alias that contains the same information. : - */ : -typedef struct { : - int counter; : -} atomic_t; : : diff --git a/include/linux/types.h b/include/linux/types.h : index 121f349cb7ec..3b864f2d9560 100644 : --- a/include/linux/types.h : +++ b/include/linux/types.h : @@ -195,6 +195,16 @@ typedef u32 phys_addr_t; : : typedef phys_addr_t resource_size_t; : : +typedef struct { : + volatile int counter; : +} atomic_t; : + Before evolving into the READ_ONCE/WRITE_ONCE that we have now. Linus, are we now really supporting a compiler where an assignment (or a read) from an int/long/pointer can be non-atomic without the volatile marker? Because if that's the case then we have a lot of code to audit. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > CPU2: if (b != 1) > CPU2: b = 1; Stop right there. The kernel is full of code that assumes that assignment to an int/long is atomic. If your compiler breaks this assumption that we can kiss the kernel good-bye. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote: > > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu > > wrote: > > > > > > You can't then go and decide to remove the compiler barrier! To do > > > that you'd need to audit every single use of rcu_read_lock in the > > > kernel to ensure that they're not depending on the compiler barrier. > > > > What's the possible case where it would matter when there is no preemption? > > The case we were discussing is from net/ipv4/inet_fragment.c from > the net-next tree: > > void fqdir_exit(struct fqdir *fqdir) > { > ... > fqdir->dead = true; > > /* call_rcu is supposed to provide memory barrier semantics, >* separating the setting of fqdir->dead with the destruction >* work. This implicit barrier is paired with inet_frag_kill(). >*/ > > INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn); > queue_rcu_work(system_wq, &fqdir->destroy_rwork); > } > > and > > void inet_frag_kill(struct inet_frag_queue *fq) > { > ... > rcu_read_lock(); > /* The RCU read lock provides a memory barrier >* guaranteeing that if fqdir->dead is false then >* the hash table destruction will not start until >* after we unlock. Paired with inet_frags_exit_net(). >*/ > if (!fqdir->dead) { > rhashtable_remove_fast(&fqdir->rhashtable, &fq->node, > fqdir->f->rhash_params); > ... > } > ... > rcu_read_unlock(); > ... > } > > I simplified this to > > Initial values: > > a = 0 > b = 0 > > CPU1 CPU2 > > a = 1 rcu_read_lock > synchronize_rcu if (a == 0) > b = 2 b = 1 > rcu_read_unlock > > On exit we want this to be true: > b == 2 > > Now what Paul was telling me is that unless every memory operation > is done with READ_ONCE/WRITE_ONCE then his memory model shows that > the exit constraint won't hold. But please note that the plain-variable portion of the memory model is very new and likely still has a bug or two. In fact, see below. > IOW, we need > > CPU1 CPU2 > > WRITE_ONCE(a, 1) rcu_read_lock > synchronize_rcu if (READ_ONCE(a) == 0) > WRITE_ONCE(b, 2) WRITE_ONCE(b, 1) > rcu_read_unlock > > Now I think this bullshit because if we really needed these compiler > barriers then we surely would need real memory barriers to go with > them. On the one hand, you have no code before your rcu_read_lock() and also 1no code after your rcu_read_unlock(). So in this particular example, adding compiler barriers to these guys won't help you. On the other hand, on CPU 1's write to "b", I agree with you and disagree with the model, though perhaps my partners in LKMM crime will show me the error of my ways on this point. On CPU 2's write to "b", I can see the memory model's point, but getting there requires some gymnastics on the part of both the compiler and the CPU. The WRITE_ONCE() and READ_ONCE() for "a" is the normal requirement for variables that are concurrently loaded and stored. Please note that garden-variety uses of RCU have similar requirements, namely the rcu_assign_pointer() on the one side and the rcu_dereference() on the other. Your use case allows rcu_assign_pointer() to be weakened to WRITE_ONCE() and rcu_dereference() to be weakened to READ_ONCE() (not that this last is all that much of a weakening these days). > In fact, the sole purpose of the RCU mechanism is to provide those > memory barriers. Quoting from > Documentation/RCU/Design/Requirements/Requirements.html: > > Each CPU that has an RCU read-side critical section that > begins before synchronize_rcu() starts is > guaranteed to execute a full memory barrier between the time > that the RCU read-side critical section ends and the time that > synchronize_rcu() returns. > Without this guarantee, a pre-existing RCU read-side critical section > might hold a reference to the newly removed struct foo > after the kfree() on line 14 of > remove_gp_synchronous(). > Each CPU that has an RCU read-side critical section that ends > after synchronize_rcu() returns is guaranteed > to execute a full memory barrier between the time that > synchronize_rcu() begins and the time that the RCU > read-side critical section begins. > Without this guarantee, a later RCU read-side critical section > running after the kfree() on line
Re: rcu_read_lock lost its compiler barrier
On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote: > > Please note that preemptible Tree RCU has lacked the compiler barrier on > all but the outermost rcu_read_unlock() for years before Boqun's patch. Actually this is not true. Boqun's patch (commit bb73c52bad36) does not add a barrier() to __rcu_read_lock. In fact I dug into the git history and this compiler barrier() has existed in preemptible tree RCU since the very start in 2009: : commit f41d911f8c49a5d65c86504c19e8204bb605c4fd : Author: Paul E. McKenney : Date: Sat Aug 22 13:56:52 2009 -0700 : : rcu: Merge preemptable-RCU functionality into hierarchical RCU : : +/* : + * Tree-preemptable RCU implementation for rcu_read_lock(). : + * Just increment ->rcu_read_lock_nesting, shared state will be updated : + * if we block. : + */ : +void __rcu_read_lock(void) : +{ : + ACCESS_ONCE(current->rcu_read_lock_nesting)++; : + barrier(); /* needed if we ever invoke rcu_read_lock in rcutree.c */ : +} : +EXPORT_SYMBOL_GPL(__rcu_read_lock); However, you are correct that in the non-preempt tree RCU case, the compiler barrier in __rcu_read_lock was not always present. In fact it was added by: : commit 386afc91144b36b42117b0092893f15bc8798a80 : Author: Linus Torvalds : Date: Tue Apr 9 10:48:33 2013 -0700 : : spinlocks and preemption points need to be at least compiler barriers I suspect this is what prompted you to remove it in 2015. > I do not believe that reverting that patch will help you at all. > > But who knows? So please point me at the full code body that was being > debated earlier on this thread. It will no doubt take me quite a while to > dig through it, given my being on the road for the next couple of weeks, > but so it goes. Please refer to my response to Linus for the code in question. In any case, I am now even more certain that compiler barriers are not needed in the code in question. The reasoning is quite simple. If you need those compiler barriers then you surely need real memory barriers. Vice versa, if real memory barriers are already present thanks to RCU, then you don't need those compiler barriers. In fact this calls into question the use of READ_ONCE/WRITE_ONCE in RCU primitives such as rcu_dereference and rcu_assign_pointer. IIRC when RCU was first added to the Linux kernel we did not have compiler barriers in rcu_dereference and rcu_assign_pointer. They were added later on. As compiler barriers per se are useless, these are surely meant to be coupled with the memory barriers provided by RCU grace periods and synchronize_rcu. But then those real memory barriers would have compiler barriers too. So why do we need the compiler barriers in rcu_dereference and rcu_assign_pointer? Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote: > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu > wrote: > > > > You can't then go and decide to remove the compiler barrier! To do > > that you'd need to audit every single use of rcu_read_lock in the > > kernel to ensure that they're not depending on the compiler barrier. > > What's the possible case where it would matter when there is no preemption? The case we were discussing is from net/ipv4/inet_fragment.c from the net-next tree: void fqdir_exit(struct fqdir *fqdir) { ... fqdir->dead = true; /* call_rcu is supposed to provide memory barrier semantics, * separating the setting of fqdir->dead with the destruction * work. This implicit barrier is paired with inet_frag_kill(). */ INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn); queue_rcu_work(system_wq, &fqdir->destroy_rwork); } and void inet_frag_kill(struct inet_frag_queue *fq) { ... rcu_read_lock(); /* The RCU read lock provides a memory barrier * guaranteeing that if fqdir->dead is false then * the hash table destruction will not start until * after we unlock. Paired with inet_frags_exit_net(). */ if (!fqdir->dead) { rhashtable_remove_fast(&fqdir->rhashtable, &fq->node, fqdir->f->rhash_params); ... } ... rcu_read_unlock(); ... } I simplified this to Initial values: a = 0 b = 0 CPU1CPU2 a = 1 rcu_read_lock synchronize_rcu if (a == 0) b = 2 b = 1 rcu_read_unlock On exit we want this to be true: b == 2 Now what Paul was telling me is that unless every memory operation is done with READ_ONCE/WRITE_ONCE then his memory model shows that the exit constraint won't hold. IOW, we need CPU1CPU2 WRITE_ONCE(a, 1)rcu_read_lock synchronize_rcu if (READ_ONCE(a) == 0) WRITE_ONCE(b, 2)WRITE_ONCE(b, 1) rcu_read_unlock Now I think this bullshit because if we really needed these compiler barriers then we surely would need real memory barriers to go with them. In fact, the sole purpose of the RCU mechanism is to provide those memory barriers. Quoting from Documentation/RCU/Design/Requirements/Requirements.html: Each CPU that has an RCU read-side critical section that begins before synchronize_rcu() starts is guaranteed to execute a full memory barrier between the time that the RCU read-side critical section ends and the time that synchronize_rcu() returns. Without this guarantee, a pre-existing RCU read-side critical section might hold a reference to the newly removed struct foo after the kfree() on line 14 of remove_gp_synchronous(). Each CPU that has an RCU read-side critical section that ends after synchronize_rcu() returns is guaranteed to execute a full memory barrier between the time that synchronize_rcu() begins and the time that the RCU read-side critical section begins. Without this guarantee, a later RCU read-side critical section running after the kfree() on line 14 of remove_gp_synchronous() might later run do_something_gp() and find the newly deleted struct foo. My review of the RCU code shows that these memory barriers are indeed present (at least when we're not in tiny mode where all this discussion would be moot anyway). For example, in call_rcu we eventually get down to rcu_segcblist_enqueue which has an smp_mb. On the reader side (correct me if I'm wrong Paul) the memory barrier is implicitly coming from the scheduler. My point is that within our kernel whenever we have a CPU memory barrier we always have a compiler barrier too. Therefore my code example above does not need any extra compiler barriers such as the ones provided by READ_ONCE/WRITE_ONCE. I think perhaps Paul was perhaps thinking that I'm expecting rcu_read_lock/rcu_read_unlock themselves to provide the memory or compiler barriers. That would indeed be wrong but this is not what I need. All I need is the RCU semantics as documented for there to be memory and compiler barriers around the whole grace period. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: rcu_read_lock lost its compiler barrier
On Sun, Jun 02, 2019 at 01:56:07PM +0800, Herbert Xu wrote: > Digging up an old email because I was not aware of this previously > but Paul pointed me to it during another discussion. > > On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote: > > On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote: > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > > index d63bb77..6c3cece 100644 > > > > --- a/include/linux/rcupdate.h > > > > +++ b/include/linux/rcupdate.h > > > > @@ -297,12 +297,14 @@ void synchronize_rcu(void); > > > > > > > > static inline void __rcu_read_lock(void) > > > > { > > > > - preempt_disable(); > > > > + if (IS_ENABLED(CONFIG_PREEMPT_COUNT)) > > > > + preempt_disable(); > > > > > > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right? > > > Or rather it's a barrier(), which is anyway implied by rcu_read_lock(). > > > > > > So perhaps we can get rid of the IS_ENABLED() check? > > > > Actually, barrier() is not intended to be implied by rcu_read_lock(). > > In a non-preemptible RCU implementation, it doesn't help anything > > to have the compiler flush its temporaries upon rcu_read_lock() > > and rcu_read_unlock(). > > This is seriously broken. RCU has been around for years and is > used throughout the kernel while the compiler barrier existed. Please note that preemptible Tree RCU has lacked the compiler barrier on all but the outermost rcu_read_unlock() for years before Boqun's patch. So exactly where in the code that we are currently discussing are you relying on compiler barriers in either rcu_read_lock() or rcu_read_unlock()? The grace-period guarantee allows the compiler ordering to be either in the readers (SMP&&PREEMPT), in the grace-period mechanism (SMP&&!PREEMPT), or both (SRCU). > You can't then go and decide to remove the compiler barrier! To do > that you'd need to audit every single use of rcu_read_lock in the > kernel to ensure that they're not depending on the compiler barrier. > > This is also contrary to the definition of almost every other > *_lock primitive in the kernel where the compiler barrier is > included. > > So please revert this patch. I do not believe that reverting that patch will help you at all. But who knows? So please point me at the full code body that was being debated earlier on this thread. It will no doubt take me quite a while to dig through it, given my being on the road for the next couple of weeks, but so it goes. Thanx, Paul
Re: rcu_read_lock lost its compiler barrier
On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu wrote: > > You can't then go and decide to remove the compiler barrier! To do > that you'd need to audit every single use of rcu_read_lock in the > kernel to ensure that they're not depending on the compiler barrier. What's the possible case where it would matter when there is no preemption? Linus
rcu_read_lock lost its compiler barrier
Digging up an old email because I was not aware of this previously but Paul pointed me to it during another discussion. On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote: > On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote: > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > index d63bb77..6c3cece 100644 > > > --- a/include/linux/rcupdate.h > > > +++ b/include/linux/rcupdate.h > > > @@ -297,12 +297,14 @@ void synchronize_rcu(void); > > > > > > static inline void __rcu_read_lock(void) > > > { > > > - preempt_disable(); > > > + if (IS_ENABLED(CONFIG_PREEMPT_COUNT)) > > > + preempt_disable(); > > > > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right? > > Or rather it's a barrier(), which is anyway implied by rcu_read_lock(). > > > > So perhaps we can get rid of the IS_ENABLED() check? > > Actually, barrier() is not intended to be implied by rcu_read_lock(). > In a non-preemptible RCU implementation, it doesn't help anything > to have the compiler flush its temporaries upon rcu_read_lock() > and rcu_read_unlock(). This is seriously broken. RCU has been around for years and is used throughout the kernel while the compiler barrier existed. You can't then go and decide to remove the compiler barrier! To do that you'd need to audit every single use of rcu_read_lock in the kernel to ensure that they're not depending on the compiler barrier. This is also contrary to the definition of almost every other *_lock primitive in the kernel where the compiler barrier is included. So please revert this patch. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt