Re: [c++std-parallel-2008] Re: Compilers and RCU readers: Once more unto the breach!

2015-09-23 Thread Paul E. McKenney
On Wed, Sep 23, 2015 at 03:30:34PM -0700, Hans Boehm wrote:
> I'd really like this to converge.  It's great that it tries to focus on a
> couple of alternatives.  But I'm not sure they're the right ones.
> 
> I kind of like the 7.9 approach of restricting dependency chain to those
> case in which no sane existing optimization break the dependency anyway.
> The hope would then be that we can restrict future optimizations to not do
> so either without too much cost.  But I'm again very concerned as to
> whether this can be specified acceptably.
> 
> The paper discusses the problem with pointer comparisons that effectively
> allow the compiler to remove a dependency by inferring the identity of a
> pointer.  I don't see how to describe the set of all such cases in a way
> that's comprehensible and acceptable in a standard.
> 
> Consider:
> 
> p = x;
> y = p -> a;
> if (x == well_known_pointer) {
>  foo(well_known_pointer -> a);
>  z = true;
> } else {
>  y = 2;
>  z = false;
> }
> 
> Clearly if z = true, the load of y should have depended on that of x.  But
> this should be transformed to:
> 
> p = x;
> if (x == well_known_pointer) {
>  foo(y = well_known_pointer -> a);
>  z = true;
> } else {
>  y = 2;
>  z = false;
> }
> 
> which no longer has that property.  And I suspect we can construct examples
> in which the conditional is in a subsequent inlined function call, and
> entirely invisible to the programmer.  And there are probably issues with
> compilers using more complicated chains of reasoning to infer that pointers
> must be equal, e.g. because anything else would result in undefined
> behavior.  I'm once again being doubtful that we can pull off anything
> along these lines.

Indeed, the first paragraph of "Equality comparisons" on page 32 says
"This dependency-chain breakage can back-propagate to earlier uses of
the pointer".  The "Narrowing magnitude comparisons" paragraph on that
same page makes this same point for sequences of inequality comparisons
that allow the compiler to deduce the exact value of the pointer.
The comparisons currently in the Linux kernel are against compile-time
initialized constant-value structures, in which case breaking the
dependency chain is OK.

I agree that the compiler could legally introduce a pointer comparison,
as noted in the "Equality comparisons" section, for example, when using
feedback-directed profiling.  Do you see other reasons why the compiler
would do this?

> 7.10 looks like a good direction, but it seems to me that we need to say
> more about handling expression temporaries and function results.
> Presumably at least ordinary expression temporaries implicitly carry
> dependencies?  Thus if both x and y are marked with _Carries_dependency,
> then
> 
> y = ((x ^ x) & 0) * 0
> 
> carries a dependency from x to y?

Yes, but a high-quality implementation would be capable of issuing a
warning in this case.  After all, the developer presumably used
memory_order_consume to gain better performance, but in this case
that added performance is reduced or perhaps even eliminated completely
by the additional code required to maintain the dependency chain.

The reason for carrying the dependency in this case is instead to make
it easier on the people producing tooling.

> I'm leaning again towards something similar to what we attempted to specify
> in the current standard, but restricted to temporaries, maybe function
> locals, and explicitly annotated locations.  This would be some combination
> of 7.10 and maybe 7.5.  It has the down side that it will likely require a
> bunch of explicit attributes (or some kind of C equivalent, though I'm not
> sure storage class works well for function arguments and results).  But it
> should avoid the kill-dependency proliferation required by the current
> standard.  And we could start to actually clean up the remaining problems
> with carries_dependency.

The nice thing about having some sort of marking is that it helps in
generation of good diagnostics.  We also need dependency chains to pass
into and out of functions, and this need will become more acute as
link-time optimization becomes more mainstream, which can remove the
function-call overhead from calls to external functions.  We need the
compiler to know for sure whether or not a given formal parameter or
return value carries a dependency, so some sort of marking is required.
In addition, where structures (as opposed to pointers/references to
structures) are passed into and out of functions, we will very likely
need some way to mark the fields of those structures that are expected
to carry dependencies.  (The Linux kernel tends to avoid this, which
I like, but other projects are likely to make other choices, regardless
of my opinions.)

That marking could be the [[carries_dependency]] attribute (at least
assuming that C adds attributes), a type modifier (which might or might
not go over well in Core), a variable modifier (which I don't 

Re: Compilers and RCU readers: Once more unto the breach!

2015-09-22 Thread Paul E. McKenney
On Mon, Jul 13, 2015 at 05:44:59PM -0700, Paul E. McKenney wrote:
> On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > Following up on last year's discussion (https://lwn.net/Articles/586838/,
> > https://lwn.net/Articles/588300/), I believe that we have a solution.  If
> > I am wrong, I am sure you all will let me know, and in great detail.  ;-)
> > 
> > The key simplification is to "just say no" to RCU-protected array indexes:
> > https://lkml.org/lkml/2015/5/12/827, as was suggested by several people.
> > This simplification means that rcu_dereference (AKA memory_order_consume)
> > need only return pointers.  This in ture avoids things like (x-x),
> > (x*0), and (x%1) because if "x" is a pointer, these expressions either
> > return non-pointers are compilation errors.  With a very few exceptions,
> > dependency chains can lead -to- non-pointers, but cannot pass -through-
> > them.
> > 
> > The result is that dependencies are carried only by operations for
> > which the compiler cannot easily optimize the away those dependencies,
> > these operations including simple assignment, integer offset (including
> > indexing), dereferencing, casts, passing as a function argument, return
> > values from functions and so on.  A complete list with commentary starts
> > on page 28 of:
> > 
> > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
> 
> And an update is available here:
> 
>   http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> 
> Among other things, this update addresses the points about equality
> comparisons introduced by the compiler, and outlines some of the
> issues head-/tail-marked alternatives face with respect to abstraction.
> The updated "Restricted Dependency Chains" section starts on page 28.
> 
> Thoughts?

This is updated based on many offline discussions.  Section 7.9 has
been updated accordingly, but still starts on page 28.  This approach is
again intended for large existing RCU code bases such as the Linux kernel.

A new Section 7.10 starting on page 35 describes a possible longer-term
solution for new-to-RCU code bases.  This approach adds a storage class
that indicates that the marked object carries dependencies.  This allows
better diagnostics (and arguably better code self-documentation) as well
as providing something that formal-verification tools can handle while
avoiding the need for compilers to trace dependency chains.

http://www.rdrop.com/users/paulmck/submission/consume.2015.09.22a.pdf

Thoughts?

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-07-13 Thread Paul E. McKenney
On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote:
 Hello!
 
 Following up on last year's discussion (https://lwn.net/Articles/586838/,
 https://lwn.net/Articles/588300/), I believe that we have a solution.  If
 I am wrong, I am sure you all will let me know, and in great detail.  ;-)
 
 The key simplification is to just say no to RCU-protected array indexes:
 https://lkml.org/lkml/2015/5/12/827, as was suggested by several people.
 This simplification means that rcu_dereference (AKA memory_order_consume)
 need only return pointers.  This in ture avoids things like (x-x),
 (x*0), and (x%1) because if x is a pointer, these expressions either
 return non-pointers are compilation errors.  With a very few exceptions,
 dependency chains can lead -to- non-pointers, but cannot pass -through-
 them.
 
 The result is that dependencies are carried only by operations for
 which the compiler cannot easily optimize the away those dependencies,
 these operations including simple assignment, integer offset (including
 indexing), dereferencing, casts, passing as a function argument, return
 values from functions and so on.  A complete list with commentary starts
 on page 28 of:
 
   http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

And an update is available here:

http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf

Among other things, this update addresses the points about equality
comparisons introduced by the compiler, and outlines some of the
issues head-/tail-marked alternatives face with respect to abstraction.
The updated Restricted Dependency Chains section starts on page 28.

Thoughts?

Thanx, Paul



Re: [c++std-parallel-1651] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Paul E. McKenney
On Tue, May 26, 2015 at 07:08:35PM +0200, Torvald Riegel wrote:
 On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: 
  http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
 
 I have been discussing Section 7.9 with Paul during last week.
 
 While I think that 7.9 helps narrow down the problem somewhat, I'm still
 concerned that it effectively requires compilers to either track
 dependencies or conservatively prevent optimizations like value
 speculation and specialization based on that.  Neither is a good thing
 for a compiler.

I do believe that we can find some middle ground.

 7.9 adds requirements that dependency chains stop if the program itself
 informs the compiler about the value of something in the dependency
 chain (e.g., as shown in Figure 33).  Also, if a correct program that
 does not have undefined behavior must use a particular value, this is
 also considered as informing the compiler about that value.  For
 example:
   int arr[2];
   int* x = foo.load(mo_consume);
   if (x  arr)   // implies same object/array, so x is in arr[]
 int r1 = *x; // compiler knows x == arr + 1
 The program, assuming no undefined behavior, first tells the compiler
 that x should be within arr, and then the comparison tells the compiler
 that x!=arr, so x==arr+1 must hold because there are just two elements
 in the array.

The updated version of Section 7.9 says that if undefined behavior
allows the compiler to deduce the exact pointer value, as in the
case you show above, the dependency chain is broken.

 Having these requirements is generally good, but we don't yet know how
 to specify this properly.  For example, I suppose we'd need to also say
 that the compiler cannot assume to know anything about a value returned
 from an mo_consume load; otherwise, nothing prevents a compiler from
 using knowledge about the stores that the mo_consume load can read from
 (see Section 7.2).

I expect that the Linux kernel's rcu_dereference() primitives would
map to volatile memory_order_consume loads for this reason.

 Also, a compiler is still required to not do value-speculation or
 optimizations based on that.  For example, this program:
 
 void op(type *p)
 {
   foo /= p-a;
   bar = p-b;
 }
 void bar()
 {
   pointer = ppp.load(mo_consume);
   op(pointer);
 }
 
 ... must not be transformed into this program, even if the compiler
 knows that global_var-a == 1:
 
 void op(type *p) { /* unchanged */}
 void bar()
 {
 pointer = ppp.load(mo_consume);
   if (pointer != global_var) {
 op(pointer);
   else // specialization for global_var
 {
   // compiler knows global_var-a==1;
   // compiler uses global_var directly, inlines, optimizes:
   bar = global_var-b;
 }
 
 The compiler could optimize out the division if pointer==global_var but
 it must not access field b directly through global_var.  This would be
 pretty awkwaard; the compiler may work based on an optimized expression
 in the specialization (ie, create code that assumes global_var instead
 of pointer) but it would still have to carry around and use the
 non-optimized expression.

Exactly how valuable is this sort of optimization in real code?  And
how likely is the compiler to actually be able to apply it?

(I nevertheless will take another pass through the Linux kernel looking
for global variables being added to RCU-protected linked structures.
Putting a global into an RCU-protected structure seems more likely than
is an RCU-protected pointer into a two-element array.)

 This wouldn't be as bad if it were easily constrained to code sequences
 that really need the dependencies.  However, 7.9 does not effectively
 contain dependencies to only the code that really needs them, IMO.
 Unless the compiler proves otherwise, it has to assume that a load from
 a pointer carries a dependency.  Proving that is often very hard because
 it requires points-to analysis; 7.9 restricts this to intra-thread
 analysis but that is still nontrivial.
 Michael Matz' had a similar concern (in terms of what it results in).

Again, I will be looking through the Linux kernel for vulnerabilities to
this sort of transformation.  However, I am having a very hard time seeing
how the compiler is going to know that much about the vast majority of
the Linux-kernel use cases.  The linked structures are allocated on the
heap, not in arrays or in globals.

 Given that mo_consume is useful but a very specialized feature, I
 wouldn't be optimistic that 7.9 would actually be supported by many
 compilers.  The trade-off between having to track dependencies or having
 to disallow optimizations is a bad one to make.  The simple way out for
 a compiler would be to just emit mo_acquire instead of mo_consume and be
 done with all -- and this might be the most practical decision overall,
 or the default general-purpose implementation.  At least I haven't heard
 any compiler implementer say that they think it's obviously worth
 implementing.

I believe that we need to 

Re: [c++std-parallel-1641] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Torvald Riegel
On Thu, 2015-05-21 at 13:42 -0700, Linus Torvalds wrote:
 On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  The compiler can (and does) speculate non-atomic non-volatile writes
  in some cases, but I do not believe that it is permitted to speculate
  either volatile or atomic writes.
 
 I do *not* believe that a compiler is ever allowed to speculate *any*
 writes - volatile or not - unless the compiler can prove that the end
 result is either single-threaded, or the write in question is
 guaranteed to only be visible in that thread (ie local stack variable
 etc).

It must not speculative volatile accesses.  It could speculate
non-volatiles even if those are atomic and observable by other threads
but that would require further work/checks on all potential observers of
those (ie, to still satisfy as-if).  Thus, compilers are unlikely to do
such speculation, I'd say.

The as-if rule (ie, equality of observable behavior (ie, volatiles, ...)
to the abstract machine) makes all this clear.

 Also, I do think that the whole consume read should be explained
 better to compiler writers. Right now the language (including very
 much in the restricted dependency model) is described in very
 abstract terms. Yet those abstract terms are actually very subtle and
 complex, and very opaque to a compiler writer.

I believe the issues around the existing specification of mo_consume
where pointed out by compiler folks.  It's a complex problem, and I'm
all for more explanations, but I did get the impression that the
compiler writers in ISO C++ Study Group 1 do have a good understanding
of the problem.

 I personally think the whole abstract machine model of the C
 language is a mistake. It would be much better to talk about things in
 terms of actual code generation and actual issues. Make all the
 problems much more concrete, with actual examples of how memory
 ordering matters on different architectures.

As someone working for a toolchain team, I don't see how the
abstract-machine-based specification is a problem at all, nor have I
seen compiler writers struggling with it.  It does give precise rules
for code generation.

The level of abstraction is a good thing for most programs because for
those, the primary concern is that the observable behavior and end
result is computed -- it's secondary and QoI how that happens.  In
contrast, if you specify on the level of code generation, you'd have to
foresee how code generation might look in the future, including predict
future optimizations and all that.  That doesn't look future-proof to
me.

I do realize that this may be less than ideal for cases when one would
want to use a C compiler more like a convenient assembler.  But that
case isn't the 99%, I believe.



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

  (a) the official rules are completely pointless, and make sense 
 only because the standard is written for some random abstract 
 machine that doesn't actually exist.

Presuming the intent of the abstract machine specification is to avoid 
being seen as biased towards any specific machine (politics), maybe 
write this as:

   (a) the official rules are written for a somewhat weird and 
   complex union of all known and theoretically possible CPU 
   architectures that exist or which might exist in the future, 
   which machine does not actually exist in practice, but which 
   allows a single abstract set of rules to apply to all machines. 
   These rules are complex, but if applied to a specific machine 
   they become considerably simpler. Here's a few examples: ...

?

(Assuming it's a goal of this standard to be human parseable to more 
than a few dozen people on the planet.)

Thanks,

Ingo


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Richard Kenner
 (Assuming it's a goal of this standard to be human parseable to more 
 than a few dozen people on the planet.)

Unfortunately, that's rarely a goal of most standards. ;-)


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Paul E. McKenney
On Fri, May 22, 2015 at 08:43:44AM +0200, Ingo Molnar wrote:
 
 * Linus Torvalds torva...@linux-foundation.org wrote:
 
   (a) the official rules are completely pointless, and make sense 
  only because the standard is written for some random abstract 
  machine that doesn't actually exist.
 
 Presuming the intent of the abstract machine specification is to avoid 
 being seen as biased towards any specific machine (politics), maybe 
 write this as:
 
(a) the official rules are written for a somewhat weird and 
complex union of all known and theoretically possible CPU 
architectures that exist or which might exist in the future, 
which machine does not actually exist in practice, but which 
allows a single abstract set of rules to apply to all machines. 
These rules are complex, but if applied to a specific machine 
they become considerably simpler. Here's a few examples: ...
 
 ?
 
 (Assuming it's a goal of this standard to be human parseable to more 
 than a few dozen people on the planet.)

Should something based on Section 7.9 go in, then I would need to add
a more developer-friendly explanation in Documentation/RCU, no two
ways about it!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Paul E. McKenney
On Fri, May 22, 2015 at 06:43:32AM -0400, Richard Kenner wrote:
  (Assuming it's a goal of this standard to be human parseable to more 
  than a few dozen people on the planet.)
 
 Unfortunately, that's rarely a goal of most standards. ;-)

My experience does match Richard's, sad to say.  There has been some
vigorous discussion in another thread involving undefined behavior and
value narrowing, which has resulted in some useful changes to Section 7.9.
The updated document is here:

http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.22a.pdf

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Paul E. McKenney
On Fri, May 22, 2015 at 06:30:29PM +0100, Will Deacon wrote:
 Hi Paul,
 
 On Thu, May 21, 2015 at 09:02:12PM +0100, Paul E. McKenney wrote:
  On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
   On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
On to #5:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);

r2 = atomic_load_explicit(y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(x, 42, memory_order_relaxed);

The first thread's accesses are dependency ordered.  The second thread's
ordering is in a corner case that memory-barriers.txt does not cover.
You are supposed to start control dependencies with READ_ONCE_CTRL(), 
not
a memory_order_consume load (AKA rcu_dereference and friends).  However,
Alpha would have a full barrier as part of the memory_order_consume 
load,
and the rest of the processors would (one way or another) respect the
control dependency.  And the compiler would have some fun trying to
break it.
   
   But this is interesting because the first thread is ordered whilst the
   second is not, so doesn't that effectively forbid the compiler from
   constant-folding values if it can't prove that there is no dependency
   chain?
  
  You lost me on this one.  Are you suggesting that the compiler
  speculate the second thread's atomic store?  That would be very
  bad regardless of dependency chains.
  
  So what constant-folding optimization are you thinking of here?
  If the above example is not amenable to such an optimization, could
  you please give me an example where constant folding would apply
  in a way that is sensitive to dependency chains?
 
 Unless I'm missing something, I can't see what would prevent a compiler
 from looking at the code in thread1 and transforming it into the code in
 thread2 (i.e. constant folding r1 with 42 given that the taken branch
 must mean that r1 == 42). However, such an optimisation breaks the
 dependency chain, which means that a compiler needs to walk backwards
 to see if there is a dependency chain extending to r1.

Indeed!  Which is one reason that (1) integers are not allowed in
dependency chains with a very few extremely constrained exceptions and
(2) sequences of comparisons and/or undefined-behavior considerations
that allow the compiler to exactly determine the pointer value break
the dependency chain.

So the current Linux memory model would allow (r1 == 42  r2 == 42),
but I don't know of any hardware/compiler combination that would
allow it.  And no, I am -not- going to update memory-barriers.txt for
this litmus test, its theoretical interest notwithstanding!  ;-)
 
 Of course, I'm not asking for that at all! I'm just trying to see how
 your proposal holds up with the example.

Whew!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Will Deacon
Hi Paul,

On Thu, May 21, 2015 at 09:02:12PM +0100, Paul E. McKenney wrote:
 On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
  On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
   On to #5:
   
 r1 = atomic_load_explicit(x, memory_order_consume);
 if (r1 == 42)
   atomic_store_explicit(y, r1, memory_order_relaxed);
 
 r2 = atomic_load_explicit(y, memory_order_consume);
 if (r2 == 42)
   atomic_store_explicit(x, 42, memory_order_relaxed);
   
   The first thread's accesses are dependency ordered.  The second thread's
   ordering is in a corner case that memory-barriers.txt does not cover.
   You are supposed to start control dependencies with READ_ONCE_CTRL(), not
   a memory_order_consume load (AKA rcu_dereference and friends).  However,
   Alpha would have a full barrier as part of the memory_order_consume load,
   and the rest of the processors would (one way or another) respect the
   control dependency.  And the compiler would have some fun trying to
   break it.
  
  But this is interesting because the first thread is ordered whilst the
  second is not, so doesn't that effectively forbid the compiler from
  constant-folding values if it can't prove that there is no dependency
  chain?
 
 You lost me on this one.  Are you suggesting that the compiler
 speculate the second thread's atomic store?  That would be very
 bad regardless of dependency chains.
 
 So what constant-folding optimization are you thinking of here?
 If the above example is not amenable to such an optimization, could
 you please give me an example where constant folding would apply
 in a way that is sensitive to dependency chains?

Unless I'm missing something, I can't see what would prevent a compiler
from looking at the code in thread1 and transforming it into the code in
thread2 (i.e. constant folding r1 with 42 given that the taken branch
must mean that r1 == 42). However, such an optimisation breaks the
dependency chain, which means that a compiler needs to walk backwards
to see if there is a dependency chain extending to r1.

   So the current Linux memory model would allow (r1 == 42  r2 == 42),
   but I don't know of any hardware/compiler combination that would
   allow it.  And no, I am -not- going to update memory-barriers.txt for
   this litmus test, its theoretical interest notwithstanding!  ;-)

Of course, I'm not asking for that at all! I'm just trying to see how
your proposal holds up with the example.

Will


Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 06:17:43PM +0200, Michael Matz wrote:
 Hi,
 
 On Thu, 21 May 2015, Paul E. McKenney wrote:
 
  The point is -exactly- to codify the current state of affairs.
 
 Ah, I see, so it's not yet about creating a more useful (for compilers, 
 that is) model.

There are several approaches being considered for that as well, but we
do need to codify current usage.

 char * fancy_assign (char *in) { return in; }
 ...
 char *x, *y;
 
 x = atomic_load_explicit(p, memory_order_consume);
 y = fancy_assign (x);
 atomic_store_explicit(q, y, memory_order_relaxed);
   
   So, is there, or is there not a dependency carried from x to y in your 
   proposed model (and which rule in your document states so)?  Clearly, 
   without any other language the compiler would have to assume that there 
   is 
   (because the equivalent 'y = x' assignment would carry the dependency).  
  
  The dependency is not carried, though this is due to the current set
  of rules not covering atomic loads and stores, which I need to fix.
 
 Okay, so with the current regime(s), the dependency carries ...

Yes, that is the intent.

  o   Rule 14 says that if a value is part of a dependency chain and
  is used as the actual parameter of a function call, then the
  dependency chain extends to the corresponding formal parameter,
  namely in of fancy_assign().
  
  o   Rule 15 says that if a value is part of a dependency chain and
  is returned from a function, then the dependency chain extends
  to the returned value in the calling function.
  
  o   And you are right.  I need to make the first and second rules
  cover the relaxed atomic operations, or at least atomic loads and
  stores.  Not that this is an issue for existing Linux-kernel code.
  
  But given such a change, the new version of rule 2 would
  extend the dependency chain to cover the atomic_store_explicit().
 
 ... (if this detail would be fixed).  Okay, that's quite awful ...
 
   If it has to assume this, then the whole model is not going to work 
   very well, as usual with models that assume a certain less-optimal 
   fact (carries-dep is less optimal for code generation purposes that 
   not-carries-dep) unless very specific circumstances say it can be 
   ignored.
  
  Although that is a good general rule of thumb, I do not believe that it 
  applies to this situation, with the exception that I do indeed assume 
  that no one is insane enough to do value-speculation optimizations for 
  non-NULL values on loads from pointers.
  
  So what am I missing here?
 
 ... because you are then missing that if carries-dep can flow through 
 function calls from arguments to return values by default, the compiler 
 has to assume this in fact always happens when it can't see the function 
 body, or can't analyze it.  In effect that's making the whole carries-dep 
 stops at these and those uses a useless excercise because a malicious 
 user (malicious in the sense of abusing the model to show that it's 
 hindering optimizations), i.e. me, can hide all such carries-dep stopping 
 effects inside a function, et voila, the dependecy carries through.  So 
 for a slightly more simple example:
 
   extern void *foo (void *);  // body not available
   x = load
   y = foo (x);
   store (y)
 
 the compiler has to assume that there's a dep-chain from x to y; always.  

Yes, the compiler does have to make this assumption.  And the intent
behind the rules is to ensure that this assumption does not get in the
way of reasonable optimizations.  So although I am sure that you are as
busy as the rest of us, I really do need you to go through the rules in
detail before you get too much more excited about this.

 What's worse, it also has to assume a carries-dep for this:
 
   extern void foo (void *in, void **out1, void **out2);
   x = load
   foo (x, o1, o2);
   store (o1);
   store (o2);
 
 Now the compiler has to assume that the body of 'foo' is just mean enough 
 to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to 
 assume that for both).  This extends to _all_ memory accessible from foo's 
 body, i.e. generally all global and all local address-taken variables, so 
 as soon as you have a function call into which a dep-chain value flows 
 you're creating a dep-chain extension from that value to each and every 
 global piece of memory, because the compiler cannot assume that the 
 black box called foo is not mean.  This could conceivably be stopped by 
 making normal stores not to carry the dependency; then only the return 
 value might be infected; but I don't see that in your rules, as a normal 
 store is just an assigment in your model and hence rules 1 and 2 apply 
 (that is, carries-dep flows through all assignments, incl. loads and 
 stores).
 
 Basically whenever you can construct black boxes for the compiler, you 
 have to limit their effects on such transitive relations like carries-dep 
 by default, 

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
 On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
  On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
   On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
Indeed, something like this does -not- carry a dependency from the
memory_order_consume load to q:

char *p, q;

p = atomic_load_explicit(gp, memory_order_consume);
q = gq + (intptr_t)p - (intptr_t)p;

If this was compiled with -O0, ARM and Power might well carry a
dependency, but given any optimization, the assembly language would have
no hint of any such dependency.  So I am not seeing any particular 
danger.
   
   The above is a welcome relaxation over C11, since ARM doesn't even give
   you ordering based off false data dependencies. My concern is more to do
   with how this can be specified precisely without prohibing honest compiler
   and hardware optimisations.
  
  That last is the challenge.  I believe that I am pretty close, but I am
  sure that additional adjustment will be required.  Especially given that
  we also need the memory model to be amenable to formal analysis.
 
 Well, there's still the whole thin-air problem which unfortunately doesn't
 go away with your proposal... (I was hoping that differentiating between
 true and false dependencies would solve that, but your set of rules isn't
 broad enough and I don't blame you at all for that!).
 
   Out of interest, how do you tackle examples (4) and (5) of (assuming the
   reads are promoted to consume loads)?:
   
 http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
   
   my understanding is that you permit both outcomes (I appreciate you're
   not directly tackling out-of-thin-air, but treatment of dependencies
   is heavily related).
 
 Thanks for taking the time to walk these two examples through.

;-) ;-) ;-)

  Let's see...  #4 is as follows, given promotion to memory_order_consume
  and (I am guessing) memory_order_relaxed:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  --
  r2 = atomic_load_explicit(y, memory_order_consume);
  if (r2 == 42)
atomic_store_explicit(x, 42, memory_order_relaxed);
  else
atomic_store_explicit(x, 42, memory_order_relaxed);
  
  The second thread does not have a proper control dependency, even with
  the memory_order_consume load because both branches assign the same
  value to x.  This means that the compiler is within its rights to
  optimize this into the following:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  --
  r2 = atomic_load_explicit(y, memory_order_consume);
  atomic_store_explicit(x, 42, memory_order_relaxed);
  
  There is no dependency between the second thread's pair of statements,
  so both the compiler and the CPU are within their rights to optimize
  further as follows:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  --
  atomic_store_explicit(x, 42, memory_order_relaxed);
  r2 = atomic_load_explicit(y, memory_order_consume);
  
  If the compiler makes this final optimization, even mythical SC hardware
  is within its rights to end up with (r1 == 42  r2 == 42).  Which is
  fine, as far as I am concerned.  Or at least something that can be
  lived with.
 
 Agreed.
 
  On to #5:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  
  r2 = atomic_load_explicit(y, memory_order_consume);
  if (r2 == 42)
atomic_store_explicit(x, 42, memory_order_relaxed);
  
  The first thread's accesses are dependency ordered.  The second thread's
  ordering is in a corner case that memory-barriers.txt does not cover.
  You are supposed to start control dependencies with READ_ONCE_CTRL(), not
  a memory_order_consume load (AKA rcu_dereference and friends).  However,
  Alpha would have a full barrier as part of the memory_order_consume load,
  and the rest of the processors would (one way or another) respect the
  control dependency.  And the compiler would have some fun trying to
  break it.
 
 But this is interesting because the first thread is ordered whilst the
 second is not, so doesn't that effectively forbid the compiler from
 constant-folding values if it can't prove that there is no dependency
 chain?

You lost me on this one.  Are you suggesting that the compiler
speculate the second thread's atomic store?  That would be 

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Linus Torvalds
On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 The compiler can (and does) speculate non-atomic non-volatile writes
 in some cases, but I do not believe that it is permitted to speculate
 either volatile or atomic writes.

I do *not* believe that a compiler is ever allowed to speculate *any*
writes - volatile or not - unless the compiler can prove that the end
result is either single-threaded, or the write in question is
guaranteed to only be visible in that thread (ie local stack variable
etc).

Quite frankly, I'd be much happier if the C standard just said so outright.

Also, I do think that the whole consume read should be explained
better to compiler writers. Right now the language (including very
much in the restricted dependency model) is described in very
abstract terms. Yet those abstract terms are actually very subtle and
complex, and very opaque to a compiler writer.

If I was a compiler writer, I'd absolutely detest that definition.
It's very far removed from my problem space as a compiler writer, and
nothing in the language *explains* the odd and subtle abstract rules.
It smells ad-hoc to me.

Now, I actually understand the point of those odd and abstract rules,
but to a compiler writer that doesn't understand the background, the
whole section reads as this is really painful for me to track all
those dependencies and what kills them.

So I would very much suggest that there would be language that
*explains* this. Basically, tell the compiler writer:

 (a) the official rules are completely pointless, and make sense
only because the standard is written for some random abstract
machine that doesn't actually exist.

 (b) in *real life*, the whole and only point of the rules is to make
sure that the compiler doesn't turn a data depenency into a control
dependency, which on ARM and POWERPC does not honor causal memory
ordering

 (c) on x86, since *all* memory accesses are causal, all the magical
dependency rules are just pointless anyway, and what it really means
is that you cannot re-order accesses with value speculation.

 (c) the *actual* relevant rule for a compiler writer is very simple:
the compiler must not do value speculation on a consume load, and
the abstract machine rules are written so that any other sane
optimization is legal.

 (d) if the compiler writer really thinks they want to do value
speculation, they have to turn the consume load into an acquire
load. And you have to do that anyway on architectures like alpha that
aren't causal even for data dependencies.

I personally think the whole abstract machine model of the C
language is a mistake. It would be much better to talk about things in
terms of actual code generation and actual issues. Make all the
problems much more concrete, with actual examples of how memory
ordering matters on different architectures.

99% of all the problems with the whole consume memory ordering comes
not from anything relevant to a compiler writer. All of it comes from
trying to define the issue in the wrong terms.

 Linus


Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 04:22:38PM +0200, Michael Matz wrote:
 Hi,
 
 On Wed, 20 May 2015, Paul E. McKenney wrote:
 
I'm not sure... you'd require the compiler to perform static analysis of
loops to determine the state of the machine when they exit (if they 
exit!)
in order to show whether or not a dependency is carried to subsequent
operations. If it can't prove otherwise, it would have to assume that a
dependency *is* carried, and it's not clear to me how it would use this
information to restrict any subsequent dependency removing 
optimisations.
   
   It'd just convert consume to acquire.
  
  It should not need to, actually.
 
 [with GCC hat, and having only lightly read your document]

Understood.

 Then you need to provide language or at least informal reasons why the 
 compiler is allowed to not do that.  Without that a compiler would have to 
 be conservative, if it can't _prove_ that a dependency chain is stopped, 
 then it has to assume it hasn't.
 
 For instance I can't really make out easily what your document says about 
 the following simple situation (well, actually I have difficulties to 
 differ between what you're proposing as the good-new model of this all, 
 and what you're merely describing as different current states of affair):

The point is -exactly- to codify the current state of affairs.  I expect
a follow-on effort to specify some sort of marking regimen, as noted in
the last paragraph of 7.9 and as discussed with Torvald Riegel.  However,
given that there are not yet any implementations or practical experience
with such markings, I suspect that some time will be required to hammer
out a good marking scheme.

   char * fancy_assign (char *in) { return in; }
   ...
   char *x, *y;
   
   x = atomic_load_explicit(p, memory_order_consume);
   y = fancy_assign (x);
   atomic_store_explicit(q, y, memory_order_relaxed);
 
 So, is there, or is there not a dependency carried from x to y in your 
 proposed model (and which rule in your document states so)?  Clearly, 
 without any other language the compiler would have to assume that there is 
 (because the equivalent 'y = x' assignment would carry the dependency).  

The dependency is not carried, though this is due to the current set
of rules not covering atomic loads and stores, which I need to fix.
Here is the sequence of events:

o   A memory_order_consume load heads a dependency chain.

o   Rule 2 says that if a value is part of a dependency chain and
is used as the right-hand side of an assignment operator,
the expression extends the chain to cover the assignment.
And I switched to numbered bullet items here:
http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.21a.pdf

o   Rule 14 says that if a value is part of a dependency chain and
is used as the actual parameter of a function call, then the
dependency chain extends to the corresponding formal parameter,
namely in of fancy_assign().

o   Rule 15 says that if a value is part of a dependency chain and
is returned from a function, then the dependency chain extends
to the returned value in the calling function.

o   And you are right.  I need to make the first and second rules
cover the relaxed atomic operations, or at least atomic loads and
stores.  Not that this is an issue for existing Linux-kernel code.

But given such a change, the new version of rule 2 would
extend the dependency chain to cover the atomic_store_explicit().

 If it has to assume this, then the whole model is not going to work very 
 well, as usual with models that assume a certain less-optimal fact 
 (carries-dep is less optimal for code generation purposes that 
 not-carries-dep) unless very specific circumstances say it can be 
 ignored.

Although that is a good general rule of thumb, I do not believe that it
applies to this situation, with the exception that I do indeed assume
that no one is insane enough to do value-speculation optimizations for
non-NULL values on loads from pointers.

So what am I missing here?  Do you have a specific example where the
compiler would need to suppress a production-quality optimization?

Thanx, Paul



Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Michael Matz
Hi,

On Wed, 20 May 2015, Paul E. McKenney wrote:

   I'm not sure... you'd require the compiler to perform static analysis of
   loops to determine the state of the machine when they exit (if they exit!)
   in order to show whether or not a dependency is carried to subsequent
   operations. If it can't prove otherwise, it would have to assume that a
   dependency *is* carried, and it's not clear to me how it would use this
   information to restrict any subsequent dependency removing optimisations.
  
  It'd just convert consume to acquire.
 
 It should not need to, actually.

[with GCC hat, and having only lightly read your document]

Then you need to provide language or at least informal reasons why the 
compiler is allowed to not do that.  Without that a compiler would have to 
be conservative, if it can't _prove_ that a dependency chain is stopped, 
then it has to assume it hasn't.

For instance I can't really make out easily what your document says about 
the following simple situation (well, actually I have difficulties to 
differ between what you're proposing as the good-new model of this all, 
and what you're merely describing as different current states of affair):

  char * fancy_assign (char *in) { return in; }
  ...
  char *x, *y;
  
  x = atomic_load_explicit(p, memory_order_consume);
  y = fancy_assign (x);
  atomic_store_explicit(q, y, memory_order_relaxed);

So, is there, or is there not a dependency carried from x to y in your 
proposed model (and which rule in your document states so)?  Clearly, 
without any other language the compiler would have to assume that there is 
(because the equivalent 'y = x' assignment would carry the dependency).  

If it has to assume this, then the whole model is not going to work very 
well, as usual with models that assume a certain less-optimal fact 
(carries-dep is less optimal for code generation purposes that 
not-carries-dep) unless very specific circumstances say it can be 
ignored.


Ciao,
Michael.


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 01:42:11PM -0700, Linus Torvalds wrote:
 On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  The compiler can (and does) speculate non-atomic non-volatile writes
  in some cases, but I do not believe that it is permitted to speculate
  either volatile or atomic writes.
 
 I do *not* believe that a compiler is ever allowed to speculate *any*
 writes - volatile or not - unless the compiler can prove that the end
 result is either single-threaded, or the write in question is
 guaranteed to only be visible in that thread (ie local stack variable
 etc).
 
 Quite frankly, I'd be much happier if the C standard just said so outright.

So would I.  ;-)

The usual example is the following, where x is non-volatile and
non-atomic:

if (a)
x = 42;
else
x = 7;

The current rules allow this to be transformed as follows:

x = 7;
if (a)
x = 42;

So the variable x takes on the value 7 momentarily when it otherwise
would not have.

At least C11 now prohibits drive-by speculative writes, where the
compiler writes a larger size and then fixes things up afterwards.

 Also, I do think that the whole consume read should be explained
 better to compiler writers. Right now the language (including very
 much in the restricted dependency model) is described in very
 abstract terms. Yet those abstract terms are actually very subtle and
 complex, and very opaque to a compiler writer.
 
 If I was a compiler writer, I'd absolutely detest that definition.
 It's very far removed from my problem space as a compiler writer, and
 nothing in the language *explains* the odd and subtle abstract rules.
 It smells ad-hoc to me.
 
 Now, I actually understand the point of those odd and abstract rules,
 but to a compiler writer that doesn't understand the background, the
 whole section reads as this is really painful for me to track all
 those dependencies and what kills them.
 
 So I would very much suggest that there would be language that
 *explains* this. Basically, tell the compiler writer:
 
  (a) the official rules are completely pointless, and make sense
 only because the standard is written for some random abstract
 machine that doesn't actually exist.
 
  (b) in *real life*, the whole and only point of the rules is to make
 sure that the compiler doesn't turn a data depenency into a control
 dependency, which on ARM and POWERPC does not honor causal memory
 ordering
 
  (c) on x86, since *all* memory accesses are causal, all the magical
 dependency rules are just pointless anyway, and what it really means
 is that you cannot re-order accesses with value speculation.
 
  (c) the *actual* relevant rule for a compiler writer is very simple:
 the compiler must not do value speculation on a consume load, and
 the abstract machine rules are written so that any other sane
 optimization is legal.
 
  (d) if the compiler writer really thinks they want to do value
 speculation, they have to turn the consume load into an acquire
 load. And you have to do that anyway on architectures like alpha that
 aren't causal even for data dependencies.

I am certainly more than willing to add this sort of wording to
the document.

 I personally think the whole abstract machine model of the C
 language is a mistake. It would be much better to talk about things in
 terms of actual code generation and actual issues. Make all the
 problems much more concrete, with actual examples of how memory
 ordering matters on different architectures.
 
 99% of all the problems with the whole consume memory ordering comes
 not from anything relevant to a compiler writer. All of it comes from
 trying to define the issue in the wrong terms.

I certainly cannot deny that there seems to be significant confusion
in the discussion thus far.

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Will Deacon
On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
 On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
  On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
   Indeed, something like this does -not- carry a dependency from the
   memory_order_consume load to q:
   
 char *p, q;
   
 p = atomic_load_explicit(gp, memory_order_consume);
 q = gq + (intptr_t)p - (intptr_t)p;
   
   If this was compiled with -O0, ARM and Power might well carry a
   dependency, but given any optimization, the assembly language would have
   no hint of any such dependency.  So I am not seeing any particular danger.
  
  The above is a welcome relaxation over C11, since ARM doesn't even give
  you ordering based off false data dependencies. My concern is more to do
  with how this can be specified precisely without prohibing honest compiler
  and hardware optimisations.
 
 That last is the challenge.  I believe that I am pretty close, but I am
 sure that additional adjustment will be required.  Especially given that
 we also need the memory model to be amenable to formal analysis.

Well, there's still the whole thin-air problem which unfortunately doesn't
go away with your proposal... (I was hoping that differentiating between
true and false dependencies would solve that, but your set of rules isn't
broad enough and I don't blame you at all for that!).

  Out of interest, how do you tackle examples (4) and (5) of (assuming the
  reads are promoted to consume loads)?:
  
http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
  
  my understanding is that you permit both outcomes (I appreciate you're
  not directly tackling out-of-thin-air, but treatment of dependencies
  is heavily related).

Thanks for taking the time to walk these two examples through.

 Let's see...  #4 is as follows, given promotion to memory_order_consume
 and (I am guessing) memory_order_relaxed:
 
   r1 = atomic_load_explicit(x, memory_order_consume);
   if (r1 == 42)
 atomic_store_explicit(y, r1, memory_order_relaxed);
   --
   r2 = atomic_load_explicit(y, memory_order_consume);
   if (r2 == 42)
 atomic_store_explicit(x, 42, memory_order_relaxed);
   else
 atomic_store_explicit(x, 42, memory_order_relaxed);
 
 The second thread does not have a proper control dependency, even with
 the memory_order_consume load because both branches assign the same
 value to x.  This means that the compiler is within its rights to
 optimize this into the following:
 
   r1 = atomic_load_explicit(x, memory_order_consume);
   if (r1 == 42)
 atomic_store_explicit(y, r1, memory_order_relaxed);
   --
   r2 = atomic_load_explicit(y, memory_order_consume);
   atomic_store_explicit(x, 42, memory_order_relaxed);
 
 There is no dependency between the second thread's pair of statements,
 so both the compiler and the CPU are within their rights to optimize
 further as follows:
 
   r1 = atomic_load_explicit(x, memory_order_consume);
   if (r1 == 42)
 atomic_store_explicit(y, r1, memory_order_relaxed);
   --
   atomic_store_explicit(x, 42, memory_order_relaxed);
   r2 = atomic_load_explicit(y, memory_order_consume);
 
 If the compiler makes this final optimization, even mythical SC hardware
 is within its rights to end up with (r1 == 42  r2 == 42).  Which is
 fine, as far as I am concerned.  Or at least something that can be
 lived with.

Agreed.

 On to #5:
 
   r1 = atomic_load_explicit(x, memory_order_consume);
   if (r1 == 42)
 atomic_store_explicit(y, r1, memory_order_relaxed);
   
   r2 = atomic_load_explicit(y, memory_order_consume);
   if (r2 == 42)
 atomic_store_explicit(x, 42, memory_order_relaxed);
 
 The first thread's accesses are dependency ordered.  The second thread's
 ordering is in a corner case that memory-barriers.txt does not cover.
 You are supposed to start control dependencies with READ_ONCE_CTRL(), not
 a memory_order_consume load (AKA rcu_dereference and friends).  However,
 Alpha would have a full barrier as part of the memory_order_consume load,
 and the rest of the processors would (one way or another) respect the
 control dependency.  And the compiler would have some fun trying to
 break it.

But this is interesting because the first thread is ordered whilst the
second is not, so doesn't that effectively forbid the compiler from
constant-folding values if it can't prove that there is no dependency
chain?

 So the current Linux memory model would allow (r1 == 42  r2 == 42),
 but I don't know of any hardware/compiler combination that would
 allow it.  And no, I am -not- going to update memory-barriers.txt for
 this litmus test, its theoretical interest notwithstanding!  

Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Michael Matz
Hi,

On Thu, 21 May 2015, Paul E. McKenney wrote:

 The point is -exactly- to codify the current state of affairs.

Ah, I see, so it's not yet about creating a more useful (for compilers, 
that is) model.

char * fancy_assign (char *in) { return in; }
...
char *x, *y;

x = atomic_load_explicit(p, memory_order_consume);
y = fancy_assign (x);
atomic_store_explicit(q, y, memory_order_relaxed);
  
  So, is there, or is there not a dependency carried from x to y in your 
  proposed model (and which rule in your document states so)?  Clearly, 
  without any other language the compiler would have to assume that there is 
  (because the equivalent 'y = x' assignment would carry the dependency).  
 
 The dependency is not carried, though this is due to the current set
 of rules not covering atomic loads and stores, which I need to fix.

Okay, so with the current regime(s), the dependency carries ...

 o Rule 14 says that if a value is part of a dependency chain and
   is used as the actual parameter of a function call, then the
   dependency chain extends to the corresponding formal parameter,
   namely in of fancy_assign().
 
 o Rule 15 says that if a value is part of a dependency chain and
   is returned from a function, then the dependency chain extends
   to the returned value in the calling function.
 
 o And you are right.  I need to make the first and second rules
   cover the relaxed atomic operations, or at least atomic loads and
   stores.  Not that this is an issue for existing Linux-kernel code.
 
   But given such a change, the new version of rule 2 would
   extend the dependency chain to cover the atomic_store_explicit().

... (if this detail would be fixed).  Okay, that's quite awful ...

  If it has to assume this, then the whole model is not going to work 
  very well, as usual with models that assume a certain less-optimal 
  fact (carries-dep is less optimal for code generation purposes that 
  not-carries-dep) unless very specific circumstances say it can be 
  ignored.
 
 Although that is a good general rule of thumb, I do not believe that it 
 applies to this situation, with the exception that I do indeed assume 
 that no one is insane enough to do value-speculation optimizations for 
 non-NULL values on loads from pointers.
 
 So what am I missing here?

... because you are then missing that if carries-dep can flow through 
function calls from arguments to return values by default, the compiler 
has to assume this in fact always happens when it can't see the function 
body, or can't analyze it.  In effect that's making the whole carries-dep 
stops at these and those uses a useless excercise because a malicious 
user (malicious in the sense of abusing the model to show that it's 
hindering optimizations), i.e. me, can hide all such carries-dep stopping 
effects inside a function, et voila, the dependecy carries through.  So 
for a slightly more simple example:

  extern void *foo (void *);  // body not available
  x = load
  y = foo (x);
  store (y)

the compiler has to assume that there's a dep-chain from x to y; always.  
What's worse, it also has to assume a carries-dep for this:

  extern void foo (void *in, void **out1, void **out2);
  x = load
  foo (x, o1, o2);
  store (o1);
  store (o2);

Now the compiler has to assume that the body of 'foo' is just mean enough 
to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to 
assume that for both).  This extends to _all_ memory accessible from foo's 
body, i.e. generally all global and all local address-taken variables, so 
as soon as you have a function call into which a dep-chain value flows 
you're creating a dep-chain extension from that value to each and every 
global piece of memory, because the compiler cannot assume that the 
black box called foo is not mean.  This could conceivably be stopped by 
making normal stores not to carry the dependency; then only the return 
value might be infected; but I don't see that in your rules, as a normal 
store is just an assigment in your model and hence rules 1 and 2 apply 
(that is, carries-dep flows through all assignments, incl. loads and 
stores).

Basically whenever you can construct black boxes for the compiler, you 
have to limit their effects on such transitive relations like carries-dep 
by default, at the border of such black boxes; otherwise that transitive 
relation quickly becomes an most-x-everything relation (i.e. 
mostthings carries-dep to everything), and as such is then totally 
useless, because such a universally filled relation (like an empty one) 
doesn't bear any interesting information, at which point it then is 
questionably why the compiler should jump through hoops to analyse the few 
cases that would be allowed to stop the carries-dep flow, when it more 
often than not have to give up anyway and generate slow code.

 Do you have a specific example where the compiler would need to 

Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Jens Maurer
On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
 On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:

  - the you can add/subtract integral values still opens you up to
 language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
 dependency, which it clearly doesn't. But language-lawyering it does,
 since all those operations (cast to pointer, cast to integer,
 subtracting an integer) claim to be dependency-preserving operations.

[...]

 There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7,
 but in that case, if the resulting pointer happens by chance to reference 
 valid memory, I believe a dependency would still be carried.
[...]

From a language lawyer standpoint, pointer arithmetic is only valid
within an array.  These examples seem to go beyond the bounds of the
array and therefore have undefined behavior.

C++ standard section 5.7 paragraph 4
If both the pointer operand and the result point to elements of the
same array object, or one past the last element of the array object,
the evaluation shall not produce an overflow; otherwise, the behavior
is undefined.

C99 and C11
identical phrasing in 6.5.6 paragraph 8

Jens


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Will Deacon
Hi Paul,

On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
 On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
  On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
  torva...@linux-foundation.org wrote:
  So I think you're better off just saying that operations designed to
  drop significant bits break the dependency chain, and give things like
   1 and (char *)ptr-(uintptr_t)ptr as examples of such.
  
  Making that just an extension of your existing  0 language would
  seem to be natural.
 
 Works for me!  I added the following bullet to the list of things
 that break dependencies:
 
   If a pointer is part of a dependency chain, and if the values
   added to or subtracted from that pointer cancel the pointer
   value so as to allow the compiler to precisely determine the
   resulting value, then the resulting value will not be part of
   any dependency chain.  For example, if p is part of a dependency
   chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
 
 Seem reasonable?

Whilst I understand what you're saying (the ARM architecture makes these
sorts of distinctions when calling out dependency-based ordering), it
feels like we're dangerously close to defining the difference between a
true and a false dependency. If we want to do this in the context of the
C language specification, you run into issues because you need to evaluate
the program in order to determine data values in order to determine the
nature of the dependency.

You tackle this above by saying to allow the compiler to precisely
determine the resulting value, but I can't see how that can be cleanly
fitted into something like the C language specification. Even if it can,
then we'd need to reword the ?: treatment that you currently have:

  If a pointer is part of a dependency chain, and that pointer appears
   in the entry of a ?: expression selected by the condition, then the
   chain extends to the result.

which I think requires the state of the condition to be known statically
if we only want to extend the chain from the selected expression. In the
general case, wouldn't a compiler have to assume that the chain is
extended from both?

Additionally, what about the following code?

  char *x = y ? z : z;

Does that extend a dependency chain from z to x? If so, I can imagine a
CPU breaking that in practice.

  Humans will understand, and compiler writers won't care. They will
  either depend on hardware semantics anyway (and argue that your
  language is tight enough that they don't need to do anything special)
  or they will turn the consume into an acquire (on platforms that have
  too weak hardware).
 
 Agreed.  Plus Core Working Group will hammer out the exact wording,
 should this approach meet their approval.

For the avoidance of doubt, I'm completely behind any attempts to tackle
this problem, but I anticipate an uphill struggle getting this text into
the C standard. Is your intention to change the carries-a-dependency
relation to encompass this change?

Cheers,

Will


Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 11:03:00AM +0200, Richard Biener wrote:
 On Wed, May 20, 2015 at 9:34 AM, Jens Maurer jens.mau...@gmx.net wrote:
  On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
  On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
 
   - the you can add/subtract integral values still opens you up to
  language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
  dependency, which it clearly doesn't. But language-lawyering it does,
  since all those operations (cast to pointer, cast to integer,
  subtracting an integer) claim to be dependency-preserving operations.
 
  [...]
 
  There are some stranger examples, such as (char *)ptr - 
  ((intptr_t)ptr)/7,
  but in that case, if the resulting pointer happens by chance to reference
  valid memory, I believe a dependency would still be carried.
  [...]
 
  From a language lawyer standpoint, pointer arithmetic is only valid
  within an array.  These examples seem to go beyond the bounds of the
  array and therefore have undefined behavior.
 
  C++ standard section 5.7 paragraph 4
  If both the pointer operand and the result point to elements of the
  same array object, or one past the last element of the array object,
  the evaluation shall not produce an overflow; otherwise, the behavior
  is undefined.
 
  C99 and C11
  identical phrasing in 6.5.6 paragraph 8
 
 Of course you can try to circumvent that by doing
 (char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr)
 (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun).
 
 Which (IMHO) gets you into the standard language that only makes conversion of
 the exact same integer back to a pointer well-defined(?)

I am feeling good about leaving the restriction and calling out
the two paragraphs in a footnote, then.  ;-)

Thanx, Paul



Re: [c++std-parallel-1616] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 09:34:10AM +0200, Jens Maurer wrote:
 On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
  On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
 
   - the you can add/subtract integral values still opens you up to
  language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
  dependency, which it clearly doesn't. But language-lawyering it does,
  since all those operations (cast to pointer, cast to integer,
  subtracting an integer) claim to be dependency-preserving operations.
 
 [...]
 
  There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7,
  but in that case, if the resulting pointer happens by chance to reference 
  valid memory, I believe a dependency would still be carried.
 [...]
 
 From a language lawyer standpoint, pointer arithmetic is only valid
 within an array.  These examples seem to go beyond the bounds of the
 array and therefore have undefined behavior.
 
 C++ standard section 5.7 paragraph 4
 If both the pointer operand and the result point to elements of the
 same array object, or one past the last element of the array object,
 the evaluation shall not produce an overflow; otherwise, the behavior
 is undefined.
 
 C99 and C11
 identical phrasing in 6.5.6 paragraph 8

Even better!  I added a footnote calling out these two paragraphs.

Thax, Paul



Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Richard Biener
On Wed, May 20, 2015 at 9:34 AM, Jens Maurer jens.mau...@gmx.net wrote:
 On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
 On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:

  - the you can add/subtract integral values still opens you up to
 language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
 dependency, which it clearly doesn't. But language-lawyering it does,
 since all those operations (cast to pointer, cast to integer,
 subtracting an integer) claim to be dependency-preserving operations.

 [...]

 There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7,
 but in that case, if the resulting pointer happens by chance to reference
 valid memory, I believe a dependency would still be carried.
 [...]

 From a language lawyer standpoint, pointer arithmetic is only valid
 within an array.  These examples seem to go beyond the bounds of the
 array and therefore have undefined behavior.

 C++ standard section 5.7 paragraph 4
 If both the pointer operand and the result point to elements of the
 same array object, or one past the last element of the array object,
 the evaluation shall not produce an overflow; otherwise, the behavior
 is undefined.

 C99 and C11
 identical phrasing in 6.5.6 paragraph 8

Of course you can try to circumvent that by doing
(char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr)
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun).

Which (IMHO) gets you into the standard language that only makes conversion of
the exact same integer back to a pointer well-defined(?)

Richard.

 Jens


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread David Howells
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

  Additionally, what about the following code?
  
char *x = y ? z : z;
  
  Does that extend a dependency chain from z to x? If so, I can imagine a
  CPU breaking that in practice.
 
 I am not seeing this.  I would expect the compiler to optimize to
 something like this:
 
   char *x = z;

Why?  What if y has a potential side-effect (say it makes a function call)?

David


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
 Hi Paul,
 
 On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
  On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
   On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
   torva...@linux-foundation.org wrote:
   So I think you're better off just saying that operations designed to
   drop significant bits break the dependency chain, and give things like
1 and (char *)ptr-(uintptr_t)ptr as examples of such.
   
   Making that just an extension of your existing  0 language would
   seem to be natural.
  
  Works for me!  I added the following bullet to the list of things
  that break dependencies:
  
  If a pointer is part of a dependency chain, and if the values
  added to or subtracted from that pointer cancel the pointer
  value so as to allow the compiler to precisely determine the
  resulting value, then the resulting value will not be part of
  any dependency chain.  For example, if p is part of a dependency
  chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
  
  Seem reasonable?
 
 Whilst I understand what you're saying (the ARM architecture makes these
 sorts of distinctions when calling out dependency-based ordering), it
 feels like we're dangerously close to defining the difference between a
 true and a false dependency. If we want to do this in the context of the
 C language specification, you run into issues because you need to evaluate
 the program in order to determine data values in order to determine the
 nature of the dependency.

Indeed, something like this does -not- carry a dependency from the
memory_order_consume load to q:

char *p, q;

p = atomic_load_explicit(gp, memory_order_consume);
q = gq + (intptr_t)p - (intptr_t)p;

If this was compiled with -O0, ARM and Power might well carry a
dependency, but given any optimization, the assembly language would have
no hint of any such dependency.  So I am not seeing any particular danger.

 You tackle this above by saying to allow the compiler to precisely
 determine the resulting value, but I can't see how that can be cleanly
 fitted into something like the C language specification.

I am sure that there will be significant rework from where this document
is to language appropriate from the standard.  Which is why I am glad
that Jens is taking an interest in this, as he is particularly good at
producing standards language.

  Even if it can,
 then we'd need to reword the ?: treatment that you currently have:
 
   If a pointer is part of a dependency chain, and that pointer appears
in the entry of a ?: expression selected by the condition, then the
chain extends to the result.
 
 which I think requires the state of the condition to be known statically
 if we only want to extend the chain from the selected expression. In the
 general case, wouldn't a compiler have to assume that the chain is
 extended from both?

In practice, yes, if the compiler cannot determine which expression is
selected, it must arrange for the dependency to be carried from either,
depending on the run-time value of the condition.  But you would have
to work pretty hard to create code that did not carry the dependencies
as require, not?

 Additionally, what about the following code?
 
   char *x = y ? z : z;
 
 Does that extend a dependency chain from z to x? If so, I can imagine a
 CPU breaking that in practice.

I am not seeing this.  I would expect the compiler to optimize to
something like this:

char *x = z;

How does this avoid carrying the dependency?  Or are you saying that
ARM loses the dependency via a store to memory and a later reload?
That would be a bit surprising...

   Humans will understand, and compiler writers won't care. They will
   either depend on hardware semantics anyway (and argue that your
   language is tight enough that they don't need to do anything special)
   or they will turn the consume into an acquire (on platforms that have
   too weak hardware).
  
  Agreed.  Plus Core Working Group will hammer out the exact wording,
  should this approach meet their approval.
 
 For the avoidance of doubt, I'm completely behind any attempts to tackle
 this problem, but I anticipate an uphill struggle getting this text into
 the C standard. Is your intention to change the carries-a-dependency
 relation to encompass this change?

I completely agree that this won't be easy, but this is the task at hand.
And yes, the intent is to change carries-a-dependency, given that the
current wording isn't helping anything.  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:18:37PM +0100, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
   Additionally, what about the following code?
   
 char *x = y ? z : z;
   
   Does that extend a dependency chain from z to x? If so, I can imagine a
   CPU breaking that in practice.
  
  I am not seeing this.  I would expect the compiler to optimize to
  something like this:
  
  char *x = z;
 
 Why?  What if y has a potential side-effect (say it makes a function call)?

I was thinking of y as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Ramana Radhakrishnan



On 20/05/15 14:37, David Howells wrote:

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:


I was thinking of y as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;


Yeah.  I presume it has to maintain the ordering, though.


The scheduler for e.g. is free to reorder if it can prove there is no 
dependence (or indeed side-effects for y) between insns produced for y 
and `x = z'.


regards
Ramana



David



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Ramana Radhakrishnan



On 20/05/15 15:03, Paul E. McKenney wrote:

On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:



On 20/05/15 14:37, David Howells wrote:

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:


I was thinking of y as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;


Yeah.  I presume it has to maintain the ordering, though.


The scheduler for e.g. is free to reorder if it can prove there is
no dependence (or indeed side-effects for y) between insns produced
for y and `x = z'.


So for example, if y is independent of z, the compiler can do the
following:

char *x;

x = z;
y;

But the dependency ordering is still maintained from z to x, so this
is not a problem.



Well, reads if any of x (assuming x was initialized elsewhere) would 
need to happen before x got assigned to z.


I understood the original maintain the ordering as between the 
statements `x = z' and `y'.





Or am I missing something subtle here?


No, it sounds like we are on the same page here.

regards
Ramana



Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread David Howells
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

 I was thinking of y as a simple variable, but if it is something more
 complex, then the compiler could do this, right?
 
   char *x;
 
   y;
   x = z;

Yeah.  I presume it has to maintain the ordering, though.

David


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
 
 
 On 20/05/15 14:37, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 I was thinking of y as a simple variable, but if it is something more
 complex, then the compiler could do this, right?
 
 char *x;
 
 y;
 x = z;
 
 Yeah.  I presume it has to maintain the ordering, though.
 
 The scheduler for e.g. is free to reorder if it can prove there is
 no dependence (or indeed side-effects for y) between insns produced
 for y and `x = z'.

So for example, if y is independent of z, the compiler can do the
following:

char *x;

x = z;
y;

But the dependency ordering is still maintained from z to x, so this
is not a problem.

Or am I missing something subtle here?

Thanx, Paul



Re: [c++std-parallel-1624] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:37:05PM +0100, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
  I was thinking of y as a simple variable, but if it is something more
  complex, then the compiler could do this, right?
  
  char *x;
  
  y;
  x = z;
 
 Yeah.  I presume it has to maintain the ordering, though.

Agreed.  Unless of course y writes to x or some such.

Given that there is already code in the Linux kernel relying on
dependencies being carried through stores to local variables,
this should not be a problem.

Or am I missing something?

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread David Howells
Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

 Ah, I was assuming between x and z.  David, what was your intent?  ;-)

Clarification.

David


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Will Deacon
On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
 On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
  On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
 If a pointer is part of a dependency chain, and if the values
 added to or subtracted from that pointer cancel the pointer
 value so as to allow the compiler to precisely determine the
 resulting value, then the resulting value will not be part of
 any dependency chain.  For example, if p is part of a dependency
 chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
   
   Seem reasonable?
  
  Whilst I understand what you're saying (the ARM architecture makes these
  sorts of distinctions when calling out dependency-based ordering), it
  feels like we're dangerously close to defining the difference between a
  true and a false dependency. If we want to do this in the context of the
  C language specification, you run into issues because you need to evaluate
  the program in order to determine data values in order to determine the
  nature of the dependency.
 
 Indeed, something like this does -not- carry a dependency from the
 memory_order_consume load to q:
 
   char *p, q;
 
   p = atomic_load_explicit(gp, memory_order_consume);
   q = gq + (intptr_t)p - (intptr_t)p;
 
 If this was compiled with -O0, ARM and Power might well carry a
 dependency, but given any optimization, the assembly language would have
 no hint of any such dependency.  So I am not seeing any particular danger.

The above is a welcome relaxation over C11, since ARM doesn't even give
you ordering based off false data dependencies. My concern is more to do
with how this can be specified precisely without prohibing honest compiler
and hardware optimisations.

Out of interest, how do you tackle examples (4) and (5) of (assuming the
reads are promoted to consume loads)?:

  http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html

my understanding is that you permit both outcomes (I appreciate you're
not directly tackling out-of-thin-air, but treatment of dependencies
is heavily related).

  You tackle this above by saying to allow the compiler to precisely
  determine the resulting value, but I can't see how that can be cleanly
  fitted into something like the C language specification.
 
 I am sure that there will be significant rework from where this document
 is to language appropriate from the standard.  Which is why I am glad
 that Jens is taking an interest in this, as he is particularly good at
 producing standards language.

Ok. I'm curious to see how that comes along.

   Even if it can,
  then we'd need to reword the ?: treatment that you currently have:
  
If a pointer is part of a dependency chain, and that pointer appears
 in the entry of a ?: expression selected by the condition, then the
 chain extends to the result.
  
  which I think requires the state of the condition to be known statically
  if we only want to extend the chain from the selected expression. In the
  general case, wouldn't a compiler have to assume that the chain is
  extended from both?
 
 In practice, yes, if the compiler cannot determine which expression is
 selected, it must arrange for the dependency to be carried from either,
 depending on the run-time value of the condition.  But you would have
 to work pretty hard to create code that did not carry the dependencies
 as require, not?

I'm not sure... you'd require the compiler to perform static analysis of
loops to determine the state of the machine when they exit (if they exit!)
in order to show whether or not a dependency is carried to subsequent
operations. If it can't prove otherwise, it would have to assume that a
dependency *is* carried, and it's not clear to me how it would use this
information to restrict any subsequent dependency removing optimisations.

I guess that's one for the GCC folks.

  Additionally, what about the following code?
  
char *x = y ? z : z;
  
  Does that extend a dependency chain from z to x? If so, I can imagine a
  CPU breaking that in practice.
 
 I am not seeing this.  I would expect the compiler to optimize to
 something like this:
 
   char *x = z;
 
 How does this avoid carrying the dependency?  Or are you saying that
 ARM loses the dependency via a store to memory and a later reload?
 That would be a bit surprising...

I was thinking that the compiler would have to preserve the conditional
structure so that the dependency chain could be tracked correctly, but
if it can just assume that the dependency is carried regardless of y then
I agree that it doesn't matter for this code. All the CPU could do is
remove the conditional hazard.

Humans will understand, and compiler writers won't care. They will
either depend on hardware semantics anyway (and argue that your
language is tight enough that they don't need to do anything special)
or they will turn 

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Andrew Haley
On 05/20/2015 04:46 PM, Will Deacon wrote:
 I'm not sure... you'd require the compiler to perform static analysis of
 loops to determine the state of the machine when they exit (if they exit!)
 in order to show whether or not a dependency is carried to subsequent
 operations. If it can't prove otherwise, it would have to assume that a
 dependency *is* carried, and it's not clear to me how it would use this
 information to restrict any subsequent dependency removing optimisations.

It'd just convert consume to acquire.

Andrew.



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
 On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
  On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
   On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
If a pointer is part of a dependency chain, and if the values
added to or subtracted from that pointer cancel the pointer
value so as to allow the compiler to precisely determine the
resulting value, then the resulting value will not be part of
any dependency chain.  For example, if p is part of a dependency
chain, then ((char *)p-(uintptr_t)p)+65536 will not be.

Seem reasonable?
   
   Whilst I understand what you're saying (the ARM architecture makes these
   sorts of distinctions when calling out dependency-based ordering), it
   feels like we're dangerously close to defining the difference between a
   true and a false dependency. If we want to do this in the context of the
   C language specification, you run into issues because you need to evaluate
   the program in order to determine data values in order to determine the
   nature of the dependency.
  
  Indeed, something like this does -not- carry a dependency from the
  memory_order_consume load to q:
  
  char *p, q;
  
  p = atomic_load_explicit(gp, memory_order_consume);
  q = gq + (intptr_t)p - (intptr_t)p;
  
  If this was compiled with -O0, ARM and Power might well carry a
  dependency, but given any optimization, the assembly language would have
  no hint of any such dependency.  So I am not seeing any particular danger.
 
 The above is a welcome relaxation over C11, since ARM doesn't even give
 you ordering based off false data dependencies. My concern is more to do
 with how this can be specified precisely without prohibing honest compiler
 and hardware optimisations.

That last is the challenge.  I believe that I am pretty close, but I am
sure that additional adjustment will be required.  Especially given that
we also need the memory model to be amenable to formal analysis.

 Out of interest, how do you tackle examples (4) and (5) of (assuming the
 reads are promoted to consume loads)?:
 
   http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
 
 my understanding is that you permit both outcomes (I appreciate you're
 not directly tackling out-of-thin-air, but treatment of dependencies
 is heavily related).

Let's see...  #4 is as follows, given promotion to memory_order_consume
and (I am guessing) memory_order_relaxed:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);
--
r2 = atomic_load_explicit(y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(x, 42, memory_order_relaxed);
else
  atomic_store_explicit(x, 42, memory_order_relaxed);

The second thread does not have a proper control dependency, even with
the memory_order_consume load because both branches assign the same
value to x.  This means that the compiler is within its rights to
optimize this into the following:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);
--
r2 = atomic_load_explicit(y, memory_order_consume);
atomic_store_explicit(x, 42, memory_order_relaxed);

There is no dependency between the second thread's pair of statements,
so both the compiler and the CPU are within their rights to optimize
further as follows:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);
--
atomic_store_explicit(x, 42, memory_order_relaxed);
r2 = atomic_load_explicit(y, memory_order_consume);

If the compiler makes this final optimization, even mythical SC hardware
is within its rights to end up with (r1 == 42  r2 == 42).  Which is
fine, as far as I am concerned.  Or at least something that can be
lived with.

On to #5:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);

r2 = atomic_load_explicit(y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(x, 42, memory_order_relaxed);

The first thread's accesses are dependency ordered.  The second thread's
ordering is in a corner case that memory-barriers.txt does not cover.
You are supposed to start control dependencies with READ_ONCE_CTRL(), not
a memory_order_consume load (AKA rcu_dereference and friends).  However,
Alpha would have a full barrier as part of the 

Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 04:54:51PM +0100, Andrew Haley wrote:
 On 05/20/2015 04:46 PM, Will Deacon wrote:
  I'm not sure... you'd require the compiler to perform static analysis of
  loops to determine the state of the machine when they exit (if they exit!)
  in order to show whether or not a dependency is carried to subsequent
  operations. If it can't prove otherwise, it would have to assume that a
  dependency *is* carried, and it's not clear to me how it would use this
  information to restrict any subsequent dependency removing optimisations.
 
 It'd just convert consume to acquire.

It should not need to, actually.

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 03:15:48PM +0100, Ramana Radhakrishnan wrote:
 
 
 On 20/05/15 15:03, Paul E. McKenney wrote:
 On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
 
 
 On 20/05/15 14:37, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 I was thinking of y as a simple variable, but if it is something more
 complex, then the compiler could do this, right?
 
   char *x;
 
   y;
   x = z;
 
 Yeah.  I presume it has to maintain the ordering, though.
 
 The scheduler for e.g. is free to reorder if it can prove there is
 no dependence (or indeed side-effects for y) between insns produced
 for y and `x = z'.
 
 So for example, if y is independent of z, the compiler can do the
 following:
 
  char *x;
 
  x = z;
  y;
 
 But the dependency ordering is still maintained from z to x, so this
 is not a problem.
 
 
 Well, reads if any of x (assuming x was initialized elsewhere) would
 need to happen before x got assigned to z.

Agreed, there needs to be a memory_order_consume load up there somewhere.
(AKA rcu_dereference().)

 I understood the original maintain the ordering as between the
 statements `x = z' and `y'.

Ah, I was assuming between x and z.  David, what was your intent?  ;-)

 Or am I missing something subtle here?
 
 No, it sounds like we are on the same page here.

Whew!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Linus Torvalds
On Tue, May 19, 2015 at 5:55 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

From a very quick read-through, the restricted dependency chain in 7.9
seems to be reasonable, and essentially covers thats' what hardware
gives us anyway, making compiler writers happy.

I would clarify the language somewhat:

 - it says that the result of a cast of a pointer is a dependency. You
need to make an exception for casting to bool, methinks (because
that's effectively just a test-against-NULL, which you later describe
as terminating the dependency).

   Maybe get rid of the any type, and just limit it to casts to
types of size intptr_t, ie ones that don't drop significant bits. All
the other rules talk about [u]intptr_t anyway.

 - you clarify that the trivial  0 and | ~0 kill the dependency
chain, but if you really want to be a stickler, you might want to
extend it to a few more cases. Things like  1 (to extract a tag
from the lot bit of a tagged pointer) should likely also drop the
dependency, since a compiler would commonly end up using the end
result as a conditional even if the code was written to then use
casting etc to look like a dereference.

 - the you can add/subtract integral values still opens you up to
language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
dependency, which it clearly doesn't. But language-lawyering it does,
since all those operations (cast to pointer, cast to integer,
subtracting an integer) claim to be dependency-preserving operations.

So I think you want to limit the logical operators to things that
don't mask off too many bits, and you should probably limit the
add/subtract operations some way (maybe specify that the integer value
you add/subtract cannot be related to the pointer). But I think
limiting it to mostly pointer ops (and a _few_ integer operations to
do offsets and remove tag bits) is otherwise a good approach.

   Linus


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Linus Torvalds
On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

  - the you can add/subtract integral values still opens you up to
 language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
 dependency, which it clearly doesn't. But language-lawyering it does,
 since all those operations (cast to pointer, cast to integer,
 subtracting an integer) claim to be dependency-preserving operations.

 So I think you want to limit the logical operators to things that
 don't mask off too many bits, and you should probably limit the
 add/subtract operations some way (maybe specify that the integer value
 you add/subtract cannot be related to the pointer).

Actually, not related doesn't work. For some buddy allocator thing,
you very much might want some of the bits to be related.

So I think you're better off just saying that operations designed to
drop significant bits break the dependency chain, and give things like
 1 and (char *)ptr-(uintptr_t)ptr as examples of such.

Making that just an extension of your existing  0 language would
seem to be natural.

Humans will understand, and compiler writers won't care. They will
either depend on hardware semantics anyway (and argue that your
language is tight enough that they don't need to do anything special)
or they will turn the consume into an acquire (on platforms that have
too weak hardware).

 Linus


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Paul E. McKenney
On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
 On Tue, May 19, 2015 at 5:55 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
 
 From a very quick read-through, the restricted dependency chain in 7.9
 seems to be reasonable, and essentially covers thats' what hardware
 gives us anyway, making compiler writers happy.
 
 I would clarify the language somewhat:
 
  - it says that the result of a cast of a pointer is a dependency. You
 need to make an exception for casting to bool, methinks (because
 that's effectively just a test-against-NULL, which you later describe
 as terminating the dependency).
 
Maybe get rid of the any type, and just limit it to casts to
 types of size intptr_t, ie ones that don't drop significant bits. All
 the other rules talk about [u]intptr_t anyway.

Excellent point!  I now say:

If a pointer is part of a dependency chain, then casting it
(either explicitly or implicitly) to any pointer-sized type
extends the chain to the result.

If this approach works out, the people in the Core Working Group will
come up with alternative language-lawyer-proof wording, but this informal
version will hopefully do for the moment.

  - you clarify that the trivial  0 and | ~0 kill the dependency
 chain, but if you really want to be a stickler, you might want to
 extend it to a few more cases. Things like  1 (to extract a tag
 from the lot bit of a tagged pointer) should likely also drop the
 dependency, since a compiler would commonly end up using the end
 result as a conditional even if the code was written to then use
 casting etc to look like a dereference.

Ah, how about the following?

If a value of type intptr_t or uintptr_t is part of a dependency
chain, and if that value is one of the operands to an  or |
infix operator whose result has too few or too many bits set,
then the resulting value will not be part of any dependency
chain.  For example, on a 64-bit system, if p is part of a
dependency chain, then (p  0x7) provides just the tag bits,
and normally cannot even be legally dereferenced.  Similarly,
(p | ~0) normally cannot be legally dereferenced.

  - the you can add/subtract integral values still opens you up to
 language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
 dependency, which it clearly doesn't. But language-lawyering it does,
 since all those operations (cast to pointer, cast to integer,
 subtracting an integer) claim to be dependency-preserving operations.

My thought was that the result of (char *)ptr - (intptr_t)ptr is a
NULL pointer in most environments, and dereferencing a NULL pointer
is undefined behavior.  So it becomes irrelevant whether or not the
NULL pointer carries a dependency.

There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7,
but in that case, if the resulting pointer happens by chance to reference 
valid memory, I believe a dependency would still be carried.  Of course,
if you are producing code like that, I am guessing that dependencies are
the very least of your concerns.

However, I will give this some more thought.

 So I think you want to limit the logical operators to things that
 don't mask off too many bits, and you should probably limit the
 add/subtract operations some way (maybe specify that the integer value
 you add/subtract cannot be related to the pointer). But I think
 limiting it to mostly pointer ops (and a _few_ integer operations to
 do offsets and remove tag bits) is otherwise a good approach.

Glad you mostly like it!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Paul E. McKenney
On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
 On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
   - the you can add/subtract integral values still opens you up to
  language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
  dependency, which it clearly doesn't. But language-lawyering it does,
  since all those operations (cast to pointer, cast to integer,
  subtracting an integer) claim to be dependency-preserving operations.
 
  So I think you want to limit the logical operators to things that
  don't mask off too many bits, and you should probably limit the
  add/subtract operations some way (maybe specify that the integer value
  you add/subtract cannot be related to the pointer).
 
 Actually, not related doesn't work. For some buddy allocator thing,
 you very much might want some of the bits to be related.

Good point, you could do the buddy-allocator computations with add
and subtract instead of exclusive OR.

 So I think you're better off just saying that operations designed to
 drop significant bits break the dependency chain, and give things like
  1 and (char *)ptr-(uintptr_t)ptr as examples of such.
 
 Making that just an extension of your existing  0 language would
 seem to be natural.

Works for me!  I added the following bullet to the list of things
that break dependencies:

If a pointer is part of a dependency chain, and if the values
added to or subtracted from that pointer cancel the pointer
value so as to allow the compiler to precisely determine the
resulting value, then the resulting value will not be part of
any dependency chain.  For example, if p is part of a dependency
chain, then ((char *)p-(uintptr_t)p)+65536 will not be.

Seem reasonable?

 Humans will understand, and compiler writers won't care. They will
 either depend on hardware semantics anyway (and argue that your
 language is tight enough that they don't need to do anything special)
 or they will turn the consume into an acquire (on platforms that have
 too weak hardware).

Agreed.  Plus Core Working Group will hammer out the exact wording,
should this approach meet their approval.

Thanx, Paul