On Fri, 2016-02-26 at 11:49 +0100, Richard Biener wrote:
> On Thu, Feb 25, 2016 at 6:33 PM, Torvald Riegel <trie...@redhat.com> wrote:
> > On Wed, 2016-02-24 at 13:14 +0100, Richard Biener wrote:
> >> On Tue, Feb 23, 2016 at 8:38 PM, Torvald Riegel <trie...@redhat.com> wrote:
> >> > I'd like to know, based on the GCC experience, how important we consider
> >> > optimizations that may turn data dependencies of pointers into control
> >> > dependencies.  I'm thinking about all optimizations or transformations
> >> > that guess that a pointer might have a specific value, and then create
> >> > (specialized) code that assumes this value that is only executed if the
> >> > pointer actually has this value.  For example:
> >> >
> >> > int d[2] = {23, compute_something()};
> >> >
> >> > int compute(int v) {
> >> >   if (likely(v == 23)) return 23;
> >> >   else <lots of stuff>;
> >> > }
> >> >
> >> > int bar() {
> >> >   int *p = ptr.load(memory_order_consume);
> >> >   size_t reveal_that_p_is_in_d = p - d[0];
> >> >   return compute(*p);
> >> > }
> >> >
> >> > Could be transformed to (after inlining compute(), and specializing for
> >> > the likely path):
> >> >
> >> > int bar() {
> >> >   int *p = ptr.load(memory_order_consume);
> >> >   if (p == d) return 23;
> >> >   else <lots of stuff(*p)>;
> >> > }
> >>
> >> Note that if a user writes
> >>
> >>   if (p == d)
> >>    {
> >>      ... do lots of stuff via p ...
> >>    }
> >>
> >> GCC might rewrite accesses to p as accesses to d and thus expose
> >> those opportunities.  Is that a transform that isn't valid then or is
> >> the code written by the user (establishing the equivalency) to blame?
> >
> > In the context of this memory_order_consume proposal, this transform
> > would be valid because the program has already "reveiled" what value p
> > has after the branch has been taken.
> >
> >> There's a PR where this kind of equivalencies lead to unexpected (wrong?)
> >> points-to results for example.
> >>
> >> > Other potential examples that come to mind are de-virtualization, or
> >> > feedback-directed optimizations that has observed at runtime that a
> >> > certain pointer is likely to be always equal to some other pointer (eg.,
> >> > if p is almost always d[0], and specializing for that).
> >>
> >> That's the cases that are quite important in practice.
> >
> > Could you quantify this somehow, even if it's a very rough estimate?
> > I'm asking because it's significant and widely used, then this would
> > require users or compiler implementors to make a difficult trade-off
> > (ie, do you want mo_consume performance or performance through those
> > other optimizations?).
> 
> Probably resoved by your followup on how the transform is safe anyway.

Are you saying this because de-virtualization would be the only
optimization that matters for performance and turns data into control
dependencies (but can be considered safe)?

What about feedback-directed optimizations, for which you said it would
be one of the important cases too?  Will these only look equivalences in
non-mutable data too (ie, like vtables)?

What about the example I gave above?  Is it unrealistic for compilers do
ever do something like this, or is it just unlikely to gain much
performance, or is it just that GCC does not do this today?

> >> > Also, it would be interesting to me to know how often we may turn data
> >> > dependencies into control dependencies in cases where this doesn't
> >> > affect performance significantly.
> >>
> >> I suppose we try to avoid that but can we ever know for sure?  Like
> >> speculative devirtualization does this (with the intent that it _does_ 
> >> matter,
> >> of course).
> >>
> >> I suppose establishing non-dependence isn't an issue, like with the
> >> vectorizer adding runtime dependence checks and applying versioning
> >> to get a vectorized and a not vectorized loop (in case there are 
> >> dependences)?
> >
> > I'm not sure I understand you correctly.  Do you have a brief example,
> > perhaps?  For mo_consume and its data dependencies, if there might be a
> > dependence, the compiler would have to preserve it; but I guess that
> > both a vectorized loop an one that accessses each element separately
> > would preserve dependences because it's doing those accesses, and they
> > depend on the input data.
> > OTOH, peraps HW vector instructions don't get the ordering guarantees
> > from data dependences -- Paul, do you know of any such cases?
> 
> A brief example would be for
> 
>  void foo (int *a, int *b, int n)
> {
>   for (int i = 0; i < n; ++i)
>    a[i] = b[i];
> }
> 
> which we can vectorize like
> 
>   if (a + n < b || b + n < a)
>    {
>       vectorized loop
>    }
>   else
>     {
>        not vectorized loop
>     }
> 
> note how we're not establishing equivalences between pointers but
> non-dependence vs. possible dependence.

Okay, non-dependence of this kind (/ disjoint-access parallelism) isn't
a problem, as long as both the vectorized and non-vectorized loops would
perform the memory accesses using HW instructions that communicate the
dependence to the HW (ie, use a / b as base) and for which the HW
establishes ordering due to the dependences (ie, my question about
vector ops to Paul).

Note that if the non-vectorized loop would do something like this:
if (a + n > b && a < b)
  {
    int n2 = n - (b-a);
    for (;a < b;a++, b++)  /* vectorized copy */
    for (i = n2;i < n; i++) b[i] = b[i+n2];
  }

We'd have a problem because the second part doesn't communicate the
dependence on a to the HW.

Maybe that's another class of transforms we'd have to consider?  That
is, transforms that figure out that some derived pointers (eg, a+n2 / b)
are equal and then carry forward with using just one of those base
dependencies in the generated code?

> >> > The background for this question is Paul McKenney's recently updated
> >> > proposal for a different memory_order_consume specification:
> >> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r0.pdf
> >> >
> >> > In a nutshell, this requires a compiler to either prove that a pointer
> >> > value is not carrying a dependency (simplified, its value somehow
> >> > originates from a memory_order_consume load), or it has to
> >> > conservatively assume that it does; if it does, the compiler must not
> >> > turn data dependencies into control dependencies in generated code.
> >> > (The data dependencies, in contrast to control dependencies, enforce
> >> > memory ordering on archs such as Power and ARM; these orderings than
> >> > allow for not having to use an acquire HW barrier in the generated
> >> > code.)
> >> >
> >> > Given that such a proof will likely be hard for a compiler (dependency
> >> > chains propagate through assignments to variables on the heap and stack,
> >> > chains are not marked in the code, and points-to analysis can be hard),
> >> > a compiler faces a trade-off between either:
> >> > (1) trying to support this memory_order_consume specification and likely
> >> > disallowing all transformations that change data dependencies into
> >> > control dependencies, or
> >> > (2) not support the proposal by simply emitting memory_order_acquire
> >> > code, but get no new constraints on transformations in return (ie, what
> >> > we do for memory_order_consume today).
> >> >
> >> > A compiler could let users make this choice, but this will be hard for
> >> > users too, and the compiler would still have to pick a default.
> >> >
> >> > Therefore, it would be good to know how important such transformations
> >> > or optimizations are in practice.  If they are, it either means somewhat
> >> > slower code everywhere (or at least having to change much in todays
> >> > compilers), or not supporting the new memory_order_consume (at least not
> >> > in the default setting of the compiler).
> >>
> >> IMHO all the memory order semantics are too complicated already so what's
> >> the reason to complicate it even more...?
> >
> > The hardware memory models of archs like Power an ARM guarantee ordering
> > for accesses that have a data dependency, thus allowing things like
> > traversing a concurrently modified list to not have to use acquire
> > memory barriers (ie, a programmer could use memory_order_consume instead
> > of memory_order_acquire).  The Linux kernel uses this heavily in RCU,
> > for example.  It doesn't matter for archs like x86 where acquire /
> > release barriers are implicit.
> 
> That's memory commit ordering on a single CPU core, right?  How would it
> be ever valid to change commit order when there is a data dependence?

No, this is across cores.  One can do this to publish an initialized
object, for example:
  obj.data = init();
  ptr.store(obj, memory_order_release);
and consume it, being sure that one will see initialized data:
  p = ptr.load(memory_order_acquire);
  my_data = p->data;
This will emit an acquire HW fence (or similar) on archs such as Power.

What mo_consume wants to enable is programmers to write this:
  p = ptr.load(memory_order_consume);
  my_data = p->data;
This would be compiled to a plain atomic load of ptr (no acquire fence,
so like memory_order_relaxed), followed by a plain memory access to ptr
plus the offset of data.  The HW instruction of the latter would
transform dependences such as the one on ptr into an ordering guarantee
that makes sure that the load from data does not "happen before" the
load from ptr, so that even in this case we'd always observe an
initialized object.

Reply via email to