On Fri, Feb 26, 2016 at 8:10 PM, Torvald Riegel <trie...@redhat.com> wrote: > On Fri, 2016-02-26 at 11:49 +0100, Richard Biener wrote: >> On Thu, Feb 25, 2016 at 6:33 PM, Torvald Riegel <trie...@redhat.com> wrote: >> > On Wed, 2016-02-24 at 13:14 +0100, Richard Biener wrote: >> >> On Tue, Feb 23, 2016 at 8:38 PM, Torvald Riegel <trie...@redhat.com> >> >> wrote: >> >> > I'd like to know, based on the GCC experience, how important we consider >> >> > optimizations that may turn data dependencies of pointers into control >> >> > dependencies. I'm thinking about all optimizations or transformations >> >> > that guess that a pointer might have a specific value, and then create >> >> > (specialized) code that assumes this value that is only executed if the >> >> > pointer actually has this value. For example: >> >> > >> >> > int d[2] = {23, compute_something()}; >> >> > >> >> > int compute(int v) { >> >> > if (likely(v == 23)) return 23; >> >> > else <lots of stuff>; >> >> > } >> >> > >> >> > int bar() { >> >> > int *p = ptr.load(memory_order_consume); >> >> > size_t reveal_that_p_is_in_d = p - d[0]; >> >> > return compute(*p); >> >> > } >> >> > >> >> > Could be transformed to (after inlining compute(), and specializing for >> >> > the likely path): >> >> > >> >> > int bar() { >> >> > int *p = ptr.load(memory_order_consume); >> >> > if (p == d) return 23; >> >> > else <lots of stuff(*p)>; >> >> > } >> >> >> >> Note that if a user writes >> >> >> >> if (p == d) >> >> { >> >> ... do lots of stuff via p ... >> >> } >> >> >> >> GCC might rewrite accesses to p as accesses to d and thus expose >> >> those opportunities. Is that a transform that isn't valid then or is >> >> the code written by the user (establishing the equivalency) to blame? >> > >> > In the context of this memory_order_consume proposal, this transform >> > would be valid because the program has already "reveiled" what value p >> > has after the branch has been taken. >> > >> >> There's a PR where this kind of equivalencies lead to unexpected (wrong?) >> >> points-to results for example. >> >> >> >> > Other potential examples that come to mind are de-virtualization, or >> >> > feedback-directed optimizations that has observed at runtime that a >> >> > certain pointer is likely to be always equal to some other pointer (eg., >> >> > if p is almost always d[0], and specializing for that). >> >> >> >> That's the cases that are quite important in practice. >> > >> > Could you quantify this somehow, even if it's a very rough estimate? >> > I'm asking because it's significant and widely used, then this would >> > require users or compiler implementors to make a difficult trade-off >> > (ie, do you want mo_consume performance or performance through those >> > other optimizations?). >> >> Probably resoved by your followup on how the transform is safe anyway. > > Are you saying this because de-virtualization would be the only > optimization that matters for performance and turns data into control > dependencies (but can be considered safe)? > > What about feedback-directed optimizations, for which you said it would > be one of the important cases too? Will these only look equivalences in > non-mutable data too (ie, like vtables)?
We do have other value-profiling transforms - for example divison/modulo of a constant: if (x == 256) z = y / 256; else z = y / x; > What about the example I gave above? Is it unrealistic for compilers do > ever do something like this, or is it just unlikely to gain much > performance, or is it just that GCC does not do this today? GCC does not do this today with the exception of value-profiling. GCC in other cases does not establish equivalences but only relations (a < b, etc.) that are not a problem as far as I can see because those do not allow to change expressions using a to use b. >> >> > Also, it would be interesting to me to know how often we may turn data >> >> > dependencies into control dependencies in cases where this doesn't >> >> > affect performance significantly. >> >> >> >> I suppose we try to avoid that but can we ever know for sure? Like >> >> speculative devirtualization does this (with the intent that it _does_ >> >> matter, >> >> of course). >> >> >> >> I suppose establishing non-dependence isn't an issue, like with the >> >> vectorizer adding runtime dependence checks and applying versioning >> >> to get a vectorized and a not vectorized loop (in case there are >> >> dependences)? >> > >> > I'm not sure I understand you correctly. Do you have a brief example, >> > perhaps? For mo_consume and its data dependencies, if there might be a >> > dependence, the compiler would have to preserve it; but I guess that >> > both a vectorized loop an one that accessses each element separately >> > would preserve dependences because it's doing those accesses, and they >> > depend on the input data. >> > OTOH, peraps HW vector instructions don't get the ordering guarantees >> > from data dependences -- Paul, do you know of any such cases? >> >> A brief example would be for >> >> void foo (int *a, int *b, int n) >> { >> for (int i = 0; i < n; ++i) >> a[i] = b[i]; >> } >> >> which we can vectorize like >> >> if (a + n < b || b + n < a) >> { >> vectorized loop >> } >> else >> { >> not vectorized loop >> } >> >> note how we're not establishing equivalences between pointers but >> non-dependence vs. possible dependence. > > Okay, non-dependence of this kind (/ disjoint-access parallelism) isn't > a problem, as long as both the vectorized and non-vectorized loops would > perform the memory accesses using HW instructions that communicate the > dependence to the HW (ie, use a / b as base) and for which the HW > establishes ordering due to the dependences (ie, my question about > vector ops to Paul). > > Note that if the non-vectorized loop would do something like this: > if (a + n > b && a < b) > { > int n2 = n - (b-a); > for (;a < b;a++, b++) /* vectorized copy */ > for (i = n2;i < n; i++) b[i] = b[i+n2]; > } > > We'd have a problem because the second part doesn't communicate the > dependence on a to the HW. Huh? IVOPTs does create such IVs already. But the actual access is of course still "original" so I fail to see how the HW does not see the dependence. > Maybe that's another class of transforms we'd have to consider? That > is, transforms that figure out that some derived pointers (eg, a+n2 / b) > are equal and then carry forward with using just one of those base > dependencies in the generated code? GCC does this today (though it causes some issues in GCC itself so we tried to avoid it, but certainly not for HW correctness reasons - which I fail to see). >> >> > The background for this question is Paul McKenney's recently updated >> >> > proposal for a different memory_order_consume specification: >> >> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r0.pdf >> >> > >> >> > In a nutshell, this requires a compiler to either prove that a pointer >> >> > value is not carrying a dependency (simplified, its value somehow >> >> > originates from a memory_order_consume load), or it has to >> >> > conservatively assume that it does; if it does, the compiler must not >> >> > turn data dependencies into control dependencies in generated code. >> >> > (The data dependencies, in contrast to control dependencies, enforce >> >> > memory ordering on archs such as Power and ARM; these orderings than >> >> > allow for not having to use an acquire HW barrier in the generated >> >> > code.) >> >> > >> >> > Given that such a proof will likely be hard for a compiler (dependency >> >> > chains propagate through assignments to variables on the heap and stack, >> >> > chains are not marked in the code, and points-to analysis can be hard), >> >> > a compiler faces a trade-off between either: >> >> > (1) trying to support this memory_order_consume specification and likely >> >> > disallowing all transformations that change data dependencies into >> >> > control dependencies, or >> >> > (2) not support the proposal by simply emitting memory_order_acquire >> >> > code, but get no new constraints on transformations in return (ie, what >> >> > we do for memory_order_consume today). >> >> > >> >> > A compiler could let users make this choice, but this will be hard for >> >> > users too, and the compiler would still have to pick a default. >> >> > >> >> > Therefore, it would be good to know how important such transformations >> >> > or optimizations are in practice. If they are, it either means somewhat >> >> > slower code everywhere (or at least having to change much in todays >> >> > compilers), or not supporting the new memory_order_consume (at least not >> >> > in the default setting of the compiler). >> >> >> >> IMHO all the memory order semantics are too complicated already so what's >> >> the reason to complicate it even more...? >> > >> > The hardware memory models of archs like Power an ARM guarantee ordering >> > for accesses that have a data dependency, thus allowing things like >> > traversing a concurrently modified list to not have to use acquire >> > memory barriers (ie, a programmer could use memory_order_consume instead >> > of memory_order_acquire). The Linux kernel uses this heavily in RCU, >> > for example. It doesn't matter for archs like x86 where acquire / >> > release barriers are implicit. >> >> That's memory commit ordering on a single CPU core, right? How would it >> be ever valid to change commit order when there is a data dependence? > > No, this is across cores. One can do this to publish an initialized > object, for example: > obj.data = init(); > ptr.store(obj, memory_order_release); > and consume it, being sure that one will see initialized data: > p = ptr.load(memory_order_acquire); > my_data = p->data; > This will emit an acquire HW fence (or similar) on archs such as Power. > > What mo_consume wants to enable is programmers to write this: > p = ptr.load(memory_order_consume); > my_data = p->data; > This would be compiled to a plain atomic load of ptr (no acquire fence, > so like memory_order_relaxed), followed by a plain memory access to ptr > plus the offset of data. The HW instruction of the latter would > transform dependences such as the one on ptr into an ordering guarantee > that makes sure that the load from data does not "happen before" the > load from ptr, so that even in this case we'd always observe an > initialized object. But there is a data dependence between the two instructions so I fail to see how the HW can ever execute the load from p->data before the load of p itself (from ptr). I'm quite sure no HW does value-speculation of 'p' to be able to load p->data before actually knowing the value of 'p' ;) Richard. >