On Fri, Feb 26, 2016 at 11:49:29AM +0100, Richard Biener wrote: > On Thu, Feb 25, 2016 at 6:33 PM, Torvald Riegel <trie...@redhat.com> wrote: > > On Wed, 2016-02-24 at 13:14 +0100, Richard Biener wrote: > >> On Tue, Feb 23, 2016 at 8:38 PM, Torvald Riegel <trie...@redhat.com> wrote: > >> > I'd like to know, based on the GCC experience, how important we consider > >> > optimizations that may turn data dependencies of pointers into control > >> > dependencies. I'm thinking about all optimizations or transformations > >> > that guess that a pointer might have a specific value, and then create > >> > (specialized) code that assumes this value that is only executed if the > >> > pointer actually has this value. For example: > >> > > >> > int d[2] = {23, compute_something()}; > >> > > >> > int compute(int v) { > >> > if (likely(v == 23)) return 23; > >> > else <lots of stuff>; > >> > } > >> > > >> > int bar() { > >> > int *p = ptr.load(memory_order_consume); > >> > size_t reveal_that_p_is_in_d = p - d[0]; > >> > return compute(*p); > >> > } > >> > > >> > Could be transformed to (after inlining compute(), and specializing for > >> > the likely path): > >> > > >> > int bar() { > >> > int *p = ptr.load(memory_order_consume); > >> > if (p == d) return 23; > >> > else <lots of stuff(*p)>; > >> > } > >> > >> Note that if a user writes > >> > >> if (p == d) > >> { > >> ... do lots of stuff via p ... > >> } > >> > >> GCC might rewrite accesses to p as accesses to d and thus expose > >> those opportunities. Is that a transform that isn't valid then or is > >> the code written by the user (establishing the equivalency) to blame? > > > > In the context of this memory_order_consume proposal, this transform > > would be valid because the program has already "reveiled" what value p > > has after the branch has been taken. > > > >> There's a PR where this kind of equivalencies lead to unexpected (wrong?) > >> points-to results for example. > >> > >> > Other potential examples that come to mind are de-virtualization, or > >> > feedback-directed optimizations that has observed at runtime that a > >> > certain pointer is likely to be always equal to some other pointer (eg., > >> > if p is almost always d[0], and specializing for that). > >> > >> That's the cases that are quite important in practice. > > > > Could you quantify this somehow, even if it's a very rough estimate? > > I'm asking because it's significant and widely used, then this would > > require users or compiler implementors to make a difficult trade-off > > (ie, do you want mo_consume performance or performance through those > > other optimizations?). > > Probably resoved by your followup on how the transform is safe anyway. > > >> > Also, it would be interesting to me to know how often we may turn data > >> > dependencies into control dependencies in cases where this doesn't > >> > affect performance significantly. > >> > >> I suppose we try to avoid that but can we ever know for sure? Like > >> speculative devirtualization does this (with the intent that it _does_ > >> matter, > >> of course). > >> > >> I suppose establishing non-dependence isn't an issue, like with the > >> vectorizer adding runtime dependence checks and applying versioning > >> to get a vectorized and a not vectorized loop (in case there are > >> dependences)? > > > > I'm not sure I understand you correctly. Do you have a brief example, > > perhaps? For mo_consume and its data dependencies, if there might be a > > dependence, the compiler would have to preserve it; but I guess that > > both a vectorized loop an one that accessses each element separately > > would preserve dependences because it's doing those accesses, and they > > depend on the input data. > > OTOH, peraps HW vector instructions don't get the ordering guarantees > > from data dependences -- Paul, do you know of any such cases? > > A brief example would be for > > void foo (int *a, int *b, int n) > { > for (int i = 0; i < n; ++i) > a[i] = b[i]; > } > > which we can vectorize like > > if (a + n < b || b + n < a) > { > vectorized loop > } > else > { > not vectorized loop > } > > note how we're not establishing equivalences between pointers but > non-dependence vs. possible dependence.
Thank you for the clarification, I will check. If I am not too confused, such a loop might want to used x86 non-temporal stores, which require special handling even with acquire and release, but that is presumably already handled. Thanx, Paul > >> > The background for this question is Paul McKenney's recently updated > >> > proposal for a different memory_order_consume specification: > >> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r0.pdf > >> > > >> > In a nutshell, this requires a compiler to either prove that a pointer > >> > value is not carrying a dependency (simplified, its value somehow > >> > originates from a memory_order_consume load), or it has to > >> > conservatively assume that it does; if it does, the compiler must not > >> > turn data dependencies into control dependencies in generated code. > >> > (The data dependencies, in contrast to control dependencies, enforce > >> > memory ordering on archs such as Power and ARM; these orderings than > >> > allow for not having to use an acquire HW barrier in the generated > >> > code.) > >> > > >> > Given that such a proof will likely be hard for a compiler (dependency > >> > chains propagate through assignments to variables on the heap and stack, > >> > chains are not marked in the code, and points-to analysis can be hard), > >> > a compiler faces a trade-off between either: > >> > (1) trying to support this memory_order_consume specification and likely > >> > disallowing all transformations that change data dependencies into > >> > control dependencies, or > >> > (2) not support the proposal by simply emitting memory_order_acquire > >> > code, but get no new constraints on transformations in return (ie, what > >> > we do for memory_order_consume today). > >> > > >> > A compiler could let users make this choice, but this will be hard for > >> > users too, and the compiler would still have to pick a default. > >> > > >> > Therefore, it would be good to know how important such transformations > >> > or optimizations are in practice. If they are, it either means somewhat > >> > slower code everywhere (or at least having to change much in todays > >> > compilers), or not supporting the new memory_order_consume (at least not > >> > in the default setting of the compiler). > >> > >> IMHO all the memory order semantics are too complicated already so what's > >> the reason to complicate it even more...? > > > > The hardware memory models of archs like Power an ARM guarantee ordering > > for accesses that have a data dependency, thus allowing things like > > traversing a concurrently modified list to not have to use acquire > > memory barriers (ie, a programmer could use memory_order_consume instead > > of memory_order_acquire). The Linux kernel uses this heavily in RCU, > > for example. It doesn't matter for archs like x86 where acquire / > > release barriers are implicit. > > That's memory commit ordering on a single CPU core, right? How would it > be ever valid to change commit order when there is a data dependence? > > > Paul is looking for a solution that would make both the compiler and the > > kernel happy; being able to standardize this in C/C++ would be another > > good outcome from his perspective, I believe. For GCC, the former might > > also have benefits because we'd have clearer requirements for something > > the Linux kernel is using heavily. > > The Linux kernel guys seem to be opposed to marking the dependency > > chains a program relies on (except for the initial memory_order_consume > > load). I pointed out that this proposal would disallow the > > transformations I have asked about in this email, and Paul wanted to > > know how important they are. So I asked here ... :) > > > > If the optimizations that would get disallowed in practice are important > > for performance, my inclination would be that we might need a different > > proposal for C++ standardization. For the kernel use case only, it > > could nonetheless still be interesting for us to further investigate it; > > if the kernel doesn't need these optimizations, we'd at least have a > > better understanding of what the kernel expects and what we deliver. > > > > >