Re: fix PR46029: reimplement if conversion of loads and stores
On Tue, Jul 7, 2015 at 11:23 PM, Abe abe_skol...@yahoo.com wrote: (if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). IMO that`s a great idea, but I don`t know how to do it. Hints would be welcome. In particular, how does one generate masked load/stores at the GIMPLE level? It already does that, see predicate_mem_writes. You should definitely preserve that path (I think it does that only when versioning the loop for if-conversion, non-vectorized loops will then not be if-converted). But are we correctly handling these cases in the current if conversion code? I`m uncertain to what that is intended to refer, but I believe Sebastian would agree that the new if converter is safer than the old one in terms of correctness at the time of running the code being compiled. Abe's changes would seem like a step forward from a correctness standpoint Not to argue, but as a point of humility: Sebastian did by far the most work on this patch. I just modernized it and helped move it along. Even _that_ was done with Sebastian`s help. even if they take us a step backwards from a performance standpoint. For now, we have a few performance regressions, and so far we have found that it`s non-trivial to remove all of those regressions. On what hardware did you test? We may be better off pushing the current patch to trunk and having the performance regressions currently introduced by the new if converter be fixed by later patches. Pushing to trunk gives us excellent visibility amongst GCC hackers, so the code will get more eyeballs than if it lingers in an uncommitted patch or in a branch. I, for one, would love some help in fixing these performance regressions. ;-) If fixing the performance regressions winds up taking too long, perhaps the current imperfect patch could be undone on trunk just before a release is tagged, and then we`ll push it in again when trunk goes back to being allowed to be unstable? According to my analysis of the data near the end of the page at https://gcc.gnu.org/develop.html, we have until roughly April of 2016 to work on not-yet-perfect patches in trunk. So the question is whether we get more non-vectorized if-converted code out of this (and thus whether we want to use --param allow-store-data-races to get the old code back which is nicer to less capable CPUs and probably faster than using scatter/gather or masked loads/stores). I do think conditionalizing some of this on the allow-store-data-races makes sense. I think having both the old if-converter and the new one live on in GCC is nontrivial, but not impossible. I also don`t think it`s the best long-term goal, but only a short-term workaround. In the long run, IMO there should be only one if converter, albeit perhaps with tweaking flags [e.g. -fallow-unsafe-if-conversion]. I also wonder if we should really care about load data races (not sure your patch does). According to a recent long discussion I had with Sebastian, our current patch does not have the flaw I was concerned it might have in terms of loads because: [1] the scratchpad is only being used to if-convert assignments to thread-local scalars, never to globals/statics, and because... [2] the gimplifier is supposed to detect the address of this scalar has been taken and when such is detected in the code being compiled, it causes the scalar to no longer look like a scalar in GIMPLE so that we are also safe from stale-data problems that could come from corner-case code that takes the address of a thread-local variable and gives that address to another thread [which then proceeds to overwrite the value in the supposedly-thread-local scalar that belongs to a different thread from the one doing the writing] Regards, Abe
Re: fix PR46029: reimplement if conversion of loads and stores
Abe wrote: I`m uncertain to what that is intended to refer, but I believe Sebastian would agree that the new if converter is safer than the old one in terms of correctness at the time of running the code being compiled. even if they take us a step backwards from a performance standpoint. For now, we have a few performance regressions, and so far we have found that it`s non-trivial to remove all of those regressions. We may be better off pushing the current patch to trunk and having the performance regressions currently introduced by the new if converter be fixed by later patches. Pushing to trunk gives us excellent visibility amongst GCC hackers, so the code will get more eyeballs than if it lingers in an uncommitted patch or in a branch. I, for one, would love some help in fixing these performance regressions. ;-) If fixing the performance regressions winds up taking too long, perhaps the current imperfect patch could be undone on trunk just before a release is tagged, and then we`ll push it in again when trunk goes back to being allowed to be unstable? TBH my two cents would be that a performance-regressed, but correct, compiler, is far better to release, than a performance-improved one that generates unsafe code (e.g. with extra faults in the straightforward single-threaded case!)... --Alan
Re: fix PR46029: reimplement if conversion of loads and stores
[Abe wrote:] I believe Sebastian would agree that the new if converter is safer than the old one in terms of correctness at the time of running the code being compiled. [...] For now, we have a few performance regressions [snip] [Alan wrote:] TBH my two cents would be that a performance-regressed, but correct, compiler, is far better to release, than a performance-improved one that generates unsafe code (e.g. with extra faults in the straightforward single-threaded case!)... I strongly agree that -- by default -- correctness trumps performance. The only times it is allowable to reverse that relationship, IMO, is when the user* of the compiler has explicitly specified flags [e.g. -ffast-math, -Ofast] that tell the compiler that the user* currently cares more about performance than about {correctness-according-to-spec and/or safety in all conditions including null pointers}. ['*': or Makefiles, build scripts, etc.] FYI: TTBOMK, the old if converter was not unsafe with default flags or with only big knobs like -O3; I`m unsure what it did under -ffast-math and -Ofast, if anything of interest. The main advantage of the new if converter over the old one is that the new one is safe in certain situations wherein the old one is unsafe, e.g. the old one may cause the vectorized code to segfault where the non-if-converted code would have run just fine all the way to program completion with the same inputs. This additional safety allows the new converter to be used under more conditions, which in turn allows it to be enabled by default. We intend for all the safe if-conversions to be done by default whenever the vectorizer is on. If there are any unsafe conversions left, which I`m not sure there are, then we will enable them only when the user* specifies something like -fif-conversion=allow-unsafe. The allows it to be enabled by default property should help the code that GCC generates under -O3 w/o any additional flags to be faster than it currently is, for the relevant targets, *without sacrificing even _one_ _bit_ of correctness*. Regards, Abe
Re: fix PR46029: reimplement if conversion of loads and stores
(if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). [Abe wrote:] IMO that`s a great idea, but I don`t know how to do it. Hints would be welcome. In particular, how does one generate masked load/stores at the GIMPLE level? [Richard Biener wrote:] It already does that, see predicate_mem_writes. You should definitely preserve that path Thanks. Yes, we have not intentionally disabled that. On what hardware did you test? AMD64 arch., Intel implementation. Nothing fancy AFAIK in the flags to make it super-specific, e.g. -march=nocona or -march=native. Occasionally using AVX-2 flags as specified by some test cases. Regards, Abe
RE: fix PR46029: reimplement if conversion of loads and stores
(if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). IMO that`s a great idea, but I don`t know how to do it. Hints would be welcome. In particular, how does one generate masked load/stores at the GIMPLE level? But are we correctly handling these cases in the current if conversion code? I`m uncertain to what that is intended to refer, but I believe Sebastian would agree that the new if converter is safer than the old one in terms of correctness at the time of running the code being compiled. Abe's changes would seem like a step forward from a correctness standpoint Not to argue, but as a point of humility: Sebastian did by far the most work on this patch. I just modernized it and helped move it along. Even _that_ was done with Sebastian`s help. even if they take us a step backwards from a performance standpoint. For now, we have a few performance regressions, and so far we have found that it`s non-trivial to remove all of those regressions. We may be better off pushing the current patch to trunk and having the performance regressions currently introduced by the new if converter be fixed by later patches. Pushing to trunk gives us excellent visibility amongst GCC hackers, so the code will get more eyeballs than if it lingers in an uncommitted patch or in a branch. I, for one, would love some help in fixing these performance regressions. ;-) If fixing the performance regressions winds up taking too long, perhaps the current imperfect patch could be undone on trunk just before a release is tagged, and then we`ll push it in again when trunk goes back to being allowed to be unstable? According to my analysis of the data near the end of the page at https://gcc.gnu.org/develop.html, we have until roughly April of 2016 to work on not-yet-perfect patches in trunk. So the question is whether we get more non-vectorized if-converted code out of this (and thus whether we want to use --param allow-store-data-races to get the old code back which is nicer to less capable CPUs and probably faster than using scatter/gather or masked loads/stores). I do think conditionalizing some of this on the allow-store-data-races makes sense. I think having both the old if-converter and the new one live on in GCC is nontrivial, but not impossible. I also don`t think it`s the best long-term goal, but only a short-term workaround. In the long run, IMO there should be only one if converter, albeit perhaps with tweaking flags [e.g. -fallow-unsafe-if-conversion]. I also wonder if we should really care about load data races (not sure your patch does). According to a recent long discussion I had with Sebastian, our current patch does not have the flaw I was concerned it might have in terms of loads because: [1] the scratchpad is only being used to if-convert assignments to thread-local scalars, never to globals/statics, and because... [2] the gimplifier is supposed to detect the address of this scalar has been taken and when such is detected in the code being compiled, it causes the scalar to no longer look like a scalar in GIMPLE so that we are also safe from stale-data problems that could come from corner-case code that takes the address of a thread-local variable and gives that address to another thread [which then proceeds to overwrite the value in the supposedly-thread-local scalar that belongs to a different thread from the one doing the writing] Regards, Abe
Re: fix PR46029: reimplement if conversion of loads and stores
On 06/25/2015 03:43 AM, Richard Biener wrote: Yes, and if you just look at scalar code then with the rewrite we can enable store if-conversion unconditionally. _But_ when the new scheme triggers vectorization cannot succeed on the result as we get if (cond) *p = val; if-converted to tem = cond ? p : scratch; *tem = val; and if (cond) val = *p; if-converted to scatch = val; tem = cond ? p : scratch; val = *tem; and thus the store and loads appear as scather/gather ones to the vectorizer (if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). Right. But are we correctly handling these cases in the current if conversion code? If we're currently mis-handling them, then Abe's changes would seem like a step forward from a correctness standpoint, even if they take us a step backwards from a performance standpoint. So the question is whether we get more non-vectorized if-converted code out of this (and thus whether we want to use --param allow-store-data-races to get the old code back which is nicer to less capable CPUs and probably faster than using scatter/gather or masked loads/stores). I do think conditionalizing some of this on the allo-store-data-races makes sense. Note that I think scatter support is still not implemented in the vectorizer. I also wonder if we should really care about load data races (not sure your patch does). I thought the consensus was that load data races weren't important from a correctness standpoint. Torvald would be the man to know the answer on that one (trie...@redhat.com). In general I like the idea. Similarly. Just trying to find the time to actually look at it is proving hard :( jeff
Re: fix PR46029: reimplement if conversion of loads and stores
Abe Skolnik wrote: In tree-if-conv.c:[…] if it doesn't trap, but has_non_addressable_refs, can't we use ifcvt_can_use_mask_load_store there too? if an access could trap, but is addressable, can't we use the scratchpad technique to get round the trapping problem? That`s how we deal with loads. We use the scratchpad in case the condition is false. That way it doesn`t matter if the original code was hypothetically dereferencing a null pointer in the condition-false section, for example: we aren`t unconditionally executing both sides exactly as written. On the false side, we load from the scratchpad and then ignore the result of that load. Ok, so I'm looking at tree-if-conv.c: @@ -876,32 +726,40 @@ if_convertible_gimple_assign_stmt_p (gimple stmt, return false; } /* tree-into-ssa.c uses GF_PLF_1, so avoid it, because in between if_convertible_loop_p and combine_blocks we can perform loop versioning. */ gimple_set_plf (stmt, GF_PLF_2, false); if (flag_tree_loop_if_convert_stores) { - if (ifcvt_could_trap_p (stmt, refs)) + if (ifcvt_could_trap_p (stmt)) { if (ifcvt_can_use_mask_load_store (stmt)) { gimple_set_plf (stmt, GF_PLF_2, true); *any_mask_load_store = true; return true; } if (dump_file (dump_flags TDF_DETAILS)) - fprintf (dump_file, tree could trap...\n); + fprintf (dump_file, tree could trap\n); return false; } + + if (has_non_addressable_refs (stmt)) + { + if (dump_file (dump_flags TDF_DETAILS)) + fprintf (dump_file, has non-addressable memory references\n); + return false; + } + return true; } I feel I'm being stupid here, but...if ifcvt_could_trap_p !ifcvt_can_use_mask_load_store, we return false here. Specifically, we return false if ifcvt_could_trap_p !ifcvt_can_use_mask_load_store !has_non_addressable_refs. Shouldn't we return true in that case, in order to use the scratchpad technique rather than bail out? Thanks, Alan
Re: fix PR46029: reimplement if conversion of loads and stores
On June 26, 2015 2:21:22 PM GMT+02:00, Alan Lawrence alan.lawre...@arm.com wrote: Sebastian Pop wrote: On Thu, Jun 25, 2015 at 4:43 AM, Richard Biener richard.guent...@gmail.com wrote: when the new scheme triggers vectorization cannot succeed on the result as we get if (cond) *p = val; if-converted to tem = cond ? p : scratch; *tem = val; That's correct. and if (cond) val = *p; if-converted to scatch = val; tem = cond ? p : scratch; val = *tem; The patch does this slightly differently: tem = cond ? p : scratch; val = cond ? *tem : val; I think I like your version better as it has only one cond_expr. Another slight concern...by reusing scratchpad's, are we at risk of creating lots of false dependencies here, that restrict instruction scheduling / other opts later? Another possibility is to not share them and make sure to insert clobbers for them when they become dead so later stack slot sharing can merge them again. Richard. [...] and thus the store and loads appear as scather/gather ones to the vectorizer (if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). Thank you Richard for much better expressing what I was thinking here too :) Abe also suggested to continue optimizing the other way in cases where we know to write or load from the same location on all branches: if (c) A[i] = ... else A[i] = ... The store here really should be commoned, yes. (Related but different?: BZ 56625.) I might add, I find this code much easier to follow than the old (removed) code about data references, memrefs_read_unconditionally, and write_memrefs_written_at_least-once... Thanks, Alan
Re: fix PR46029: reimplement if conversion of loads and stores
On 06/26/2015 08:57 AM, Richard Biener wrote: Another possibility is to not share them and make sure to insert clobbers for them when they become dead so later stack slot sharing can merge them again. We've certainly had cases in the past where re-using a stack slot for an object that has gone out of scope for some other object in a different scope with a different type are not seen as conflicting by the aliasing machinery in the past resulting in incorrect code motions. Creating distinct scratchpads and letting the existing machinery (which has been fixed to deal with the issues noted above I believe) would make more sense than trying to reimplement the sharing correctness stuff within if-conversion. jeff
Re: fix PR46029: reimplement if conversion of loads and stores
Sebastian Pop wrote: On Thu, Jun 25, 2015 at 4:43 AM, Richard Biener richard.guent...@gmail.com wrote: when the new scheme triggers vectorization cannot succeed on the result as we get if (cond) *p = val; if-converted to tem = cond ? p : scratch; *tem = val; That's correct. and if (cond) val = *p; if-converted to scatch = val; tem = cond ? p : scratch; val = *tem; The patch does this slightly differently: tem = cond ? p : scratch; val = cond ? *tem : val; I think I like your version better as it has only one cond_expr. Another slight concern...by reusing scratchpad's, are we at risk of creating lots of false dependencies here, that restrict instruction scheduling / other opts later? [...] and thus the store and loads appear as scather/gather ones to the vectorizer (if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). Thank you Richard for much better expressing what I was thinking here too :) Abe also suggested to continue optimizing the other way in cases where we know to write or load from the same location on all branches: if (c) A[i] = ... else A[i] = ... The store here really should be commoned, yes. (Related but different?: BZ 56625.) I might add, I find this code much easier to follow than the old (removed) code about data references, memrefs_read_unconditionally, and write_memrefs_written_at_least-once... Thanks, Alan
Re: fix PR46029: reimplement if conversion of loads and stores
On Thu, Jun 25, 2015 at 4:43 AM, Richard Biener richard.guent...@gmail.com wrote: when the new scheme triggers vectorization cannot succeed on the result as we get if (cond) *p = val; if-converted to tem = cond ? p : scratch; *tem = val; That's correct. and if (cond) val = *p; if-converted to scatch = val; tem = cond ? p : scratch; val = *tem; The patch does this slightly differently: tem = cond ? p : scratch; val = cond ? *tem : val; I think I like your version better as it has only one cond_expr. and thus the store and loads appear as scather/gather ones to the vectorizer (if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). So the question is whether we get more non-vectorized if-converted code out of this this is the case. (and thus whether we want to use --param allow-store-data-races to get the old code back which is nicer to less capable CPUs and probably faster than using scatter/gather or masked loads/stores). A flag to allow load and store data-races is an interesting suggestion. Abe also suggested to continue optimizing the other way in cases where we know to write or load from the same location on all branches: if (c) A[i] = ... else A[i] = ... scatter support is still not implemented in the vectorizer. Correct. I also wonder if we should really care about load data races (not sure your patch does). The patch does both loads and stores. I didn't look at the patch in detail yet - please address Alans comments and re-post an updated patch. In general I like the idea. Thanks for your review, Sebastian Thanks, Richard. Re. creation of scratchpads: (1) Should the '64' byte size be the result of scanning the function, for the largest data size to which we store? (ideally, conditionally store!) I suspect most functions have conditional stores, but far fewer have conditional stores that we'd like to if-convert. ISTM that if we can lazily allocate the scratchpad that'd be best. If this were an RTL pass, then I'd say query the backend for the widest mode store insn and use that to size the scratchpad. We may have something similar we can do in gimple without resorting querying insn backend capabilities. Perhaps walking the in-scope addressable variables or somesuch. (2) Allocating only once per function: if we had one scratchpad per loop, it could/would live inside the test of gimple_build_call_internal (IFN_LOOP_VECTORIZED, Otherwise, if we if-convert one or more loops in the function, but then fail to vectorize them, we'll leave the scratchpad around for later phases to clean up. Is that OK? If the scratchpad is local to a function, then I'd expect we'd clean it up just like any other unused local. Once it's a global, then all bets are off. Anyway, I probably should just look at the patch before making more comments. jeff
Re: fix PR46029: reimplement if conversion of loads and stores
On Wed, Jun 24, 2015 at 7:10 PM, Jeff Law l...@redhat.com wrote: On 06/22/2015 10:27 AM, Alan Lawrence wrote: My main thought concerns the direction we are travelling here. A major reason why we do if-conversion is to enable vectorization. Is this is targetted at gathering/scattering loads? Following vectorization, different elements of the vector being loaded/stored may have to go to/from the scratchpad or to/from main memory. Or, are we aiming at the case where the predicate or address are invariant? That seems unlikely - loop unswitching would be better for the predicate; loading from an address, we'd just peel and hoist; storing, this'd result in the address holding the last value written, at exit from the loop, a curious idiom. Where the predicate/address is invariant across the vector? (!) Or, at we aiming at non-vectorized code? I think we're aiming at correctness issues, particularly WRT not allowing the optimizers to introduce new data races for C11/C++11. Yes, and if you just look at scalar code then with the rewrite we can enable store if-conversion unconditionally. _But_ when the new scheme triggers vectorization cannot succeed on the result as we get if (cond) *p = val; if-converted to tem = cond ? p : scratch; *tem = val; and if (cond) val = *p; if-converted to scatch = val; tem = cond ? p : scratch; val = *tem; and thus the store and loads appear as scather/gather ones to the vectorizer (if-conversion could directly generate masked load/stores of course and not use a scratch-pad at all in that case). So the question is whether we get more non-vectorized if-converted code out of this (and thus whether we want to use --param allow-store-data-races to get the old code back which is nicer to less capable CPUs and probably faster than using scatter/gather or masked loads/stores). Note that I think scatter support is still not implemented in the vectorizer. I also wonder if we should really care about load data races (not sure your patch does). I didn't look at the patch in detail yet - please address Alans comments and re-post an updated patch. In general I like the idea. Thanks, Richard. Re. creation of scratchpads: (1) Should the '64' byte size be the result of scanning the function, for the largest data size to which we store? (ideally, conditionally store!) I suspect most functions have conditional stores, but far fewer have conditional stores that we'd like to if-convert. ISTM that if we can lazily allocate the scratchpad that'd be best. If this were an RTL pass, then I'd say query the backend for the widest mode store insn and use that to size the scratchpad. We may have something similar we can do in gimple without resorting querying insn backend capabilities. Perhaps walking the in-scope addressable variables or somesuch. (2) Allocating only once per function: if we had one scratchpad per loop, it could/would live inside the test of gimple_build_call_internal (IFN_LOOP_VECTORIZED, Otherwise, if we if-convert one or more loops in the function, but then fail to vectorize them, we'll leave the scratchpad around for later phases to clean up. Is that OK? If the scratchpad is local to a function, then I'd expect we'd clean it up just like any other unused local. Once it's a global, then all bets are off. Anyway, I probably should just look at the patch before making more comments. jeff
Re: fix PR46029: reimplement if conversion of loads and stores
On 12/06/15 21:50, Abe Skolnik wrote: Hi everybody! In the current implementation of if conversion, loads and stores are if-converted in a thread-unsafe way: * loads were always executed, even when they should have not been. Some source code could be rendered invalid due to null pointers that were OK in the original program because they were never dereferenced. * writes were if-converted via load/maybe-modify/store, which renders some code multithreading-unsafe. This patch reimplements if-conversion of loads and stores in a safe way using a scratchpad allocated by the compiler on the stack: * loads are done through an indirection, reading either the correct data from the correct source [if the condition is true] or reading from the scratchpad and later ignoring this read result [if the condition is false]. * writes are also done through an indirection, writing either to the correct destination [if the condition is true] or to the scratchpad [if the condition is false]. Vectorization of if-cvt-stores-vect-ifcvt-18.c disabled because the old if-conversion resulted in unsafe code that could fail under multithreading even though the as-written code _was_ thread-safe. Passed regression testing and bootstrap on amd64-linux. Is this OK to commit to trunk? I can't approve or reject but this certainly looks like an improvement compared to where we are as we get rid of the data races. The only gotcha I can think with this approach is that this introduces false dependencies that would cause unnecessary write-after-write hazards with the writes to the scratchpad when you unroll the loop - but that's not necessarily worse than where we are today. Some fun stats from a previous Friday afternoon poke at this without doing any benchmarking as such. In a bootstrap with BOOT_CFLAGS=-O2 -ftree-loop-if-convert-stores and one without it, I see about 12.20% more csel's on an AArch64 bootstrap (goes from 7898 - 8862) vs plain old -O2. And I did see the one case in libquantum get sorted with this, though the performance results were funny let's say (+5% in one case, -1.5% on another core), I haven't analysed it deeply yet but it does look interesting. regards Ramana Regards, Abe 2015-06-12 Sebastian Pop s@samsung.com Abe Skolnik a.skol...@samsung.com PR tree-optimization/46029 * tree-data-ref.c (struct data_ref_loc_d): Moved... (get_references_in_stmt): Exported. * tree-data-ref.h (struct data_ref_loc_d): ... here. (get_references_in_stmt): Declared. * doc/invoke.texi (-ftree-loop-if-convert-stores): Update description. * tree-if-conv.c (struct ifc_dr): Removed. (IFC_DR): Removed. (DR_WRITTEN_AT_LEAST_ONCE): Removed. (DR_RW_UNCONDITIONALLY): Removed. (memrefs_read_or_written_unconditionally): Removed. (write_memrefs_written_at_least_once): Removed. (ifcvt_could_trap_p): Does not take refs parameter anymore. (ifcvt_memrefs_wont_trap): Removed. (has_non_addressable_refs): New. (if_convertible_gimple_assign_stmt_p): Call has_non_addressable_refs. Removed use of refs. (if_convertible_stmt_p): Removed use of refs. (if_convertible_gimple_assign_stmt_p): Same. (if_convertible_loop_p_1): Removed use of refs. Remove initialization of dr-aux, DR_WRITTEN_AT_LEAST_ONCE, and DR_RW_UNCONDITIONALLY. (insert_address_of): New. (create_scratchpad): New. (create_indirect_cond_expr): New. (predicate_mem_writes): Call create_indirect_cond_expr. Take an extra parameter for scratch_pad. (combine_blocks): Same. (tree_if_conversion): Same. testsuite/ * g++.dg/tree-ssa/ifc-pr46029.C: New. * gcc.dg/tree-ssa/ifc-5.c: Make it exactly like the FFmpeg kernel. * gcc.dg/tree-ssa/ifc-8.c: New. * gcc.dg/tree-ssa/ifc-9.c: New. * gcc.dg/tree-ssa/ifc-10.c: New. * gcc.dg/tree-ssa/ifc-11.c: New. * gcc.dg/tree-ssa/ifc-12.c: New. * gcc.dg/vect/if-cvt-stores-vect-ifcvt-18.c: Disabled. * gcc.dg/vect/if-cvt-stores-vect-ifcvt-19.c: New. --- gcc/ChangeLog | 28 ++ gcc/doc/invoke.texi| 18 +- gcc/testsuite/g++.dg/tree-ssa/ifc-pr46029.C| 76 gcc/testsuite/gcc.dg/tree-ssa/ifc-10.c | 17 + gcc/testsuite/gcc.dg/tree-ssa/ifc-11.c | 16 + gcc/testsuite/gcc.dg/tree-ssa/ifc-12.c | 13 + gcc/testsuite/gcc.dg/tree-ssa/ifc-5.c | 19 +- gcc/testsuite/gcc.dg/tree-ssa/ifc-8.c | 29 ++ gcc/testsuite/gcc.dg/tree-ssa/ifc-9.c | 17 + .../gcc.dg/vect/if-cvt-stores-vect-ifcvt-18.c | 10 +- .../gcc.dg/vect/if-cvt-stores-vect-ifcvt-19.c | 46 +++ gcc/tree-data-ref.c
Re: fix PR46029: reimplement if conversion of loads and stores
On 06/22/2015 10:27 AM, Alan Lawrence wrote: My main thought concerns the direction we are travelling here. A major reason why we do if-conversion is to enable vectorization. Is this is targetted at gathering/scattering loads? Following vectorization, different elements of the vector being loaded/stored may have to go to/from the scratchpad or to/from main memory. Or, are we aiming at the case where the predicate or address are invariant? That seems unlikely - loop unswitching would be better for the predicate; loading from an address, we'd just peel and hoist; storing, this'd result in the address holding the last value written, at exit from the loop, a curious idiom. Where the predicate/address is invariant across the vector? (!) Or, at we aiming at non-vectorized code? I think we're aiming at correctness issues, particularly WRT not allowing the optimizers to introduce new data races for C11/C++11. Re. creation of scratchpads: (1) Should the '64' byte size be the result of scanning the function, for the largest data size to which we store? (ideally, conditionally store!) I suspect most functions have conditional stores, but far fewer have conditional stores that we'd like to if-convert. ISTM that if we can lazily allocate the scratchpad that'd be best. If this were an RTL pass, then I'd say query the backend for the widest mode store insn and use that to size the scratchpad. We may have something similar we can do in gimple without resorting querying insn backend capabilities. Perhaps walking the in-scope addressable variables or somesuch. (2) Allocating only once per function: if we had one scratchpad per loop, it could/would live inside the test of gimple_build_call_internal (IFN_LOOP_VECTORIZED, Otherwise, if we if-convert one or more loops in the function, but then fail to vectorize them, we'll leave the scratchpad around for later phases to clean up. Is that OK? If the scratchpad is local to a function, then I'd expect we'd clean it up just like any other unused local. Once it's a global, then all bets are off. Anyway, I probably should just look at the patch before making more comments. jeff
Re: fix PR46029: reimplement if conversion of loads and stores
On 06/24/2015 10:50 AM, Ramana Radhakrishnan wrote: On 12/06/15 21:50, Abe Skolnik wrote: Hi everybody! In the current implementation of if conversion, loads and stores are if-converted in a thread-unsafe way: * loads were always executed, even when they should have not been. Some source code could be rendered invalid due to null pointers that were OK in the original program because they were never dereferenced. * writes were if-converted via load/maybe-modify/store, which renders some code multithreading-unsafe. This patch reimplements if-conversion of loads and stores in a safe way using a scratchpad allocated by the compiler on the stack: * loads are done through an indirection, reading either the correct data from the correct source [if the condition is true] or reading from the scratchpad and later ignoring this read result [if the condition is false]. * writes are also done through an indirection, writing either to the correct destination [if the condition is true] or to the scratchpad [if the condition is false]. Vectorization of if-cvt-stores-vect-ifcvt-18.c disabled because the old if-conversion resulted in unsafe code that could fail under multithreading even though the as-written code _was_ thread-safe. Passed regression testing and bootstrap on amd64-linux. Is this OK to commit to trunk? I can't approve or reject but this certainly looks like an improvement compared to where we are as we get rid of the data races. Right. I was going to assume the primary purpose to to address correctness issues, not increase the amount of if-conversion for optimization purposes. I have a couple of high level concerns around the scratchpad usage (aliasing, write-write hazards), but until I dig into the patch I don't know if they're real issues or not. Jeff
Re: fix PR46029: reimplement if conversion of loads and stores
Abe Skolnik wrote: Hi everybody! In the current implementation of if conversion, loads and stores are if-converted in a thread-unsafe way: * loads were always executed, even when they should have not been. Some source code could be rendered invalid due to null pointers that were OK in the original program because they were never dereferenced. * writes were if-converted via load/maybe-modify/store, which renders some code multithreading-unsafe. This patch reimplements if-conversion of loads and stores in a safe way using a scratchpad allocated by the compiler on the stack: * loads are done through an indirection, reading either the correct data from the correct source [if the condition is true] or reading from the scratchpad and later ignoring this read result [if the condition is false]. * writes are also done through an indirection, writing either to the correct destination [if the condition is true] or to the scratchpad [if the condition is false]. Vectorization of if-cvt-stores-vect-ifcvt-18.c disabled because the old if-conversion resulted in unsafe code that could fail under multithreading even though the as-written code _was_ thread-safe. Passed regression testing and bootstrap on amd64-linux. Is this OK to commit to trunk? Regards, Abe Thanks for getting back to this! My main thought concerns the direction we are travelling here. A major reason why we do if-conversion is to enable vectorization. Is this is targetted at gathering/scattering loads? Following vectorization, different elements of the vector being loaded/stored may have to go to/from the scratchpad or to/from main memory. Or, are we aiming at the case where the predicate or address are invariant? That seems unlikely - loop unswitching would be better for the predicate; loading from an address, we'd just peel and hoist; storing, this'd result in the address holding the last value written, at exit from the loop, a curious idiom. Where the predicate/address is invariant across the vector? (!) Or, at we aiming at non-vectorized code? Beyond that question... Does the description for -ftree-loop-if-convert-stores in doc/invoke.texi describe what the flag now does? (It doesn't mention loads; the code doesn't look like we use scratchpads at all without -ftree-loop-if-convert-stores, or am I missing something?) In tree-if-conv.c: @@ -883,7 +733,7 @@ if_convertible_gimple_assign_stmt_p (gimple stmt, if (flag_tree_loop_if_convert_stores) { - if (ifcvt_could_trap_p (stmt, refs)) + if (ifcvt_could_trap_p (stmt)) { if (ifcvt_can_use_mask_load_store (stmt)) { and + + if (has_non_addressable_refs (stmt)) + { + if (dump_file (dump_flags TDF_DETAILS)) + fprintf (dump_file, has non-addressable memory references\n); + return false; + } + if it doesn't trap, but has_non_addressable_refs, can't we use ifcvt_can_use_mask_load_store there too? And/or, I think I may be misunderstanding here, but if an access could trap, but is addressable, can't we use the scratchpad technique to get round the trapping problem? (Look at it another way - this patch makes strictly more things return true from ifcvt_could_trap_p, which always exits immediately from if_convertible_gimple_assign_stmt_p...?) Re. creation of scratchpads: (1) Should the '64' byte size be the result of scanning the function, for the largest data size to which we store? (ideally, conditionally store!) (2) Allocating only once per function: if we had one scratchpad per loop, it could/would live inside the test of gimple_build_call_internal (IFN_LOOP_VECTORIZED, Otherwise, if we if-convert one or more loops in the function, but then fail to vectorize them, we'll leave the scratchpad around for later phases to clean up. Is that OK? Also some style nits: @@ -1342,7 +1190,7 @@ if_convertible_loop_p_1 (struct loop *loop, /* Check the if-convertibility of statements in predicated BBs. */ if (!dominated_by_p (CDI_DOMINATORS, loop-latch, bb)) for (itr = gsi_start_bb (bb); !gsi_end_p (itr); gsi_next (itr)) - if (!if_convertible_stmt_p (gsi_stmt (itr), *refs, + if (!if_convertible_stmt_p (gsi_stmt (itr), any_mask_load_store)) return false; } bet that fits on one line now. + * Returns a memory reference to the pointer defined by the +conditional expression: pointer = cond ? A[i] : scratch_pad; and + inserts this code at GSI. */ + +static tree +create_indirect_cond_expr (tree ai, tree cond, tree *scratch_pad, + gimple_stmt_iterator *gsi, bool swap) in comment, should A[i] just be AI, as I see nothing in create_indirect_cond_expr that requires ai to be an array dereference? @@ -2063,12 +1998,14 @@ mask_exists (int size, vecint vec) | end_bb_1 | | bb_2 + | cond =