Ping^1. Thanks, bin
On Mon, Jul 10, 2017 at 9:23 AM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Tue, Jul 4, 2017 at 1:29 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Tue, Jul 4, 2017 at 2:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Tue, Jul 4, 2017 at 12:19 PM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Mon, Jul 3, 2017 at 4:17 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>>> On Mon, Jul 3, 2017 at 10:38 AM, Richard Biener >>>>> <richard.guent...@gmail.com> wrote: >>>>>> On Tue, Jun 27, 2017 at 12:49 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>>>> Hi, >>>>>>> For the moment, tree-predcom.c only supports >>>>>>> invariant/load-loads/store-loads chains. >>>>>>> This patch generalizes dead store elimination (or store motion) across >>>>>>> loop iterations in >>>>>>> predictive commoning pass by supporting store-store chain. As comment >>>>>>> in the patch: >>>>>>> >>>>>>> Apart from predictive commoning on Load-Load and Store-Load chains, >>>>>>> we >>>>>>> also support Store-Store chains -- stores killed by other store can >>>>>>> be >>>>>>> eliminated. Given below example: >>>>>>> >>>>>>> for (i = 0; i < n; i++) >>>>>>> { >>>>>>> a[i] = 1; >>>>>>> a[i+2] = 2; >>>>>>> } >>>>>>> >>>>>>> It can be replaced with: >>>>>>> >>>>>>> t0 = a[0]; >>>>>>> t1 = a[1]; >>>>>>> for (i = 0; i < n; i++) >>>>>>> { >>>>>>> a[i] = 1; >>>>>>> t2 = 2; >>>>>>> t0 = t1; >>>>>>> t1 = t2; >>>>>>> } >>>>>>> a[n] = t0; >>>>>>> a[n+1] = t1; >>>>>>> >>>>>>> If the loop runs more than 1 iterations, it can be further >>>>>>> simplified into: >>>>>>> >>>>>>> for (i = 0; i < n; i++) >>>>>>> { >>>>>>> a[i] = 1; >>>>>>> } >>>>>>> a[n] = 2; >>>>>>> a[n+1] = 2; >>>>>>> >>>>>>> The interesting part is this can be viewed either as general store >>>>>>> motion >>>>>>> or general dead store elimination in either intra/inter-iterations >>>>>>> way. >>>>>>> >>>>>>> There are number of interesting facts about this enhancement: >>>>>>> a) This patch supports dead store elimination for both across-iteration >>>>>>> case and single-iteration >>>>>>> case. For the latter, it is dead store elimination. >>>>>>> b) There are advantages supporting dead store elimination in predcom, >>>>>>> for example, it has >>>>>>> complete information about memory address. On the contrary, DSE >>>>>>> pass can only handle >>>>>>> memory references with exact the same memory address expression. >>>>>>> c) It's cheap to support store-stores chain in predcom based on >>>>>>> existing code. >>>>>>> d) As commented, the enhancement can be viewed as either generalized >>>>>>> dead store elimination >>>>>>> or generalized store motion. I prefer DSE here. >>>>>>> >>>>>>> Bootstrap(O2/O3) in patch series on x86_64 and AArch64. Is it OK? >>>>>> >>>>>> Looks mostly ok. I have a few questions though. >>>>>> >>>>>> + /* Don't do store elimination if loop has multiple exit edges. */ >>>>>> + bool eliminate_store_p = single_exit (loop) != NULL; >>>>>> >>>>>> handling this would be an enhancement? IIRC LIM store-motion handles >>>>>> this >>>>>> just fine by emitting code on all exits. >>>>> It is an enhancement with a little bit more complication. We would >>>>> need to setup/record finalizer memory references for different exit >>>>> edges. I added TODO description for this (and following one). Is it >>>>> okay to pick up this in the future? >>>> >>>> Yes. >>>> >>>>>> >>>>>> @@ -1773,6 +2003,9 @@ determine_unroll_factor (vec<chain_p> chains) >>>>>> { >>>>>> if (chain->type == CT_INVARIANT) >>>>>> continue; >>>>>> + /* Don't unroll when eliminating stores. */ >>>>>> + else if (chain->type == CT_STORE_STORE) >>>>>> + return 1; >>>>>> >>>>>> this is a hard exit value so we do not handle the case where another >>>>>> chain >>>>>> in the loop would want to unroll? (enhancement?) I'd have expected to do >>>>>> the same as for CT_INVARIANT here. >>>>> I didn't check what change is needed in case of unrolling. I am not >>>>> very sure if we should prefer unroll for *load chains or prefer not >>>>> unroll for store-store chains, because unroll in general increases >>>>> loop-carried register pressure for store-store chains rather than >>>>> decreases register pressure for *load chains. >>>>> I was also thinking if it's possible to restrict unrolling somehow in >>>>> order to enable predcom at O2. BTW, this is not common, it only >>>>> happens once in spec2k6 with factor forced to 1. So okay if as it is >>>>> now? >>>> >>>> I think it is ok for now with a TODO added. Please change the comment >>>> to "we can't handle unrolling when eliminating stores" (it's not clear if >>>> we >>>> can -- did you try? maybe add a testcase that would ICE) >>>> >>>>> >>>>>> >>>>>> + tree init = ref_at_iteration (dr, (int) 0 - i, &stmts); >>>>>> + if (!chain->all_always_accessed && tree_could_trap_p (init)) >>>>>> + { >>>>>> + gimple_seq_discard (stmts); >>>>>> + return false; >>>>>> >>>>>> so this is the only place that remotely cares for not always performed >>>>>> stores. >>>>>> But as-is the patch doesn't seem to avoid speculating stores and thus >>>>>> violates the C++ memory model, aka, introduces store-data-races? The LIM >>>>>> store-motion code was fixed to avoid this by keeping track of whether a >>>>>> BB >>>>>> has executed to guard the stores done in the compensation code on the >>>>>> loop >>>>>> exit. >>>>>> >>>>>> That said, to "fix" this all && tree_could_trap_p cases would need to be >>>>>> removed >>>>>> (or similarly flag vars be introduced). Speculating loads that do not >>>>>> trap is ok >>>>>> (might only introduce false uninit use reports by tools like valgrind). >>>>> Hmm, not sure IIUC. Patch updated, is it correct (though conservative)? >>>> >>>> +static bool >>>> +prepare_initializers_chain_store_elim (struct loop *, chain_p chain) >>>> +{ >>>> ... >>>> + tree init = ref_at_iteration (dr, (int) 0 - i, &stmts); >>>> + if (!chain->all_always_accessed && tree_could_trap_p (init)) >>>> + { >>>> >>>> remove && tree_could_trap_p and hoist the other check. >>>> >>>> Same in prepare_finalizers_chain. I think you should check >>>> >>>> if (! chain->all_always_accessed >>>> && ! PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES)) >>>> return false; >>>> >>>> at the beginning of both functions and retain >>>> >>>> if (! chain->all_always_accessed && tree_could_trap_p ()) >>>> >>>> in the loops. >>>> >>>> Ok with that change and a testcase that would exercise failure/ICE of >>>> store chains w/ unrolling. >>> Hmm, now I remember maybe these >>> all_always_accessed/trap/data_store_race checks can be simplified. In >>> function suitable_component_p, we call just_once_each_iteration_p for >>> all references. So we shouldn't not end up with >>> chain->all_always_accessed == false cases, right? why means we don't >>> really need to check at all? >> >> Not sure. We have this check in the existing code already. Please come >> up with a testcase that we might not DSE-pcom because of store speculation. > So just_once_each_iteration_p function call skips all if-condition > cases, while chain->all_always_accessed skips multi-exit cases. As > for store-elimination, given we already requires single-exit loops, we > can assume chain->all_always_accessed is true in > prepare_initializers_chain_store_elim and prepare_finalizers_chain. > To make code clearer, this updated patch explicitly checks > chain->all_always_accessed at the beginning of both functions. The > fact is, we can't eliminate conditional stores (at least with current > code) because both the condition and niters need to be compilation > time constants to propagate the value for storing. It's complicated > and may not worth the effort if they are constants. > I also added two new tests, one for conditional store elimination, the > other for unrolling/store-elimination. > Bootstrap and test on x86_64 and AArch64. Is it OK? > > Thanks, > bin > 2017-07-04 Bin Cheng <bin.ch...@arm.com> > > * tree-predcom.c: Revise general description of pass. > (enum chain_type): New enum type for store elimination. > (struct chain): New field supporting store elimination. > (dump_chain): Dump store-stores chain. > (release_chain): Release resources. > (split_data_refs_to_components): Compute and create component > contains only stores for elimination. > (get_chain_last_ref_at): New function. > (make_invariant_chain): Initialization. > (make_rooted_chain): Specify chain type in parameter. > (add_looparound_copies): Skip for store-stores chain. > (determine_roots_comp): Compute type of chain and pass it to > make_rooted_chain. > (initialize_root_vars_store_elim_2): New function. > (finalize_eliminated_stores): New function. > (remove_stmt): Handle store for elimination. > (execute_pred_commoning_chain): Execute predictive commoning on > store-store chains. > (determine_unroll_factor): Skip unroll for store-stores chain. > (prepare_initializers_chain_store_elim): New function. > (prepare_initializers_chain): Hanlde store-store chain. > (prepare_finalizers_chain, prepare_finalizers): New function. > (tree_predictive_commoning_loop): Return integer value indicating > if loop is unrolled or lcssa form is corrupted. > (tree_predictive_commoning): Rewrite for lcssa form if necessary. > > gcc/testsuite/ChangeLog > 2017-07-04 Bin Cheng <bin.ch...@arm.com> > > * gcc.dg/tree-ssa/predcom-dse-1.c: New test. > * gcc.dg/tree-ssa/predcom-dse-2.c: New test. > * gcc.dg/tree-ssa/predcom-dse-3.c: New test. > * gcc.dg/tree-ssa/predcom-dse-4.c: New test. > * gcc.dg/tree-ssa/predcom-dse-5.c: New test. > * gcc.dg/tree-ssa/predcom-dse-6.c: New test. > * gcc.dg/tree-ssa/predcom-dse-7.c: New test. > * gcc.dg/tree-ssa/predcom-dse-8.c: New test. > * gcc.dg/tree-ssa/predcom-dse-9.c: New test. > * gcc.dg/tree-ssa/predcom-dse-10.c: New test. > * gcc.dg/tree-ssa/predcom-dse-11.c: New test.