On Thu, 2013-12-05 at 12:02 +0000, Yufeng Zhang wrote: > On 12/04/13 13:08, Bill Schmidt wrote: > > On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote: > >> On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt > >> <wschm...@linux.vnet.ibm.com> wrote: > >>> On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote: > >>>> Yufeng Zhang<yufeng.zh...@arm.com> wrote: > >>>>> On 12/03/13 14:20, Richard Biener wrote: > >>>>>> On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<yufeng.zh...@arm.com> > >>>>> wrote: > >>>>>>> On 12/03/13 06:48, Jeff Law wrote: > >>>>>>>> > >>>>>>>> On 12/02/13 08:47, Yufeng Zhang wrote: > >>>>>>>>> > >>>>>>>>> Ping~ > >>>>>>>>> > >>>>>>>>> http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html > >>>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Yufeng > >>>>>>>>> > >>>>>>>>> On 11/26/13 15:02, Yufeng Zhang wrote: > >>>>>>>>>> > >>>>>>>>>> On 11/26/13 12:45, Richard Biener wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Nov 14, 2013 at 12:25 AM, Yufeng > >>>>>>>>>>> Zhang<yufeng.zh...@arm.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 11/13/13 20:54, Bill Schmidt wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> The second version of your original patch is ok with me with > >>>>> the > >>>>>>>>>>>>> following changes. Sorry for the little side adventure into > >>>>> the > >>>>>>>>>>>>> next-interp logic; in the end that's going to hurt more than > >>>>> it > >>>>>>>>>>>>> helps in > >>>>>>>>>>>>> this case. Thanks for having a look at it, anyway. Thanks > >>>>> also for > >>>>>>>>>>>>> cleaning up this version to be less intrusive to common > >>>>> interfaces; I > >>>>>>>>>>>>> appreciate it. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks a lot for the review. I've attached an updated patch > >>>>> with the > >>>>>>>>>>>> suggested changes incorporated. > >>>>>>>>>>>> > >>>>>>>>>>>> For the next-interp adventure, I was quite happy to do the > >>>>>>>>>>>> experiment; it's > >>>>>>>>>>>> a good chance of gaining insight into the pass. Many thanks > >>>>> for > >>>>>>>>>>>> your prompt > >>>>>>>>>>>> replies and patience in guiding! > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Everything else looks OK to me. Please ask Richard for final > >>>>>>>>>>>>> approval, > >>>>>>>>>>>>> as I'm not a maintainer. > >>>>>>>> > >>>>>>>> First a note, I need to check on voting for Bill as the slsr > >>>>> maintainer > >>>>>>>> from the steering committee. Voting was in progress just before > >>>>> the > >>>>>>>> close of stage1 development so I haven't tallied the results :-) > >>>>>>> > >>>>>>> > >>>>>>> Looking forward to some good news! :) > >>>>>>> > >>>>>>> > >>>>>>>>>> > >>>>>>>>>> Yes, you are right about the non-trivial 'base' tree are rarely > >>>>> shared. > >>>>>>>>>> The cached is introduced mainly because get_alternative_base > >>>>> () may > >>>>>>>>>> be > >>>>>>>>>> called twice on the same 'base' tree, once in the > >>>>>>>>>> find_basis_for_candidate () for look-up and the other time in > >>>>>>>>>> alloc_cand_and_find_basis () for record_potential_basis (). I'm > >>>>> happy > >>>>>>>>>> to leave out the cache if you think the benefit is trivial. > >>>>>>>> > >>>>>>>> Without some sense of how expensive the lookups are vs how often > >>>>> the > >>>>>>>> cache hits it's awful hard to know if the cache is worth it. > >>>>>>>> > >>>>>>>> I'd say take it out unless you have some sense it's really saving > >>>>> time. > >>>>>>>> It's a pretty minor implementation detail either way. > >>>>>>> > >>>>>>> > >>>>>>> I think the affine tree routines are generally expensive; it is > >>>>> worth having > >>>>>>> a cache to avoid calling them too many times. I run the slsr-*.c > >>>>> tests > >>>>>>> under gcc.dg/tree-ssa/ and find out that the cache hit rates range > >>>>> from > >>>>>>> 55.6% to 90%, with 73.5% as the average. The samples may not well > >>>>> represent > >>>>>>> the real world scenario, but they do show the fact that the 'base' > >>>>> tree can > >>>>>>> be shared to some extent. So I'd like to have the cache in the > >>>>> patch. > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> +/* { dg-do compile } */ > >>>>>>>>>>> +/* { dg-options "-O2 -fdump-tree-slsr" } */ > >>>>>>>>>>> + > >>>>>>>>>>> +typedef int arr_2[50][50]; > >>>>>>>>>>> + > >>>>>>>>>>> +void foo (arr_2 a2, int v1) > >>>>>>>>>>> +{ > >>>>>>>>>>> + int i, j; > >>>>>>>>>>> + > >>>>>>>>>>> + i = v1 + 5; > >>>>>>>>>>> + j = i; > >>>>>>>>>>> + a2 [i-10] [j] = 2; > >>>>>>>>>>> + a2 [i] [j++] = i; > >>>>>>>>>>> + a2 [i+20] [j++] = i; > >>>>>>>>>>> + a2 [i-3] [i-1] += 1; > >>>>>>>>>>> + return; > >>>>>>>>>>> +} > >>>>>>>>>>> + > >>>>>>>>>>> +/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */ > >>>>>>>>>>> +/* { dg-final { cleanup-tree-dump "slsr" } } */ > >>>>>>>>>>> > >>>>>>>>>>> scanning for 5 MEMs looks non-sensical. What transform do > >>>>>>>>>>> you expect? I see other slsr testcases do similar non-sensical > >>>>>>>>>>> checking which is bad, too. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> As the slsr optimizes CAND_REF candidates by simply lowering them > >>>>> to > >>>>>>>>>> MEM_REF from e.g. ARRAY_REF, I think scanning for the number of > >>>>> MEM_REFs > >>>>>>>>>> is an effective check. Alternatively, I can add a follow-up > >>>>> patch to > >>>>>>>>>> add some dumping facility in replace_ref () to print out the > >>>>> replacing > >>>>>>>>>> actions when -fdump-tree-slsr-details is on. > >>>>>>>> > >>>>>>>> I think adding some details to the dump and scanning for them would > >>>>> be > >>>>>>>> better. That's the only change that is required for this to move > >>>>> forward. > >>>>>>> > >>>>>>> > >>>>>>> I've updated to patch to dump more details when > >>>>> -fdump-tree-slsr-details is > >>>>>>> on. The tests have also been updated to scan for these new dumps > >>>>> instead of > >>>>>>> MEMs. > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> I suggest doing it quickly. We're well past stage1 close at this > >>>>> point. > >>>>>>> > >>>>>>> > >>>>>>> The bootstrapping on x86_64 is still running. OK to commit if it > >>>>> succeeds? > >>>>>> > >>>>>> I still don't like it. It's using the wrong and too expensive tools > >>>>> to do > >>>>>> stuff. What kind of bases are we ultimately interested in? Browsing > >>>>>> the code it looks like we're having > >>>>>> > >>>>>> /* Base expression for the chain of candidates: often, but not > >>>>>> always, an SSA name. */ > >>>>>> tree base_expr; > >>>>>> > >>>>>> which isn't really too informative but I suppose they are all > >>>>>> kind-of-gimple_val()s? That said, I wonder if you can simply > >>>>>> use get_addr_base_and_unit_offset in place of get_alternative_base > >>>>> (), > >>>>>> ignoring the returned offset. > >>>>> > >>>>> 'base_expr' is essentially the base address of a handled_component_p, > >>>>> e.g. ARRAY_REF, COMPONENT_REF, etc. In most case, it is the address of > >>>>> > >>>>> the object returned by get_inner_reference (). > >>>>> > >>>>> Given a test case like the following: > >>>>> > >>>>> typedef int arr_2[20][20]; > >>>>> > >>>>> void foo (arr_2 a2, int i, int j) > >>>>> { > >>>>> a2[i+10][j] = 1; > >>>>> a2[i+10][j+1] = 1; > >>>>> a2[i+20][j] = 1; > >>>>> } > >>>>> > >>>>> The IR before SLSR is (on x86_64): > >>>>> > >>>>> _2 = (long unsigned int) i_1(D); > >>>>> _3 = _2 * 80; > >>>>> _4 = _3 + 800; > >>>>> _6 = a2_5(D) + _4; > >>>>> *_6[j_8(D)] = 1; > >>>>> _10 = j_8(D) + 1; > >>>>> *_6[_10] = 1; > >>>>> _12 = _3 + 1600; > >>>>> _13 = a2_5(D) + _12; > >>>>> *_13[j_8(D)] = 1; > >>>>> > >>>>> The base_expr for the 1st and 2nd memory reference are the same, i.e. > >>>>> _6, while the base_expr for a2[i+20][j] is _13. > >>>>> > >>>>> _13 is essentially (_6 + 800), so all of the three memory references > >>>>> essentially share the same base address. As their strides are also the > >>>>> > >>>>> same (MULT_EXPR (j, 4)), the three references can all be lowered to > >>>>> MEM_REFs. What this patch does is to use the tree affine tools to help > >>>>> > >>>>> recognize the underlying base address expression; as it requires > >>>>> looking > >>>>> into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset () > >>>>> won't help here. > >>>>> > >>>>> Bill has helped me exploit other ways of achieving this in SLSR, but so > >>>>> > >>>>> far we think this is the best way to proceed. The use of tree affine > >>>>> routines has been restricted to CAND_REFs only and there is the > >>>>> aforementioned cache facility to help reduce the overhead. > >>>>> > >>>>> Thanks, > >>>>> Yufeng > >>>>> > >>>>> P.S. some more details what the patch does: > >>>>> > >>>>> The CAND_REF for the three memory references are: > >>>>> > >>>>> 6 [2] *_6[j_8(D)] = 1; > >>>>> REF : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] * > >>>>> basis: 0 dependent: 8 sibling: 0 > >>>>> next-interp: 0 dead-savings: 0 > >>>>> > >>>>> 8 [2] *_6[_10] = 1; > >>>>> REF : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] * > >>>>> basis: 6 dependent: 11 sibling: 0 > >>>>> next-interp: 0 dead-savings: 0 > >>>>> > >>>>> 11 [2] *_13[j_8(D)] = 1; > >>>>> REF : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] * > >>>>> basis: 8 dependent: 0 sibling: 0 > >>>>> next-interp: 0 dead-savings: 0 > >>>>> > >>>>> Before the patch, the strength reduction candidate chains for the three > >>>>> > >>>>> CAND_REFs are: > >>>>> > >>>>> _6 -> 6 -> 8 > >>>>> _13 -> 11 > >>>>> > >>>>> i.e. SLSR recognizes the first two references share the same basis, > >>>>> while the last one is on it own. > >>>>> > >>>>> With the patch, an extra candidate chain can be recognized: > >>>>> > >>>>> a2_5(D) + (sizetype) i_1(D) * 80 -> 6 -> 11 -> 8 > >>>>> > >>>>> i.e. all of the three references are found to have the same basis > >>>>> (a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded > >>>>> _6 > >>>>> or _13, with the immediate offset removed. The pass is now able to > >>>>> lower all of the three references, instead of the first two only, to > >>>>> MEM_REFs. > >>>> > >>>> Ok, so slsr handles arbitrary complex bases and figures out common > >>>> components? If so, then why not just use get_inner_reference? After all > >>>> slsr does not use tree-affine as representation for bases (which it > >>>> could?) > >>> > >>> I think that's overstating SLSR's current capabilities a bit. :) We do > >>> use get_inner_reference to come up with the base expression for > >>> reference candidates (based on some of your suggestions a couple of > >>> years back). However, in the case of multiple levels of array > >>> references, we miss opportunities because get_inner_reference stops at > >>> an SSA name that could be further expanded by following its definition > >>> back to a more fundamental base expression. > >> > >> Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the > >> same problem. > >> > >>> Part of the issue here is that reference candidates are basis for a more > >>> specific optimization than the mult and add candidates. The latter have > >>> a more general framework for building up a recording of simple affine > >>> expressions that can be strength-reduced. Ultimately we ought to be > >>> able to do something similar for reference candidates, building up > >>> simple affine expressions from base expressions, so that everything is > >>> done in a forward order and the tree-affine interfaces aren't needed. > >>> But that will take some more fundamental design changes, and since this > >>> provides some good improvements for important cases, I feel it's > >>> reasonable to get this into the release. > >> > >> But I fail to see what is special about doing the dance to affine and > >> then back to trees just to drop the constant offset which would be > >> done by get_inner_reference as well and cheaper if you just ignore > >> bitpos. > > > > I'm not sure what you're suggesting that he use get_inner_reference on > > at this point. At the point where the affine machinery is invoked, the > > memory reference was already expanded with get_inner_reference, and > > there was no basis involving the SSA name produced as the base. The > > affine machinery is invoked on that SSA name to see if it is hiding > > another base. There's no additional memory reference to use > > get_inner_reference on, just potentially some pointer arithmetic. > > > > That said, if we have real compile-time issues, we should hold off on > > this patch for this release. > > > > Yufeng, please time some reasonably large benchmarks (some version of > > SPECint or similar) and report back here before the patch goes in. > > I've got some build time data for SPEC2Kint. > > On x86_64 the -O3 builds take about 4m7.5s with or without the patch > (consistent over 3 samples). The difference of the -O3 build time on > arm cortex-a15 is also within 2 seconds. > > The bootstrapping time on x86_64 is 134m48.040s without the patch and > 134m46.889s with the patch; this data is preliminary as I only sampled > once, but the difference of the bootstrapping time on arm cortex-a15 is > also within 5 seconds. > > I can further time SPEC2006int if necessary. > > I've also prepared a patch to further reduce the number of calls to > tree-affine expansion, by checking whether or not the passed-in BASE in > get_alternative_base () is simply an SSA_NAME of a declared variable. > Please see the inlined patch below. > > Thanks, > Yufeng > > > diff --git a/gcc/gimple-ssa-strength-reduction.c > b/gcc/gimple-ssa-strength-reduction.c > index 26502c3..2984f06 100644 > --- a/gcc/gimple-ssa-strength-reduction.c > +++ b/gcc/gimple-ssa-strength-reduction.c > @@ -437,13 +437,22 @@ get_alternative_base (tree base) > > if (result == NULL) > { > - tree expr; > - aff_tree aff; > + tree expr = NULL; > + gimple def = NULL; > > - tree_to_aff_combination_expand (base, TREE_TYPE (base), > - &aff, &name_expansions); > - aff.offset = tree_to_double_int (integer_zero_node); > - expr = aff_combination_to_tree (&aff); > + if (TREE_CODE (base) == SSA_NAME) > + def = SSA_NAME_DEF_STMT (base); > + > + /* Avoid calling expensive tree-affine expansion if BASE > + is just an SSA_NAME of, e.g. a para_decl. */ > + if (!def || (is_gimple_assign (def) && gimple_assign_lhs (def) == > base))
Well, that just isn't right. !def indicates you have a parameter, so why call tree_to_aff_combination_expand in that case? Just forget this addition and check for flag_expensive_optimizations as Richard suggested in another branch of this thread. Previous version of the patch is ok with this change, and with a comment added that we should eliminate this backtracking with better forward analysis in a future release. Thanks, Bill > + { > + aff_tree aff; > + tree_to_aff_combination_expand (base, TREE_TYPE (base), > + &aff, &name_expansions); > + aff.offset = tree_to_double_int (integer_zero_node); > + expr = aff_combination_to_tree (&aff); > + } > > result = (tree *) pointer_map_insert (alt_base_map, base); > gcc_assert (!*result); >