Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs

Yufeng Zhang Thu, 05 Dec 2013 04:03:32 -0800

On 12/04/13 13:08, Bill Schmidt wrote:

On Wed, 2013-12-04 at 11:26 +0100, Richard Biener wrote:

On Tue, Dec 3, 2013 at 11:04 PM, Bill Schmidt
<wschm...@linux.vnet.ibm.com>  wrote:

On Tue, 2013-12-03 at 21:35 +0100, Richard Biener wrote:

Yufeng Zhang<yufeng.zh...@arm.com>  wrote:

On 12/03/13 14:20, Richard Biener wrote:

On Tue, Dec 3, 2013 at 1:50 PM, Yufeng Zhang<yufeng.zh...@arm.com>

wrote:

On 12/03/13 06:48, Jeff Law wrote:


On 12/02/13 08:47, Yufeng Zhang wrote:


Ping~

http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03360.html


Thanks,
Yufeng

On 11/26/13 15:02, Yufeng Zhang wrote:


On 11/26/13 12:45, Richard Biener wrote:


On Thu, Nov 14, 2013 at 12:25 AM, Yufeng
Zhang<yufeng.zh...@arm.com>      wrote:


On 11/13/13 20:54, Bill Schmidt wrote:


The second version of your original patch is ok with me with

the

following changes.  Sorry for the little side adventure into

the

next-interp logic; in the end that's going to hurt more than

it

helps in
this case.  Thanks for having a look at it, anyway.  Thanks

also for

cleaning up this version to be less intrusive to common

interfaces; I

appreciate it.




Thanks a lot for the review.  I've attached an updated patch

with the

suggested changes incorporated.

For the next-interp adventure, I was quite happy to do the
experiment; it's
a good chance of gaining insight into the pass.  Many thanks

for

your prompt
replies and patience in guiding!

Everything else looks OK to me.  Please ask Richard for final
approval,
as I'm not a maintainer.


First a note, I need to check on voting for Bill as the slsr

maintainer

from the steering committee.   Voting was in progress just before

the

close of stage1 development so I haven't tallied the results :-)



Looking forward to some good news! :)


Yes, you are right about the non-trivial 'base' tree are rarely

shared.

      The cached is introduced mainly because get_alternative_base

() may

be
called twice on the same 'base' tree, once in the
find_basis_for_candidate () for look-up and the other time in
alloc_cand_and_find_basis () for record_potential_basis ().  I'm

happy

to leave out the cache if you think the benefit is trivial.


Without some sense of how expensive the lookups are vs how often

the

cache hits it's awful hard to know if the cache is worth it.

I'd say take it out unless you have some sense it's really saving

time.

     It's a pretty minor implementation detail either way.



I think the affine tree routines are generally expensive; it is

worth having

a cache to avoid calling them too many times.  I run the slsr-*.c

tests

under gcc.dg/tree-ssa/ and find out that the cache hit rates range

from

55.6% to 90%, with 73.5% as the average.  The samples may not well

represent

the real world scenario, but they do show the fact that the 'base'

tree can

be shared to some extent.  So I'd like to have the cache in the

patch.

+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-slsr" } */
+
+typedef int arr_2[50][50];
+
+void foo (arr_2 a2, int v1)
+{
+  int i, j;
+
+  i = v1 + 5;
+  j = i;
+  a2 [i-10] [j] = 2;
+  a2 [i] [j++] = i;
+  a2 [i+20] [j++] = i;
+  a2 [i-3] [i-1] += 1;
+  return;
+}
+
+/* { dg-final { scan-tree-dump-times "MEM" 5 "slsr" } } */
+/* { dg-final { cleanup-tree-dump "slsr" } } */

scanning for 5 MEMs looks non-sensical.  What transform do
you expect?  I see other slsr testcases do similar non-sensical
checking which is bad, too.



As the slsr optimizes CAND_REF candidates by simply lowering them

to

MEM_REF from e.g. ARRAY_REF, I think scanning for the number of

MEM_REFs

is an effective check.  Alternatively, I can add a follow-up

patch to

add some dumping facility in replace_ref () to print out the

replacing

actions when -fdump-tree-slsr-details is on.


I think adding some details to the dump and scanning for them would

be

better.  That's the only change that is required for this to move

forward.



I've updated to patch to dump more details when

-fdump-tree-slsr-details is

on.  The tests have also been updated to scan for these new dumps

instead of

MEMs.


I suggest doing it quickly.  We're well past stage1 close at this

point.



The bootstrapping on x86_64 is still running.  OK to commit if it

succeeds?


I still don't like it.  It's using the wrong and too expensive tools

to do

stuff.  What kind of bases are we ultimately interested in?  Browsing
the code it looks like we're having

    /* Base expression for the chain of candidates:  often, but not
       always, an SSA name.  */
    tree base_expr;

which isn't really too informative but I suppose they are all
kind-of-gimple_val()s?  That said, I wonder if you can simply
use get_addr_base_and_unit_offset in place of get_alternative_base

(),

ignoring the returned offset.


'base_expr' is essentially the base address of a handled_component_p,
e.g. ARRAY_REF, COMPONENT_REF, etc.  In most case, it is the address of

the object returned by get_inner_reference ().

Given a test case like the following:

typedef int arr_2[20][20];

void foo (arr_2 a2, int i, int j)
{
   a2[i+10][j] = 1;
   a2[i+10][j+1] = 1;
   a2[i+20][j] = 1;
}

The IR before SLSR is (on x86_64):

   _2 = (long unsigned int) i_1(D);
   _3 = _2 * 80;
   _4 = _3 + 800;
   _6 = a2_5(D) + _4;
   *_6[j_8(D)] = 1;
   _10 = j_8(D) + 1;
   *_6[_10] = 1;
   _12 = _3 + 1600;
   _13 = a2_5(D) + _12;
   *_13[j_8(D)] = 1;

The base_expr for the 1st and 2nd memory reference are the same, i.e.
_6, while the base_expr for a2[i+20][j] is _13.

_13 is essentially (_6 + 800), so all of the three memory references
essentially share the same base address.  As their strides are also the

same (MULT_EXPR (j, 4)), the three references can all be lowered to
MEM_REFs.  What this patch does is to use the tree affine tools to help

recognize the underlying base address expression; as it requires
looking
into the definitions of SSA_NAMEs, get_addr_base_and_unit_offset ()
won't help here.

Bill has helped me exploit other ways of achieving this in SLSR, but so

far we think this is the best way to proceed.  The use of tree affine
routines has been restricted to CAND_REFs only and there is the
aforementioned cache facility to help reduce the overhead.

Thanks,
Yufeng

P.S. some more details what the patch does:

The CAND_REF for the three memory references are:

  6  [2] *_6[j_8(D)] = 1;
      REF  : _6 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
      basis: 0  dependent: 8  sibling: 0
      next-interp: 0  dead-savings: 0

   8  [2] *_6[_10] = 1;
      REF  : _6 + ((sizetype) j_8(D) * 4) + 4 : int[20] *
      basis: 6  dependent: 11  sibling: 0
      next-interp: 0  dead-savings: 0

  11  [2] *_13[j_8(D)] = 1;
      REF  : _13 + ((sizetype) j_8(D) * 4) + 0 : int[20] *
      basis: 8  dependent: 0  sibling: 0
      next-interp: 0  dead-savings: 0

Before the patch, the strength reduction candidate chains for the three

CAND_REFs are:

   _6 ->  6 ->  8
   _13 ->  11

i.e. SLSR recognizes the first two references share the same basis,
while the last one is on it own.

With the patch, an extra candidate chain can be recognized:

   a2_5(D) + (sizetype) i_1(D) * 80 ->  6 ->  11 ->  8

i.e. all of the three references are found to have the same basis
(a2_5(D) + (sizetype) i_1(D) * 80), which is essentially the expanded
_6
or _13, with the immediate offset removed.  The pass is now able to
lower all of the three references, instead of the first two only, to
MEM_REFs.


Ok, so slsr handles arbitrary complex bases and figures out common components? 
If so, then why not just use get_inner_reference? After all slsr does not use 
tree-affine as representation for bases (which it could?)


I think that's overstating SLSR's current capabilities a bit. :)  We do
use get_inner_reference to come up with the base expression for
reference candidates (based on some of your suggestions a couple of
years back).  However, in the case of multiple levels of array
references, we miss opportunities because get_inner_reference stops at
an SSA name that could be further expanded by following its definition
back to a more fundamental base expression.


Using tree-affine.c to_affine_comb / affine_comb_to_tree has exactly the
same problem.

Part of the issue here is that reference candidates are basis for a more
specific optimization than the mult and add candidates.  The latter have
a more general framework for building up a recording of simple affine
expressions that can be strength-reduced.  Ultimately we ought to be
able to do something similar for reference candidates, building up
simple affine expressions from base expressions, so that everything is
done in a forward order and the tree-affine interfaces aren't needed.
But that will take some more fundamental design changes, and since this
provides some good improvements for important cases, I feel it's
reasonable to get this into the release.


But I fail to see what is special about doing the dance to affine and
then back to trees just to drop the constant offset which would be
done by get_inner_reference as well and cheaper if you just ignore
bitpos.


I'm not sure what you're suggesting that he use get_inner_reference on
at this point.  At the point where the affine machinery is invoked, the
memory reference was already expanded with get_inner_reference, and
there was no basis involving the SSA name produced as the base.  The
affine machinery is invoked on that SSA name to see if it is hiding
another base.  There's no additional memory reference to use
get_inner_reference on, just potentially some pointer arithmetic.

That said, if we have real compile-time issues, we should hold off on
this patch for this release.

Yufeng, please time some reasonably large benchmarks (some version of
SPECint or similar) and report back here before the patch goes in.


I've got some build time data for SPEC2Kint.

On x86_64 the -O3 builds take about 4m7.5s with or without the patch(consistent over 3 samples). The difference of the -O3 build time onarm cortex-a15 is also within 2 seconds.

The bootstrapping time on x86_64 is 134m48.040s without the patch and134m46.889s with the patch; this data is preliminary as I only sampledonce, but the difference of the bootstrapping time on arm cortex-a15 isalso within 5 seconds.


I can further time SPEC2006int if necessary.

I've also prepared a patch to further reduce the number of calls totree-affine expansion, by checking whether or not the passed-in BASE inget_alternative_base () is simply an SSA_NAME of a declared variable.Please see the inlined patch below.


Thanks,
Yufeng

diff --git a/gcc/gimple-ssa-strength-reduction.cb/gcc/gimple-ssa-strength-reduction.c

index 26502c3..2984f06 100644
--- a/gcc/gimple-ssa-strength-reduction.c
+++ b/gcc/gimple-ssa-strength-reduction.c
@@ -437,13 +437,22 @@ get_alternative_base (tree base)

   if (result == NULL)
     {
-      tree expr;
-      aff_tree aff;
+      tree expr = NULL;
+      gimple def = NULL;

-      tree_to_aff_combination_expand (base, TREE_TYPE (base),
-                                     &aff, &name_expansions);
-      aff.offset = tree_to_double_int (integer_zero_node);
-      expr = aff_combination_to_tree (&aff);
+      if (TREE_CODE (base) == SSA_NAME)
+       def = SSA_NAME_DEF_STMT (base);
+
+      /* Avoid calling expensive tree-affine expansion if BASE
+         is just an SSA_NAME of, e.g. a para_decl.  */

+ if (!def || (is_gimple_assign (def) && gimple_assign_lhs (def) ==base))

+       {
+         aff_tree aff;
+         tree_to_aff_combination_expand (base, TREE_TYPE (base),
+                                         &aff, &name_expansions);
+         aff.offset = tree_to_double_int (integer_zero_node);
+         expr = aff_combination_to_tree (&aff);
+       }

       result = (tree *) pointer_map_insert (alt_base_map, base);
       gcc_assert (!*result);

Re: [PING] [PATCH] Optional alternative base_expr in finding basis for CAND_REFs

Reply via email to