On Fri, 14 Jun 2024, Tamar Christina wrote:
> Hi All,
>
> IVOPTS normally uses affine trees to perform comparisons between different
> IVs,
> but these seem to have been missing in two key spots and instead normal tree
> equivalencies used.
>
> In some cases where we have a structural equivalence but not a signedness
> equivalencies we end up generating both a signed and unsigned IV for the same
> candidate.
>
> This happens quite a lot with fortran but can also happen in C because this
> came
> code is unable to figure out when one expression is a multiple of another.
>
> As an example in the attached testcase we get:
>
> Initial set of candidates:
> cost: 24 (complexity 3)
> reg_cost: 9
> cand_cost: 15
> cand_group_cost: 0 (complexity 3)
> candidates: 1, 6, 8
> group:0 --> iv_cand:6, cost=(0,1)
> group:1 --> iv_cand:1, cost=(0,0)
> group:2 --> iv_cand:8, cost=(0,1)
> group:3 --> iv_cand:8, cost=(0,1)
> invariant variables: 6
> invariant expressions: 1, 2
>
> <Invariant Expressions>:
> inv_expr 1: stride.3_27 * 4
> inv_expr 2: (unsigned long) stride.3_27 * 4
>
> These end up being used in the same group:
>
> Group 1:
> cand cost compl. inv.expr. inv.vars
> 1 0 0 NIL; 6
> 2 0 0 NIL; 6
> 3 0 0 NIL; 6
>
> which ends up with IV opts picking the signed and unsigned IVs:
>
> Improved to:
> cost: 24 (complexity 3)
> reg_cost: 9
> cand_cost: 15
> cand_group_cost: 0 (complexity 3)
> candidates: 1, 6, 8
> group:0 --> iv_cand:6, cost=(0,1)
> group:1 --> iv_cand:1, cost=(0,0)
> group:2 --> iv_cand:8, cost=(0,1)
> group:3 --> iv_cand:8, cost=(0,1)
> invariant variables: 6
> invariant expressions: 1, 2
>
> and so generates the same IV as both signed and unsigned:
>
> ;; basic block 21, loop depth 3, count 214748368 (estimated locally, freq
> 58.2545), maybe hot
> ;; prev block 28, next block 31, flags: (NEW, REACHABLE, VISITED)
> ;; pred: 28 [always] count:23622320 (estimated locally, freq
> 6.4080) (FALLTHRU,EXECUTABLE)
> ;; 25 [always] count:191126046 (estimated locally, freq
> 51.8465) (FALLTHRU,DFS_BACK,EXECUTABLE)
> # .MEM_66 = PHI <.MEM_34(28), .MEM_22(25)>
> # ivtmp.22_41 = PHI <0(28), ivtmp.22_82(25)>
> # ivtmp.26_51 = PHI <ivtmp.26_55(28), ivtmp.26_72(25)>
> # ivtmp.28_90 = PHI <ivtmp.28_99(28), ivtmp.28_98(25)>
>
> ...
>
> ;; basic block 24, loop depth 3, count 214748366 (estimated locally, freq
> 58.2545), maybe hot
> ;; prev block 22, next block 25, flags: (NEW, REACHABLE, VISITED)'
> ;; pred: 22 [always] count:95443719 (estimated locally, freq
> 25.8909) (FALLTHRU)
>
> ;; 21 [33.3%
> (guessed)] count:71582790 (estimated locally, freq 19.4182)
> (TRUE_VALUE,EXECUTABLE)
>
> ;; 31 [33.3% (guessed)]
> count:47721860 (estimated locally, freq 12.9455) (TRUE_VALUE,EXECUTABLE)
>
> #
> .MEM_22 = PHI <.MEM_44(22), .MEM_31(21), .MEM_79(31)>
>
>
ivtmp.22_82 = ivtmp.22_41 + 1;
ivtmp.26_72 = ivtmp.26_51 + _80;
ivtmp.28_98 =
ivtmp.28_90 + _39;
>
> These two IVs are always used as unsigned, so IV ops generates:
>
> _73 = stride.3_27 * 4;
> _80 = (unsigned long) _73;
> _54 = (unsigned long) stride.3_27;
> _39 = _54 * 4;
>
> Which means that in e.g. exchange2 we generate a lot of duplicate code.
>
> This is because candidate 6 and 8 are structurally equivalent but have
> different
> signs.
>
> This patch changes it so that if you have two IVs that are affine equivalent
> to
> just pick one over the other. IV already has code for this, so the patch just
> uses affine trees instead of tree for the check.
>
> With it we get:
>
> <Invariant Expressions>:
> inv_expr 1: stride.3_27 * 4
>
> <Group-candidate Costs>:
> Group 0:
> cand cost compl. inv.expr. inv.vars
> 5 0 2 NIL; NIL;
> 6 0 3 NIL; NIL;
>
> Group 1:
> cand cost compl. inv.expr. inv.vars
> 1 0 0 NIL; 6
> 2 0 0 NIL; 6
> 3 0 0 NIL; 6
> 4 0 0 NIL; 6
>
> Initial set of candidates:
> cost: 16 (complexity 3)
> reg_cost: 6
> cand_cost: 10
> cand_group_cost: 0 (complexity 3)
> candidates: 1, 6
> group:0 --> iv_cand:6, cost=(0,3)
> group:1 --> iv_cand:1, cost=(0,0)
> invariant variables: 6
> invariant expressions: 1
>
> The two patches together results in a 10% performance increase in exchange2 in
> SPECCPU 2017 and a 4% reduction in binary size and a 5% improvement in compile
> time. There's also a 5% performance improvement in fotonik3d and similar
> reduction in binary size.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> PR tree-optimization/114932
> * tree-affine.cc (aff_combination_constant_multiple_p): Take zero
> offsets into account.
> * tree-ssa-loop-ivopts.cc (affine_compare_eq): New.
> (record_group_use): Use it.
> (constant_multiple_of): Also check equality under
> aff_combination_constant_multiple_p.
>
> gcc/testsuite/ChangeLog:
>
> PR tree-optimization/114932
> * gfortran.dg/addressing-modes_2.f90: New test.
>
> ---
> diff --git a/gcc/testsuite/gfortran.dg/addressing-modes_2.f90
> b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..8eee4be3dc4d69fecfacd4c2e24a4973c8539fae
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/addressing-modes_2.f90
> @@ -0,0 +1,20 @@
> +! { dg-do compile { target aarch64-*-* } }
> +! { dg-additional-options "-w -Ofast -fdump-tree-ivopts-all" }
> +
> +module a
> +integer, parameter :: b=3, c=b
> +contains
> +subroutine d(block)
> +integer f, col , block(:, :, :), e
> +do f = 1, c
> + do col = 1, c
> + block(:f, :, e()) = do
> + end do
> + end do
> + end
> + end
> +
> +! { dg-final { scan-tree-dump-not {Selected IV set for loop .+ niters, 3
> IVs:} ivopts } }
> +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ niters, 2
> IVs:} 1 ivopts } }
> +! { dg-final { scan-tree-dump-times {Selected IV set for loop .+ niters, 1
> IVs:} 1 ivopts } }
> +
> diff --git a/gcc/tree-affine.cc b/gcc/tree-affine.cc
> index
> d6309c4390362b680f0aa97a41fac3281ade66fd..b141bf23c1bbea1001b1bb286346890ddeab4096
> 100644
> --- a/gcc/tree-affine.cc
> +++ b/gcc/tree-affine.cc
> @@ -941,6 +941,13 @@ aff_combination_constant_multiple_p (aff_tree *val,
> aff_tree *div,
> &mult_set, mult))
> return false;
>
> + /* Everything is a multiple of 0, which means we shoudn't enforce that
> + mult_set is checked, since that forced the only valid multiple of
> + val and div to be 0 whereas 1 is also possible. */
> + if (known_eq (val->offset, 0)
> + && known_eq (div->offset, 0))
> + mult_set = false;
> +
In fact all numbers are possible? Shouldn't this be better handled
in wide_int_constant_multiple_p by special-casing
known_eq (div, 0) in the known_eq (val, 0) condition by simply
returning 'true' without checking or setting *mult_set?
The docs of wide_int_constant_multiple_p is odd:
/* If VAL != CST * DIV for any constant CST, returns false.
should that be 'If VAL == CST * DIV for no constant CST, returns false.'?
Or s/any/all/?
> for (i = 0; i < div->n; i++)
> {
> class aff_comb_elt *elt
> diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
> index
> 7a277aaf18a9e0a32b8ac0d23332b7cd9945ef98..b11bd62a86092ba972a648764cd2facd9ddb4914
> 100644
> --- a/gcc/tree-ssa-loop-ivopts.cc
> +++ b/gcc/tree-ssa-loop-ivopts.cc
> @@ -757,6 +757,19 @@ single_dom_exit (class loop *loop)
> return exit;
> }
>
> +/* Compares the given affine tree LEFT with the tree expression RIGHT and
> return
> + whether they are the same under affine equality. */
> +
> +static bool
> +affine_compare_eq (aff_tree &left, tree right)
> +{
> + aff_tree aff_right;
> + tree_to_aff_combination (right, TREE_TYPE (right), &aff_right);
> + aff_combination_scale (&aff_right, -1);
> + aff_combination_add (&aff_right, &left);
> + return aff_combination_zero_p (&aff_right);
> +}
> +
>
> /* Given a nested expression in ARG consisting of PLUS or MULT try to see if
> one
> of the arguments of each expression is a constant and that the type of the
> @@ -1673,6 +1686,9 @@ record_group_use (struct ivopts_data *data, tree *use_p,
> tree addr_base = NULL;
> struct iv_group *group = NULL;
> poly_uint64 addr_offset = 0;
> + aff_tree iv_step, iv_addr_base;
> +
> + tree_to_aff_combination (iv->step, TREE_TYPE (iv->step), &iv_step);
>
> /* Record non address type use in a new group. */
> if (address_p (type))
> @@ -1683,6 +1699,7 @@ record_group_use (struct ivopts_data *data, tree *use_p,
> tree addr_toffset;
> split_constant_offset (iv->base, &addr_base, &addr_toffset);
> addr_offset = int_cst_value (addr_toffset);
> + tree_to_aff_combination (addr_base, TREE_TYPE (addr_base),
> &iv_addr_base);
> for (i = 0; i < data->vgroups.length (); i++)
> {
> struct iv_use *use;
> @@ -1694,8 +1711,8 @@ record_group_use (struct ivopts_data *data, tree *use_p,
>
> /* Check if it has the same stripped base and step. */
> if (operand_equal_p (iv->base_object, use->iv->base_object, 0)
> - && operand_equal_p (iv->step, use->iv->step, 0)
> - && operand_equal_p (addr_base, use->addr_base, 0))
> + && affine_compare_eq (iv_step, use->iv->step)
> + && affine_compare_eq (iv_addr_base, use->addr_base))
There's only this use of addr_base so I think the opportunity is to
turn iv_use->addr_base into aff_tree (even though that's a quite big
representation).
For the testcase, what are the two IVs we are comparing? I wonder
why you need the affine compare for iv->step?
> break;
> }
> if (i == data->vgroups.length ())
> @@ -2231,6 +2248,14 @@ constant_multiple_of (tree top, tree bot, widest_int
> *mul)
> return true;
> }
>
> + aff_tree aff_top, aff_bot;
> + tree_to_aff_combination (top, TREE_TYPE (top), &aff_top);
> + tree_to_aff_combination (bot, TREE_TYPE (bot), &aff_bot);
> + poly_widest_int poly_mul;
> + if (aff_combination_constant_multiple_p (&aff_top, &aff_bot, &poly_mul)
> + && poly_mul.is_constant (mul))
> + return true;
> +
So why does stripping nops not work here?
> code = TREE_CODE (top);
> switch (code)
> {
>
>
>
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)