[RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-04-30 Thread Andre Vieira (lists) via Gcc-patches

Hi,

The aim of this RFC is to explore a way of cleaning up the codegen 
around data_references.  To be specific, I'd like to reuse the 
main-loop's updated data_reference as the base_address for the 
epilogue's corresponding data_reference, rather than use the niters.  We 
have found this leads to better codegen in the vectorized epilogue loops.


The approach in this RFC creates a map if iv_updates which always 
contain an updated pointer that is caputed in vectorizable_{load,store}, 
an iv_update may also contain a skip_edge in case we decide the 
vectorization can be skipped in 'vect_do_peeling'. During the epilogue 
update this map of iv_updates is then checked to see if it contains an 
entry for a data_reference and it is used accordingly and if not it 
reverts back to the old behavior of using the niters to advance the 
data_reference.


The motivation for this work is to improve codegen for the option 
`--param vect-partial-vector-usage=1` for SVE. We found that one of the 
main problems for the codegen here was coming from unnecessary 
conversions caused by the way we update the data_references in the epilogue.


This patch passes regression tests in aarch64-linux-gnu, but the codegen 
is still not optimal in some cases. Specifically those where we have a 
scalar epilogue, as this does not use the data_reference's and will rely 
on the gimple scalar code, thus constructing again a memory access using 
the niters.  This is a limitation for which I haven't quite worked out a 
solution yet and does cause some minor regressions due to unfortunate 
spills.


Let me know what you think and if you have ideas of how we can better 
achieve this.


Kind regards,
Andre Vieira

diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 
c1d6e02194b251f7c940784c291d58c754f07454..ebb71948abe4ca27d495a2707254beb27e385a0d
 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -1928,6 +1928,15 @@ vect_gen_prolog_loop_niters (loop_vec_info loop_vinfo,
   return iters_name;
 }
 
+static bool
+maybe_not_zero (tree t)
+{
+  if (!t)
+return false;
+  if (TREE_CODE (t) != INTEGER_CST)
+return true;
+  return !tree_int_cst_equal (t, build_zero_cst (TREE_TYPE (t)));
+}
 
 /* Function vect_update_init_of_dr
 
@@ -1954,6 +1963,76 @@ vect_update_init_of_dr (dr_vec_info *dr_info, tree 
niters, tree_code code)
   dr_info->offset = offset;
 }
 
+static void
+vect_update_base_of_dr (struct data_reference * dr,
+   loop_vec_info epilogue_vinfo, iv_update *iv_update)
+{
+  tree new_base_addr = iv_update->new_base_addr;
+  edge skip_e = iv_update->skip_edge;
+  if (skip_e)
+{
+  /* If we have SKIP_E we need to use the phi-node that joins the IV coming
+from the main loop and the initial IV.  */
+  gimple_seq stmts;
+  tree base_addr = DR_BASE_ADDRESS (dr);
+  tree type = TREE_TYPE (base_addr);
+  gphi *new_phi;
+
+  edge e = EDGE_PRED (skip_e->dest, 0);
+  e = e != skip_e ? e : EDGE_PRED (skip_e->dest, 1);
+
+  base_addr = force_gimple_operand (base_addr, &stmts, true,
+   NULL_TREE);
+  gimple_stmt_iterator gsi = gsi_last_bb (skip_e->src);
+  if (is_gimple_assign (gsi_stmt (gsi))
+ || is_gimple_call (gsi_stmt (gsi)))
+   gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT);
+  else
+   gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT);
+
+  /* Make sure NEW_BASE_ADDR and the initial base address use the same
+type.  Not sure why I chose to use DR_BASE_ADDR's type here, probably
+makes more sense to use the NEW_BASE_ADDR's type.  */
+  stmts = NULL;
+  new_base_addr = fold_convert (type, new_base_addr);
+  new_base_addr = force_gimple_operand (new_base_addr, &stmts, true, 
NULL_TREE);
+  gsi = gsi_last_bb (e->src);
+  if (is_gimple_assign (gsi_stmt (gsi))
+ || is_gimple_call (gsi_stmt (gsi)))
+   gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT);
+  else
+   gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT);
+
+  new_phi = create_phi_node (make_ssa_name (type), skip_e->dest);
+  add_phi_arg (new_phi, new_base_addr, e, UNKNOWN_LOCATION);
+  add_phi_arg (new_phi, base_addr, skip_e, UNKNOWN_LOCATION);
+
+  new_base_addr = gimple_phi_result (new_phi);
+}
+  else
+{
+  gimple_seq stmts;
+  class loop *loop = LOOP_VINFO_LOOP (epilogue_vinfo);
+  tree type = TREE_TYPE (DR_BASE_ADDRESS (dr));
+  new_base_addr = fold_convert (type, new_base_addr);
+  new_base_addr = force_gimple_operand (new_base_addr, &stmts, true,
+   NULL_TREE);
+  gimple_stmt_iterator gsi
+   = gsi_last_bb (loop_preheader_edge (loop)->src);
+  if (!gsi_stmt (gsi)
+ || is_gimple_assign (gsi_stmt (gsi))
+ || is_gimple_call (gsi_stmt (gsi)))
+   gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT);
+  else
+   gsi_insert_seq_befo

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-17 Thread Andre Vieira (lists) via Gcc-patches

Hi,

So this is my second attempt at finding a way to improve how we generate 
the vector IV's and teach the vectorizer to share them between main loop 
and epilogues. On IRC we discussed my idea to use the loop's control_iv, 
but that was a terrible idea and I quickly threw it in the bin. The main 
problem, that for some reason I failed to see, was that the control_iv 
increases by 's' and the datarefs by 's' * NELEMENTS where 's' is 
usually 1 and NELEMENTs the amount of elements we handle per iteration. 
That means the epilogue loops would have to start from the last loop's 
IV * the last loop's NELEMENT's and that would just cause a mess.


Instead I started to think about creating IV's for the datarefs and what 
I thought worked best was to create these in scalar before peeling. That 
way the peeling mechanisms takes care of the duplication of these for 
the vector and scalar epilogues and it also takes care of adding 
phi-nodes for the skip_vector paths.

These new IV's have two functions:
1) 'vect_create_data_ref_ptr' can use them to:
 a) if it's the main loop: replace the values of the 'initial' value of 
the main loop's IV and the initial values in the skip_vector phi-nodes
 b) Update the the skip_vector phi-nodes argument for the non-skip path 
with the updated vector ptr.


2) They are used for the scalar epilogue ensuring they share the same 
datareference ptr.


There are still a variety of 'hacky' elements here and a lot of testing 
to be done, but I hope to be able to clean them away. One of the main 
issues I had was that I had to skip a couple of checks and things for 
the added phi-nodes and update statements as these do not have 
stmt_vec_info representation.  Though I'm not sure adding this 
representation at their creation was much cleaner... It is something I 
could play around with but I thought this was a good moment to ask you 
for input. For instance, maybe we could do this transformation before 
analysis?


Also be aware that because I create a IV for each dataref this leads to 
regressions with SVE codegen for instance. NEON is able to use the 
post-index addressing mode to increase each dr IV at access time, but 
SVE can't do this.  For this I don't know if maybe we could try to be 
smart and create shared IV's. So rather than make them based on the 
actual vector ptr, use a shared sizetype IV that can be shared among dr 
IV's with the same step. Or maybe this is something for IVOPTs?


Let me know what ya think!

Kind regards,
Andre
diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h
index 
8001cc54f518d9d9d1a0fcfe5790d22dae109fb2..939c0a7fefd4355dd75d7646ac2ae63ce23a0e14
 100644
--- a/gcc/tree-data-ref.h
+++ b/gcc/tree-data-ref.h
@@ -174,6 +174,8 @@ struct data_reference
 
   /* Alias information for the data reference.  */
   struct dr_alias alias;
+
+  hash_map *iv_bases;
 };
 
 #define DR_STMT(DR)(DR)->stmt
diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 
124a7bea6a94161556a6622fa7b113b3cef98bcf..f638bb3e0aa007e0bf7ad8f75fb767d3484b02ce
 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -1475,6 +1475,7 @@ void
 free_data_ref (data_reference_p dr)
 {
   DR_ACCESS_FNS (dr).release ();
+  delete dr->iv_bases;
   free (dr);
 }
 
@@ -1506,6 +1507,7 @@ create_data_ref (edge nest, loop_p loop, tree memref, 
gimple *stmt,
   DR_REF (dr) = memref;
   DR_IS_READ (dr) = is_read;
   DR_IS_CONDITIONAL_IN_STMT (dr) = is_conditional_in_stmt;
+  dr->iv_bases = new hash_map ();
 
   dr_analyze_innermost (&DR_INNERMOST (dr), memref,
nest != NULL ? loop : NULL, stmt);
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index 
86fc118b6befb06233e5e86a01454fd7075075e1..93e14d09763da5034ba97d09b07c94c20fe25a28
 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -24,6 +24,8 @@ typedef void (*transform_callback)(class loop *, void *);
 
 extern void create_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *,
   bool, tree *, tree *);
+extern void create_or_update_iv (tree, tree, tree, class loop *, 
gimple_stmt_iterator *,
+ bool, tree *, tree *, gphi *, bool);
 extern void rewrite_into_loop_closed_ssa_1 (bitmap, unsigned, int,
class loop *);
 extern void rewrite_into_loop_closed_ssa (bitmap, unsigned);
diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c
index 
28ae1316fa0eb6939a45d15e893b7386622ba60c..1709e175c382ef5d74c2f628a61c9fffe26f726d
 100644
--- a/gcc/tree-ssa-loop-manip.c
+++ b/gcc/tree-ssa-loop-manip.c
@@ -57,9 +57,10 @@ static bitmap_obstack loop_renamer_obstack;
VAR_AFTER (unless they are NULL).  */
 
 void
-create_iv (tree base, tree step, tree var, class loop *loop,
-  gimple_stmt_iterator *incr_pos, bool after,
-  tree *var_before, tree *var_after)
+create_or_update_iv (tree base, tree step, tree var, class loop *loop,
+

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-20 Thread Richard Biener
On Mon, 17 May 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> So this is my second attempt at finding a way to improve how we generate the
> vector IV's and teach the vectorizer to share them between main loop and
> epilogues. On IRC we discussed my idea to use the loop's control_iv, but that
> was a terrible idea and I quickly threw it in the bin. The main problem, that
> for some reason I failed to see, was that the control_iv increases by 's' and
> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> amount of elements we handle per iteration. That means the epilogue loops
> would have to start from the last loop's IV * the last loop's NELEMENT's and
> that would just cause a mess.
> 
> Instead I started to think about creating IV's for the datarefs and what I
> thought worked best was to create these in scalar before peeling. That way the
> peeling mechanisms takes care of the duplication of these for the vector and
> scalar epilogues and it also takes care of adding phi-nodes for the
> skip_vector paths.

How does this work for if-converted loops where we use the 
non-if-converted scalar loop for (scalar) peeling but the
if-converted scalar loop for vectorized epilogues?  I suppose
you're only adjusting the if-converted copy.

> These new IV's have two functions:
> 1) 'vect_create_data_ref_ptr' can use them to:
>  a) if it's the main loop: replace the values of the 'initial' value of the
> main loop's IV and the initial values in the skip_vector phi-nodes
>  b) Update the the skip_vector phi-nodes argument for the non-skip path with
> the updated vector ptr.

b) means the prologue IV will not be dead there so we actually need
to compute it?  I suppose IVOPTs could be teached to replace an
IV with its final value (based on some other IV) when it's unused?
Or does it already magically do good?

> 2) They are used for the scalar epilogue ensuring they share the same
> datareference ptr.
> 
> There are still a variety of 'hacky' elements here and a lot of testing to be
> done, but I hope to be able to clean them away. One of the main issues I had
> was that I had to skip a couple of checks and things for the added phi-nodes
> and update statements as these do not have stmt_vec_info representation. 
> Though I'm not sure adding this representation at their creation was much
> cleaner... It is something I could play around with but I thought this was a
> good moment to ask you for input. For instance, maybe we could do this
> transformation before analysis?
> 
> Also be aware that because I create a IV for each dataref this leads to
> regressions with SVE codegen for instance. NEON is able to use the post-index
> addressing mode to increase each dr IV at access time, but SVE can't do this. 
> For this I don't know if maybe we could try to be smart and create shared
> IV's. So rather than make them based on the actual vector ptr, use a shared
> sizetype IV that can be shared among dr IV's with the same step. Or maybe this
> is something for IVOPTs?

Certainly IVOPTs could decide to use the newly created IVs in the
scalar loops for the DRs therein as well.  But since IVOPTs only
considers a single loop at a time it will probably not pay too
much attention and is only influenced by the out-of-loop uses of
the final values of the IVs.

My gut feeling tells me that whatever we do we'll have to look
into improving IVOPTs to consider multiple loops.

> Let me know what ya think!

Now as to the patch itself.  We shouldn't amend struct data_reference,
try to use dr_vec_info instead.

--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -4941,11 +4941,14 @@ vect_create_data_ref_ptr (vec_info *vinfo, 
stmt_vec_info stmt_info,
}

   standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+  gphi *update_phi
+   = as_a (SSA_NAME_DEF_STMT (*dr->iv_bases->get (loop)));


and avoid this by having some explicit representation of IVs.

iv_bases is a map that exists for each DR and maps the loop to the
IV def it seems.  I'd have added the map to loop_vinfo, mapping
the (vect-)DR to the IV [def].

That probably makes the convenience of transforming the scalar
loop before peeling go away, but I wonder whether that's a good
idea anyway.

@@ -9484,6 +9480,51 @@ vect_transform_loop (loop_vec_info loop_vinfo, 
gimple *loop_vectorized_call)
   tree advance;
   drs_init_vec orig_drs_init;

+  if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) == NULL)
+{
+  struct data_reference *dr;
+  unsigned int i;
+  gimple_seq stmts;
+  gimple_stmt_iterator gsi = gsi_for_stmt (get_loop_exit_condition 
(loop));
+
+  FOR_EACH_VEC_ELT (LOOP_VINFO_DATAREFS (loop_vinfo), i, dr) {
+ tree &iv_base = dr->iv_bases->get_or_insert (loop);

there's a comment missing - does this try to replace the
IVs used by the scalar DRs in the non-if-converted loop by
the new artificial IVs?

I notice that even for

void foo (int * __restrict x, int *y, int n1, int n2)
{
  int i;
  f

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-04 Thread Richard Biener
On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> The aim of this RFC is to explore a way of cleaning up the codegen around
> data_references.  To be specific, I'd like to reuse the main-loop's updated
> data_reference as the base_address for the epilogue's corresponding
> data_reference, rather than use the niters.  We have found this leads to
> better codegen in the vectorized epilogue loops.
> 
> The approach in this RFC creates a map if iv_updates which always contain an
> updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
> also contain a skip_edge in case we decide the vectorization can be skipped in
> 'vect_do_peeling'. During the epilogue update this map of iv_updates is then
> checked to see if it contains an entry for a data_reference and it is used
> accordingly and if not it reverts back to the old behavior of using the niters
> to advance the data_reference.
> 
> The motivation for this work is to improve codegen for the option `--param
> vect-partial-vector-usage=1` for SVE. We found that one of the main problems
> for the codegen here was coming from unnecessary conversions caused by the way
> we update the data_references in the epilogue.
> 
> This patch passes regression tests in aarch64-linux-gnu, but the codegen is
> still not optimal in some cases. Specifically those where we have a scalar
> epilogue, as this does not use the data_reference's and will rely on the
> gimple scalar code, thus constructing again a memory access using the niters. 
> This is a limitation for which I haven't quite worked out a solution yet and
> does cause some minor regressions due to unfortunate spills.
> 
> Let me know what you think and if you have ideas of how we can better achieve
> this.

Hmm, so the patch adds a kludge to improve the kludge we have in place ;)

I think it might be interesting to create a C testcase mimicing the
update problem without involving the vectorizer.  That way we can
see how the various components involved behave (FRE + ivopts most
specifically).

That said, a cleaner approach to dealing with this would be to
explicitely track the IVs we generate for vectorized DRs, eventually
factoring that out from vectorizable_{store,load} so we can simply
carry over the actual pointer IV final value to the epilogue as
initial value.  For each DR group we'd create a single IV (we can
even do better in case we have load + store of the "same" group).

We already kind-of track things via the ivexpr_map, but I'm not sure
if this lazly populated map can be reliably re-used to "re-populate"
the epilogue one (walk the map, create epilogue IVs with the appropriate
initial value & adjustd upate).

Richard.

> Kind regards,
> Andre Vieira
> 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-05 Thread Andre Vieira (lists) via Gcc-patches

Hi Richi,

So I'm trying to look at what IVOPTs does right now and how it might be 
able to help us. Looking at these two code examples:

#include 
#if 0
int foo(short * a, short * b, unsigned int n)
{
    int sum = 0;
    for (unsigned int i = 0; i < n; ++i)
    sum += a[i] + b[i];

    return sum;
}


#else

int bar (short * a, short *b, unsigned int n)
{
    int sum = 0;
    unsigned int i = 0;
    for (; i < (n / 16); i += 1)
    {
    // Iterates [0, 16, .., (n/16 * 16) * 16]
    // Example n = 127,
    // iterates [0, 16, 32, 48, 64, 80, 96, 112]
    sum += a[i*16] + b[i*16];
    }
    for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)
    {
    // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
    // Example n = 127,
    // j starts at (127/16) * 16 = 7 * 16 = 112,
    // So iterates over [112, 113, 114, 115, ..., 127]
    sum += a[j] + b[j];
    }
    return sum;
}
#endif

Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the 
following options '-O3 -march=armv8-a -fno-tree-vectorize 
-fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here: 
https://godbolt.org/z/MEf6j6ebM


I tried to see what IVOPTs would make of this and it is able to analyze 
the IVs but it doesn't realize (not even sure it tries) that one IV's 
end (loop 1) could be used as the base for the other (loop 2). I don't 
know if this is where you'd want such optimizations to be made, on one 
side I think it would be great as it would also help with non-vectorized 
loops as you allured to.


However, if you compile the top test case (#if 1) and let the 
tree-vectorizer have a go you will see different behaviours for 
different vectorization approaches, so for:
'-O3 -march=armv8-a', using NEON and epilogue vectorization it seems 
IVOPTs only picks up on one loop.
If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' 
it will detect two loops. This may well be because in fact epilogue 
vectorization 'un-loops' it because it knows it will only have to do one 
iteration of the vectorized epilogue. vect-partial-vector-usage=1 could 
have done the same, but because we are dealing with polymorphic vector 
modes it fails to, I have a hack that works for 
vect-partial-vector-usage to avoid it, but I think we can probably do 
better and try to reason about boundaries in poly_int's rather than 
integers (TBC).


Anyway I diverge. Back to the main question of this patch. How do you 
suggest I go about this? Is there a way to make IVOPTS aware of the 
'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then 
teach it to merge IV's if one ends where the other begins?


On 04/05/2021 10:56, Richard Biener wrote:

On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:


Hi,

The aim of this RFC is to explore a way of cleaning up the codegen around
data_references.  To be specific, I'd like to reuse the main-loop's updated
data_reference as the base_address for the epilogue's corresponding
data_reference, rather than use the niters.  We have found this leads to
better codegen in the vectorized epilogue loops.

The approach in this RFC creates a map if iv_updates which always contain an
updated pointer that is caputed in vectorizable_{load,store}, an iv_update may
also contain a skip_edge in case we decide the vectorization can be skipped in
'vect_do_peeling'. During the epilogue update this map of iv_updates is then
checked to see if it contains an entry for a data_reference and it is used
accordingly and if not it reverts back to the old behavior of using the niters
to advance the data_reference.

The motivation for this work is to improve codegen for the option `--param
vect-partial-vector-usage=1` for SVE. We found that one of the main problems
for the codegen here was coming from unnecessary conversions caused by the way
we update the data_references in the epilogue.

This patch passes regression tests in aarch64-linux-gnu, but the codegen is
still not optimal in some cases. Specifically those where we have a scalar
epilogue, as this does not use the data_reference's and will rely on the
gimple scalar code, thus constructing again a memory access using the niters.
This is a limitation for which I haven't quite worked out a solution yet and
does cause some minor regressions due to unfortunate spills.

Let me know what you think and if you have ideas of how we can better achieve
this.

Hmm, so the patch adds a kludge to improve the kludge we have in place ;)

I think it might be interesting to create a C testcase mimicing the
update problem without involving the vectorizer.  That way we can
see how the various components involved behave (FRE + ivopts most
specifically).

That said, a cleaner approach to dealing with this would be to
explicitely track the IVs we generate for vectorized DRs, eventually
factoring that out from vectorizable_{store,load} so we can simply
carry over the actual pointer IV final value to the epilogue as
initial va

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-05 Thread Richard Biener
On Wed, 5 May 2021, Andre Vieira (lists) wrote:

> Hi Richi,
> 
> So I'm trying to look at what IVOPTs does right now and how it might be able
> to help us. Looking at these two code examples:
> #include 
> #if 0
> int foo(short * a, short * b, unsigned int n)
> {
>     int sum = 0;
>     for (unsigned int i = 0; i < n; ++i)
>     sum += a[i] + b[i];
> 
>     return sum;
> }
> 
> 
> #else
> 
> int bar (short * a, short *b, unsigned int n)
> {
>     int sum = 0;
>     unsigned int i = 0;
>     for (; i < (n / 16); i += 1)
>     {
>     // Iterates [0, 16, .., (n/16 * 16) * 16]
>     // Example n = 127,
>     // iterates [0, 16, 32, 48, 64, 80, 96, 112]
>     sum += a[i*16] + b[i*16];
>     }
>     for (size_t j =  (size_t) ((n / 16) * 16); j < n; ++j)
>     {
>     // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16]
>     // Example n = 127,
>     // j starts at (127/16) * 16 = 7 * 16 = 112,
>     // So iterates over [112, 113, 114, 115, ..., 127]
>     sum += a[j] + b[j];
>     }
>     return sum;
> }
> #endif
> 
> Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the following
> options '-O3 -march=armv8-a -fno-tree-vectorize -fdump-tree-ivopts-all
> -fno-unroll-loops'. See godbolt link here: https://godbolt.org/z/MEf6j6ebM
> 
> I tried to see what IVOPTs would make of this and it is able to analyze the
> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1)
> could be used as the base for the other (loop 2). I don't know if this is
> where you'd want such optimizations to be made, on one side I think it would
> be great as it would also help with non-vectorized loops as you allured to.

Hmm, OK.  So there's the first loop that has a looparound jump and thus
we do not always enter the 2nd loop with the first loop final value of the
IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
And for a followup transform to catch this it would need to compute
the final value of the IV and then match this up with the initial
value computation.  I suppose FRE could be teached to do this, at
least for very simple cases.

> However, if you compile the top test case (#if 1) and let the tree-vectorizer
> have a go you will see different behaviours for different vectorization
> approaches, so for:
> '-O3 -march=armv8-a', using NEON and epilogue vectorization it seems IVOPTs
> only picks up on one loop.

Yep, the "others" are likely fully peeled because they have just a single
iteration.  Again some kind-of final-value replacement "CSE" could
eventually help - but then we have jump-arounds here as well thus
we'd need final-value replacement "PRE".

> If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' it
> will detect two loops. This may well be because in fact epilogue vectorization
> 'un-loops' it because it knows it will only have to do one iteration of the
> vectorized epilogue. vect-partial-vector-usage=1 could have done the same, but
> because we are dealing with polymorphic vector modes it fails to, I have a
> hack that works for vect-partial-vector-usage to avoid it, but I think we can
> probably do better and try to reason about boundaries in poly_int's rather
> than integers (TBC).
> 
> Anyway I diverge. Back to the main question of this patch. How do you suggest
> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs
> in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's
> if one ends where the other begins?

I don't think we will make that work easily.  So indeed attacking this
in the vectorizer sounds most promising.  I'll note there's also
the issue of epilogue vectorization and reductions where we seem
to not re-use partially reduced reduction vectors but instead
reduce to a scalar in each step.  That's a related issue - we're
not able to carry forward a (reduction) IV we generated for the
main vector loop to the epilogue loops.  Like for

double foo (double *a, int n)
{
  double sum = 0.;
  for (int i = 0; i < n; ++i)
sum += a[i];
  return sum;
}

with AVX512 we get three reductions to scalars instead of
a partial reduction from zmm to ymm before the first vectorized
epilogue followed by a reduction from ymm to xmm before the second
(the jump around for the epilogues need to jump to the further
reduction piece obviously).

So I think we want to record IVs we generate (the reduction IVs
are already nicely associated with the stmt-infos), one might
consider to refer to them from the dr_vec_info for example.

It's just going to be "interesting" to wire everything up
correctly with all the jump-arounds we have ...

> On 04/05/2021 10:56, Richard Biener wrote:
> > On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:
> >
> >> Hi,
> >>
> >> The aim of this RFC is to explore a way of cleaning up the codegen around
> >> data_references.  To be specific, I'd like to reuse the main-loop's updated
> >> data_reference as the base_address for

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-05 Thread Andre Vieira (lists) via Gcc-patches



On 05/05/2021 13:34, Richard Biener wrote:

On Wed, 5 May 2021, Andre Vieira (lists) wrote:


I tried to see what IVOPTs would make of this and it is able to analyze the
IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1)
could be used as the base for the other (loop 2). I don't know if this is
where you'd want such optimizations to be made, on one side I think it would
be great as it would also help with non-vectorized loops as you allured to.

Hmm, OK.  So there's the first loop that has a looparound jump and thus
we do not always enter the 2nd loop with the first loop final value of the
IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
And for a followup transform to catch this it would need to compute
the final value of the IV and then match this up with the initial
value computation.  I suppose FRE could be teached to do this, at
least for very simple cases.
I will admit I am not at all familiar with how FRE works, I know it 
exists as the occlusion of running it often breaks my vector patches :P 
But that's about all I know.
I will have a look and see if it makes sense from my perspective to 
address it there, because ...



Anyway I diverge. Back to the main question of this patch. How do you suggest
I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs
in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's
if one ends where the other begins?

I don't think we will make that work easily.  So indeed attacking this
in the vectorizer sounds most promising.


The problem with this that I found with my approach is that it only 
tackles the vectorized epilogues and that leads to regressions, I don't 
have the example at hand, but what I saw was happening was that 
increased register pressure lead to a spill in the hot path. I believe 
this was caused by the epilogue loop using the update pointers as the 
base for their DR's, in this case there were three DR's (2 loads one 
store), but the scalar epilogue still using the original base + niters, 
since this data_reference approach only changes the vectorized epilogues.




  I'll note there's also
the issue of epilogue vectorization and reductions where we seem
to not re-use partially reduced reduction vectors but instead
reduce to a scalar in each step.  That's a related issue - we're
not able to carry forward a (reduction) IV we generated for the
main vector loop to the epilogue loops.  Like for

double foo (double *a, int n)
{
   double sum = 0.;
   for (int i = 0; i < n; ++i)
 sum += a[i];
   return sum;
}

with AVX512 we get three reductions to scalars instead of
a partial reduction from zmm to ymm before the first vectorized
epilogue followed by a reduction from ymm to xmm before the second
(the jump around for the epilogues need to jump to the further
reduction piece obviously).

So I think we want to record IVs we generate (the reduction IVs
are already nicely associated with the stmt-infos), one might
consider to refer to them from the dr_vec_info for example.

It's just going to be "interesting" to wire everything up
correctly with all the jump-arounds we have ...
I have a downstream hack for the reductions, but it only worked for 
partial-vector-usage as there you have the guarantee it's the same 
vector-mode, so you don't need to pfaff around with half and full 
vectors. Obviously what you are suggesting has much wider applications 
and not surprisingly I think Richard Sandiford also pointed out to me 
that these are somewhat related and we might be able to reuse the 
IV-creation to manage it all. But I feel like I am currently light years 
away from that.


I had started to look at removing the data_reference updating we have 
now and dealing with this in the 'create_iv' calls from 
'vect_create_data_ref_ptr' inside 'vectorizable_{load,store}' but then I 
thought it would be good to discuss it with you first. This will require 
keeping track of the 'end-value' of the IV, which for loops where we can 
skip the previous loop means we will need to construct a phi-node 
containing the updated pointer and the initial base. But I'm not 
entirely sure where to keep track of all this. Also I don't know if I 
can replace the base address of the data_reference right there at the 
'create_iv' call, can a data_reference be used multiple times in the 
same loop?


I'll go do a bit more nosing around this idea and the ivmap you 
mentioned before. Let me know if you have any ideas on how this all 
should look like, even if its a 'in an ideal world'.


Andre



On 04/05/2021 10:56, Richard Biener wrote:

On Fri, 30 Apr 2021, Andre Vieira (lists) wrote:


Hi,

The aim of this RFC is to explore a way of cleaning up the codegen around
data_references.  To be specific, I'd like to reuse the main-loop's updated
data_reference as the base_address for the epilogue's corresponding
data_reference, rather than use the niters.  We have found this leads to
better codege

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-05-07 Thread Richard Biener
On Wed, 5 May 2021, Andre Vieira (lists) wrote:

> 
> On 05/05/2021 13:34, Richard Biener wrote:
> > On Wed, 5 May 2021, Andre Vieira (lists) wrote:
> >
> >> I tried to see what IVOPTs would make of this and it is able to analyze the
> >> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop
> >> 1)
> >> could be used as the base for the other (loop 2). I don't know if this is
> >> where you'd want such optimizations to be made, on one side I think it
> >> would
> >> be great as it would also help with non-vectorized loops as you allured to.
> > Hmm, OK.  So there's the first loop that has a looparound jump and thus
> > we do not always enter the 2nd loop with the first loop final value of the
> > IV.  But yes, IVOPTs does not try to allocate IVs across multiple loops.
> > And for a followup transform to catch this it would need to compute
> > the final value of the IV and then match this up with the initial
> > value computation.  I suppose FRE could be teached to do this, at
> > least for very simple cases.
> I will admit I am not at all familiar with how FRE works, I know it exists as
> the occlusion of running it often breaks my vector patches :P But that's about
> all I know.
> I will have a look and see if it makes sense from my perspective to address it
> there, because ...
> >
> >> Anyway I diverge. Back to the main question of this patch. How do you
> >> suggest
> >> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once'
> >> IVs
> >> in the epilogue(s) (both vector and scalar!) and then teach it to merge
> >> IV's
> >> if one ends where the other begins?
> > I don't think we will make that work easily.  So indeed attacking this
> > in the vectorizer sounds most promising.
> 
> The problem with this that I found with my approach is that it only tackles
> the vectorized epilogues and that leads to regressions, I don't have the
> example at hand, but what I saw was happening was that increased register
> pressure lead to a spill in the hot path. I believe this was caused by the
> epilogue loop using the update pointers as the base for their DR's, in this
> case there were three DR's (2 loads one store), but the scalar epilogue still
> using the original base + niters, since this data_reference approach only
> changes the vectorized epilogues.

Yeah, this issue obviously extends to the scalar pro and epilogue loops...

So ideally we'd produce IL (mainly for the IV setup code I guess)
that will be handled well by the following passes but then IVOPTs
is not multi-loop aware ...

That said, in the end we should be able to code-generate the scalar
loops as well (my plan is to add that, at least for the vector
loop, to be able to support partly vectorized loops with unvectorizable
stmts simply replicated as scalar ops).  In that case we can use
the same IVs again.

> >   I'll note there's also
> > the issue of epilogue vectorization and reductions where we seem
> > to not re-use partially reduced reduction vectors but instead
> > reduce to a scalar in each step.  That's a related issue - we're
> > not able to carry forward a (reduction) IV we generated for the
> > main vector loop to the epilogue loops.  Like for
> >
> > double foo (double *a, int n)
> > {
> >double sum = 0.;
> >for (int i = 0; i < n; ++i)
> >  sum += a[i];
> >return sum;
> > }
> >
> > with AVX512 we get three reductions to scalars instead of
> > a partial reduction from zmm to ymm before the first vectorized
> > epilogue followed by a reduction from ymm to xmm before the second
> > (the jump around for the epilogues need to jump to the further
> > reduction piece obviously).
> >
> > So I think we want to record IVs we generate (the reduction IVs
> > are already nicely associated with the stmt-infos), one might
> > consider to refer to them from the dr_vec_info for example.
> >
> > It's just going to be "interesting" to wire everything up
> > correctly with all the jump-arounds we have ...
> I have a downstream hack for the reductions, but it only worked for
> partial-vector-usage as there you have the guarantee it's the same
> vector-mode, so you don't need to pfaff around with half and full vectors.
> Obviously what you are suggesting has much wider applications and not
> surprisingly I think Richard Sandiford also pointed out to me that these are
> somewhat related and we might be able to reuse the IV-creation to manage it
> all. But I feel like I am currently light years away from that.
> 
> I had started to look at removing the data_reference updating we have now and
> dealing with this in the 'create_iv' calls from 'vect_create_data_ref_ptr'
> inside 'vectorizable_{load,store}' but then I thought it would be good to
> discuss it with you first. This will require keeping track of the 'end-value'
> of the IV, which for loops where we can skip the previous loop means we will
> need to construct a phi-node containing the updated pointer and the initial
> base. But I'm not entirel

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-06-14 Thread Andre Vieira (lists) via Gcc-patches

Hi,


On 20/05/2021 11:22, Richard Biener wrote:

On Mon, 17 May 2021, Andre Vieira (lists) wrote:


Hi,

So this is my second attempt at finding a way to improve how we generate the
vector IV's and teach the vectorizer to share them between main loop and
epilogues. On IRC we discussed my idea to use the loop's control_iv, but that
was a terrible idea and I quickly threw it in the bin. The main problem, that
for some reason I failed to see, was that the control_iv increases by 's' and
the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
amount of elements we handle per iteration. That means the epilogue loops
would have to start from the last loop's IV * the last loop's NELEMENT's and
that would just cause a mess.

Instead I started to think about creating IV's for the datarefs and what I
thought worked best was to create these in scalar before peeling. That way the
peeling mechanisms takes care of the duplication of these for the vector and
scalar epilogues and it also takes care of adding phi-nodes for the
skip_vector paths.

How does this work for if-converted loops where we use the
non-if-converted scalar loop for (scalar) peeling but the
if-converted scalar loop for vectorized epilogues?  I suppose
you're only adjusting the if-converted copy.

True hadn't thought about this :(



These new IV's have two functions:
1) 'vect_create_data_ref_ptr' can use them to:
  a) if it's the main loop: replace the values of the 'initial' value of the
main loop's IV and the initial values in the skip_vector phi-nodes
  b) Update the the skip_vector phi-nodes argument for the non-skip path with
the updated vector ptr.

b) means the prologue IV will not be dead there so we actually need
to compute it?  I suppose IVOPTs could be teached to replace an
IV with its final value (based on some other IV) when it's unused?
Or does it already magically do good?

It does not and ...



2) They are used for the scalar epilogue ensuring they share the same
datareference ptr.

There are still a variety of 'hacky' elements here and a lot of testing to be
done, but I hope to be able to clean them away. One of the main issues I had
was that I had to skip a couple of checks and things for the added phi-nodes
and update statements as these do not have stmt_vec_info representation.
Though I'm not sure adding this representation at their creation was much
cleaner... It is something I could play around with but I thought this was a
good moment to ask you for input. For instance, maybe we could do this
transformation before analysis?

Also be aware that because I create a IV for each dataref this leads to
regressions with SVE codegen for instance. NEON is able to use the post-index
addressing mode to increase each dr IV at access time, but SVE can't do this.
For this I don't know if maybe we could try to be smart and create shared
IV's. So rather than make them based on the actual vector ptr, use a shared
sizetype IV that can be shared among dr IV's with the same step. Or maybe this
is something for IVOPTs?

Certainly IVOPTs could decide to use the newly created IVs in the
scalar loops for the DRs therein as well.  But since IVOPTs only
considers a single loop at a time it will probably not pay too
much attention and is only influenced by the out-of-loop uses of
the final values of the IVs.

My gut feeling tells me that whatever we do we'll have to look
into improving IVOPTs to consider multiple loops.


So I redid the IV-sharing and it's looking a lot simpler and neater, 
however it only shares IVs between vectorized loops and not scalar pro- 
or epilogues. I am not certain IVOPTs will be able to deal with these, 
as it has no knowledge of the number of iterations of each different 
loop. So take for instance a prologue peeling for alignment loop and a 
first main vectorization loop. To be able to reuse the IV's from the 
prologue in the main vectorization loop it would need to know that the 
initial start adress + PEELING_NITERS == base address for main 
vectorization loop.


I'll starting testing this approach for correctness if there are no 
major concerns. Though I suspect we will only want to turn this into a 
patch once we have the IVOPTs work done to a point where it at least 
doesn't regress codegen because of shared IVs and eventually we can look 
at how to solve the sharing between vectorized and scalar loops.


A small nitpick on my own RFC. I will probably move the 'skip_e' to 
outside of the map, as we only need one per loop_vinfo and not one per 
DR. Initially I didnt even have this skip_e in, but was using the 
creation of a dummy PHI node and then replacing it with the real thing 
later. Though this made the code simpler, especially when inserting the 
'init's stmt_list.


Kind regards,
Andre
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index 
b317df532a9a92a619de9572378437d09c632ab0..e7d0f1e657b1a0c9bec75799817242e0bc1d8282
 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-ve

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-06-14 Thread Richard Biener
On Mon, 14 Jun 2021, Andre Vieira (lists) wrote:

> Hi,
> 
> 
> On 20/05/2021 11:22, Richard Biener wrote:
> > On Mon, 17 May 2021, Andre Vieira (lists) wrote:
> >
> >> Hi,
> >>
> >> So this is my second attempt at finding a way to improve how we generate
> >> the
> >> vector IV's and teach the vectorizer to share them between main loop and
> >> epilogues. On IRC we discussed my idea to use the loop's control_iv, but
> >> that
> >> was a terrible idea and I quickly threw it in the bin. The main problem,
> >> that
> >> for some reason I failed to see, was that the control_iv increases by 's'
> >> and
> >> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> >> amount of elements we handle per iteration. That means the epilogue loops
> >> would have to start from the last loop's IV * the last loop's NELEMENT's
> >> and
> >> that would just cause a mess.
> >>
> >> Instead I started to think about creating IV's for the datarefs and what I
> >> thought worked best was to create these in scalar before peeling. That way
> >> the
> >> peeling mechanisms takes care of the duplication of these for the vector
> >> and
> >> scalar epilogues and it also takes care of adding phi-nodes for the
> >> skip_vector paths.
> > How does this work for if-converted loops where we use the
> > non-if-converted scalar loop for (scalar) peeling but the
> > if-converted scalar loop for vectorized epilogues?  I suppose
> > you're only adjusting the if-converted copy.
> True hadn't thought about this :(
> >
> >> These new IV's have two functions:
> >> 1) 'vect_create_data_ref_ptr' can use them to:
> >>  a) if it's the main loop: replace the values of the 'initial' value of the
> >> main loop's IV and the initial values in the skip_vector phi-nodes
> >>  b) Update the the skip_vector phi-nodes argument for the non-skip path
> >> with
> >> the updated vector ptr.
> > b) means the prologue IV will not be dead there so we actually need
> > to compute it?  I suppose IVOPTs could be teached to replace an
> > IV with its final value (based on some other IV) when it's unused?
> > Or does it already magically do good?
> It does not and ...
> >
> >> 2) They are used for the scalar epilogue ensuring they share the same
> >> datareference ptr.
> >>
> >> There are still a variety of 'hacky' elements here and a lot of testing to
> >> be
> >> done, but I hope to be able to clean them away. One of the main issues I
> >> had
> >> was that I had to skip a couple of checks and things for the added
> >> phi-nodes
> >> and update statements as these do not have stmt_vec_info representation.
> >> Though I'm not sure adding this representation at their creation was much
> >> cleaner... It is something I could play around with but I thought this was
> >> a
> >> good moment to ask you for input. For instance, maybe we could do this
> >> transformation before analysis?
> >>
> >> Also be aware that because I create a IV for each dataref this leads to
> >> regressions with SVE codegen for instance. NEON is able to use the
> >> post-index
> >> addressing mode to increase each dr IV at access time, but SVE can't do
> >> this.
> >> For this I don't know if maybe we could try to be smart and create shared
> >> IV's. So rather than make them based on the actual vector ptr, use a shared
> >> sizetype IV that can be shared among dr IV's with the same step. Or maybe
> >> this
> >> is something for IVOPTs?
> > Certainly IVOPTs could decide to use the newly created IVs in the
> > scalar loops for the DRs therein as well.  But since IVOPTs only
> > considers a single loop at a time it will probably not pay too
> > much attention and is only influenced by the out-of-loop uses of
> > the final values of the IVs.
> >
> > My gut feeling tells me that whatever we do we'll have to look
> > into improving IVOPTs to consider multiple loops.
> 
> So I redid the IV-sharing and it's looking a lot simpler and neater, however
> it only shares IVs between vectorized loops and not scalar pro- or epilogues.
> I am not certain IVOPTs will be able to deal with these, as it has no
> knowledge of the number of iterations of each different loop. So take for
> instance a prologue peeling for alignment loop and a first main vectorization
> loop. To be able to reuse the IV's from the prologue in the main vectorization
> loop it would need to know that the initial start adress + PEELING_NITERS ==
> base address for main vectorization loop.

Yes, indeed.  I think what we may want to do is to make this CSE/PRE
"easy" when IVOPTs manages to choose compatible IVs.  Since we generally
have bypass jumps around the various loops this boils down to PRE
since the redundancies only appear on the loop-to-loop fallthru edges
which means we won't see those opportunities on GIMPLE (unless we
realize the PRE opportunities in the actual vectorizer generated code).
But eventually RTL PRE does all the magic.

Iff we'd manage to present IVOPTs with out-of-loop IV uses that are
actual initial valu

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-06-14 Thread Richard Biener
On Mon, 14 Jun 2021, Richard Biener wrote:

> On Mon, 14 Jun 2021, Andre Vieira (lists) wrote:
> 
> > Hi,
> > 
> > 
> > On 20/05/2021 11:22, Richard Biener wrote:
> > > On Mon, 17 May 2021, Andre Vieira (lists) wrote:
> > >
> > >> Hi,
> > >>
> > >> So this is my second attempt at finding a way to improve how we generate
> > >> the
> > >> vector IV's and teach the vectorizer to share them between main loop and
> > >> epilogues. On IRC we discussed my idea to use the loop's control_iv, but
> > >> that
> > >> was a terrible idea and I quickly threw it in the bin. The main problem,
> > >> that
> > >> for some reason I failed to see, was that the control_iv increases by 's'
> > >> and
> > >> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the
> > >> amount of elements we handle per iteration. That means the epilogue loops
> > >> would have to start from the last loop's IV * the last loop's NELEMENT's
> > >> and
> > >> that would just cause a mess.
> > >>
> > >> Instead I started to think about creating IV's for the datarefs and what 
> > >> I
> > >> thought worked best was to create these in scalar before peeling. That 
> > >> way
> > >> the
> > >> peeling mechanisms takes care of the duplication of these for the vector
> > >> and
> > >> scalar epilogues and it also takes care of adding phi-nodes for the
> > >> skip_vector paths.
> > > How does this work for if-converted loops where we use the
> > > non-if-converted scalar loop for (scalar) peeling but the
> > > if-converted scalar loop for vectorized epilogues?  I suppose
> > > you're only adjusting the if-converted copy.
> > True hadn't thought about this :(
> > >
> > >> These new IV's have two functions:
> > >> 1) 'vect_create_data_ref_ptr' can use them to:
> > >>  a) if it's the main loop: replace the values of the 'initial' value of 
> > >> the
> > >> main loop's IV and the initial values in the skip_vector phi-nodes
> > >>  b) Update the the skip_vector phi-nodes argument for the non-skip path
> > >> with
> > >> the updated vector ptr.
> > > b) means the prologue IV will not be dead there so we actually need
> > > to compute it?  I suppose IVOPTs could be teached to replace an
> > > IV with its final value (based on some other IV) when it's unused?
> > > Or does it already magically do good?
> > It does not and ...
> > >
> > >> 2) They are used for the scalar epilogue ensuring they share the same
> > >> datareference ptr.
> > >>
> > >> There are still a variety of 'hacky' elements here and a lot of testing 
> > >> to
> > >> be
> > >> done, but I hope to be able to clean them away. One of the main issues I
> > >> had
> > >> was that I had to skip a couple of checks and things for the added
> > >> phi-nodes
> > >> and update statements as these do not have stmt_vec_info representation.
> > >> Though I'm not sure adding this representation at their creation was much
> > >> cleaner... It is something I could play around with but I thought this 
> > >> was
> > >> a
> > >> good moment to ask you for input. For instance, maybe we could do this
> > >> transformation before analysis?
> > >>
> > >> Also be aware that because I create a IV for each dataref this leads to
> > >> regressions with SVE codegen for instance. NEON is able to use the
> > >> post-index
> > >> addressing mode to increase each dr IV at access time, but SVE can't do
> > >> this.
> > >> For this I don't know if maybe we could try to be smart and create shared
> > >> IV's. So rather than make them based on the actual vector ptr, use a 
> > >> shared
> > >> sizetype IV that can be shared among dr IV's with the same step. Or maybe
> > >> this
> > >> is something for IVOPTs?
> > > Certainly IVOPTs could decide to use the newly created IVs in the
> > > scalar loops for the DRs therein as well.  But since IVOPTs only
> > > considers a single loop at a time it will probably not pay too
> > > much attention and is only influenced by the out-of-loop uses of
> > > the final values of the IVs.
> > >
> > > My gut feeling tells me that whatever we do we'll have to look
> > > into improving IVOPTs to consider multiple loops.
> > 
> > So I redid the IV-sharing and it's looking a lot simpler and neater, however
> > it only shares IVs between vectorized loops and not scalar pro- or 
> > epilogues.
> > I am not certain IVOPTs will be able to deal with these, as it has no
> > knowledge of the number of iterations of each different loop. So take for
> > instance a prologue peeling for alignment loop and a first main 
> > vectorization
> > loop. To be able to reuse the IV's from the prologue in the main 
> > vectorization
> > loop it would need to know that the initial start adress + PEELING_NITERS ==
> > base address for main vectorization loop.
> 
> Yes, indeed.  I think what we may want to do is to make this CSE/PRE
> "easy" when IVOPTs manages to choose compatible IVs.  Since we generally
> have bypass jumps around the various loops this boils down to PRE
> since the redundancies only appear

Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-06-16 Thread Andre Vieira (lists) via Gcc-patches



On 14/06/2021 11:57, Richard Biener wrote:

On Mon, 14 Jun 2021, Richard Biener wrote:


Indeed. For example a simple
int a[1024], b[1024], c[1024];

void foo(int n)
{
   for (int i = 0; i < n; ++i)
 a[i+1] += c[i+i] ? b[i+1] : 0;
}

should usually see peeling for alignment (though on x86 you need
exotic -march= since cost models generally have equal aligned and
unaligned access costs).  For example with -mavx2 -mtune=atom
we'll see an alignment peeling prologue, a AVX2 vector loop,
a SSE2 vectorized epilogue and a scalar epilogue.  It also
shows the original scalar loop being used in the scalar prologue
and epilogue.

We're not even trying to make the counting IV easily used
across loops (we're not counting scalar iterations in the
vector loops).

Specifically we see

 [local count: 94607391]:
niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
_67 = niters_vector_mult_vf.10_62 + 7;
_64 = (int) niters_vector_mult_vf.10_62;
tmp.11_63 = i_43 + _64;
if (niters.8_45 == niters_vector_mult_vf.10_62)
   goto ; [12.50%]
else
   goto ; [87.50%]

after the maini vect loop, recomputing the original IV (i) rather
than using the inserted canonical IV.  And then the vectorized
epilogue header check doing

 [local count: 93293400]:
# i_59 = PHI 
# _66 = PHI <_67(33), 0(18)>
_96 = (unsigned int) n_10(D);
niters.26_95 = _96 - _66;
_108 = (unsigned int) n_10(D);
_109 = _108 - _66;
_110 = _109 + 4294967295;
if (_110 <= 3)
   goto ; [10.00%]
else
   goto ; [90.00%]

re-computing everything from scratch again (also notice how
the main vect loop guard jumps around the alignment prologue
as well and lands here - and the vectorized epilogue using
unaligned accesses - good!).

That is, I'd expect _much_ easier jobs if we'd manage to
track the number of performed scalar iterations (or the
number of scalar iterations remaining) using the canonical
IV we add to all loops across all of the involved loops.

Richard.



So I am now looking at using an IV that counts scalar iterations rather 
than vector iterations and reusing that through all loops, (prologue, 
main loop, vect_epilogue and scalar epilogue). The first is easy, since 
that's what we already do for partial vectors or non-constant VFs. The 
latter requires some plumbing and removing a lot of the code in there 
that creates new IV's going from [0, niters - previous iterations]. I 
don't yet have a clear cut view of how to do this, I first thought of 
keeping track of the 'control' IV in the loop_vinfo, but the prologue 
and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, 
but that is used for overflow detection and only keeps track of what 
looks like a constant 'base' and 'step'. Not quite sure how all that 
works, but intuitively doesn't seem like the right thing to reuse.


I'll go hack around and keep you posted on progress.

Regards,
Andre



Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization

2021-06-16 Thread Richard Biener
On Wed, 16 Jun 2021, Andre Vieira (lists) wrote:

> 
> On 14/06/2021 11:57, Richard Biener wrote:
> > On Mon, 14 Jun 2021, Richard Biener wrote:
> >
> >> Indeed. For example a simple
> >> int a[1024], b[1024], c[1024];
> >>
> >> void foo(int n)
> >> {
> >>for (int i = 0; i < n; ++i)
> >>  a[i+1] += c[i+i] ? b[i+1] : 0;
> >> }
> >>
> >> should usually see peeling for alignment (though on x86 you need
> >> exotic -march= since cost models generally have equal aligned and
> >> unaligned access costs).  For example with -mavx2 -mtune=atom
> >> we'll see an alignment peeling prologue, a AVX2 vector loop,
> >> a SSE2 vectorized epilogue and a scalar epilogue.  It also
> >> shows the original scalar loop being used in the scalar prologue
> >> and epilogue.
> >>
> >> We're not even trying to make the counting IV easily used
> >> across loops (we're not counting scalar iterations in the
> >> vector loops).
> > Specifically we see
> >
> >  [local count: 94607391]:
> > niters_vector_mult_vf.10_62 = bnd.9_61 << 3;
> > _67 = niters_vector_mult_vf.10_62 + 7;
> > _64 = (int) niters_vector_mult_vf.10_62;
> > tmp.11_63 = i_43 + _64;
> > if (niters.8_45 == niters_vector_mult_vf.10_62)
> >goto ; [12.50%]
> > else
> >goto ; [87.50%]
> >
> > after the maini vect loop, recomputing the original IV (i) rather
> > than using the inserted canonical IV.  And then the vectorized
> > epilogue header check doing
> >
> >  [local count: 93293400]:
> > # i_59 = PHI 
> > # _66 = PHI <_67(33), 0(18)>
> > _96 = (unsigned int) n_10(D);
> > niters.26_95 = _96 - _66;
> > _108 = (unsigned int) n_10(D);
> > _109 = _108 - _66;
> > _110 = _109 + 4294967295;
> > if (_110 <= 3)
> >goto ; [10.00%]
> > else
> >goto ; [90.00%]
> >
> > re-computing everything from scratch again (also notice how
> > the main vect loop guard jumps around the alignment prologue
> > as well and lands here - and the vectorized epilogue using
> > unaligned accesses - good!).
> >
> > That is, I'd expect _much_ easier jobs if we'd manage to
> > track the number of performed scalar iterations (or the
> > number of scalar iterations remaining) using the canonical
> > IV we add to all loops across all of the involved loops.
> >
> > Richard.
> 
> 
> So I am now looking at using an IV that counts scalar iterations rather than
> vector iterations and reusing that through all loops, (prologue, main loop,
> vect_epilogue and scalar epilogue). The first is easy, since that's what we
> already do for partial vectors or non-constant VFs. The latter requires some
> plumbing and removing a lot of the code in there that creates new IV's going
> from [0, niters - previous iterations]. I don't yet have a clear cut view of
> how to do this, I first thought of keeping track of the 'control' IV in the
> loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps
> a control_ivs struct, but that is used for overflow detection and only keeps
> track of what looks like a constant 'base' and 'step'. Not quite sure how all
> that works, but intuitively doesn't seem like the right thing to reuse.

Maybe it's enough to maintain this [remaining] scalar iterations counter
between loops, thus after the vector loop do

  remain_scalar_iter -= vector_iters * vf;

etc., this should make it possible to do some first order cleanups,
avoiding some repeated computations.  It does involve placing
additional PHIs for this remain_scalar_iter var of course (I'd be
hesitant to rely on the SSA renamer for this due to its expense).

I think that for all later jump-around tests tracking remaining
scalar iters is more convenient than tracking performed scalar iters.

> I'll go hack around and keep you posted on progress.

Thanks - it's an iffy area ...
Richard.