[RFC] Using main loop's updated IV as base_address for epilogue vectorization
Hi, The aim of this RFC is to explore a way of cleaning up the codegen around data_references. To be specific, I'd like to reuse the main-loop's updated data_reference as the base_address for the epilogue's corresponding data_reference, rather than use the niters. We have found this leads to better codegen in the vectorized epilogue loops. The approach in this RFC creates a map if iv_updates which always contain an updated pointer that is caputed in vectorizable_{load,store}, an iv_update may also contain a skip_edge in case we decide the vectorization can be skipped in 'vect_do_peeling'. During the epilogue update this map of iv_updates is then checked to see if it contains an entry for a data_reference and it is used accordingly and if not it reverts back to the old behavior of using the niters to advance the data_reference. The motivation for this work is to improve codegen for the option `--param vect-partial-vector-usage=1` for SVE. We found that one of the main problems for the codegen here was coming from unnecessary conversions caused by the way we update the data_references in the epilogue. This patch passes regression tests in aarch64-linux-gnu, but the codegen is still not optimal in some cases. Specifically those where we have a scalar epilogue, as this does not use the data_reference's and will rely on the gimple scalar code, thus constructing again a memory access using the niters. This is a limitation for which I haven't quite worked out a solution yet and does cause some minor regressions due to unfortunate spills. Let me know what you think and if you have ideas of how we can better achieve this. Kind regards, Andre Vieira diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index c1d6e02194b251f7c940784c291d58c754f07454..ebb71948abe4ca27d495a2707254beb27e385a0d 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -1928,6 +1928,15 @@ vect_gen_prolog_loop_niters (loop_vec_info loop_vinfo, return iters_name; } +static bool +maybe_not_zero (tree t) +{ + if (!t) +return false; + if (TREE_CODE (t) != INTEGER_CST) +return true; + return !tree_int_cst_equal (t, build_zero_cst (TREE_TYPE (t))); +} /* Function vect_update_init_of_dr @@ -1954,6 +1963,76 @@ vect_update_init_of_dr (dr_vec_info *dr_info, tree niters, tree_code code) dr_info->offset = offset; } +static void +vect_update_base_of_dr (struct data_reference * dr, + loop_vec_info epilogue_vinfo, iv_update *iv_update) +{ + tree new_base_addr = iv_update->new_base_addr; + edge skip_e = iv_update->skip_edge; + if (skip_e) +{ + /* If we have SKIP_E we need to use the phi-node that joins the IV coming +from the main loop and the initial IV. */ + gimple_seq stmts; + tree base_addr = DR_BASE_ADDRESS (dr); + tree type = TREE_TYPE (base_addr); + gphi *new_phi; + + edge e = EDGE_PRED (skip_e->dest, 0); + e = e != skip_e ? e : EDGE_PRED (skip_e->dest, 1); + + base_addr = force_gimple_operand (base_addr, &stmts, true, + NULL_TREE); + gimple_stmt_iterator gsi = gsi_last_bb (skip_e->src); + if (is_gimple_assign (gsi_stmt (gsi)) + || is_gimple_call (gsi_stmt (gsi))) + gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT); + else + gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT); + + /* Make sure NEW_BASE_ADDR and the initial base address use the same +type. Not sure why I chose to use DR_BASE_ADDR's type here, probably +makes more sense to use the NEW_BASE_ADDR's type. */ + stmts = NULL; + new_base_addr = fold_convert (type, new_base_addr); + new_base_addr = force_gimple_operand (new_base_addr, &stmts, true, NULL_TREE); + gsi = gsi_last_bb (e->src); + if (is_gimple_assign (gsi_stmt (gsi)) + || is_gimple_call (gsi_stmt (gsi))) + gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT); + else + gsi_insert_seq_before (&gsi, stmts, GSI_NEW_STMT); + + new_phi = create_phi_node (make_ssa_name (type), skip_e->dest); + add_phi_arg (new_phi, new_base_addr, e, UNKNOWN_LOCATION); + add_phi_arg (new_phi, base_addr, skip_e, UNKNOWN_LOCATION); + + new_base_addr = gimple_phi_result (new_phi); +} + else +{ + gimple_seq stmts; + class loop *loop = LOOP_VINFO_LOOP (epilogue_vinfo); + tree type = TREE_TYPE (DR_BASE_ADDRESS (dr)); + new_base_addr = fold_convert (type, new_base_addr); + new_base_addr = force_gimple_operand (new_base_addr, &stmts, true, + NULL_TREE); + gimple_stmt_iterator gsi + = gsi_last_bb (loop_preheader_edge (loop)->src); + if (!gsi_stmt (gsi) + || is_gimple_assign (gsi_stmt (gsi)) + || is_gimple_call (gsi_stmt (gsi))) + gsi_insert_seq_after (&gsi, stmts, GSI_NEW_STMT); + else + gsi_insert_seq_befo
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
Hi, So this is my second attempt at finding a way to improve how we generate the vector IV's and teach the vectorizer to share them between main loop and epilogues. On IRC we discussed my idea to use the loop's control_iv, but that was a terrible idea and I quickly threw it in the bin. The main problem, that for some reason I failed to see, was that the control_iv increases by 's' and the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the amount of elements we handle per iteration. That means the epilogue loops would have to start from the last loop's IV * the last loop's NELEMENT's and that would just cause a mess. Instead I started to think about creating IV's for the datarefs and what I thought worked best was to create these in scalar before peeling. That way the peeling mechanisms takes care of the duplication of these for the vector and scalar epilogues and it also takes care of adding phi-nodes for the skip_vector paths. These new IV's have two functions: 1) 'vect_create_data_ref_ptr' can use them to: a) if it's the main loop: replace the values of the 'initial' value of the main loop's IV and the initial values in the skip_vector phi-nodes b) Update the the skip_vector phi-nodes argument for the non-skip path with the updated vector ptr. 2) They are used for the scalar epilogue ensuring they share the same datareference ptr. There are still a variety of 'hacky' elements here and a lot of testing to be done, but I hope to be able to clean them away. One of the main issues I had was that I had to skip a couple of checks and things for the added phi-nodes and update statements as these do not have stmt_vec_info representation. Though I'm not sure adding this representation at their creation was much cleaner... It is something I could play around with but I thought this was a good moment to ask you for input. For instance, maybe we could do this transformation before analysis? Also be aware that because I create a IV for each dataref this leads to regressions with SVE codegen for instance. NEON is able to use the post-index addressing mode to increase each dr IV at access time, but SVE can't do this. For this I don't know if maybe we could try to be smart and create shared IV's. So rather than make them based on the actual vector ptr, use a shared sizetype IV that can be shared among dr IV's with the same step. Or maybe this is something for IVOPTs? Let me know what ya think! Kind regards, Andre diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h index 8001cc54f518d9d9d1a0fcfe5790d22dae109fb2..939c0a7fefd4355dd75d7646ac2ae63ce23a0e14 100644 --- a/gcc/tree-data-ref.h +++ b/gcc/tree-data-ref.h @@ -174,6 +174,8 @@ struct data_reference /* Alias information for the data reference. */ struct dr_alias alias; + + hash_map *iv_bases; }; #define DR_STMT(DR)(DR)->stmt diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c index 124a7bea6a94161556a6622fa7b113b3cef98bcf..f638bb3e0aa007e0bf7ad8f75fb767d3484b02ce 100644 --- a/gcc/tree-data-ref.c +++ b/gcc/tree-data-ref.c @@ -1475,6 +1475,7 @@ void free_data_ref (data_reference_p dr) { DR_ACCESS_FNS (dr).release (); + delete dr->iv_bases; free (dr); } @@ -1506,6 +1507,7 @@ create_data_ref (edge nest, loop_p loop, tree memref, gimple *stmt, DR_REF (dr) = memref; DR_IS_READ (dr) = is_read; DR_IS_CONDITIONAL_IN_STMT (dr) = is_conditional_in_stmt; + dr->iv_bases = new hash_map (); dr_analyze_innermost (&DR_INNERMOST (dr), memref, nest != NULL ? loop : NULL, stmt); diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h index 86fc118b6befb06233e5e86a01454fd7075075e1..93e14d09763da5034ba97d09b07c94c20fe25a28 100644 --- a/gcc/tree-ssa-loop-manip.h +++ b/gcc/tree-ssa-loop-manip.h @@ -24,6 +24,8 @@ typedef void (*transform_callback)(class loop *, void *); extern void create_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *, bool, tree *, tree *); +extern void create_or_update_iv (tree, tree, tree, class loop *, gimple_stmt_iterator *, + bool, tree *, tree *, gphi *, bool); extern void rewrite_into_loop_closed_ssa_1 (bitmap, unsigned, int, class loop *); extern void rewrite_into_loop_closed_ssa (bitmap, unsigned); diff --git a/gcc/tree-ssa-loop-manip.c b/gcc/tree-ssa-loop-manip.c index 28ae1316fa0eb6939a45d15e893b7386622ba60c..1709e175c382ef5d74c2f628a61c9fffe26f726d 100644 --- a/gcc/tree-ssa-loop-manip.c +++ b/gcc/tree-ssa-loop-manip.c @@ -57,9 +57,10 @@ static bitmap_obstack loop_renamer_obstack; VAR_AFTER (unless they are NULL). */ void -create_iv (tree base, tree step, tree var, class loop *loop, - gimple_stmt_iterator *incr_pos, bool after, - tree *var_before, tree *var_after) +create_or_update_iv (tree base, tree step, tree var, class loop *loop, +
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On Mon, 17 May 2021, Andre Vieira (lists) wrote: > Hi, > > So this is my second attempt at finding a way to improve how we generate the > vector IV's and teach the vectorizer to share them between main loop and > epilogues. On IRC we discussed my idea to use the loop's control_iv, but that > was a terrible idea and I quickly threw it in the bin. The main problem, that > for some reason I failed to see, was that the control_iv increases by 's' and > the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the > amount of elements we handle per iteration. That means the epilogue loops > would have to start from the last loop's IV * the last loop's NELEMENT's and > that would just cause a mess. > > Instead I started to think about creating IV's for the datarefs and what I > thought worked best was to create these in scalar before peeling. That way the > peeling mechanisms takes care of the duplication of these for the vector and > scalar epilogues and it also takes care of adding phi-nodes for the > skip_vector paths. How does this work for if-converted loops where we use the non-if-converted scalar loop for (scalar) peeling but the if-converted scalar loop for vectorized epilogues? I suppose you're only adjusting the if-converted copy. > These new IV's have two functions: > 1) 'vect_create_data_ref_ptr' can use them to: > a) if it's the main loop: replace the values of the 'initial' value of the > main loop's IV and the initial values in the skip_vector phi-nodes > b) Update the the skip_vector phi-nodes argument for the non-skip path with > the updated vector ptr. b) means the prologue IV will not be dead there so we actually need to compute it? I suppose IVOPTs could be teached to replace an IV with its final value (based on some other IV) when it's unused? Or does it already magically do good? > 2) They are used for the scalar epilogue ensuring they share the same > datareference ptr. > > There are still a variety of 'hacky' elements here and a lot of testing to be > done, but I hope to be able to clean them away. One of the main issues I had > was that I had to skip a couple of checks and things for the added phi-nodes > and update statements as these do not have stmt_vec_info representation. > Though I'm not sure adding this representation at their creation was much > cleaner... It is something I could play around with but I thought this was a > good moment to ask you for input. For instance, maybe we could do this > transformation before analysis? > > Also be aware that because I create a IV for each dataref this leads to > regressions with SVE codegen for instance. NEON is able to use the post-index > addressing mode to increase each dr IV at access time, but SVE can't do this. > For this I don't know if maybe we could try to be smart and create shared > IV's. So rather than make them based on the actual vector ptr, use a shared > sizetype IV that can be shared among dr IV's with the same step. Or maybe this > is something for IVOPTs? Certainly IVOPTs could decide to use the newly created IVs in the scalar loops for the DRs therein as well. But since IVOPTs only considers a single loop at a time it will probably not pay too much attention and is only influenced by the out-of-loop uses of the final values of the IVs. My gut feeling tells me that whatever we do we'll have to look into improving IVOPTs to consider multiple loops. > Let me know what ya think! Now as to the patch itself. We shouldn't amend struct data_reference, try to use dr_vec_info instead. --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -4941,11 +4941,14 @@ vect_create_data_ref_ptr (vec_info *vinfo, stmt_vec_info stmt_info, } standard_iv_increment_position (loop, &incr_gsi, &insert_after); + gphi *update_phi + = as_a (SSA_NAME_DEF_STMT (*dr->iv_bases->get (loop))); and avoid this by having some explicit representation of IVs. iv_bases is a map that exists for each DR and maps the loop to the IV def it seems. I'd have added the map to loop_vinfo, mapping the (vect-)DR to the IV [def]. That probably makes the convenience of transforming the scalar loop before peeling go away, but I wonder whether that's a good idea anyway. @@ -9484,6 +9480,51 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) tree advance; drs_init_vec orig_drs_init; + if (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo) == NULL) +{ + struct data_reference *dr; + unsigned int i; + gimple_seq stmts; + gimple_stmt_iterator gsi = gsi_for_stmt (get_loop_exit_condition (loop)); + + FOR_EACH_VEC_ELT (LOOP_VINFO_DATAREFS (loop_vinfo), i, dr) { + tree &iv_base = dr->iv_bases->get_or_insert (loop); there's a comment missing - does this try to replace the IVs used by the scalar DRs in the non-if-converted loop by the new artificial IVs? I notice that even for void foo (int * __restrict x, int *y, int n1, int n2) { int i; f
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On Fri, 30 Apr 2021, Andre Vieira (lists) wrote: > Hi, > > The aim of this RFC is to explore a way of cleaning up the codegen around > data_references. To be specific, I'd like to reuse the main-loop's updated > data_reference as the base_address for the epilogue's corresponding > data_reference, rather than use the niters. We have found this leads to > better codegen in the vectorized epilogue loops. > > The approach in this RFC creates a map if iv_updates which always contain an > updated pointer that is caputed in vectorizable_{load,store}, an iv_update may > also contain a skip_edge in case we decide the vectorization can be skipped in > 'vect_do_peeling'. During the epilogue update this map of iv_updates is then > checked to see if it contains an entry for a data_reference and it is used > accordingly and if not it reverts back to the old behavior of using the niters > to advance the data_reference. > > The motivation for this work is to improve codegen for the option `--param > vect-partial-vector-usage=1` for SVE. We found that one of the main problems > for the codegen here was coming from unnecessary conversions caused by the way > we update the data_references in the epilogue. > > This patch passes regression tests in aarch64-linux-gnu, but the codegen is > still not optimal in some cases. Specifically those where we have a scalar > epilogue, as this does not use the data_reference's and will rely on the > gimple scalar code, thus constructing again a memory access using the niters. > This is a limitation for which I haven't quite worked out a solution yet and > does cause some minor regressions due to unfortunate spills. > > Let me know what you think and if you have ideas of how we can better achieve > this. Hmm, so the patch adds a kludge to improve the kludge we have in place ;) I think it might be interesting to create a C testcase mimicing the update problem without involving the vectorizer. That way we can see how the various components involved behave (FRE + ivopts most specifically). That said, a cleaner approach to dealing with this would be to explicitely track the IVs we generate for vectorized DRs, eventually factoring that out from vectorizable_{store,load} so we can simply carry over the actual pointer IV final value to the epilogue as initial value. For each DR group we'd create a single IV (we can even do better in case we have load + store of the "same" group). We already kind-of track things via the ivexpr_map, but I'm not sure if this lazly populated map can be reliably re-used to "re-populate" the epilogue one (walk the map, create epilogue IVs with the appropriate initial value & adjustd upate). Richard. > Kind regards, > Andre Vieira > > > -- Richard Biener SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
Hi Richi, So I'm trying to look at what IVOPTs does right now and how it might be able to help us. Looking at these two code examples: #include #if 0 int foo(short * a, short * b, unsigned int n) { int sum = 0; for (unsigned int i = 0; i < n; ++i) sum += a[i] + b[i]; return sum; } #else int bar (short * a, short *b, unsigned int n) { int sum = 0; unsigned int i = 0; for (; i < (n / 16); i += 1) { // Iterates [0, 16, .., (n/16 * 16) * 16] // Example n = 127, // iterates [0, 16, 32, 48, 64, 80, 96, 112] sum += a[i*16] + b[i*16]; } for (size_t j = (size_t) ((n / 16) * 16); j < n; ++j) { // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16] // Example n = 127, // j starts at (127/16) * 16 = 7 * 16 = 112, // So iterates over [112, 113, 114, 115, ..., 127] sum += a[j] + b[j]; } return sum; } #endif Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the following options '-O3 -march=armv8-a -fno-tree-vectorize -fdump-tree-ivopts-all -fno-unroll-loops'. See godbolt link here: https://godbolt.org/z/MEf6j6ebM I tried to see what IVOPTs would make of this and it is able to analyze the IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1) could be used as the base for the other (loop 2). I don't know if this is where you'd want such optimizations to be made, on one side I think it would be great as it would also help with non-vectorized loops as you allured to. However, if you compile the top test case (#if 1) and let the tree-vectorizer have a go you will see different behaviours for different vectorization approaches, so for: '-O3 -march=armv8-a', using NEON and epilogue vectorization it seems IVOPTs only picks up on one loop. If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' it will detect two loops. This may well be because in fact epilogue vectorization 'un-loops' it because it knows it will only have to do one iteration of the vectorized epilogue. vect-partial-vector-usage=1 could have done the same, but because we are dealing with polymorphic vector modes it fails to, I have a hack that works for vect-partial-vector-usage to avoid it, but I think we can probably do better and try to reason about boundaries in poly_int's rather than integers (TBC). Anyway I diverge. Back to the main question of this patch. How do you suggest I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's if one ends where the other begins? On 04/05/2021 10:56, Richard Biener wrote: On Fri, 30 Apr 2021, Andre Vieira (lists) wrote: Hi, The aim of this RFC is to explore a way of cleaning up the codegen around data_references. To be specific, I'd like to reuse the main-loop's updated data_reference as the base_address for the epilogue's corresponding data_reference, rather than use the niters. We have found this leads to better codegen in the vectorized epilogue loops. The approach in this RFC creates a map if iv_updates which always contain an updated pointer that is caputed in vectorizable_{load,store}, an iv_update may also contain a skip_edge in case we decide the vectorization can be skipped in 'vect_do_peeling'. During the epilogue update this map of iv_updates is then checked to see if it contains an entry for a data_reference and it is used accordingly and if not it reverts back to the old behavior of using the niters to advance the data_reference. The motivation for this work is to improve codegen for the option `--param vect-partial-vector-usage=1` for SVE. We found that one of the main problems for the codegen here was coming from unnecessary conversions caused by the way we update the data_references in the epilogue. This patch passes regression tests in aarch64-linux-gnu, but the codegen is still not optimal in some cases. Specifically those where we have a scalar epilogue, as this does not use the data_reference's and will rely on the gimple scalar code, thus constructing again a memory access using the niters. This is a limitation for which I haven't quite worked out a solution yet and does cause some minor regressions due to unfortunate spills. Let me know what you think and if you have ideas of how we can better achieve this. Hmm, so the patch adds a kludge to improve the kludge we have in place ;) I think it might be interesting to create a C testcase mimicing the update problem without involving the vectorizer. That way we can see how the various components involved behave (FRE + ivopts most specifically). That said, a cleaner approach to dealing with this would be to explicitely track the IVs we generate for vectorized DRs, eventually factoring that out from vectorizable_{store,load} so we can simply carry over the actual pointer IV final value to the epilogue as initial va
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On Wed, 5 May 2021, Andre Vieira (lists) wrote: > Hi Richi, > > So I'm trying to look at what IVOPTs does right now and how it might be able > to help us. Looking at these two code examples: > #include > #if 0 > int foo(short * a, short * b, unsigned int n) > { > int sum = 0; > for (unsigned int i = 0; i < n; ++i) > sum += a[i] + b[i]; > > return sum; > } > > > #else > > int bar (short * a, short *b, unsigned int n) > { > int sum = 0; > unsigned int i = 0; > for (; i < (n / 16); i += 1) > { > // Iterates [0, 16, .., (n/16 * 16) * 16] > // Example n = 127, > // iterates [0, 16, 32, 48, 64, 80, 96, 112] > sum += a[i*16] + b[i*16]; > } > for (size_t j = (size_t) ((n / 16) * 16); j < n; ++j) > { > // Iterates [(n/16 * 16) * 16 , (((n/16 * 16) + 1) * 16)... ,n*16] > // Example n = 127, > // j starts at (127/16) * 16 = 7 * 16 = 112, > // So iterates over [112, 113, 114, 115, ..., 127] > sum += a[j] + b[j]; > } > return sum; > } > #endif > > Compiled the bottom one (#if 0) with 'aarch64-linux-gnu' with the following > options '-O3 -march=armv8-a -fno-tree-vectorize -fdump-tree-ivopts-all > -fno-unroll-loops'. See godbolt link here: https://godbolt.org/z/MEf6j6ebM > > I tried to see what IVOPTs would make of this and it is able to analyze the > IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1) > could be used as the base for the other (loop 2). I don't know if this is > where you'd want such optimizations to be made, on one side I think it would > be great as it would also help with non-vectorized loops as you allured to. Hmm, OK. So there's the first loop that has a looparound jump and thus we do not always enter the 2nd loop with the first loop final value of the IV. But yes, IVOPTs does not try to allocate IVs across multiple loops. And for a followup transform to catch this it would need to compute the final value of the IV and then match this up with the initial value computation. I suppose FRE could be teached to do this, at least for very simple cases. > However, if you compile the top test case (#if 1) and let the tree-vectorizer > have a go you will see different behaviours for different vectorization > approaches, so for: > '-O3 -march=armv8-a', using NEON and epilogue vectorization it seems IVOPTs > only picks up on one loop. Yep, the "others" are likely fully peeled because they have just a single iteration. Again some kind-of final-value replacement "CSE" could eventually help - but then we have jump-arounds here as well thus we'd need final-value replacement "PRE". > If you use '-O3 -march=armv8-a+sve --param vect-partial-vector-usage=1' it > will detect two loops. This may well be because in fact epilogue vectorization > 'un-loops' it because it knows it will only have to do one iteration of the > vectorized epilogue. vect-partial-vector-usage=1 could have done the same, but > because we are dealing with polymorphic vector modes it fails to, I have a > hack that works for vect-partial-vector-usage to avoid it, but I think we can > probably do better and try to reason about boundaries in poly_int's rather > than integers (TBC). > > Anyway I diverge. Back to the main question of this patch. How do you suggest > I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs > in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's > if one ends where the other begins? I don't think we will make that work easily. So indeed attacking this in the vectorizer sounds most promising. I'll note there's also the issue of epilogue vectorization and reductions where we seem to not re-use partially reduced reduction vectors but instead reduce to a scalar in each step. That's a related issue - we're not able to carry forward a (reduction) IV we generated for the main vector loop to the epilogue loops. Like for double foo (double *a, int n) { double sum = 0.; for (int i = 0; i < n; ++i) sum += a[i]; return sum; } with AVX512 we get three reductions to scalars instead of a partial reduction from zmm to ymm before the first vectorized epilogue followed by a reduction from ymm to xmm before the second (the jump around for the epilogues need to jump to the further reduction piece obviously). So I think we want to record IVs we generate (the reduction IVs are already nicely associated with the stmt-infos), one might consider to refer to them from the dr_vec_info for example. It's just going to be "interesting" to wire everything up correctly with all the jump-arounds we have ... > On 04/05/2021 10:56, Richard Biener wrote: > > On Fri, 30 Apr 2021, Andre Vieira (lists) wrote: > > > >> Hi, > >> > >> The aim of this RFC is to explore a way of cleaning up the codegen around > >> data_references. To be specific, I'd like to reuse the main-loop's updated > >> data_reference as the base_address for
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On 05/05/2021 13:34, Richard Biener wrote: On Wed, 5 May 2021, Andre Vieira (lists) wrote: I tried to see what IVOPTs would make of this and it is able to analyze the IVs but it doesn't realize (not even sure it tries) that one IV's end (loop 1) could be used as the base for the other (loop 2). I don't know if this is where you'd want such optimizations to be made, on one side I think it would be great as it would also help with non-vectorized loops as you allured to. Hmm, OK. So there's the first loop that has a looparound jump and thus we do not always enter the 2nd loop with the first loop final value of the IV. But yes, IVOPTs does not try to allocate IVs across multiple loops. And for a followup transform to catch this it would need to compute the final value of the IV and then match this up with the initial value computation. I suppose FRE could be teached to do this, at least for very simple cases. I will admit I am not at all familiar with how FRE works, I know it exists as the occlusion of running it often breaks my vector patches :P But that's about all I know. I will have a look and see if it makes sense from my perspective to address it there, because ... Anyway I diverge. Back to the main question of this patch. How do you suggest I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' IVs in the epilogue(s) (both vector and scalar!) and then teach it to merge IV's if one ends where the other begins? I don't think we will make that work easily. So indeed attacking this in the vectorizer sounds most promising. The problem with this that I found with my approach is that it only tackles the vectorized epilogues and that leads to regressions, I don't have the example at hand, but what I saw was happening was that increased register pressure lead to a spill in the hot path. I believe this was caused by the epilogue loop using the update pointers as the base for their DR's, in this case there were three DR's (2 loads one store), but the scalar epilogue still using the original base + niters, since this data_reference approach only changes the vectorized epilogues. I'll note there's also the issue of epilogue vectorization and reductions where we seem to not re-use partially reduced reduction vectors but instead reduce to a scalar in each step. That's a related issue - we're not able to carry forward a (reduction) IV we generated for the main vector loop to the epilogue loops. Like for double foo (double *a, int n) { double sum = 0.; for (int i = 0; i < n; ++i) sum += a[i]; return sum; } with AVX512 we get three reductions to scalars instead of a partial reduction from zmm to ymm before the first vectorized epilogue followed by a reduction from ymm to xmm before the second (the jump around for the epilogues need to jump to the further reduction piece obviously). So I think we want to record IVs we generate (the reduction IVs are already nicely associated with the stmt-infos), one might consider to refer to them from the dr_vec_info for example. It's just going to be "interesting" to wire everything up correctly with all the jump-arounds we have ... I have a downstream hack for the reductions, but it only worked for partial-vector-usage as there you have the guarantee it's the same vector-mode, so you don't need to pfaff around with half and full vectors. Obviously what you are suggesting has much wider applications and not surprisingly I think Richard Sandiford also pointed out to me that these are somewhat related and we might be able to reuse the IV-creation to manage it all. But I feel like I am currently light years away from that. I had started to look at removing the data_reference updating we have now and dealing with this in the 'create_iv' calls from 'vect_create_data_ref_ptr' inside 'vectorizable_{load,store}' but then I thought it would be good to discuss it with you first. This will require keeping track of the 'end-value' of the IV, which for loops where we can skip the previous loop means we will need to construct a phi-node containing the updated pointer and the initial base. But I'm not entirely sure where to keep track of all this. Also I don't know if I can replace the base address of the data_reference right there at the 'create_iv' call, can a data_reference be used multiple times in the same loop? I'll go do a bit more nosing around this idea and the ivmap you mentioned before. Let me know if you have any ideas on how this all should look like, even if its a 'in an ideal world'. Andre On 04/05/2021 10:56, Richard Biener wrote: On Fri, 30 Apr 2021, Andre Vieira (lists) wrote: Hi, The aim of this RFC is to explore a way of cleaning up the codegen around data_references. To be specific, I'd like to reuse the main-loop's updated data_reference as the base_address for the epilogue's corresponding data_reference, rather than use the niters. We have found this leads to better codege
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On Wed, 5 May 2021, Andre Vieira (lists) wrote: > > On 05/05/2021 13:34, Richard Biener wrote: > > On Wed, 5 May 2021, Andre Vieira (lists) wrote: > > > >> I tried to see what IVOPTs would make of this and it is able to analyze the > >> IVs but it doesn't realize (not even sure it tries) that one IV's end (loop > >> 1) > >> could be used as the base for the other (loop 2). I don't know if this is > >> where you'd want such optimizations to be made, on one side I think it > >> would > >> be great as it would also help with non-vectorized loops as you allured to. > > Hmm, OK. So there's the first loop that has a looparound jump and thus > > we do not always enter the 2nd loop with the first loop final value of the > > IV. But yes, IVOPTs does not try to allocate IVs across multiple loops. > > And for a followup transform to catch this it would need to compute > > the final value of the IV and then match this up with the initial > > value computation. I suppose FRE could be teached to do this, at > > least for very simple cases. > I will admit I am not at all familiar with how FRE works, I know it exists as > the occlusion of running it often breaks my vector patches :P But that's about > all I know. > I will have a look and see if it makes sense from my perspective to address it > there, because ... > > > >> Anyway I diverge. Back to the main question of this patch. How do you > >> suggest > >> I go about this? Is there a way to make IVOPTS aware of the 'iterate-once' > >> IVs > >> in the epilogue(s) (both vector and scalar!) and then teach it to merge > >> IV's > >> if one ends where the other begins? > > I don't think we will make that work easily. So indeed attacking this > > in the vectorizer sounds most promising. > > The problem with this that I found with my approach is that it only tackles > the vectorized epilogues and that leads to regressions, I don't have the > example at hand, but what I saw was happening was that increased register > pressure lead to a spill in the hot path. I believe this was caused by the > epilogue loop using the update pointers as the base for their DR's, in this > case there were three DR's (2 loads one store), but the scalar epilogue still > using the original base + niters, since this data_reference approach only > changes the vectorized epilogues. Yeah, this issue obviously extends to the scalar pro and epilogue loops... So ideally we'd produce IL (mainly for the IV setup code I guess) that will be handled well by the following passes but then IVOPTs is not multi-loop aware ... That said, in the end we should be able to code-generate the scalar loops as well (my plan is to add that, at least for the vector loop, to be able to support partly vectorized loops with unvectorizable stmts simply replicated as scalar ops). In that case we can use the same IVs again. > > I'll note there's also > > the issue of epilogue vectorization and reductions where we seem > > to not re-use partially reduced reduction vectors but instead > > reduce to a scalar in each step. That's a related issue - we're > > not able to carry forward a (reduction) IV we generated for the > > main vector loop to the epilogue loops. Like for > > > > double foo (double *a, int n) > > { > >double sum = 0.; > >for (int i = 0; i < n; ++i) > > sum += a[i]; > >return sum; > > } > > > > with AVX512 we get three reductions to scalars instead of > > a partial reduction from zmm to ymm before the first vectorized > > epilogue followed by a reduction from ymm to xmm before the second > > (the jump around for the epilogues need to jump to the further > > reduction piece obviously). > > > > So I think we want to record IVs we generate (the reduction IVs > > are already nicely associated with the stmt-infos), one might > > consider to refer to them from the dr_vec_info for example. > > > > It's just going to be "interesting" to wire everything up > > correctly with all the jump-arounds we have ... > I have a downstream hack for the reductions, but it only worked for > partial-vector-usage as there you have the guarantee it's the same > vector-mode, so you don't need to pfaff around with half and full vectors. > Obviously what you are suggesting has much wider applications and not > surprisingly I think Richard Sandiford also pointed out to me that these are > somewhat related and we might be able to reuse the IV-creation to manage it > all. But I feel like I am currently light years away from that. > > I had started to look at removing the data_reference updating we have now and > dealing with this in the 'create_iv' calls from 'vect_create_data_ref_ptr' > inside 'vectorizable_{load,store}' but then I thought it would be good to > discuss it with you first. This will require keeping track of the 'end-value' > of the IV, which for loops where we can skip the previous loop means we will > need to construct a phi-node containing the updated pointer and the initial > base. But I'm not entirel
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
Hi, On 20/05/2021 11:22, Richard Biener wrote: On Mon, 17 May 2021, Andre Vieira (lists) wrote: Hi, So this is my second attempt at finding a way to improve how we generate the vector IV's and teach the vectorizer to share them between main loop and epilogues. On IRC we discussed my idea to use the loop's control_iv, but that was a terrible idea and I quickly threw it in the bin. The main problem, that for some reason I failed to see, was that the control_iv increases by 's' and the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the amount of elements we handle per iteration. That means the epilogue loops would have to start from the last loop's IV * the last loop's NELEMENT's and that would just cause a mess. Instead I started to think about creating IV's for the datarefs and what I thought worked best was to create these in scalar before peeling. That way the peeling mechanisms takes care of the duplication of these for the vector and scalar epilogues and it also takes care of adding phi-nodes for the skip_vector paths. How does this work for if-converted loops where we use the non-if-converted scalar loop for (scalar) peeling but the if-converted scalar loop for vectorized epilogues? I suppose you're only adjusting the if-converted copy. True hadn't thought about this :( These new IV's have two functions: 1) 'vect_create_data_ref_ptr' can use them to: a) if it's the main loop: replace the values of the 'initial' value of the main loop's IV and the initial values in the skip_vector phi-nodes b) Update the the skip_vector phi-nodes argument for the non-skip path with the updated vector ptr. b) means the prologue IV will not be dead there so we actually need to compute it? I suppose IVOPTs could be teached to replace an IV with its final value (based on some other IV) when it's unused? Or does it already magically do good? It does not and ... 2) They are used for the scalar epilogue ensuring they share the same datareference ptr. There are still a variety of 'hacky' elements here and a lot of testing to be done, but I hope to be able to clean them away. One of the main issues I had was that I had to skip a couple of checks and things for the added phi-nodes and update statements as these do not have stmt_vec_info representation. Though I'm not sure adding this representation at their creation was much cleaner... It is something I could play around with but I thought this was a good moment to ask you for input. For instance, maybe we could do this transformation before analysis? Also be aware that because I create a IV for each dataref this leads to regressions with SVE codegen for instance. NEON is able to use the post-index addressing mode to increase each dr IV at access time, but SVE can't do this. For this I don't know if maybe we could try to be smart and create shared IV's. So rather than make them based on the actual vector ptr, use a shared sizetype IV that can be shared among dr IV's with the same step. Or maybe this is something for IVOPTs? Certainly IVOPTs could decide to use the newly created IVs in the scalar loops for the DRs therein as well. But since IVOPTs only considers a single loop at a time it will probably not pay too much attention and is only influenced by the out-of-loop uses of the final values of the IVs. My gut feeling tells me that whatever we do we'll have to look into improving IVOPTs to consider multiple loops. So I redid the IV-sharing and it's looking a lot simpler and neater, however it only shares IVs between vectorized loops and not scalar pro- or epilogues. I am not certain IVOPTs will be able to deal with these, as it has no knowledge of the number of iterations of each different loop. So take for instance a prologue peeling for alignment loop and a first main vectorization loop. To be able to reuse the IV's from the prologue in the main vectorization loop it would need to know that the initial start adress + PEELING_NITERS == base address for main vectorization loop. I'll starting testing this approach for correctness if there are no major concerns. Though I suspect we will only want to turn this into a patch once we have the IVOPTs work done to a point where it at least doesn't regress codegen because of shared IVs and eventually we can look at how to solve the sharing between vectorized and scalar loops. A small nitpick on my own RFC. I will probably move the 'skip_e' to outside of the map, as we only need one per loop_vinfo and not one per DR. Initially I didnt even have this skip_e in, but was using the creation of a dummy PHI node and then replacing it with the real thing later. Though this made the code simpler, especially when inserting the 'init's stmt_list. Kind regards, Andre diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index b317df532a9a92a619de9572378437d09c632ab0..e7d0f1e657b1a0c9bec75799817242e0bc1d8282 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-ve
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On Mon, 14 Jun 2021, Andre Vieira (lists) wrote: > Hi, > > > On 20/05/2021 11:22, Richard Biener wrote: > > On Mon, 17 May 2021, Andre Vieira (lists) wrote: > > > >> Hi, > >> > >> So this is my second attempt at finding a way to improve how we generate > >> the > >> vector IV's and teach the vectorizer to share them between main loop and > >> epilogues. On IRC we discussed my idea to use the loop's control_iv, but > >> that > >> was a terrible idea and I quickly threw it in the bin. The main problem, > >> that > >> for some reason I failed to see, was that the control_iv increases by 's' > >> and > >> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the > >> amount of elements we handle per iteration. That means the epilogue loops > >> would have to start from the last loop's IV * the last loop's NELEMENT's > >> and > >> that would just cause a mess. > >> > >> Instead I started to think about creating IV's for the datarefs and what I > >> thought worked best was to create these in scalar before peeling. That way > >> the > >> peeling mechanisms takes care of the duplication of these for the vector > >> and > >> scalar epilogues and it also takes care of adding phi-nodes for the > >> skip_vector paths. > > How does this work for if-converted loops where we use the > > non-if-converted scalar loop for (scalar) peeling but the > > if-converted scalar loop for vectorized epilogues? I suppose > > you're only adjusting the if-converted copy. > True hadn't thought about this :( > > > >> These new IV's have two functions: > >> 1) 'vect_create_data_ref_ptr' can use them to: > >> a) if it's the main loop: replace the values of the 'initial' value of the > >> main loop's IV and the initial values in the skip_vector phi-nodes > >> b) Update the the skip_vector phi-nodes argument for the non-skip path > >> with > >> the updated vector ptr. > > b) means the prologue IV will not be dead there so we actually need > > to compute it? I suppose IVOPTs could be teached to replace an > > IV with its final value (based on some other IV) when it's unused? > > Or does it already magically do good? > It does not and ... > > > >> 2) They are used for the scalar epilogue ensuring they share the same > >> datareference ptr. > >> > >> There are still a variety of 'hacky' elements here and a lot of testing to > >> be > >> done, but I hope to be able to clean them away. One of the main issues I > >> had > >> was that I had to skip a couple of checks and things for the added > >> phi-nodes > >> and update statements as these do not have stmt_vec_info representation. > >> Though I'm not sure adding this representation at their creation was much > >> cleaner... It is something I could play around with but I thought this was > >> a > >> good moment to ask you for input. For instance, maybe we could do this > >> transformation before analysis? > >> > >> Also be aware that because I create a IV for each dataref this leads to > >> regressions with SVE codegen for instance. NEON is able to use the > >> post-index > >> addressing mode to increase each dr IV at access time, but SVE can't do > >> this. > >> For this I don't know if maybe we could try to be smart and create shared > >> IV's. So rather than make them based on the actual vector ptr, use a shared > >> sizetype IV that can be shared among dr IV's with the same step. Or maybe > >> this > >> is something for IVOPTs? > > Certainly IVOPTs could decide to use the newly created IVs in the > > scalar loops for the DRs therein as well. But since IVOPTs only > > considers a single loop at a time it will probably not pay too > > much attention and is only influenced by the out-of-loop uses of > > the final values of the IVs. > > > > My gut feeling tells me that whatever we do we'll have to look > > into improving IVOPTs to consider multiple loops. > > So I redid the IV-sharing and it's looking a lot simpler and neater, however > it only shares IVs between vectorized loops and not scalar pro- or epilogues. > I am not certain IVOPTs will be able to deal with these, as it has no > knowledge of the number of iterations of each different loop. So take for > instance a prologue peeling for alignment loop and a first main vectorization > loop. To be able to reuse the IV's from the prologue in the main vectorization > loop it would need to know that the initial start adress + PEELING_NITERS == > base address for main vectorization loop. Yes, indeed. I think what we may want to do is to make this CSE/PRE "easy" when IVOPTs manages to choose compatible IVs. Since we generally have bypass jumps around the various loops this boils down to PRE since the redundancies only appear on the loop-to-loop fallthru edges which means we won't see those opportunities on GIMPLE (unless we realize the PRE opportunities in the actual vectorizer generated code). But eventually RTL PRE does all the magic. Iff we'd manage to present IVOPTs with out-of-loop IV uses that are actual initial valu
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On Mon, 14 Jun 2021, Richard Biener wrote: > On Mon, 14 Jun 2021, Andre Vieira (lists) wrote: > > > Hi, > > > > > > On 20/05/2021 11:22, Richard Biener wrote: > > > On Mon, 17 May 2021, Andre Vieira (lists) wrote: > > > > > >> Hi, > > >> > > >> So this is my second attempt at finding a way to improve how we generate > > >> the > > >> vector IV's and teach the vectorizer to share them between main loop and > > >> epilogues. On IRC we discussed my idea to use the loop's control_iv, but > > >> that > > >> was a terrible idea and I quickly threw it in the bin. The main problem, > > >> that > > >> for some reason I failed to see, was that the control_iv increases by 's' > > >> and > > >> the datarefs by 's' * NELEMENTS where 's' is usually 1 and NELEMENTs the > > >> amount of elements we handle per iteration. That means the epilogue loops > > >> would have to start from the last loop's IV * the last loop's NELEMENT's > > >> and > > >> that would just cause a mess. > > >> > > >> Instead I started to think about creating IV's for the datarefs and what > > >> I > > >> thought worked best was to create these in scalar before peeling. That > > >> way > > >> the > > >> peeling mechanisms takes care of the duplication of these for the vector > > >> and > > >> scalar epilogues and it also takes care of adding phi-nodes for the > > >> skip_vector paths. > > > How does this work for if-converted loops where we use the > > > non-if-converted scalar loop for (scalar) peeling but the > > > if-converted scalar loop for vectorized epilogues? I suppose > > > you're only adjusting the if-converted copy. > > True hadn't thought about this :( > > > > > >> These new IV's have two functions: > > >> 1) 'vect_create_data_ref_ptr' can use them to: > > >> a) if it's the main loop: replace the values of the 'initial' value of > > >> the > > >> main loop's IV and the initial values in the skip_vector phi-nodes > > >> b) Update the the skip_vector phi-nodes argument for the non-skip path > > >> with > > >> the updated vector ptr. > > > b) means the prologue IV will not be dead there so we actually need > > > to compute it? I suppose IVOPTs could be teached to replace an > > > IV with its final value (based on some other IV) when it's unused? > > > Or does it already magically do good? > > It does not and ... > > > > > >> 2) They are used for the scalar epilogue ensuring they share the same > > >> datareference ptr. > > >> > > >> There are still a variety of 'hacky' elements here and a lot of testing > > >> to > > >> be > > >> done, but I hope to be able to clean them away. One of the main issues I > > >> had > > >> was that I had to skip a couple of checks and things for the added > > >> phi-nodes > > >> and update statements as these do not have stmt_vec_info representation. > > >> Though I'm not sure adding this representation at their creation was much > > >> cleaner... It is something I could play around with but I thought this > > >> was > > >> a > > >> good moment to ask you for input. For instance, maybe we could do this > > >> transformation before analysis? > > >> > > >> Also be aware that because I create a IV for each dataref this leads to > > >> regressions with SVE codegen for instance. NEON is able to use the > > >> post-index > > >> addressing mode to increase each dr IV at access time, but SVE can't do > > >> this. > > >> For this I don't know if maybe we could try to be smart and create shared > > >> IV's. So rather than make them based on the actual vector ptr, use a > > >> shared > > >> sizetype IV that can be shared among dr IV's with the same step. Or maybe > > >> this > > >> is something for IVOPTs? > > > Certainly IVOPTs could decide to use the newly created IVs in the > > > scalar loops for the DRs therein as well. But since IVOPTs only > > > considers a single loop at a time it will probably not pay too > > > much attention and is only influenced by the out-of-loop uses of > > > the final values of the IVs. > > > > > > My gut feeling tells me that whatever we do we'll have to look > > > into improving IVOPTs to consider multiple loops. > > > > So I redid the IV-sharing and it's looking a lot simpler and neater, however > > it only shares IVs between vectorized loops and not scalar pro- or > > epilogues. > > I am not certain IVOPTs will be able to deal with these, as it has no > > knowledge of the number of iterations of each different loop. So take for > > instance a prologue peeling for alignment loop and a first main > > vectorization > > loop. To be able to reuse the IV's from the prologue in the main > > vectorization > > loop it would need to know that the initial start adress + PEELING_NITERS == > > base address for main vectorization loop. > > Yes, indeed. I think what we may want to do is to make this CSE/PRE > "easy" when IVOPTs manages to choose compatible IVs. Since we generally > have bypass jumps around the various loops this boils down to PRE > since the redundancies only appear
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On 14/06/2021 11:57, Richard Biener wrote: On Mon, 14 Jun 2021, Richard Biener wrote: Indeed. For example a simple int a[1024], b[1024], c[1024]; void foo(int n) { for (int i = 0; i < n; ++i) a[i+1] += c[i+i] ? b[i+1] : 0; } should usually see peeling for alignment (though on x86 you need exotic -march= since cost models generally have equal aligned and unaligned access costs). For example with -mavx2 -mtune=atom we'll see an alignment peeling prologue, a AVX2 vector loop, a SSE2 vectorized epilogue and a scalar epilogue. It also shows the original scalar loop being used in the scalar prologue and epilogue. We're not even trying to make the counting IV easily used across loops (we're not counting scalar iterations in the vector loops). Specifically we see [local count: 94607391]: niters_vector_mult_vf.10_62 = bnd.9_61 << 3; _67 = niters_vector_mult_vf.10_62 + 7; _64 = (int) niters_vector_mult_vf.10_62; tmp.11_63 = i_43 + _64; if (niters.8_45 == niters_vector_mult_vf.10_62) goto ; [12.50%] else goto ; [87.50%] after the maini vect loop, recomputing the original IV (i) rather than using the inserted canonical IV. And then the vectorized epilogue header check doing [local count: 93293400]: # i_59 = PHI # _66 = PHI <_67(33), 0(18)> _96 = (unsigned int) n_10(D); niters.26_95 = _96 - _66; _108 = (unsigned int) n_10(D); _109 = _108 - _66; _110 = _109 + 4294967295; if (_110 <= 3) goto ; [10.00%] else goto ; [90.00%] re-computing everything from scratch again (also notice how the main vect loop guard jumps around the alignment prologue as well and lands here - and the vectorized epilogue using unaligned accesses - good!). That is, I'd expect _much_ easier jobs if we'd manage to track the number of performed scalar iterations (or the number of scalar iterations remaining) using the canonical IV we add to all loops across all of the involved loops. Richard. So I am now looking at using an IV that counts scalar iterations rather than vector iterations and reusing that through all loops, (prologue, main loop, vect_epilogue and scalar epilogue). The first is easy, since that's what we already do for partial vectors or non-constant VFs. The latter requires some plumbing and removing a lot of the code in there that creates new IV's going from [0, niters - previous iterations]. I don't yet have a clear cut view of how to do this, I first thought of keeping track of the 'control' IV in the loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps a control_ivs struct, but that is used for overflow detection and only keeps track of what looks like a constant 'base' and 'step'. Not quite sure how all that works, but intuitively doesn't seem like the right thing to reuse. I'll go hack around and keep you posted on progress. Regards, Andre
Re: [RFC] Using main loop's updated IV as base_address for epilogue vectorization
On Wed, 16 Jun 2021, Andre Vieira (lists) wrote: > > On 14/06/2021 11:57, Richard Biener wrote: > > On Mon, 14 Jun 2021, Richard Biener wrote: > > > >> Indeed. For example a simple > >> int a[1024], b[1024], c[1024]; > >> > >> void foo(int n) > >> { > >>for (int i = 0; i < n; ++i) > >> a[i+1] += c[i+i] ? b[i+1] : 0; > >> } > >> > >> should usually see peeling for alignment (though on x86 you need > >> exotic -march= since cost models generally have equal aligned and > >> unaligned access costs). For example with -mavx2 -mtune=atom > >> we'll see an alignment peeling prologue, a AVX2 vector loop, > >> a SSE2 vectorized epilogue and a scalar epilogue. It also > >> shows the original scalar loop being used in the scalar prologue > >> and epilogue. > >> > >> We're not even trying to make the counting IV easily used > >> across loops (we're not counting scalar iterations in the > >> vector loops). > > Specifically we see > > > > [local count: 94607391]: > > niters_vector_mult_vf.10_62 = bnd.9_61 << 3; > > _67 = niters_vector_mult_vf.10_62 + 7; > > _64 = (int) niters_vector_mult_vf.10_62; > > tmp.11_63 = i_43 + _64; > > if (niters.8_45 == niters_vector_mult_vf.10_62) > >goto ; [12.50%] > > else > >goto ; [87.50%] > > > > after the maini vect loop, recomputing the original IV (i) rather > > than using the inserted canonical IV. And then the vectorized > > epilogue header check doing > > > > [local count: 93293400]: > > # i_59 = PHI > > # _66 = PHI <_67(33), 0(18)> > > _96 = (unsigned int) n_10(D); > > niters.26_95 = _96 - _66; > > _108 = (unsigned int) n_10(D); > > _109 = _108 - _66; > > _110 = _109 + 4294967295; > > if (_110 <= 3) > >goto ; [10.00%] > > else > >goto ; [90.00%] > > > > re-computing everything from scratch again (also notice how > > the main vect loop guard jumps around the alignment prologue > > as well and lands here - and the vectorized epilogue using > > unaligned accesses - good!). > > > > That is, I'd expect _much_ easier jobs if we'd manage to > > track the number of performed scalar iterations (or the > > number of scalar iterations remaining) using the canonical > > IV we add to all loops across all of the involved loops. > > > > Richard. > > > So I am now looking at using an IV that counts scalar iterations rather than > vector iterations and reusing that through all loops, (prologue, main loop, > vect_epilogue and scalar epilogue). The first is easy, since that's what we > already do for partial vectors or non-constant VFs. The latter requires some > plumbing and removing a lot of the code in there that creates new IV's going > from [0, niters - previous iterations]. I don't yet have a clear cut view of > how to do this, I first thought of keeping track of the 'control' IV in the > loop_vinfo, but the prologue and scalar epilogues won't have one. 'loop' keeps > a control_ivs struct, but that is used for overflow detection and only keeps > track of what looks like a constant 'base' and 'step'. Not quite sure how all > that works, but intuitively doesn't seem like the right thing to reuse. Maybe it's enough to maintain this [remaining] scalar iterations counter between loops, thus after the vector loop do remain_scalar_iter -= vector_iters * vf; etc., this should make it possible to do some first order cleanups, avoiding some repeated computations. It does involve placing additional PHIs for this remain_scalar_iter var of course (I'd be hesitant to rely on the SSA renamer for this due to its expense). I think that for all later jump-around tests tracking remaining scalar iters is more convenient than tracking performed scalar iters. > I'll go hack around and keep you posted on progress. Thanks - it's an iffy area ... Richard.