Re: [RFC][gcc][vect] PR 91246: Prototype for vectorization of loops with breaks

Richard Biener via Gcc-patches Thu, 12 Mar 2020 02:51:26 -0700

On Thu, Mar 12, 2020 at 10:15 AM Andre Vieira (lists)
<andre.simoesdiasvie...@arm.com> wrote:
>
> Hi all,
>
> This is a prototype I have put together to look at the feasibility and
> profitability of vectorizing loops with breaks as suggested in PR 91246.
> I am posting this here for comments as I am curious what your opinions
> are on my approach.  I don't expect much attention to this in stage 4,
> but I wanted to get it out now before I forget about it.
>
> The idea is that during ifcvt it checks whether it can safely transform
> the break away, version the loop and in the "to vectorize" version
> replace it with a hopefully a reduceable comparison and MIN_EXPR.
> Currently I have limited the cases it can transform to loops that meet
> all of the following conditions:
> - is a inner-loop with one break point,
> - the break condition is an EQ_EXPR or NE_EXPR of integers,
> - the entire loop is vectorize, i.e. it has no scalar epilogue,
> - the loop has no writes to memory and no variable escapes the loop,
> - all memory accesses are valid (i.e. within bounds) even if the loop
> doesn't break early.
>
> To check the last condition I copied and modified the checks used to
> warn for out of bounds accesses.  I modified these checks such that they
> are more conservative, we want to reject the transformation if we can't
> guarantee the accesses are within bounds for any possible range. I have
> noticed these range checks aren't very good yet, I am hoping that as
> project ranger evolves this will get easier. However, I am also worried
> that if they start to take control flow into consideration and
> "understand" the break, they will use it to reason that with the break
> in place the range of loop induction variables are limited. This
> obviously makes it unusable for us to check whether removing the break
> will cause out of bounds accesses.  Ideally the new ranger would allow
> us to "ignore" certain expressions in the range calculation. All this
> was with targets in mind that do not support masking, for targets with
> the ability to mask vector operations we could do without such analysis,
> as we may be able to prevent memory accesses taking place after the
> break condition has been met.


Not looking at the patch but just throwing in thoughts.

I think at some point we need to stop relying on a separate if-conversion
pass since that requires a GIMPLE representation for the scalar
if-converted loop which is really difficult if you start do apply predication
to more than just loads and stores.

That is, I'd like to see us experimenting with doing the if-conversion
"on the fly" during vectorization.  That probably means doing what
if-conversion does but represent the result in some non-IL data
structure - since we're going to do all-SLP it'll be the SLP graph,
maybe simply attach a vector of predicates to each node.
The benefit is that we'll eventually be able to if-convert for
non-loop vectorization as well.

Possibly easiest is of course to try do away with if-conversion as it
is now and only have the existing feature-set moved to the vectorizer.
I'd probably really start with copying the if-conversion analysis code
and simply skip its "code-generation", then handle multiple BBs
in all of the vectorizer code and see how analysis and code generation
would have to be adjusted.  Given the general direction I'd focus
on full-SLP testcases but in reality I cannot promise I'll finish
the only-SLP transition in time for GCC 11 (I hope to have a good
chunk ready for the Cauldron though).

> In trying to improve performance of this transformation I made a small
> change in the vectorizer, such that if we are dealing with such loops
> with breaks the ifcvt pass stores the variable holding the break
> condition in the loop struct under 'early_break_cond'. The vectorizer
> can then use this variable to check whether we have hit the break
> condition every iteration, preventing us to continue to go through the
> loop if we know we can stop early. This will benefit cases where we
> would break early, on the other hand if we break late, then we now
> introduced extra checks within the loop... However there is no way for
> us to know, one small improvement would be to use the known iteration
> count to decide whether or not to add these checks, one could argue that
> if we know the number of iterations to be small there is little point in
> adding these.  Thoughts welcome...
>
> Another issue I encountered was that the early loop unroller often gets
> in the way of things. For this reason I tried to teach it not to unroll
> inner loops which we can remove breaks from. Unfortunately the early
> loop unroller is performed before some other simplification
> transformations and loop canonicalisation, I have seen cases where these
> transformations were crucial in making our memory access analysis accept
> loops. I suggest disabling the early loop unroller to check whether this
> pass is able to handle certain loops, i.e. pass
> '-fdisable-tree-cunrolli', you can even go as far as use
> -fdisable-tree-cunrolli=<function_name> for more accurate
> benchmarking/analysis.
>
> I made some changes to the testcase on the PR ticket such that the
> compiler has enough information to determine all memory accesses are
> within bounds, see testcase below. This is one of those cases that
> requires '-fdisable-tree-cunrolli'.
>
> #include <stdbool.h>
> #define SIZE 8
> int s, sum;
>
> int *g_board;
> int *g_moves;
>
> static void f(int *moves, int *board, int cnt, int lastmove, int ko)
> {
>      for (int i = 0; i < cnt; i++) {
>          if (moves[i] == ko) continue;
>
>          bool found = false;
>          for (int j = 0; j < SIZE; j++) {
>              if ((lastmove + board[j]) == moves[j]) {
>                  found = true;
>                  break;
>              }
>          }
>
>          if (!found) {
>              s *= 20;
>          }
>
>          if (s >= 40) {
>              sum += s;
>          }
>      }
> }
>
> void foo(int cnt, int lastmove, int ko)
> {
>    int board[SIZE];
>    int moves[SIZE];
>    for (int i = 0; i < SIZE; ++i)
>      {
>        board[i] = g_board[i];
>        moves[i] = g_moves[i];
>      }
>
>    f (moves, board, cnt, lastmove, ko);
> }
>
>
> I have not found this transformation to have a big impact on performance
> so far.
>
> Kind Regards,
> Andre

Re: [RFC][gcc][vect] PR 91246: Prototype for vectorization of loops with breaks

Reply via email to