https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113583

--- Comment #9 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #6)

> t.c:47:21: missed:   the size of the group of accesses is not a power of 2 
> or not equal to 3
> t.c:47:21: missed:   not falling back to elementwise accesses
> t.c:58:15: missed:   not vectorized: relevant stmt not supported: _4 = 
> *_3;
> t.c:47:21: missed:  bad operation or unsupported loop bound.
> 
> where we don't consider using gather because we have a known constant
> stride (20).  Since the stores are really scatters we don't attempt
> to SLP either.
> 
> Disabling the above heuristic we get this vectorized as well, avoiding
> gather/scatter by manually implementing them and using a quite high
> VF of 8 (with -mprefer-vector-width=256 you get VF 4 and likely
> faster code in the end).

I suppose you're referring to this?

  /* FIXME: At the moment the cost model seems to underestimate the
     cost of using elementwise accesses.  This check preserves the
     traditional behavior until that can be fixed.  */
  stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
  if (!first_stmt_info)
    first_stmt_info = stmt_info;
  if (*memory_access_type == VMAT_ELEMENTWISE
      && !STMT_VINFO_STRIDED_P (first_stmt_info)
      && !(stmt_info == DR_GROUP_FIRST_ELEMENT (stmt_info)
           && !DR_GROUP_NEXT_ELEMENT (stmt_info)
           && !pow2p_hwi (DR_GROUP_SIZE (stmt_info))))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "not falling back to elementwise accesses\n");
      return false;
    }


I did some more tests on my laptop.  As said above the whole loop in lbm is
larger and contains two ifs.  The first one prevents clang and GCC from
vectorizing the loop, the second one

                if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
                        ux = 0.005;
                        uy = 0.002;
                        uz = 0.000;
                }

seems to be if-converted? by clang or at least doesn't inhibit vectorization.

Now if I comment out the first, larger if clang does vectorize the loop.  With
the return false commented out in the above GCC snippet GCC also vectorizes,
but only when both ifs are commented out.

Results (with both ifs commented out), -march=native (resulting in avx2), best
of 3 as lbm is notoriously fickle:

gcc trunk vanilla: 156.04s
gcc trunk with elementwise: 132.10s
clang 17: 143.06s

Of course even the comment already said that costing is difficult and the
change will surely cause regressions elsewhere.  However the 15% improvement
with vectorization (or the 9% improvement of clang) IMHO show that it's surely
useful to look into this further.  On top, the riscv clang seems to not care
about the first if either and still vectorize.  I haven't looked closer what
happens there, though.

Reply via email to