On Mon, Aug 7, 2023 at 2:05 AM Prathamesh Kulkarni via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguent...@suse.de> wrote:
> >
> > On Thu, 3 Aug 2023, Richard Biener wrote:
> >
> > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > >
> > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > > >
> > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > > The following delays sinking of loads within the same innermost
> > > > > > > > loop when it was unconditional before.  That's a not uncommon
> > > > > > > > issue preventing vectorization when masked loads are not 
> > > > > > > > available.
> > > > > > > >
> > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > > >
> > > > > > > > I have a followup patch improving sinking that without this 
> > > > > > > > would
> > > > > > > > cause more of the problematic sinking - now that we have a 
> > > > > > > > second
> > > > > > > > sink pass after loop opts this looks like a reasonable approach?
> > > > > > > >
> > > > > > > > OK?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > >  PR tree-optimization/92335
> > > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > > >  in innermost loops to conditional executed places.
> > > > > > > >
> > > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > > I think it's reasonable -- there's probably going to be cases 
> > > > > > > where it's not
> > > > > > > great, but more often than not I think it's going to be a 
> > > > > > > reasonable
> > > > > > > heuristic.
> > > > > > >
> > > > > > > If there is undesirable fallout, better to find it over the 
> > > > > > > coming months than
> > > > > > > next spring.  So I'd suggest we go forward now to give more time 
> > > > > > > to find any
> > > > > > > pathological cases (if they exist).
> > > > > >
> > > > > > Agreed, I've pushed this now.
> > > > > Hi Richard,
> > > > > After this patch (committed in 
> > > > > 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP 
> > > > > VECTORIZED"
> > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > >
> > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> > > > > ! vect_fold_extract_last } } } } */
> > > > >
> > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized
> > > > > regardless of vect_fold_extract_last,
> > > > > which gates the above test (which is an improvement, because the
> > > > > function didn't get vectorized before the commit).
> > > > >
> > > > > The attached patch thus removes the gating on vect_fold_extract_last,
> > > > > and the test passes again.
> > > > > OK to commit ?
> > > >
> > > > OK.
> > >
> > > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > > critical target condition is missing.  Can you figure out which?
> >
> > I see
> >
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>,
> > type of def: reduction
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > note:   vect_is_simple_use: vectype vector(4) int
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > missed:   multiple types in double reduction or condition reduction or
> > fold-left reduction.
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> > missed:   not vectorized: relevant phi not supported: last_19 = PHI
> > <last_8(7), 108(15)>
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > missed:  bad operation or unsupported loop bound.
> Hi Richard,
> Looking at the aarch64 vect dump, it seems the loop in
> condition_reduction gets vectorized with V4HI mode
> while fails for other modes in vectorizable_condition:
>
>   if ((double_reduc || reduction_type != TREE_CODE_REDUCTION)
>       && ncopies > 1)
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "multiple types in double reduction or condition "
>                          "reduction or fold-left reduction.\n");
>       return false;
>     }
>
> From the dump:
> foo.c:9:21: note:   === vect_analyze_loop_operations ===
> foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
>
> For V8HI, VF = 8, and vectype_in = vector(4) int.
> Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1,
> and thus fails:
> foo.c:9:21: missed:   multiple types in double reduction or condition
> reduction or fold-left reduction.
> foo.c:4:1: missed:   not vectorized: relevant phi not supported:
> last_19 = PHI <last_8(7), 108(15)>
> While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds.
>
> For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
> If I "force" the vectorizer to use V4HI mode, we get the following dump:
> foo.c:9:21: note:   === vect_analyze_loop_operations ===
> foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> foo.c:9:21: missed:   multiple types in double reduction or condition
> reduction or fold-left reduction.
>
> Not sure tho if this is the only reason for the test to fail to
> vectorize on the target.
> Will investigate in more details next week.

The odd thing is that you say

  for (int i = 0; i < N; i++)
    {
      aval = a[i];
      if (b[i] < min_v)
        last = aval;
    }

fails to vectorize but

  for (int i = 0; i < N; i++)
    {
      if (b[i] < min_v)
        last = a[i];
    }

succeeds?  The IL difference should be irrelevant for the reduction
vectorization:

  <bb 3> [local count: 1049367889]:
  # last_19 = PHI <last_8(7), 108(15)>
  # i_21 = PHI <i_17(7), 0(15)>
  # ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
  _1 = (long unsigned int) i_21;
  _2 = _1 * 2;
  _3 = a_12(D) + _2;
  aval_13 = *_3;
  _5 = _1 * 4;
  _6 = b_14(D) + _5;
  _7 = *_6;
  last_16 = (int) aval_13;
  _9 = _7 < min_v_15(D);
  last_8 = _9 ? last_16 : last_19;
  i_17 = i_21 + 1;
  ivtmp_10 = ivtmp_18 - 1;
  if (ivtmp_10 != 0)
    goto <bb 7>; [97.68%]

vs

  <bb 3> [local count: 1049367889]:
  # last_19 = PHI <last_9(7), 108(15)>
  # i_21 = PHI <i_17(7), 0(15)>
  # ivtmp_11 = PHI <ivtmp_10(7), 43(15)>
  _1 = (long unsigned int) i_21;
  _2 = _1 * 4;
  _3 = b_13(D) + _2;
  _4 = *_3;
  _5 = _4 < min_v_14(D);
  _6 = _1 * 2;
  _38 = _37 + _6;
  _7 = (short int *) _38;
  _8 = .MASK_LOAD (_7, 16B, _5);
  last_16 = (int) _8;
  last_9 = _5 ? last_16 : last_19;
  i_17 = i_21 + 1;
  ivtmp_10 = ivtmp_11 - 1;
  if (ivtmp_10 != 0)
    goto <bb 7>; [97.68%]

maybe since the "mask" is used twice with the .MASK_LOAD
we are not actually looking at the def (the comparison) and it's
the comparison which would introduce the "multiple types"?

That is, I wonder why not sinking the load, avoiding a conditional
load, makes a difference to vectorizing the condition/extract last reduction.

It doesn't seem to make a difference for x86.  That said, the "fix" is
probably sticking the correct target on the dump-check, it seems
that vect_fold_extract_last is no longer correct here.

Richard.

> Thanks,
> Prathamesh
> >
> > Richard.

Reply via email to