SLP vectorization possible inefficiency

rdapp at gcc dot gnu.org via Gcc-bugs Tue, 04 Jun 2024 00:46:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340


            Bug ID: 115340
           Summary: Loop/SLP vectorization possible inefficiency
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
            Target: riscv

Given the following loop which is heavily adjusted from x264 satd's last loop
to showcase a particular problem.  I'm not sure it's still representative of
x264 but well...
I used riscv as target but avx512 and aarch64 look similar.

typedef unsigned int uint32_t;

void
foo (uint32_t **restrict tmp, uint32_t *restrict out)
{
    uint32_t sum = 0;
    for( int i = 0; i < 4; i++ )
    {
        out[i + 0] = tmp[0][i] + 1;
        out[i + 4] = tmp[1][i] + 1;
        out[i + 8] = tmp[2][i] + 1;
        out[i + 12] = tmp[3][i] + 1;
    }
}

we loop vectorize it as 4x unrolled
  vect__5.6_70 = MEM <vector(4) unsigned int> [(uint32_t *)vectp.4_72];
  vect__7.7_68 = vect__5.6_70 + { 1, 1, 1, 1 };
  MEM <vector(4) unsigned int> [(uint32_t *)vectp_out.8_67] = vect__7.7_68;
  ...

On riscv:

        vsetivli        zero,4,e32,mf2,ta,ma
        vle32.v v1,0(a4)
        vadd.vi v1,v1,1
        vse32.v v1,0(a1)
        addi    a4,a1,16
        vle32.v v1,0(a2)
        vadd.vi v1,v1,1
        vse32.v v1,0(a4)
        addi    a4,a1,32
        vle32.v v1,0(a3)
        vadd.vi v1,v1,1
        vse32.v v1,0(a4)
        addi    a1,a1,48
        vle32.v v1,0(a5)
        vadd.vi v1,v1,1
        vse32.v v1,0(a1)

That's not ideal when a 64-byte vector can hold all values simultaneously
(and permuting four 16-byte vectors into one is acceptable).

Of course the exact sequence will depend on costs but on riscv I would expect
code along those lines (similar to what clang produces):

        vsetivli        zero, 4, e32, mf2, ta, ma
        vle32.v v8, (a2)
        vle32.v v9, (a4)
        vle32.v v10, (a0)
        vle32.v v11, (a3)
        vsetivli        zero, 8, e32, mf2, ta, ma
        vslideup.vi     v9, v10, 4
        vslideup.vi     v8, v11, 4
        vsetivli        zero, 16, e32, m1, ta, ma
        vslideup.vi     v8, v9, 8
        vadd.vi v8, v8, 1
        vse32.v v8, (a1)

4 loads, 3 permutes, one add and one store.

When disabling tree vectorization, SLP gets us halfway there but still loads
each value separately:

  ...
  _17 = MEM[(uint32_tD.2756 *)_15 + 12B clique 1 base 0];
  _24 = MEM[(uint32_tD.2756 *)_22 + 12B clique 1 base 0];
  _69 = {_66, _81, _114, _5, _61, _86, _119, _10, _54, _93, _126, _17, _47,
_100, _133, _24};
  vect__64.5_68 = _69 + { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 };
  MEM <vector(16) unsigned intD.4> [(uint32_tD.2756 *)out_33(D) clique 1 base
1] = vect__64.5_68;

I suppose that's the way it should work.  I'd rather have the grouped loads
be done with vector loads and permuted into place (rather than scalar loads and
"permute" each one individually)?

Now, with SLPification in progress I suppose that goal is a bit more realistic
as before.  Richi, is there something that can be done for now to get us
closer?
I'm currently looking into your rguenth/load-perm branch, is that a reasonable
way to start?

[Bug tree-optimization/115340] New: Loop/SLP vectorization possible inefficiency

Reply via email to