On Sat, May 25, 2024 at 3:08 PM Hanke Zhang via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi,
> I'm trying to studing the automatic vectorization optimization in GCC,
> but I found one case that SLP vectorizer failed to do such things.
>
> Here is the sample code: (also a simplification version of a function
> from the 625/525.x264 source code in SPEC CPU 2017)
>
> void pixel_sub_wxh(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
>   for (int y = 0; y < 4; y++) {
>     for (int x = 0; x < 4; x++)
>       diff[x + y * 4] = pix1[x] - pix2[x];
>     pix1 += 16;
>     pix2 += 32;

The issue is these increments, with only four uint8_t elements accessed
we still want to fill up a vectors worth of them.

In the end we succeed with v4hi / v8qi but also peel for gaps even though
we handle the half-load case fine.

>   }
> }
>
> When I compiled with `-O3 -mavx2/-msse4.2`, SLP vectorizer failed to
> vectorize it, and I got the following message when adding
> `-fopt-info-vec-all`. (The inner loop will be unrolled)
>
> <source>:6:21: optimized: loop vectorized using 8 byte vectors
> <source>:6:21: optimized:  loop versioned for vectorization because of
> possible aliasing
> <source>:5:6: note: vectorized 1 loops in function.

^^^

so you do see the vectorization as outlined above.

> <source>:5:6: note: ***** Analysis failed with vector mode V8SI
> <source>:5:6: note: ***** The result for vector mode V32QI would be the same
> <source>:5:6: note: ***** Re-trying analysis with vector mode V16QI
> <source>:5:6: note: ***** Analysis failed with vector mode V16QI
> <source>:5:6: note: ***** Re-trying analysis with vector mode V8QI
> <source>:5:6: note: ***** Analysis failed with vector mode V8QI
> <source>:5:6: note: ***** Re-trying analysis with vector mode V4QI
> <source>:5:6: note: ***** Analysis failed with vector mode V4QI
>
> If I manually use the type declaration provided by `immintrin.h` to
> rewrite the code, the code is as follows (which I hope the SLP
> vectorizer to be able to do)
>
> void pixel_sub_wxh_vec(int16_t *diff, uint8_t *pix1, uint8_t *pix2) {
>   for (int y = 0; y < 4; y++) {
>     __v4hi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3]};
>     __v4hi pix2_v = {pix2[0], pix2[1], pix2[2], pix2[3]};
>     __v4hi diff_v = pix1_v - pix2_v;
>     *(long long *)(diff + y * 4) = (long long)diff_v;

We kind-of do it this way, just

        __v8qi pix1_v = {pix1[0], pix1[1], pix1[2], pix1[3], 0, 0, 0, 0};
...

and then unpack __v8qi low to v4hi.

And unfortunately the last two outer iterations are scalar because of the
gap issue.  There's some PRs about this, I did start to work on improving this,
I'm not sure this exact case is covered so can you open a new bugreport?

>     pix1 += 16;
>     pix2 += 32;
>   }
> }
>
> What I want to know is why SLP vectorizer can't vectorize the code
> here, and what changes do I need to make to SLP vectorizer or the
> source code if I want it to do so?
>
> Thanks
> Hanke Zhang

Reply via email to