[Bug middle-end/114109] New: x264 satd vectorization vs LLVM

rdapp at gcc dot gnu.org via Gcc-bugs Mon, 26 Feb 2024 02:28:57 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109


            Bug ID: 114109
           Summary: x264 satd vectorization vs LLVM
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
                CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-* riscv*-*-*

Looking at the following code of x264 (SPEC 2017):

typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
typedef unsigned int uint32_t;

static inline uint32_t abs2 (uint32_t a)
{
    uint32_t s = ((a >> 15) & 0x10001) * 0xffff;
    return (a + s) ^ s;
}

int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2)
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;

    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
        a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        {
          int t0 = a0 + a1;
          int t1 = a0 - a1;
          int t2 = a2 + a3;
          int t3 = a2 - a3;
          tmp[i][0] = t0 + t2;
          tmp[i][1] = t1 + t3;
          tmp[i][2] = t0 - t2;
          tmp[i][3] = t1 - t3;
        };
    }
    for( int i = 0; i < 4; i++ )
    {
        { int t0 = tmp[0][i] + tmp[1][i];
          int t1 = tmp[0][i] - tmp[1][i];
          int t2 = tmp[2][i] + tmp[3][i];
          int t3 = tmp[2][i] - tmp[3][i];
          a0 = t0 + t2;
          a2 = t0 - t2;
          a1 = t1 + t3;
          a3 = t1 - t3;
        };
        sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3);
    }
    return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1;
}

I first checked on riscv but x86 and aarch64 are pretty similar.  (Refer
https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f)

Vectorizing the first loop seems to be a costing issue.  By default we don't
vectorize and the code becomes much larger when disabling vector costing, so
the costing decision in itself seems correct.
Clang's version is significantly shorter and it looks like it just directly
vec_sets/vec_inits the individual elements.  On riscv it can be handled rather
elegantly with strided loads that we don't emit right now.
As there are only 4 active vector elements and the loop is likely load bound it
might be debatable whether LLVM's version is better?

The second loop we do vectorize (4 elements at a time) but end up with e.g.
four XORs for the four inlined abs2 calls while clang chooses a larger
vectorization factor and does all the xors in one.

On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM)
but I guess the general case is still interesting?

[Bug middle-end/114109] New: x264 satd vectorization vs LLVM

Reply via email to