https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114109
Bug ID: 114109 Summary: x264 satd vectorization vs LLVM Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org CC: juzhe.zhong at rivai dot ai, law at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* riscv*-*-* Looking at the following code of x264 (SPEC 2017): typedef unsigned char uint8_t; typedef unsigned short uint16_t; typedef unsigned int uint32_t; static inline uint32_t abs2 (uint32_t a) { uint32_t s = ((a >> 15) & 0x10001) * 0xffff; return (a + s) ^ s; } int x264_pixel_satd_8x4 (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2) { uint32_t tmp[4][4]; uint32_t a0, a1, a2, a3; int sum = 0; for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) { a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); { int t0 = a0 + a1; int t1 = a0 - a1; int t2 = a2 + a3; int t3 = a2 - a3; tmp[i][0] = t0 + t2; tmp[i][1] = t1 + t3; tmp[i][2] = t0 - t2; tmp[i][3] = t1 - t3; }; } for( int i = 0; i < 4; i++ ) { { int t0 = tmp[0][i] + tmp[1][i]; int t1 = tmp[0][i] - tmp[1][i]; int t2 = tmp[2][i] + tmp[3][i]; int t3 = tmp[2][i] - tmp[3][i]; a0 = t0 + t2; a2 = t0 - t2; a1 = t1 + t3; a3 = t1 - t3; }; sum += abs2 (a0) + abs2 (a1) + abs2 (a2) + abs2 (a3); } return (((uint16_t) sum) + ((uint32_t) sum > >16)) >> 1; } I first checked on riscv but x86 and aarch64 are pretty similar. (Refer https://godbolt.org/z/vzf5ha44r that compares at -O3 -mavx512f) Vectorizing the first loop seems to be a costing issue. By default we don't vectorize and the code becomes much larger when disabling vector costing, so the costing decision in itself seems correct. Clang's version is significantly shorter and it looks like it just directly vec_sets/vec_inits the individual elements. On riscv it can be handled rather elegantly with strided loads that we don't emit right now. As there are only 4 active vector elements and the loop is likely load bound it might be debatable whether LLVM's version is better? The second loop we do vectorize (4 elements at a time) but end up with e.g. four XORs for the four inlined abs2 calls while clang chooses a larger vectorization factor and does all the xors in one. On my laptop (no avx512) I don't see a huge difference (113s GCC vs 108s LLVM) but I guess the general case is still interesting?