https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117769
Bug ID: 117769
Summary: RISC-V: Worse codegen in x264_pixel_satd_8x4
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: juzhe.zhong at rivai dot ai
Target Milestone: ---
Notice we generate worse codegen than ARM SVE in hottest function of 625.x264:
#include <stdint.h>
#include <math.h>
#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
int t0 = s0 + s1;\
int t1 = s0 - s1;\
int t2 = s2 + s3;\
int t3 = s2 - s3;\
d0 = t0 + t2;\
d2 = t0 - t2;\
d1 = t1 + t3;\
d3 = t1 - t3;\
}
uint32_t abs2( uint32_t a )
{
uint32_t s = ((a>>15)&0x10001)*0xffff;
return (a+s)^s;
}
int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
uint32_t tmp[4][4];
uint32_t a0, a1, a2, a3;
int sum = 0;
for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
{
a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
}
for( int i = 0; i < 4; i++ )
{
HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]
);
sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
}
return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}
https://godbolt.org/z/4KsfMaTjo
I notice in this commit:
https://github.com/gcc-mirror/gcc/commit/cf261dd52272bdca767560131f3c2b4e1edae9ab?open_in_browser=true
We can have similiar codegen as ARM SVE. And we have tested in QEMU 20%
performance drop in case of dynamic instruction count in the trunk GCC compare
with the commit described above. We need to bisect to see which commit hurt the
performance of RISC-V.