https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698
--- Comment #6 from Pat Haugen <pthaugen at gcc dot gnu.org> --- (In reply to Richard Biener from comment #4) > I can see what the patch does to this testcase on x86_64 - it enables BB > vectorization of the first two loops after runrolling. I don't see anything > suspicious here on x86_64 and 525.x264_r works fine for me. > > Can you claify whether test, ref or train inputs fail for you? I tried > AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some > time... > > Can you check whether the following reduced file produces the same assembly > for add4x4_idct as in the complete benchmark? If so it should be possible to > generate a runtime testcase from it. Please attach preprocessed source if > that doesn't work out. > > Sofar I do suspect we are hitting a latent target issue? > > #include <stdint.h> > static uint8_t x264_clip_uint8( int x ) > { > return x&(~255) ? (-x)>>31 : x; > } > void add4x4_idct( uint8_t *p_dst, int16_t dct[16]) > { > int16_t d[16]; > int16_t tmp[16]; > for( int i = 0; i < 4; i++ ) > { > int s02 = dct[0*4+i] + dct[2*4+i]; > int d02 = dct[0*4+i] - dct[2*4+i]; > int s13 = dct[1*4+i] + (dct[3*4+i]>>1); > int d13 = (dct[1*4+i]>>1) - dct[3*4+i]; > tmp[i*4+0] = s02 + s13; > tmp[i*4+1] = d02 + d13; > tmp[i*4+2] = d02 - d13; > tmp[i*4+3] = s02 - s13; > } > for( int i = 0; i < 4; i++ ) > { > int s02 = tmp[0*4+i] + tmp[2*4+i]; > int d02 = tmp[0*4+i] - tmp[2*4+i]; > int s13 = tmp[1*4+i] + (tmp[3*4+i]>>1); > int d13 = (tmp[1*4+i]>>1) - tmp[3*4+i]; > d[0*4+i] = ( s02 + s13 + 32 ) >> 6; > d[1*4+i] = ( d02 + d13 + 32 ) >> 6; > d[2*4+i] = ( d02 - d13 + 32 ) >> 6; > d[3*4+i] = ( s02 - s13 + 32 ) >> 6; > } > for( int y = 0; y < 4; y++ ) > { > for( int x = 0; x < 4; x++ ) > p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] ); > p_dst += 32; > } > } Yes, that produces similar code, and adding the following to it produces an executable test that fails at -O3. void main() { uint8_t dst[128]; int16_t dct[16]; int i; for (i = 0; i < 16; i++) dct[i] = i*10 + i; for (i = 0; i < 128; i++) dst[i] = i; add4x4_idct(dst, dct); if (dst[0] != 14 || dst[1] != 0 || dst[2] != 4 || dst[3] != 2 || dst[32] != 28 || dst[33] != 35 || dst[34] != 33 || dst[35] != 35) abort(); } Continuing to debug further...