https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- I can see what the patch does to this testcase on x86_64 - it enables BB vectorization of the first two loops after runrolling. I don't see anything suspicious here on x86_64 and 525.x264_r works fine for me. Can you claify whether test, ref or train inputs fail for you? I tried AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some time... Can you check whether the following reduced file produces the same assembly for add4x4_idct as in the complete benchmark? If so it should be possible to generate a runtime testcase from it. Please attach preprocessed source if that doesn't work out. Sofar I do suspect we are hitting a latent target issue? #include <stdint.h> static uint8_t x264_clip_uint8( int x ) { return x&(~255) ? (-x)>>31 : x; } void add4x4_idct( uint8_t *p_dst, int16_t dct[16]) { int16_t d[16]; int16_t tmp[16]; for( int i = 0; i < 4; i++ ) { int s02 = dct[0*4+i] + dct[2*4+i]; int d02 = dct[0*4+i] - dct[2*4+i]; int s13 = dct[1*4+i] + (dct[3*4+i]>>1); int d13 = (dct[1*4+i]>>1) - dct[3*4+i]; tmp[i*4+0] = s02 + s13; tmp[i*4+1] = d02 + d13; tmp[i*4+2] = d02 - d13; tmp[i*4+3] = s02 - s13; } for( int i = 0; i < 4; i++ ) { int s02 = tmp[0*4+i] + tmp[2*4+i]; int d02 = tmp[0*4+i] - tmp[2*4+i]; int s13 = tmp[1*4+i] + (tmp[3*4+i]>>1); int d13 = (tmp[1*4+i]>>1) - tmp[3*4+i]; d[0*4+i] = ( s02 + s13 + 32 ) >> 6; d[1*4+i] = ( d02 + d13 + 32 ) >> 6; d[2*4+i] = ( d02 - d13 + 32 ) >> 6; d[3*4+i] = ( s02 - s13 + 32 ) >> 6; } for( int y = 0; y < 4; y++ ) { for( int x = 0; x < 4; x++ ) p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] ); p_dst += 32; } }