https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
I can see what the patch does to this testcase on x86_64 - it enables BB
vectorization of the first two loops after runrolling.  I don't see anything
suspicious here on x86_64 and 525.x264_r works fine for me.

Can you claify whether test, ref or train inputs fail for you?  I tried
AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some
time...

Can you check whether the following reduced file produces the same assembly
for add4x4_idct as in the complete benchmark?  If so it should be possible to
generate a runtime testcase from it.  Please attach preprocessed source if
that doesn't work out.

Sofar I do suspect we are hitting a latent target issue?

#include <stdint.h>
static uint8_t x264_clip_uint8( int x )
{
  return x&(~255) ? (-x)>>31 : x;
}
void add4x4_idct( uint8_t *p_dst, int16_t dct[16])
{
  int16_t d[16];
  int16_t tmp[16];
  for( int i = 0; i < 4; i++ )
    {
      int s02 =  dct[0*4+i]     +  dct[2*4+i];
      int d02 =  dct[0*4+i]     -  dct[2*4+i];
      int s13 =  dct[1*4+i]     + (dct[3*4+i]>>1);
      int d13 = (dct[1*4+i]>>1) -  dct[3*4+i];
      tmp[i*4+0] = s02 + s13;
      tmp[i*4+1] = d02 + d13;
      tmp[i*4+2] = d02 - d13;
      tmp[i*4+3] = s02 - s13;
    }
  for( int i = 0; i < 4; i++ )
    {
      int s02 =  tmp[0*4+i]     +  tmp[2*4+i];
      int d02 =  tmp[0*4+i]     -  tmp[2*4+i];
      int s13 =  tmp[1*4+i]     + (tmp[3*4+i]>>1);
      int d13 = (tmp[1*4+i]>>1) -  tmp[3*4+i];
      d[0*4+i] = ( s02 + s13 + 32 ) >> 6;
      d[1*4+i] = ( d02 + d13 + 32 ) >> 6;
      d[2*4+i] = ( d02 - d13 + 32 ) >> 6;
      d[3*4+i] = ( s02 - s13 + 32 ) >> 6;
    }
  for( int y = 0; y < 4; y++ )
    {
      for( int x = 0; x < 4; x++ )
        p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] );
      p_dst += 32;
    }
}

Reply via email to