[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-21 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #13 from Pat Haugen  ---
Author: pthaugen
Date: Mon May 21 16:41:09 2018
New Revision: 260477

URL: https://gcc.gnu.org/viewcvs?rev=260477&root=gcc&view=rev
Log:
PR target/85698
* gcc.target/powerpc/vec-setup-be-long.c: Remove XFAIL.


Modified:
branches/gcc-8-branch/gcc/testsuite/ChangeLog
branches/gcc-8-branch/gcc/testsuite/gcc.target/powerpc/vec-setup-be-long.c

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-21 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #12 from Pat Haugen  ---
Author: pthaugen
Date: Mon May 21 16:34:44 2018
New Revision: 260476

URL: https://gcc.gnu.org/viewcvs?rev=260476&root=gcc&view=rev
Log:
PR target/85698
* config/rs6000/rs6000.c (rs6000_output_move_128bit): Check dest
operand.

* gcc.target/powerpc/pr85698.c: New test.


Added:
branches/gcc-7-branch/gcc/testsuite/gcc.target/powerpc/pr85698.c
Modified:
branches/gcc-7-branch/gcc/ChangeLog
branches/gcc-7-branch/gcc/config/rs6000/rs6000.c
branches/gcc-7-branch/gcc/testsuite/ChangeLog

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-21 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #11 from Pat Haugen  ---
Author: pthaugen
Date: Mon May 21 16:23:20 2018
New Revision: 260475

URL: https://gcc.gnu.org/viewcvs?rev=260475&root=gcc&view=rev
Log:
PR target/85698
* config/rs6000/rs6000.c (rs6000_output_move_128bit): Check dest
operand.

* gcc.target/powerpc/pr85698.c: New test.


Added:
branches/gcc-8-branch/gcc/testsuite/gcc.target/powerpc/pr85698.c
Modified:
branches/gcc-8-branch/gcc/ChangeLog
branches/gcc-8-branch/gcc/config/rs6000/rs6000.c
branches/gcc-8-branch/gcc/testsuite/ChangeLog

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-17 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

Pat Haugen  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #10 from Pat Haugen  ---
Fixed.

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-17 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #9 from Pat Haugen  ---
Author: pthaugen
Date: Thu May 17 16:19:16 2018
New Revision: 260329

URL: https://gcc.gnu.org/viewcvs?rev=260329&root=gcc&view=rev
Log:
PR target/85698
* config/rs6000/rs6000.c (rs6000_output_move_128bit): Check dest
operand.

* gcc.target/powerpc/pr85698.c: New test.


Added:
trunk/gcc/testsuite/gcc.target/powerpc/pr85698.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/rs6000.c
trunk/gcc/testsuite/ChangeLog

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-14 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

Segher Boessenkool  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-05-15
 Ever confirmed|0   |1

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-14 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

Segher Boessenkool  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #8 from Segher Boessenkool  ---
Created attachment 44133
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44133&action=edit
patch

This has existed for five years (r199918).  Wow.

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-14 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #7 from Pat Haugen  ---
So the problem is that we're generating a stxvw4x insn to write to memory,
which corrupts the contents due to both endian behavior and element size (since
we're dealing with halfword/uint16_t elements).

Value in vector reg = 0x0002fffc0002fff5000e

stvx/good
(gdb) x/8hx $r1+$r8
0x7fffe490: 0x000e  0xfff5  0x0002  0x  0xfffc  0x0002  0x  0x


stxvw4x/bad
(gdb) x/8hx $r7+$r8
0x7fffe470: 0x  0x  0xfffc  0x0002  0x0002  0x  0x000e  0xfff5

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-14 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #6 from Pat Haugen  ---
(In reply to Richard Biener from comment #4)
> I can see what the patch does to this testcase on x86_64 - it enables BB
> vectorization of the first two loops after runrolling.  I don't see anything
> suspicious here on x86_64 and 525.x264_r works fine for me.
> 
> Can you claify whether test, ref or train inputs fail for you?  I tried
> AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some
> time...
> 
> Can you check whether the following reduced file produces the same assembly
> for add4x4_idct as in the complete benchmark?  If so it should be possible to
> generate a runtime testcase from it.  Please attach preprocessed source if
> that doesn't work out.
> 
> Sofar I do suspect we are hitting a latent target issue?
> 
> #include 
> static uint8_t x264_clip_uint8( int x )
> {
>   return x&(~255) ? (-x)>>31 : x;
> }
> void add4x4_idct( uint8_t *p_dst, int16_t dct[16])
> {
>   int16_t d[16];
>   int16_t tmp[16];
>   for( int i = 0; i < 4; i++ )
> {
>   int s02 =  dct[0*4+i] +  dct[2*4+i];
>   int d02 =  dct[0*4+i] -  dct[2*4+i];
>   int s13 =  dct[1*4+i] + (dct[3*4+i]>>1);
>   int d13 = (dct[1*4+i]>>1) -  dct[3*4+i];
>   tmp[i*4+0] = s02 + s13;
>   tmp[i*4+1] = d02 + d13;
>   tmp[i*4+2] = d02 - d13;
>   tmp[i*4+3] = s02 - s13;
> }
>   for( int i = 0; i < 4; i++ )
> {
>   int s02 =  tmp[0*4+i] +  tmp[2*4+i];
>   int d02 =  tmp[0*4+i] -  tmp[2*4+i];
>   int s13 =  tmp[1*4+i] + (tmp[3*4+i]>>1);
>   int d13 = (tmp[1*4+i]>>1) -  tmp[3*4+i];
>   d[0*4+i] = ( s02 + s13 + 32 ) >> 6;
>   d[1*4+i] = ( d02 + d13 + 32 ) >> 6;
>   d[2*4+i] = ( d02 - d13 + 32 ) >> 6;
>   d[3*4+i] = ( s02 - s13 + 32 ) >> 6;
> }
>   for( int y = 0; y < 4; y++ )
> {
>   for( int x = 0; x < 4; x++ )
> p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] );
>   p_dst += 32;
> }
> }

Yes, that produces similar code, and adding the following to it produces an
executable test that fails at -O3.

void main()
{
  uint8_t dst[128];
  int16_t dct[16];
  int i;

  for (i = 0; i < 16; i++)
dct[i] = i*10 + i;
  for (i = 0; i < 128; i++)
dst[i] = i;

  add4x4_idct(dst, dct);

  if (dst[0] != 14 || dst[1] != 0 || dst[2] != 4 || dst[3] != 2 
  || dst[32] != 28 || dst[33] != 35 || dst[34] != 33 || dst[35] != 35)
abort();

}

Continuing to debug further...

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-11 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #5 from Pat Haugen  ---
(In reply to Richard Biener from comment #4)
> 
> Can you claify whether test, ref or train inputs fail for you?  I tried
> AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some
> time...
> 

I see the error for ref and test inputs. The train input appears to pass, but
then  the FDO optimized version fails with the ref input also.

I will keep looking at the other stuff you requested.

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-11 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #4 from Richard Biener  ---
I can see what the patch does to this testcase on x86_64 - it enables BB
vectorization of the first two loops after runrolling.  I don't see anything
suspicious here on x86_64 and 525.x264_r works fine for me.

Can you claify whether test, ref or train inputs fail for you?  I tried
AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some
time...

Can you check whether the following reduced file produces the same assembly
for add4x4_idct as in the complete benchmark?  If so it should be possible to
generate a runtime testcase from it.  Please attach preprocessed source if
that doesn't work out.

Sofar I do suspect we are hitting a latent target issue?

#include 
static uint8_t x264_clip_uint8( int x )
{
  return x&(~255) ? (-x)>>31 : x;
}
void add4x4_idct( uint8_t *p_dst, int16_t dct[16])
{
  int16_t d[16];
  int16_t tmp[16];
  for( int i = 0; i < 4; i++ )
{
  int s02 =  dct[0*4+i] +  dct[2*4+i];
  int d02 =  dct[0*4+i] -  dct[2*4+i];
  int s13 =  dct[1*4+i] + (dct[3*4+i]>>1);
  int d13 = (dct[1*4+i]>>1) -  dct[3*4+i];
  tmp[i*4+0] = s02 + s13;
  tmp[i*4+1] = d02 + d13;
  tmp[i*4+2] = d02 - d13;
  tmp[i*4+3] = s02 - s13;
}
  for( int i = 0; i < 4; i++ )
{
  int s02 =  tmp[0*4+i] +  tmp[2*4+i];
  int d02 =  tmp[0*4+i] -  tmp[2*4+i];
  int s13 =  tmp[1*4+i] + (tmp[3*4+i]>>1);
  int d13 = (tmp[1*4+i]>>1) -  tmp[3*4+i];
  d[0*4+i] = ( s02 + s13 + 32 ) >> 6;
  d[1*4+i] = ( d02 + d13 + 32 ) >> 6;
  d[2*4+i] = ( d02 - d13 + 32 ) >> 6;
  d[3*4+i] = ( s02 - s13 + 32 ) >> 6;
}
  for( int y = 0; y < 4; y++ )
{
  for( int x = 0; x < 4; x++ )
p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] );
  p_dst += 32;
}
}

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-09 Thread pthaugen at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

--- Comment #3 from Pat Haugen  ---
(In reply to Richard Biener from comment #2)
> 
> Can you help me with isolating this to a single function inside that file?
> Maybe try sticking __attribute__((optimize("no-tree-vectorize"))) on some
> functions.  Oh, there's also the vect_loop debug counter
> (-fdbg-cnt=vect_loop:N).

add4x4_idct() looks like the function, adding the attribute (or
"no-tree-slp-vectorize") to it resulted in a successful run.


> Otherwise I'll have to find a power8 machine where I can set up CPU 2017
> myself (unlikely this week due to public holidays).

Note that it also fails with -mcpu=power7, so a power8 machine is not needed.

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

Richard Biener  changed:

   What|Removed |Added

   Keywords||wrong-code
   Priority|P3  |P2

--- Comment #2 from Richard Biener  ---
Hmpf.  Sounds like the issue requires "careful" preparation of stmt operand
order
(aka SSA name numbering).  We've had issues in this area in the past.

Can you help me with isolating this to a single function inside that file?
Maybe try sticking __attribute__((optimize("no-tree-vectorize"))) on some
functions.  Oh, there's also the vect_loop debug counter
(-fdbg-cnt=vect_loop:N).

Eventually we simply trigger a latent issue elsewhere when we now recognize
sth for SLP vectorization.

Otherwise I'll have to find a power8 machine where I can set up CPU 2017
myself (unlikely this week due to public holidays).

[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581

2018-05-08 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698

Andrew Pinski  changed:

   What|Removed |Added

   Target Milestone|--- |8.2
Summary|CPU2017 525.x264_r fails|[8/9 Regression] CPU2017
   |starting with r257581   |525.x264_r fails starting
   ||with r257581