[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #13 from Pat Haugen --- Author: pthaugen Date: Mon May 21 16:41:09 2018 New Revision: 260477 URL: https://gcc.gnu.org/viewcvs?rev=260477&root=gcc&view=rev Log: PR target/85698 * gcc.target/powerpc/vec-setup-be-long.c: Remove XFAIL. Modified: branches/gcc-8-branch/gcc/testsuite/ChangeLog branches/gcc-8-branch/gcc/testsuite/gcc.target/powerpc/vec-setup-be-long.c
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #12 from Pat Haugen --- Author: pthaugen Date: Mon May 21 16:34:44 2018 New Revision: 260476 URL: https://gcc.gnu.org/viewcvs?rev=260476&root=gcc&view=rev Log: PR target/85698 * config/rs6000/rs6000.c (rs6000_output_move_128bit): Check dest operand. * gcc.target/powerpc/pr85698.c: New test. Added: branches/gcc-7-branch/gcc/testsuite/gcc.target/powerpc/pr85698.c Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/config/rs6000/rs6000.c branches/gcc-7-branch/gcc/testsuite/ChangeLog
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #11 from Pat Haugen --- Author: pthaugen Date: Mon May 21 16:23:20 2018 New Revision: 260475 URL: https://gcc.gnu.org/viewcvs?rev=260475&root=gcc&view=rev Log: PR target/85698 * config/rs6000/rs6000.c (rs6000_output_move_128bit): Check dest operand. * gcc.target/powerpc/pr85698.c: New test. Added: branches/gcc-8-branch/gcc/testsuite/gcc.target/powerpc/pr85698.c Modified: branches/gcc-8-branch/gcc/ChangeLog branches/gcc-8-branch/gcc/config/rs6000/rs6000.c branches/gcc-8-branch/gcc/testsuite/ChangeLog
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 Pat Haugen changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #10 from Pat Haugen --- Fixed.
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #9 from Pat Haugen --- Author: pthaugen Date: Thu May 17 16:19:16 2018 New Revision: 260329 URL: https://gcc.gnu.org/viewcvs?rev=260329&root=gcc&view=rev Log: PR target/85698 * config/rs6000/rs6000.c (rs6000_output_move_128bit): Check dest operand. * gcc.target/powerpc/pr85698.c: New test. Added: trunk/gcc/testsuite/gcc.target/powerpc/pr85698.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/rs6000.c trunk/gcc/testsuite/ChangeLog
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 Segher Boessenkool changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2018-05-15 Ever confirmed|0 |1
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 Segher Boessenkool changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #8 from Segher Boessenkool --- Created attachment 44133 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44133&action=edit patch This has existed for five years (r199918). Wow.
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #7 from Pat Haugen --- So the problem is that we're generating a stxvw4x insn to write to memory, which corrupts the contents due to both endian behavior and element size (since we're dealing with halfword/uint16_t elements). Value in vector reg = 0x0002fffc0002fff5000e stvx/good (gdb) x/8hx $r1+$r8 0x7fffe490: 0x000e 0xfff5 0x0002 0x 0xfffc 0x0002 0x 0x stxvw4x/bad (gdb) x/8hx $r7+$r8 0x7fffe470: 0x 0x 0xfffc 0x0002 0x0002 0x 0x000e 0xfff5
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #6 from Pat Haugen --- (In reply to Richard Biener from comment #4) > I can see what the patch does to this testcase on x86_64 - it enables BB > vectorization of the first two loops after runrolling. I don't see anything > suspicious here on x86_64 and 525.x264_r works fine for me. > > Can you claify whether test, ref or train inputs fail for you? I tried > AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some > time... > > Can you check whether the following reduced file produces the same assembly > for add4x4_idct as in the complete benchmark? If so it should be possible to > generate a runtime testcase from it. Please attach preprocessed source if > that doesn't work out. > > Sofar I do suspect we are hitting a latent target issue? > > #include > static uint8_t x264_clip_uint8( int x ) > { > return x&(~255) ? (-x)>>31 : x; > } > void add4x4_idct( uint8_t *p_dst, int16_t dct[16]) > { > int16_t d[16]; > int16_t tmp[16]; > for( int i = 0; i < 4; i++ ) > { > int s02 = dct[0*4+i] + dct[2*4+i]; > int d02 = dct[0*4+i] - dct[2*4+i]; > int s13 = dct[1*4+i] + (dct[3*4+i]>>1); > int d13 = (dct[1*4+i]>>1) - dct[3*4+i]; > tmp[i*4+0] = s02 + s13; > tmp[i*4+1] = d02 + d13; > tmp[i*4+2] = d02 - d13; > tmp[i*4+3] = s02 - s13; > } > for( int i = 0; i < 4; i++ ) > { > int s02 = tmp[0*4+i] + tmp[2*4+i]; > int d02 = tmp[0*4+i] - tmp[2*4+i]; > int s13 = tmp[1*4+i] + (tmp[3*4+i]>>1); > int d13 = (tmp[1*4+i]>>1) - tmp[3*4+i]; > d[0*4+i] = ( s02 + s13 + 32 ) >> 6; > d[1*4+i] = ( d02 + d13 + 32 ) >> 6; > d[2*4+i] = ( d02 - d13 + 32 ) >> 6; > d[3*4+i] = ( s02 - s13 + 32 ) >> 6; > } > for( int y = 0; y < 4; y++ ) > { > for( int x = 0; x < 4; x++ ) > p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] ); > p_dst += 32; > } > } Yes, that produces similar code, and adding the following to it produces an executable test that fails at -O3. void main() { uint8_t dst[128]; int16_t dct[16]; int i; for (i = 0; i < 16; i++) dct[i] = i*10 + i; for (i = 0; i < 128; i++) dst[i] = i; add4x4_idct(dst, dct); if (dst[0] != 14 || dst[1] != 0 || dst[2] != 4 || dst[3] != 2 || dst[32] != 28 || dst[33] != 35 || dst[34] != 33 || dst[35] != 35) abort(); } Continuing to debug further...
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #5 from Pat Haugen --- (In reply to Richard Biener from comment #4) > > Can you claify whether test, ref or train inputs fail for you? I tried > AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some > time... > I see the error for ref and test inputs. The train input appears to pass, but then the FDO optimized version fails with the ref input also. I will keep looking at the other stuff you requested.
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #4 from Richard Biener --- I can see what the patch does to this testcase on x86_64 - it enables BB vectorization of the first two loops after runrolling. I don't see anything suspicious here on x86_64 and 525.x264_r works fine for me. Can you claify whether test, ref or train inputs fail for you? I tried AVX256, AVX128 and plain old SSE sofar without any issue but ref takes some time... Can you check whether the following reduced file produces the same assembly for add4x4_idct as in the complete benchmark? If so it should be possible to generate a runtime testcase from it. Please attach preprocessed source if that doesn't work out. Sofar I do suspect we are hitting a latent target issue? #include static uint8_t x264_clip_uint8( int x ) { return x&(~255) ? (-x)>>31 : x; } void add4x4_idct( uint8_t *p_dst, int16_t dct[16]) { int16_t d[16]; int16_t tmp[16]; for( int i = 0; i < 4; i++ ) { int s02 = dct[0*4+i] + dct[2*4+i]; int d02 = dct[0*4+i] - dct[2*4+i]; int s13 = dct[1*4+i] + (dct[3*4+i]>>1); int d13 = (dct[1*4+i]>>1) - dct[3*4+i]; tmp[i*4+0] = s02 + s13; tmp[i*4+1] = d02 + d13; tmp[i*4+2] = d02 - d13; tmp[i*4+3] = s02 - s13; } for( int i = 0; i < 4; i++ ) { int s02 = tmp[0*4+i] + tmp[2*4+i]; int d02 = tmp[0*4+i] - tmp[2*4+i]; int s13 = tmp[1*4+i] + (tmp[3*4+i]>>1); int d13 = (tmp[1*4+i]>>1) - tmp[3*4+i]; d[0*4+i] = ( s02 + s13 + 32 ) >> 6; d[1*4+i] = ( d02 + d13 + 32 ) >> 6; d[2*4+i] = ( d02 - d13 + 32 ) >> 6; d[3*4+i] = ( s02 - s13 + 32 ) >> 6; } for( int y = 0; y < 4; y++ ) { for( int x = 0; x < 4; x++ ) p_dst[x] = x264_clip_uint8( p_dst[x] + d[y*4+x] ); p_dst += 32; } }
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 --- Comment #3 from Pat Haugen --- (In reply to Richard Biener from comment #2) > > Can you help me with isolating this to a single function inside that file? > Maybe try sticking __attribute__((optimize("no-tree-vectorize"))) on some > functions. Oh, there's also the vect_loop debug counter > (-fdbg-cnt=vect_loop:N). add4x4_idct() looks like the function, adding the attribute (or "no-tree-slp-vectorize") to it resulted in a successful run. > Otherwise I'll have to find a power8 machine where I can set up CPU 2017 > myself (unlikely this week due to public holidays). Note that it also fails with -mcpu=power7, so a power8 machine is not needed.
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 Richard Biener changed: What|Removed |Added Keywords||wrong-code Priority|P3 |P2 --- Comment #2 from Richard Biener --- Hmpf. Sounds like the issue requires "careful" preparation of stmt operand order (aka SSA name numbering). We've had issues in this area in the past. Can you help me with isolating this to a single function inside that file? Maybe try sticking __attribute__((optimize("no-tree-vectorize"))) on some functions. Oh, there's also the vect_loop debug counter (-fdbg-cnt=vect_loop:N). Eventually we simply trigger a latent issue elsewhere when we now recognize sth for SLP vectorization. Otherwise I'll have to find a power8 machine where I can set up CPU 2017 myself (unlikely this week due to public holidays).
[Bug tree-optimization/85698] [8/9 Regression] CPU2017 525.x264_r fails starting with r257581
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85698 Andrew Pinski changed: What|Removed |Added Target Milestone|--- |8.2 Summary|CPU2017 525.x264_r fails|[8/9 Regression] CPU2017 |starting with r257581 |525.x264_r fails starting ||with r257581