[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 Martin Jambor changed: What|Removed |Added CC||jamborm at gcc dot gnu.org Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #8 from Martin Jambor --- (In reply to cuilili from comment #7) > (In reply to Martin Jambor from comment #6) > > I believe this has been fixed? > > Yes. Closing the bug then.
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 --- Comment #7 from cuilili --- (In reply to Martin Jambor from comment #6) > I believe this has been fixed? Yes.
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 --- Comment #6 from Martin Jambor --- I believe this has been fixed?
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 --- Comment #5 from CVS Commits --- The master branch has been updated by Lili Cui : https://gcc.gnu.org/g:4633e38cd22c5e51fac984124c7627be912d0999 commit r14-2185-g4633e38cd22c5e51fac984124c7627be912d0999 Author: Lili Cui Date: Thu Jun 29 06:51:56 2023 + Avoid adding loop-carried ops to long chains Avoid adding loop-carried ops to long chains, otherwise the whole chain will have dependencies across the loop iteration. Just keep loop-carried ops in a separate chain. E.g. x_1 = phi(x_0, x_2) y_1 = phi(y_0, y_2) a + b + c + d + e + x1 + y1 SSA1 = a + b; SSA2 = c + d; SSA3 = SSA1 + e; SSA4 = SSA3 + SSA2; SSA5 = x1 + y1; SSA6 = SSA4 + SSA5; With the patch applied, these test cases improved by 32%~100%. S242: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];} Case 1: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];} Case 2: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + b[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];} The value is the execution time A: original version B: with FMA patch g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409(base on A) C: with current patch(base on B) A B C B/A C/A s2422.859 5.152 2.859 1.802028681 1 case 1 5.489 5.488 3.511 0.9998180.64 case 2 7.216 7.499 4.885 1.0392180.68 gcc/ChangeLog: PR tree-optimization/110148 * tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): Handle loop-carried ops in this function.
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 --- Comment #4 from Jan Hubicka --- zen3 fma requires all inputs to be ready to start execution, separate multiply+add can start multiplication earlier. Not sure if that explains the difference.
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 --- Comment #3 from cuilili --- I reproduced S1244 regression on znver3. Src code: for (int i = 0; i < LEN_1D-1; i++) { a[i] = b[i] + c[i] * c[i] + b[i] * b[i] + c[i]; d[i] = a[i] + a[i+1]; } Base version: Base + commit version: Assembler Assembler Loop1:Loop1: vmovsd 0x60c400(%rax),%xmm2 vmovsd 0x60ba00(%rax),%xmm2 vmovsd 0x60ba00(%rax),%xmm1 vmovsd 0x60c400(%rax),%xmm1 add$0x8,%rax add$0x8,%rax vaddsd %xmm1,%xmm2,%xmm0 vmovsd %xmm2,%xmm2,%xmm0 vmulsd %xmm2,%xmm2,%xmm2 vfmadd132sd %xmm2,%xmm1,%xmm0 vfmadd132sd %xmm1,%xmm2,%xmm1 vfmadd132sd %xmm1,%xmm2,%xmm1 vaddsd %xmm1,%xmm0,%xmm0 vaddsd %xmm1,%xmm0,%xmm0 vmovsd %xmm0,0x60cdf8(%rax) vmovsd %xmm0,0x60cdf8(%rax) vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vmovsd %xmm0,0x60aff8(%rax) vmovsd %xmm0,0x60aff8(%rax) cmp$0x9f8,%raxcmp$0x9f8,%rax jneLoop1: jneLoop1 For the Base version, mult and FMA have dependencies, which increases the latency of the critical dependency chain. I didn't find out why znver3 has regression. Same binary running on ICX has 11% gain (with #define iterations 1).
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 cuilili changed: What|Removed |Added CC||lili.cui at intel dot com --- Comment #2 from cuilili --- The commit changed the break dependency chain function, in order to generate more FMA. S242 has a chain that needs to be broken. The chain is in a small loop and related with the loop reduction variable a[i-1]. Src code: for (int i = 1; i < LEN_1D; ++i) { a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i]; } -- Base version: SSA tree ssa1 = (s1+s2) + b[i]; ssa2 = c[i] + d[i]; ssa3 = ssa1+ssa2; ssa4 = ssa3 + a[i-1] a[i-1] uses xmm1, there are 2 instructions using xmm0 have dependencies across iterations Assembler Loop1: vmovsd 0x60c400(%rax),%xmm0 vaddsd 0x60b000(%rax),%xmm3,%xmm2 add$0x8,%rax vaddsd 0x60b9f8(%rax),%xmm0,%xmm0 vaddsd %xmm2,%xmm0,%xmm0 vaddsd %xmm0,%xmm1,%xmm1 ---> 1 vmovsd %xmm1,0x60cdf8(%rax) ---> 2 cmp$0xa00,%rdx jneLoop1 -- Base + commit g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409 version: a[i-1] uses xmm0, there are 4 instructions using xmm0 have dependencies across iterations SSA tree ssa1 = (s1+s2) + b[i]; ssa2 = c[i] + d[i]; ssa3 = ssa1 + a[i-1] ssa3 = ssa2 + ssa3; Assembler Loop1: vaddsdq 0x60b000(%rax), %xmm0, %xmm0 ---> 1 vmovsdq 0x60c400(%rax), %xmm1 add $0x8, %rax vaddsdq 0x60b9f8(%rax), %xmm1, %xmm1 vaddsd %xmm2, %xmm0, %xmm0 ---> 2 vaddsd %xmm1, %xmm0, %xmm0 ---> 3 vmovsdq %xmm0, 0x60cdf8(%rax) ---> 4 cmp$0xa00,%rdx jneLoop1
[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148 Richard Biener changed: What|Removed |Added Blocks||53947 Target||x86_64-*-* Version|13.1.0 |14.0 Summary|TSVC s242 regression|[14 Regression] TSVC s242 |between |regression between |g:c0df96b3cda5738afbba3a65b |g:c0df96b3cda5738afbba3a65b |b054183c5cd5530 and |b054183c5cd5530 and |g:e4c986fde56a6248f8fbe6cf0 |g:e4c986fde56a6248f8fbe6cf0 |704e1da34b055d8 |704e1da34b055d8 Target Milestone|--- |14.0 Keywords||needs-bisection Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations