[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-09-26 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

Martin Jambor  changed:

   What|Removed |Added

 CC||jamborm at gcc dot gnu.org
 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Martin Jambor  ---
(In reply to cuilili from comment #7)
> (In reply to Martin Jambor from comment #6)
> > I believe this has been fixed?
> 
> Yes.

Closing the bug then.

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-09-25 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #7 from cuilili  ---
(In reply to Martin Jambor from comment #6)
> I believe this has been fixed?

Yes.

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-09-23 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #6 from Martin Jambor  ---
I believe this has been fixed?

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-06-29 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #5 from CVS Commits  ---
The master branch has been updated by Lili Cui :

https://gcc.gnu.org/g:4633e38cd22c5e51fac984124c7627be912d0999

commit r14-2185-g4633e38cd22c5e51fac984124c7627be912d0999
Author: Lili Cui 
Date:   Thu Jun 29 06:51:56 2023 +

Avoid adding loop-carried ops to long chains

Avoid adding loop-carried ops to long chains, otherwise the whole chain
will
have dependencies across the loop iteration. Just keep loop-carried ops in
a
separate chain.
   E.g.
   x_1 = phi(x_0, x_2)
   y_1 = phi(y_0, y_2)

   a + b + c + d + e + x1 + y1

   SSA1 = a + b;
   SSA2 = c + d;
   SSA3 = SSA1 + e;
   SSA4 = SSA3 + SSA2;
   SSA5 = x1 + y1;
   SSA6 = SSA4 + SSA5;

With the patch applied, these test cases improved by 32%~100%.

S242:
for (int i = 1; i < LEN_1D; ++i) {
a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];}

Case 1:
for (int i = 1; i < LEN_1D; ++i) {
a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}

Case 2:
for (int i = 1; i < LEN_1D; ++i) {
a[i] = a[i - 1] + b[i - 1] + s1 + s2 + b[i] + c[i] + d[i] + e[i];}

The value is the execution time
A: original version
B: with FMA patch g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409(base on A)
C: with current patch(base on B)

  A   B   C B/A C/A
s2422.859   5.152   2.859   1.802028681 1
case 1  5.489   5.488   3.511   0.9998180.64
case 2  7.216   7.499   4.885   1.0392180.68

gcc/ChangeLog:

PR tree-optimization/110148
* tree-ssa-reassoc.cc (rewrite_expr_tree_parallel): Handle
loop-carried
ops in this function.

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-06-25 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #4 from Jan Hubicka  ---
zen3 fma requires all inputs to be ready to start execution, separate
multiply+add can start multiplication earlier. Not sure if that explains the
difference.

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-06-24 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

--- Comment #3 from cuilili  ---
I reproduced S1244 regression on znver3.

Src code:

for (int i = 0; i < LEN_1D-1; i++)
  {
a[i] = b[i] + c[i] * c[i] + b[i] * b[i] + c[i];
d[i] = a[i] + a[i+1];
  }

Base version: Base + commit version:

Assembler Assembler 
Loop1:Loop1:
vmovsd 0x60c400(%rax),%xmm2   vmovsd 0x60ba00(%rax),%xmm2   
vmovsd 0x60ba00(%rax),%xmm1   vmovsd 0x60c400(%rax),%xmm1   
add$0x8,%rax  add$0x8,%rax  

vaddsd %xmm1,%xmm2,%xmm0  vmovsd %xmm2,%xmm2,%xmm0  
vmulsd %xmm2,%xmm2,%xmm2  vfmadd132sd %xmm2,%xmm1,%xmm0 
vfmadd132sd %xmm1,%xmm2,%xmm1 vfmadd132sd %xmm1,%xmm2,%xmm1 

vaddsd %xmm1,%xmm0,%xmm0  vaddsd %xmm1,%xmm0,%xmm0  
vmovsd %xmm0,0x60cdf8(%rax)   vmovsd %xmm0,0x60cdf8(%rax)   
vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0 
vmovsd %xmm0,0x60aff8(%rax)   vmovsd %xmm0,0x60aff8(%rax)   
cmp$0x9f8,%raxcmp$0x9f8,%rax
jneLoop1: jneLoop1


For the Base version, mult and FMA have dependencies, which increases the
latency of the critical dependency chain. I didn't find out why znver3 has
regression. Same binary running on ICX has 11% gain (with #define iterations
1).

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-06-09 Thread lili.cui at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

cuilili  changed:

   What|Removed |Added

 CC||lili.cui at intel dot com

--- Comment #2 from cuilili  ---

The commit changed the break dependency chain function, in order to generate
more FMA. S242 has a chain that needs to be broken. The chain is in a small
loop and related with the loop reduction variable a[i-1].


Src code:

for (int i = 1; i < LEN_1D; ++i) 
   {
 a[i] = a[i - 1] + s1 + s2 + b[i] + c[i] + d[i];
   }

--
Base version:

SSA tree
ssa1 = (s1+s2) + b[i];
ssa2 = c[i] + d[i];
ssa3 = ssa1+ssa2;
ssa4 = ssa3 + a[i-1]

a[i-1] uses xmm1, there are 2 instructions using xmm0 have dependencies across
iterations

Assembler
Loop1:
vmovsd 0x60c400(%rax),%xmm0  
vaddsd 0x60b000(%rax),%xmm3,%xmm2
add$0x8,%rax 
vaddsd 0x60b9f8(%rax),%xmm0,%xmm0
vaddsd %xmm2,%xmm0,%xmm0 
vaddsd %xmm0,%xmm1,%xmm1 ---> 1   
vmovsd %xmm1,0x60cdf8(%rax)  ---> 2
cmp$0xa00,%rdx
jneLoop1

--
Base + commit g:e5405f065bace0685cb3b8878d1dfc7a6e7ef409 version:

a[i-1] uses xmm0, there are 4 instructions using xmm0 have dependencies across
iterations

SSA tree
ssa1 = (s1+s2) + b[i];
ssa2 = c[i] + d[i];
ssa3 = ssa1 + a[i-1]
ssa3 = ssa2 + ssa3;

Assembler
Loop1:
vaddsdq  0x60b000(%rax), %xmm0, %xmm0  ---> 1
vmovsdq  0x60c400(%rax), %xmm1
add $0x8, %rax   
vaddsdq  0x60b9f8(%rax), %xmm1, %xmm1
vaddsd %xmm2, %xmm0, %xmm0 ---> 2
vaddsd %xmm1, %xmm0, %xmm0 ---> 3
vmovsdq  %xmm0, 0x60cdf8(%rax) ---> 4
cmp$0xa00,%rdx
jneLoop1

[Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8

2023-06-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110148

Richard Biener  changed:

   What|Removed |Added

 Blocks||53947
 Target||x86_64-*-*
Version|13.1.0  |14.0
Summary|TSVC s242 regression|[14 Regression] TSVC s242
   |between |regression between
   |g:c0df96b3cda5738afbba3a65b |g:c0df96b3cda5738afbba3a65b
   |b054183c5cd5530 and |b054183c5cd5530 and
   |g:e4c986fde56a6248f8fbe6cf0 |g:e4c986fde56a6248f8fbe6cf0
   |704e1da34b055d8 |704e1da34b055d8
   Target Milestone|--- |14.0
   Keywords||needs-bisection


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations