subject:"\[Bug target\/38824\] \[4.4 Regression\] performance regression of sse code from 4.2\/4.3"

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-03-12 Thread hjl dot tools at gmail dot com



--- Comment #30 from hjl dot tools at gmail dot com  2009-03-12 20:21 
---
Fixed.


-- 

hjl dot tools at gmail dot com changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-03-12 Thread hjl at gcc dot gnu dot org



--- Comment #29 from hjl at gcc dot gnu dot org  2009-03-12 16:08 ---
Subject: Bug 38824

Author: hjl
Date: Thu Mar 12 16:08:02 2009
New Revision: 144817

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144817
Log:
2009-03-12  H.J. Lu  

PR target/38824
* config/i386/i386.md: Compare REGNO on the new peephole2
patterns.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.md


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-03-12 Thread hjl dot tools at gmail dot com



--- Comment #28 from hjl dot tools at gmail dot com  2009-03-12 16:00 
---
(In reply to comment #25)
> patch committed (the changelog was in gcc-patches :-).
> 

This patch caused:

http://gcc.gnu.org/ml/gcc/2009-03/msg00340.html


-- 

hjl dot tools at gmail dot com changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-16 Thread bonzini at gnu dot org



--- Comment #27 from bonzini at gnu dot org  2009-02-16 09:14 ---
Added bugs corresponding to the patch fallout in case distros want to backport
it (it gave quite a nice boost and probably fixed PR21676 too)


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

  BugsThisDependsOn||39152, 39196
OtherBugsDependingO||21676
  nThis||
Bug 38824 depends on bug 39152, which changed state.

Bug 39152 Summary: [4.4 regression] Revision 144098 breaks 416.gamess in SPEC 
CPU 2006
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39152

   What|Old Value   |New Value

 Status|REOPENED|RESOLVED
 Resolution||FIXED

Bug 38824 depends on bug 39196, which changed state.

Bug 39196 Summary: [4.4 Regression] ICE in copyprop_hardreg_forward_1, at 
regrename.c:1603 during libjava compile
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39196

   What|Old Value   |New Value

 Status|UNCONFIRMED |RESOLVED
 Resolution||FIXED

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-12 Thread hjl at gcc dot gnu dot org



--- Comment #26 from hjl at gcc dot gnu dot org  2009-02-12 15:45 ---
Subject: Bug 38824

Author: hjl
Date: Thu Feb 12 15:45:20 2009
New Revision: 144129

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144129
Log:
Mention PR target/38824 in ChangeLog entries.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-11 Thread bonzini at gnu dot org



--- Comment #25 from bonzini at gnu dot org  2009-02-11 08:57 ---
patch committed (the changelog was in gcc-patches :-).


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-11 Thread ubizjak at gmail dot com



--- Comment #24 from ubizjak at gmail dot com  2009-02-11 08:14 ---
(In reply to comment #23)

> Even though you don't observe the reporter's slowdown from 4.2/4.3 to
> unpatched 4.4, I guess this makes a good case for the patch.  Ok for trunk?

OK with a ChangeLog ;)

BTW: Please watch benchmarks testers [1] for a couple of days...

[1] http://gcc.gnu.org/benchmarks/


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-11 Thread bonzini at gnu dot org



--- Comment #23 from bonzini at gnu dot org  2009-02-11 08:01 ---
Subject: Re:  [4.4 Regression] performance regression of
 sse code from 4.2/4.3


> [xg...@shgcc-9 38824]$ time ./gcc-42.out
> real0m1.991s
> 
> [xg...@shgcc-9 38824]$ time ./gcc-44.out
> real0m1.880s
> 
> [xg...@shgcc-9 38824]$ time ./gcc-44p.out
> real0m1.690s

Even though you don't observe the reporter's slowdown from 4.2/4.3 to
unpatched 4.4, I guess this makes a good case for the patch.  Ok for trunk?

Paolo


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-10 Thread xuepeng dot guo at intel dot com



--- Comment #22 from xuepeng dot guo at intel dot com  2009-02-11 07:37 
---
(In reply to comment #18)
> Xuepeng, can you test with the loop as produced by my posted patch, that is:
> .L11:
> movaps  (%rsi,%rax), %xmm0
> addps   %xmm1, %xmm0
> movaps  %xmm0, (%rdi,%rax)
> addq$16, %rax
> cmpq%rdx, %rax
> jne .L11
> I don't have access to new enough chips.

Your patch improved the performance. My machine is "Intel(R) Core(TM)2 Quad CPU
   Q6700  @ 2.66GHz". The results are:

[xg...@shgcc-9 38824]$ time ./gcc-42.out

real0m1.991s
user0m1.990s
sys 0m0.000s
[xg...@shgcc-9 38824]$ time ./gcc-42.out

real0m1.991s
user0m1.991s
sys 0m0.001s
[xg...@shgcc-9 38824]$ time ./gcc-42.out

real0m1.991s
user0m1.989s
sys 0m0.002s
[xg...@shgcc-9 38824]$ time ./gcc-44.out

real0m1.880s
user0m1.879s
sys 0m0.001s
[xg...@shgcc-9 38824]$ time ./gcc-44.out

real0m1.878s
user0m1.878s
sys 0m0.000s
[xg...@shgcc-9 38824]$ time ./gcc-44.out

real0m1.870s
user0m1.869s
sys 0m0.002s
[xg...@shgcc-9 38824]$ time ./gcc-44p.out

real0m1.690s
user0m1.690s
sys 0m0.000s
[xg...@shgcc-9 38824]$ time ./gcc-44p.out

real0m1.690s
user0m1.689s
sys 0m0.002s
[xg...@shgcc-9 38824]$ time ./gcc-44p.out

real0m1.690s
user0m1.690s
sys 0m0.000s

The only difference is:

--- 44.s2009-02-11 15:34:57.0 +0800
+++ 44p.s   2009-02-11 15:34:49.0 +0800
@@ -102,8 +102,8 @@ _Z7bench_1PfS_fj:
.p2align 4,,10
.p2align 3
 .L11:
-   movaps  %xmm0, %xmm1
-   addps   (%rsi,%rax), %xmm1
+   movaps  (%rsi,%rax), %xmm1
+   addps   %xmm0, %xmm1
movaps  %xmm1, (%rdi,%rax)
addq$16, %rax
cmpq%rdx, %rax


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-10 Thread bonzini at gnu dot org



--- Comment #21 from bonzini at gnu dot org  2009-02-10 16:39 ---
So my patch should be a uniform win.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-10 Thread dwarak dot rajagopal at amd dot com



--- Comment #20 from dwarak dot rajagopal at amd dot com  2009-02-10 16:28 
---
Paulo,
(a)   movaps  (%rax, %rsi), %xmm0
  addps  %xmm0, %xmm1

(b)   movaps  %xmm0, %xmm1
  addps  (%rax, %rsi), %xmm1

Yes, case (a) is slightly better than case (b). It shouldn't matter much though
in amdfam10(shanghai) processors. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-09 Thread bonzini at gnu dot org



--- Comment #19 from bonzini at gnu dot org  2009-02-09 13:37 ---
Also, Dwarak, here the change is not from

addps  (%rax, %rsi), %xmm1

to

movps  (%rax, %rsi), %xmm0
addps  %xmm0, %xmm1

but rather from

movps  %xmm0, %xmm1
addps  (%rax, %rsi), %xmm1

to the second snippet above.  Does this pessimize on AMD too?  I don't think
so, it should be 1 uop less, but I'd rather have confirmation.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-09 Thread bonzini at gnu dot org



--- Comment #18 from bonzini at gnu dot org  2009-02-09 13:35 ---
Xuepeng, can you test with the loop as produced by my posted patch, that is:

.L11:
movaps  (%rsi,%rax), %xmm0
addps   %xmm1, %xmm0
movaps  %xmm0, (%rdi,%rax)
addq$16, %rax
cmpq%rdx, %rax
jne .L11

I don't have access to new enough chips.


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

 CC||bonzini at gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-09 Thread xuepeng dot guo at intel dot com



--- Comment #17 from xuepeng dot guo at intel dot com  2009-02-09 09:16 
---
Below is a loop in the case in its original form(compiled by GCC 4.4):

_Z7bench_1PfS_fj:
.LFB2309:
shrl$2, %edx
shufps  $0, %xmm0, %xmm0
subl$1, %edx
xorl%eax, %eax
addq$1, %rdx
salq$4, %rdx
.p2align 4,,10
.p2align 3
.L11:
movaps  %xmm0, %xmm1   
addps   (%rsi,%rax), %xmm1
movaps  %xmm1, (%rdi,%rax)
addq$16, %rax
cmpq%rdx, %rax
jne .L11
rep
ret

The time is:

[xg...@shgcc-10 38824]$ g++ 44.s -o orig.out
[xg...@shgcc-10 38824]$ time ./orig.out

real0m1.878s
user0m1.877s
sys 0m0.000s
[xg...@shgcc-10 38824]$ time ./orig.out

real0m1.879s
user0m1.879s
sys 0m0.001s
[xg...@shgcc-10 38824]$ time ./orig.out

real0m1.873s
user0m1.872s
sys 0m0.001s

After adding two nop:

.L11:
movaps  %xmm0, %xmm1
nop
nop
addps   (%rsi,%rax), %xmm1
movaps  %xmm1, (%rdi,%rax)
addq$16, %rax
cmpq%rdx, %rax
jne .L11
rep
ret

The time is:
[xg...@shgcc-10 38824]$ g++ 44.s -o 2nop.out
[xg...@shgcc-10 38824]$ time ./2nop.out

real0m1.762s
user0m1.762s
sys 0m0.000s
[xg...@shgcc-10 38824]$ time ./2nop.out

real0m1.762s
user0m1.762s
sys 0m0.000s
[xg...@shgcc-10 38824]$ time ./2nop.out

real0m1.762s
user0m1.761s
sys 0m0.000s

I suspect that the code layout maybe hurt the performance.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-08 Thread hubicka at gcc dot gnu dot org



--- Comment #16 from hubicka at gcc dot gnu dot org  2009-02-08 12:40 
---
Since the splitting peep2 don't seem to be win in general (it wins only when
copy propagation takes place afterwards) and we don't seem to understand what
really makes the testcase faster I am unassigning myself until we get better
idea what is going on here.

Honza


-- 

hubicka at gcc dot gnu dot org changed:

   What|Removed |Added

 AssignedTo|hubicka at gcc dot gnu dot  |unassigned at gcc dot gnu
   |org |dot org
 Status|ASSIGNED|NEW


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-08 Thread hubicka at gcc dot gnu dot org



--- Comment #15 from hubicka at gcc dot gnu dot org  2009-02-08 12:36 
---
I tested the patch on SPECfp and core and there is not much difference.  I
guess without somehow tweaking regalloc there is not much to do about this
problem. Xuepeng, if the testcase is core2-variant sensitive, perhaps it is not
related to uops count at all? It seems to me that the bottleneck should lie
elsewhere anyway, as the testcase should be memory bound after all...

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-07 Thread rob1weld at aol dot com



--- Comment #14 from rob1weld at aol dot com  2009-02-07 16:18 ---
(In reply to comment #8)
> Created an attachment (id=17173)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view) [edit]
> An extracted test case for this bug.
> 
> Hi tim, I extracted this test case from your website. But I can't exactly
> ...
FWIW.

Platform i386-pc-solaris2.11 on an AMD Athlon X2 4200+:

# /usr/bin/g++ -v
Reading specs from /usr/sfw/lib/gcc/i386-pc-solaris2.11/3.4.3/specs
Configured with: /builds2/sfwnv-gate/usr/src/cmd/gcc/gcc-3.4.3/configure
--prefix=/usr/sfw --with-as=/usr/sfw/bin/gas --with-gnu-as
--with-ld=/usr/ccs/bin/ld --without-gnu-ld --enable-languages=c,c++,f77,objc
--enable-shared
Thread model: posix
gcc version 3.4.3 (csl-sol210-3_4-20050802)

# /opt/csw/gcc3/bin/g++ -v
Reading specs from /opt/csw/gcc3/lib/gcc/i386-pc-solaris2.8/3.4.5/specs
Configured with: ../sources/gcc-3.4.5/configure --prefix=/opt/csw/gcc3
--with-local-prefix=/opt/csw --with-gnu-as --with-as=/opt/csw/bin/gas
--without-gnu-ld --with-ld=/usr/ccs/bin/ld --enable-threads=posix
--enable-shared --enable-multilib --enable-nls --with-included-gettext
--with-libiconv-prefix=/opt/csw --with-x --enable-java-awt=xlib
--enable-languages=all
Thread model: posix
gcc version 3.4.5

# /opt/csw/gcc4/bin/g++ -v
Reading specs from /opt/csw/gcc4/lib/gcc/i386-pc-solaris2.8/4.0.2/specs
Target: i386-pc-solaris2.8
Configured with: ../sources/gcc-4.0.2/configure --prefix=/opt/csw/gcc4
--with-local-prefix=/opt/csw --with-gnu-as --with-as=/opt/csw/bin/gas
--without-gnu-ld --with-ld=/usr/ccs/bin/ld --enable-threads=posix
--enable-shared --enable-multilib --enable-nls --with-included-gettext
--with-libiconv-prefix=/opt/csw --with-x --enable-java-awt=xlib
--with-system-zlib --enable-languages=c,c++,f95,java,objc,ada
Thread model: posix
gcc version 4.0.2

# g++ -v
Using built-in specs.
Target: i386-pc-solaris2.11
Configured with: ../gcc_trunk/configure
--enable-languages=ada,c,c++,fortran,java,objc,obj-c++ --enable-shared
--disable-static --enable-multilib --enable-decimal-float
--with-long-double-128 --with-included-gettext --enable-stage1-checking
--enable-checking=release --with-tune=k8 --with-cpu=k8 --with-arch=k8
--with-gnu-as --with-as=/usr/local/bin/as --without-gnu-ld
--with-ld=/usr/ccs/bin/ld
Thread model: posix
gcc version 4.4.0 20090206 (experimental) [trunk revision 143992] (GCC) 


-

# time ./3.4.3.out 
real0m5.554s
user0m4.144s
sys 0m0.146s

# time ./3.4.5.out 
real0m5.669s
user0m4.089s
sys 0m0.141s

# time ./4.0.2.out 
real0m5.266s
user0m4.023s
sys 0m0.132s

# time ./4.4.0.out 
real0m5.060s
user0m3.799s
sys 0m0.124s

-

It seems gcc 3.4.3 (csl-sol210-3_4-20050802) is faster than gcc 3.4.5 and 
the current Trunk is ~10% faster (with all the years of progress)

Rob


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-06 Thread dwarak dot rajagopal at amd dot com



--- Comment #13 from dwarak dot rajagopal at amd dot com  2009-02-06 22:35 
---

> The patch makes GCC to generate movaps load followed by addps.  On Core 2 it
> speeds up the testcase from 7s to 6.2s so I guess it works as expected.
> 
> The same however does not reproduce on AMD box and I am not sure if it is just
> coincidence here or if really core preffer to split read-execute SSE 
> operations
> (it is not recommended by the manual).

fyi, AMD (amdfam10) prefers load-execute rather than having separate load and
execute instructions. 


-- 

dwarak dot rajagopal at amd dot com changed:

   What|Removed |Added

 CC||dwarak dot rajagopal at amd
   ||dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-06 Thread bonzini at gnu dot org



--- Comment #12 from bonzini at gnu dot org  2009-02-06 09:16 ---
There's another peephole2, namely from

[(set (match_operand 0 "register_operand")
  (match_operand 1 "register_operand"))
 (set (match_operand 0 "register_operand")
  (match_operator 3 "arith_or_logical_operator"
  [(match_dup 0)
   (match_operand 2 "memory_operand" "")]))]

to

[(set (match_dup 0) (match_dup 2))
 (set (match_dup 0) (match_op_dup 3 [(match_dup 0) (match_dup 1)])]

for operands[0] != operands[1] and commutative operator 3 (i.e.
plus,mult,and,ior,xor,smin,smax,umin,umax).  Testing a patch.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-25 Thread rguenth at gcc dot gnu dot org



--- Comment #11 from rguenth at gcc dot gnu dot org  2009-01-25 17:56 
---
We seem to have a lot of similar "sse performance regression" P2 bugs, can
someone make sure that there are no duplicates here?


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

   Priority|P3  |P2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-24 Thread tim at klingt dot org



--- Comment #10 from tim at klingt dot org  2009-01-24 13:14 ---
btw, i tried the proposed patch ssef, with no big performance difference:

t...@thinkpad:~/sandbox$ time ./a.out 
real0m2.494s
user0m2.473s
sys 0m0.002s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m2.479s
user0m2.475s
sys 0m0.002s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m2.501s
user0m2.476s
sys 0m0.003s


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-24 Thread tim at klingt dot org



--- Comment #9 from tim at klingt dot org  2009-01-24 09:56 ---
> Hi tim, I extracted this test case from your website. But I can't exactly
> reproduce this bug on my machine with a core2 quard micor processor. Can you
> help me to check whether my test case is valid firstly? Here I post what I got
> on my machine for your reference:

the benchmark test case looks fine.

the times on my machine:
gcc-4.2:
t...@thinkpad:~/sandbox$ time ./a.out 

real0m1.852s
user0m1.829s
sys 0m0.010s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m1.826s
user0m1.817s
sys 0m0.002s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m1.833s
user0m1.826s
sys 0m0.001s

gcc-4.3:
time ./a.out 

real0m2.062s
user0m2.047s
sys 0m0.002s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m2.061s
user0m2.043s
sys 0m0.006s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m2.101s
user0m2.053s
sys 0m0.036s

gcc-4.4 (20090111):
t...@thinkpad:~/sandbox$ time ./a.out 

real0m2.536s
user0m2.481s
sys 0m0.017s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m2.497s
user0m2.467s
sys 0m0.003s
t...@thinkpad:~/sandbox$ time ./a.out 

real0m2.539s
user0m2.484s
sys 0m0.036s

best, tim


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-23 Thread xuepeng dot guo at intel dot com



--- Comment #8 from xuepeng dot guo at intel dot com  2009-01-24 05:12 
---
Created an attachment (id=17173)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view)
An extracted test case for this bug.

Hi tim, I extracted this test case from your website. But I can't exactly
reproduce this bug on my machine with a core2 quard micor processor. Can you
help me to check whether my test case is valid firstly? Here I post what I got
on my machine for your reference:

[xg...@shgcc-10 38824]$ /home/xguo2/app/trunk/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../src/configure --enable-checking=assert --disable-bootstrap
--enable-languages=c,c++,fortran
Thread model: posix
gcc version 4.4.0 20090121 (experimental) [trunk revision 143537] (GCC)
[xg...@shgcc-10 38824]$ /home/xguo2/app/trunk/bin/g++ -O3 -msse -mfpmath=sse
simd_unroll_benchmarks.cpp -o 44.out
[xg...@shgcc-10 38824]$ time ./44.out

real0m1.877s
user0m1.876s
sys 0m0.001s
[xg...@shgcc-10 38824]$ time ./44.out

real0m1.877s
user0m1.877s
sys 0m0.000s
[xg...@shgcc-10 38824]$ time ./44.out

real0m1.881s
user0m1.882s
sys 0m0.000s
[xg...@shgcc-10 38824]$ /home/xguo2/app/usr/gcc-4.2/bin/g++ -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: /net/gnu-13/export/gnu/src/gcc-4.2/gcc/configure
--enable-clocale=gnu --with-system-zlib --with-demangler-in-ld --enable-shared
--enable-threads=posix --enable-haifa --enable-checking=assert
--prefix=/usr/gcc-4.2 --with-local-prefix=/usr/local
Thread model: posix
gcc version 4.2.0
[xg...@shgcc-10 38824]$ /home/xguo2/app/usr/gcc-4.2/bin/g++ -O3 -msse
-mfpmath=sse simd_unroll_benchmarks.cpp -o 42.out
[xg...@shgcc-10 38824]$ time ./42.out

real0m1.991s
user0m1.991s
sys 0m0.000s
[xg...@shgcc-10 38824]$ time ./42.out

real0m1.991s
user0m1.989s
sys 0m0.001s
[xg...@shgcc-10 38824]$ time ./42.out

real0m1.991s
user0m1.990s
sys 0m0.000s
[xg...@shgcc-10 38824]$ g++ -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-libgcj-multifile
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk
--disable-dssi --enable-plugin
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic
--host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20071124 (Red Hat 4.1.2-42)
[xg...@shgcc-10 38824]$ g++ -O3 -msse -mfpmath=sse simd_unroll_benchmarks.cpp
-o 41.out
[xg...@shgcc-10 38824]$ time ./41.out

real0m1.465s
user0m1.464s
sys 0m0.002s
[xg...@shgcc-10 38824]$ time ./41.out

real0m1.465s
user0m1.465s
sys 0m0.000s
[xg...@shgcc-10 38824]$ time ./41.out

real0m1.465s
user0m1.464s
sys 0m0.002s


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-23 Thread rguenth at gcc dot gnu dot org



-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

   Keywords||missed-optimization
Summary|[4.4 regression] performance|[4.4 Regression] performance
   |regression of sse code from |regression of sse code from
   |4.2/4.3 |4.2/4.3
   Target Milestone|--- |4.4.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at ucw dot cz



--- Comment #7 from hubicka at ucw dot cz  2009-01-15 01:49 ---
Subject: Re:  [4.4 regression] performance regression of sse code from 4.2/4.3

I guess th3 main difference here is that load + addps pair generate 2
uops, while mov + loading addps generate 3 since the move has to go
through the queue.  I will try to change testcase to fit in cache to see
if AMD machine reproduce it too..

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hjl dot tools at gmail dot com



--- Comment #6 from hjl dot tools at gmail dot com  2009-01-15 01:25 ---
(In reply to comment #5)
>
> H.J. perhaps, you can have some advice here?  Or at least can we do some
> benchmarking?
> 

Joey and Xuepeng are looking into it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at gcc dot gnu dot org



--- Comment #5 from hubicka at gcc dot gnu dot org  2009-01-15 00:30 ---
Created an attachment (id=17106)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17106&action=view)
Proposed patch

The patch makes GCC to generate movaps load followed by addps.  On Core 2 it
speeds up the testcase from 7s to 6.2s so I guess it works as expected.

The same however does not reproduce on AMD box and I am not sure if it is just
coincidence here or if really core preffer to split read-execute SSE operations
(it is not recommended by the manual).

H.J. perhaps, you can have some advice here?  Or at least can we do some
benchmarking?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at gcc dot gnu dot org



--- Comment #4 from hubicka at gcc dot gnu dot org  2009-01-14 20:31 ---
Actually perhaps in simple case like this even peep2 will work since we can
copyprop will fix it later.  I am trying to add the peep


-- 

hubicka at gcc dot gnu dot org changed:

   What|Removed |Added

 AssignedTo|unassigned at gcc dot gnu   |hubicka at gcc dot gnu dot
   |dot org |org
 Status|UNCONFIRMED |ASSIGNED
 Ever Confirmed|0   |1
   Keywords|missed-optimization |
   Last reconfirmed|-00-00 00:00:00 |2009-01-14 20:31:52
   date||
Summary|[4.4 Regression] performance|[4.4 regression] performance
   |regression of sse code from |regression of sse code from
   |4.2/4.3 |4.2/4.3
   Target Milestone|4.4.0   |---


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread rguenth at gcc dot gnu dot org



-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

   Keywords||missed-optimization
Summary|[4.4 regression] performance|[4.4 Regression] performance
   |regression of sse code from |regression of sse code from
   |4.2/4.3 |4.2/4.3
   Target Milestone|--- |4.4.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at gcc dot gnu dot org



--- Comment #3 from hubicka at gcc dot gnu dot org  2009-01-14 20:20 ---
It might be IRA change.  Chips generally preffer separate load and execute
instruction as in the old loop over the load+execute since they are easier to
retire.

Splitting the instruction post reload probably won't do much good, since there
is extra move already. If just splitting the instruction would help, we can
macroize:
(define_peephole2
  [(match_scratch:SI 2 "r")
   (parallel [(set (match_operand:SI 0 "register_operand" "")
   (match_operator:SI 3 "arith_or_logical_operator"
 [(match_dup 0)
  (match_operand:SI 1 "memory_operand" "")]))
  (clobber (reg:CC FLAGS_REG))])]
  "optimize_insn_for_speed_p () && ! TARGET_READ_MODIFY"
  [(set (match_dup 2) (match_dup 1))
   (parallel [(set (match_dup 0)
   (match_op_dup 3 [(match_dup 0) (match_dup 2)]))
  (clobber (reg:CC FLAGS_REG))])]
  "") 

peephole for vector modes too.
Vladimir, perhaps IRA can be tweaked here somehow?


-- 

hubicka at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||vmakarov at redhat dot com,
   ||hubicka at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-13 Thread tim at klingt dot org



--- Comment #2 from tim at klingt dot org  2009-01-13 16:22 ---
(In reply to comment #1)
> I don't see how this changes could cause more branch misses.  If you do the
> same .palign for the 4.4 code does the regression vanish?  I would suspect
> that the loop-stream detector catches one but not the other form for some
> reason.  Maybe the Intel folks can properly analyze this - HJ?

after doing some more tests, i wouldn't think too much about the branch misses.
they seem to be quite dependent on the binary, even on linked libraries. i am
more concerned about the inner loop ...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-13 Thread rguenth at gcc dot gnu dot org



--- Comment #1 from rguenth at gcc dot gnu dot org  2009-01-13 15:07 ---
I don't see how this changes could cause more branch misses.  If you do the
same .palign for the 4.4 code does the regression vanish?  I would suspect
that the loop-stream detector catches one but not the other form for some
reason.  Maybe the Intel folks can properly analyze this - HJ?


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||hjl at gcc dot gnu dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

32 matches

Site Navigation

Mail list logo

Footer information