[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #30 from hjl dot tools at gmail dot com 2009-03-12 20:21 --- Fixed. -- hjl dot tools at gmail dot com changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #29 from hjl at gcc dot gnu dot org 2009-03-12 16:08 --- Subject: Bug 38824 Author: hjl Date: Thu Mar 12 16:08:02 2009 New Revision: 144817 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144817 Log: 2009-03-12 H.J. Lu PR target/38824 * config/i386/i386.md: Compare REGNO on the new peephole2 patterns. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #28 from hjl dot tools at gmail dot com 2009-03-12 16:00 --- (In reply to comment #25) > patch committed (the changelog was in gcc-patches :-). > This patch caused: http://gcc.gnu.org/ml/gcc/2009-03/msg00340.html -- hjl dot tools at gmail dot com changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #27 from bonzini at gnu dot org 2009-02-16 09:14 --- Added bugs corresponding to the patch fallout in case distros want to backport it (it gave quite a nice boost and probably fixed PR21676 too) -- bonzini at gnu dot org changed: What|Removed |Added BugsThisDependsOn||39152, 39196 OtherBugsDependingO||21676 nThis|| Bug 38824 depends on bug 39152, which changed state. Bug 39152 Summary: [4.4 regression] Revision 144098 breaks 416.gamess in SPEC CPU 2006 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39152 What|Old Value |New Value Status|REOPENED|RESOLVED Resolution||FIXED Bug 38824 depends on bug 39196, which changed state. Bug 39196 Summary: [4.4 Regression] ICE in copyprop_hardreg_forward_1, at regrename.c:1603 during libjava compile http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39196 What|Old Value |New Value Status|UNCONFIRMED |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #26 from hjl at gcc dot gnu dot org 2009-02-12 15:45 --- Subject: Bug 38824 Author: hjl Date: Thu Feb 12 15:45:20 2009 New Revision: 144129 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144129 Log: Mention PR target/38824 in ChangeLog entries. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #25 from bonzini at gnu dot org 2009-02-11 08:57 --- patch committed (the changelog was in gcc-patches :-). -- bonzini at gnu dot org changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #24 from ubizjak at gmail dot com 2009-02-11 08:14 --- (In reply to comment #23) > Even though you don't observe the reporter's slowdown from 4.2/4.3 to > unpatched 4.4, I guess this makes a good case for the patch. Ok for trunk? OK with a ChangeLog ;) BTW: Please watch benchmarks testers [1] for a couple of days... [1] http://gcc.gnu.org/benchmarks/ -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #23 from bonzini at gnu dot org 2009-02-11 08:01 --- Subject: Re: [4.4 Regression] performance regression of sse code from 4.2/4.3 > [xg...@shgcc-9 38824]$ time ./gcc-42.out > real0m1.991s > > [xg...@shgcc-9 38824]$ time ./gcc-44.out > real0m1.880s > > [xg...@shgcc-9 38824]$ time ./gcc-44p.out > real0m1.690s Even though you don't observe the reporter's slowdown from 4.2/4.3 to unpatched 4.4, I guess this makes a good case for the patch. Ok for trunk? Paolo -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #22 from xuepeng dot guo at intel dot com 2009-02-11 07:37 --- (In reply to comment #18) > Xuepeng, can you test with the loop as produced by my posted patch, that is: > .L11: > movaps (%rsi,%rax), %xmm0 > addps %xmm1, %xmm0 > movaps %xmm0, (%rdi,%rax) > addq$16, %rax > cmpq%rdx, %rax > jne .L11 > I don't have access to new enough chips. Your patch improved the performance. My machine is "Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz". The results are: [xg...@shgcc-9 38824]$ time ./gcc-42.out real0m1.991s user0m1.990s sys 0m0.000s [xg...@shgcc-9 38824]$ time ./gcc-42.out real0m1.991s user0m1.991s sys 0m0.001s [xg...@shgcc-9 38824]$ time ./gcc-42.out real0m1.991s user0m1.989s sys 0m0.002s [xg...@shgcc-9 38824]$ time ./gcc-44.out real0m1.880s user0m1.879s sys 0m0.001s [xg...@shgcc-9 38824]$ time ./gcc-44.out real0m1.878s user0m1.878s sys 0m0.000s [xg...@shgcc-9 38824]$ time ./gcc-44.out real0m1.870s user0m1.869s sys 0m0.002s [xg...@shgcc-9 38824]$ time ./gcc-44p.out real0m1.690s user0m1.690s sys 0m0.000s [xg...@shgcc-9 38824]$ time ./gcc-44p.out real0m1.690s user0m1.689s sys 0m0.002s [xg...@shgcc-9 38824]$ time ./gcc-44p.out real0m1.690s user0m1.690s sys 0m0.000s The only difference is: --- 44.s2009-02-11 15:34:57.0 +0800 +++ 44p.s 2009-02-11 15:34:49.0 +0800 @@ -102,8 +102,8 @@ _Z7bench_1PfS_fj: .p2align 4,,10 .p2align 3 .L11: - movaps %xmm0, %xmm1 - addps (%rsi,%rax), %xmm1 + movaps (%rsi,%rax), %xmm1 + addps %xmm0, %xmm1 movaps %xmm1, (%rdi,%rax) addq$16, %rax cmpq%rdx, %rax -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #21 from bonzini at gnu dot org 2009-02-10 16:39 --- So my patch should be a uniform win. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #20 from dwarak dot rajagopal at amd dot com 2009-02-10 16:28 --- Paulo, (a) movaps (%rax, %rsi), %xmm0 addps %xmm0, %xmm1 (b) movaps %xmm0, %xmm1 addps (%rax, %rsi), %xmm1 Yes, case (a) is slightly better than case (b). It shouldn't matter much though in amdfam10(shanghai) processors. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #19 from bonzini at gnu dot org 2009-02-09 13:37 --- Also, Dwarak, here the change is not from addps (%rax, %rsi), %xmm1 to movps (%rax, %rsi), %xmm0 addps %xmm0, %xmm1 but rather from movps %xmm0, %xmm1 addps (%rax, %rsi), %xmm1 to the second snippet above. Does this pessimize on AMD too? I don't think so, it should be 1 uop less, but I'd rather have confirmation. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #18 from bonzini at gnu dot org 2009-02-09 13:35 --- Xuepeng, can you test with the loop as produced by my posted patch, that is: .L11: movaps (%rsi,%rax), %xmm0 addps %xmm1, %xmm0 movaps %xmm0, (%rdi,%rax) addq$16, %rax cmpq%rdx, %rax jne .L11 I don't have access to new enough chips. -- bonzini at gnu dot org changed: What|Removed |Added CC||bonzini at gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #17 from xuepeng dot guo at intel dot com 2009-02-09 09:16 --- Below is a loop in the case in its original form(compiled by GCC 4.4): _Z7bench_1PfS_fj: .LFB2309: shrl$2, %edx shufps $0, %xmm0, %xmm0 subl$1, %edx xorl%eax, %eax addq$1, %rdx salq$4, %rdx .p2align 4,,10 .p2align 3 .L11: movaps %xmm0, %xmm1 addps (%rsi,%rax), %xmm1 movaps %xmm1, (%rdi,%rax) addq$16, %rax cmpq%rdx, %rax jne .L11 rep ret The time is: [xg...@shgcc-10 38824]$ g++ 44.s -o orig.out [xg...@shgcc-10 38824]$ time ./orig.out real0m1.878s user0m1.877s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./orig.out real0m1.879s user0m1.879s sys 0m0.001s [xg...@shgcc-10 38824]$ time ./orig.out real0m1.873s user0m1.872s sys 0m0.001s After adding two nop: .L11: movaps %xmm0, %xmm1 nop nop addps (%rsi,%rax), %xmm1 movaps %xmm1, (%rdi,%rax) addq$16, %rax cmpq%rdx, %rax jne .L11 rep ret The time is: [xg...@shgcc-10 38824]$ g++ 44.s -o 2nop.out [xg...@shgcc-10 38824]$ time ./2nop.out real0m1.762s user0m1.762s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./2nop.out real0m1.762s user0m1.762s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./2nop.out real0m1.762s user0m1.761s sys 0m0.000s I suspect that the code layout maybe hurt the performance. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #16 from hubicka at gcc dot gnu dot org 2009-02-08 12:40 --- Since the splitting peep2 don't seem to be win in general (it wins only when copy propagation takes place afterwards) and we don't seem to understand what really makes the testcase faster I am unassigning myself until we get better idea what is going on here. Honza -- hubicka at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|hubicka at gcc dot gnu dot |unassigned at gcc dot gnu |org |dot org Status|ASSIGNED|NEW http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #15 from hubicka at gcc dot gnu dot org 2009-02-08 12:36 --- I tested the patch on SPECfp and core and there is not much difference. I guess without somehow tweaking regalloc there is not much to do about this problem. Xuepeng, if the testcase is core2-variant sensitive, perhaps it is not related to uops count at all? It seems to me that the bottleneck should lie elsewhere anyway, as the testcase should be memory bound after all... Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #14 from rob1weld at aol dot com 2009-02-07 16:18 --- (In reply to comment #8) > Created an attachment (id=17173) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view) [edit] > An extracted test case for this bug. > > Hi tim, I extracted this test case from your website. But I can't exactly > ... FWIW. Platform i386-pc-solaris2.11 on an AMD Athlon X2 4200+: # /usr/bin/g++ -v Reading specs from /usr/sfw/lib/gcc/i386-pc-solaris2.11/3.4.3/specs Configured with: /builds2/sfwnv-gate/usr/src/cmd/gcc/gcc-3.4.3/configure --prefix=/usr/sfw --with-as=/usr/sfw/bin/gas --with-gnu-as --with-ld=/usr/ccs/bin/ld --without-gnu-ld --enable-languages=c,c++,f77,objc --enable-shared Thread model: posix gcc version 3.4.3 (csl-sol210-3_4-20050802) # /opt/csw/gcc3/bin/g++ -v Reading specs from /opt/csw/gcc3/lib/gcc/i386-pc-solaris2.8/3.4.5/specs Configured with: ../sources/gcc-3.4.5/configure --prefix=/opt/csw/gcc3 --with-local-prefix=/opt/csw --with-gnu-as --with-as=/opt/csw/bin/gas --without-gnu-ld --with-ld=/usr/ccs/bin/ld --enable-threads=posix --enable-shared --enable-multilib --enable-nls --with-included-gettext --with-libiconv-prefix=/opt/csw --with-x --enable-java-awt=xlib --enable-languages=all Thread model: posix gcc version 3.4.5 # /opt/csw/gcc4/bin/g++ -v Reading specs from /opt/csw/gcc4/lib/gcc/i386-pc-solaris2.8/4.0.2/specs Target: i386-pc-solaris2.8 Configured with: ../sources/gcc-4.0.2/configure --prefix=/opt/csw/gcc4 --with-local-prefix=/opt/csw --with-gnu-as --with-as=/opt/csw/bin/gas --without-gnu-ld --with-ld=/usr/ccs/bin/ld --enable-threads=posix --enable-shared --enable-multilib --enable-nls --with-included-gettext --with-libiconv-prefix=/opt/csw --with-x --enable-java-awt=xlib --with-system-zlib --enable-languages=c,c++,f95,java,objc,ada Thread model: posix gcc version 4.0.2 # g++ -v Using built-in specs. Target: i386-pc-solaris2.11 Configured with: ../gcc_trunk/configure --enable-languages=ada,c,c++,fortran,java,objc,obj-c++ --enable-shared --disable-static --enable-multilib --enable-decimal-float --with-long-double-128 --with-included-gettext --enable-stage1-checking --enable-checking=release --with-tune=k8 --with-cpu=k8 --with-arch=k8 --with-gnu-as --with-as=/usr/local/bin/as --without-gnu-ld --with-ld=/usr/ccs/bin/ld Thread model: posix gcc version 4.4.0 20090206 (experimental) [trunk revision 143992] (GCC) - # time ./3.4.3.out real0m5.554s user0m4.144s sys 0m0.146s # time ./3.4.5.out real0m5.669s user0m4.089s sys 0m0.141s # time ./4.0.2.out real0m5.266s user0m4.023s sys 0m0.132s # time ./4.4.0.out real0m5.060s user0m3.799s sys 0m0.124s - It seems gcc 3.4.3 (csl-sol210-3_4-20050802) is faster than gcc 3.4.5 and the current Trunk is ~10% faster (with all the years of progress) Rob -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #13 from dwarak dot rajagopal at amd dot com 2009-02-06 22:35 --- > The patch makes GCC to generate movaps load followed by addps. On Core 2 it > speeds up the testcase from 7s to 6.2s so I guess it works as expected. > > The same however does not reproduce on AMD box and I am not sure if it is just > coincidence here or if really core preffer to split read-execute SSE > operations > (it is not recommended by the manual). fyi, AMD (amdfam10) prefers load-execute rather than having separate load and execute instructions. -- dwarak dot rajagopal at amd dot com changed: What|Removed |Added CC||dwarak dot rajagopal at amd ||dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #12 from bonzini at gnu dot org 2009-02-06 09:16 --- There's another peephole2, namely from [(set (match_operand 0 "register_operand") (match_operand 1 "register_operand")) (set (match_operand 0 "register_operand") (match_operator 3 "arith_or_logical_operator" [(match_dup 0) (match_operand 2 "memory_operand" "")]))] to [(set (match_dup 0) (match_dup 2)) (set (match_dup 0) (match_op_dup 3 [(match_dup 0) (match_dup 1)])] for operands[0] != operands[1] and commutative operator 3 (i.e. plus,mult,and,ior,xor,smin,smax,umin,umax). Testing a patch. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #11 from rguenth at gcc dot gnu dot org 2009-01-25 17:56 --- We seem to have a lot of similar "sse performance regression" P2 bugs, can someone make sure that there are no duplicates here? -- rguenth at gcc dot gnu dot org changed: What|Removed |Added Priority|P3 |P2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #10 from tim at klingt dot org 2009-01-24 13:14 --- btw, i tried the proposed patch ssef, with no big performance difference: t...@thinkpad:~/sandbox$ time ./a.out real0m2.494s user0m2.473s sys 0m0.002s t...@thinkpad:~/sandbox$ time ./a.out real0m2.479s user0m2.475s sys 0m0.002s t...@thinkpad:~/sandbox$ time ./a.out real0m2.501s user0m2.476s sys 0m0.003s -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #9 from tim at klingt dot org 2009-01-24 09:56 --- > Hi tim, I extracted this test case from your website. But I can't exactly > reproduce this bug on my machine with a core2 quard micor processor. Can you > help me to check whether my test case is valid firstly? Here I post what I got > on my machine for your reference: the benchmark test case looks fine. the times on my machine: gcc-4.2: t...@thinkpad:~/sandbox$ time ./a.out real0m1.852s user0m1.829s sys 0m0.010s t...@thinkpad:~/sandbox$ time ./a.out real0m1.826s user0m1.817s sys 0m0.002s t...@thinkpad:~/sandbox$ time ./a.out real0m1.833s user0m1.826s sys 0m0.001s gcc-4.3: time ./a.out real0m2.062s user0m2.047s sys 0m0.002s t...@thinkpad:~/sandbox$ time ./a.out real0m2.061s user0m2.043s sys 0m0.006s t...@thinkpad:~/sandbox$ time ./a.out real0m2.101s user0m2.053s sys 0m0.036s gcc-4.4 (20090111): t...@thinkpad:~/sandbox$ time ./a.out real0m2.536s user0m2.481s sys 0m0.017s t...@thinkpad:~/sandbox$ time ./a.out real0m2.497s user0m2.467s sys 0m0.003s t...@thinkpad:~/sandbox$ time ./a.out real0m2.539s user0m2.484s sys 0m0.036s best, tim -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
--- Comment #8 from xuepeng dot guo at intel dot com 2009-01-24 05:12 --- Created an attachment (id=17173) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view) An extracted test case for this bug. Hi tim, I extracted this test case from your website. But I can't exactly reproduce this bug on my machine with a core2 quard micor processor. Can you help me to check whether my test case is valid firstly? Here I post what I got on my machine for your reference: [xg...@shgcc-10 38824]$ /home/xguo2/app/trunk/bin/g++ -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../src/configure --enable-checking=assert --disable-bootstrap --enable-languages=c,c++,fortran Thread model: posix gcc version 4.4.0 20090121 (experimental) [trunk revision 143537] (GCC) [xg...@shgcc-10 38824]$ /home/xguo2/app/trunk/bin/g++ -O3 -msse -mfpmath=sse simd_unroll_benchmarks.cpp -o 44.out [xg...@shgcc-10 38824]$ time ./44.out real0m1.877s user0m1.876s sys 0m0.001s [xg...@shgcc-10 38824]$ time ./44.out real0m1.877s user0m1.877s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./44.out real0m1.881s user0m1.882s sys 0m0.000s [xg...@shgcc-10 38824]$ /home/xguo2/app/usr/gcc-4.2/bin/g++ -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: /net/gnu-13/export/gnu/src/gcc-4.2/gcc/configure --enable-clocale=gnu --with-system-zlib --with-demangler-in-ld --enable-shared --enable-threads=posix --enable-haifa --enable-checking=assert --prefix=/usr/gcc-4.2 --with-local-prefix=/usr/local Thread model: posix gcc version 4.2.0 [xg...@shgcc-10 38824]$ /home/xguo2/app/usr/gcc-4.2/bin/g++ -O3 -msse -mfpmath=sse simd_unroll_benchmarks.cpp -o 42.out [xg...@shgcc-10 38824]$ time ./42.out real0m1.991s user0m1.991s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./42.out real0m1.991s user0m1.989s sys 0m0.001s [xg...@shgcc-10 38824]$ time ./42.out real0m1.991s user0m1.990s sys 0m0.000s [xg...@shgcc-10 38824]$ g++ -v Using built-in specs. Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=x86_64-redhat-linux Thread model: posix gcc version 4.1.2 20071124 (Red Hat 4.1.2-42) [xg...@shgcc-10 38824]$ g++ -O3 -msse -mfpmath=sse simd_unroll_benchmarks.cpp -o 41.out [xg...@shgcc-10 38824]$ time ./41.out real0m1.465s user0m1.464s sys 0m0.002s [xg...@shgcc-10 38824]$ time ./41.out real0m1.465s user0m1.465s sys 0m0.000s [xg...@shgcc-10 38824]$ time ./41.out real0m1.465s user0m1.464s sys 0m0.002s -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
-- rguenth at gcc dot gnu dot org changed: What|Removed |Added Keywords||missed-optimization Summary|[4.4 regression] performance|[4.4 Regression] performance |regression of sse code from |regression of sse code from |4.2/4.3 |4.2/4.3 Target Milestone|--- |4.4.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
--- Comment #7 from hubicka at ucw dot cz 2009-01-15 01:49 --- Subject: Re: [4.4 regression] performance regression of sse code from 4.2/4.3 I guess th3 main difference here is that load + addps pair generate 2 uops, while mov + loading addps generate 3 since the move has to go through the queue. I will try to change testcase to fit in cache to see if AMD machine reproduce it too.. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
--- Comment #6 from hjl dot tools at gmail dot com 2009-01-15 01:25 --- (In reply to comment #5) > > H.J. perhaps, you can have some advice here? Or at least can we do some > benchmarking? > Joey and Xuepeng are looking into it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
--- Comment #5 from hubicka at gcc dot gnu dot org 2009-01-15 00:30 --- Created an attachment (id=17106) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17106&action=view) Proposed patch The patch makes GCC to generate movaps load followed by addps. On Core 2 it speeds up the testcase from 7s to 6.2s so I guess it works as expected. The same however does not reproduce on AMD box and I am not sure if it is just coincidence here or if really core preffer to split read-execute SSE operations (it is not recommended by the manual). H.J. perhaps, you can have some advice here? Or at least can we do some benchmarking? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
--- Comment #4 from hubicka at gcc dot gnu dot org 2009-01-14 20:31 --- Actually perhaps in simple case like this even peep2 will work since we can copyprop will fix it later. I am trying to add the peep -- hubicka at gcc dot gnu dot org changed: What|Removed |Added AssignedTo|unassigned at gcc dot gnu |hubicka at gcc dot gnu dot |dot org |org Status|UNCONFIRMED |ASSIGNED Ever Confirmed|0 |1 Keywords|missed-optimization | Last reconfirmed|-00-00 00:00:00 |2009-01-14 20:31:52 date|| Summary|[4.4 Regression] performance|[4.4 regression] performance |regression of sse code from |regression of sse code from |4.2/4.3 |4.2/4.3 Target Milestone|4.4.0 |--- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3
-- rguenth at gcc dot gnu dot org changed: What|Removed |Added Keywords||missed-optimization Summary|[4.4 regression] performance|[4.4 Regression] performance |regression of sse code from |regression of sse code from |4.2/4.3 |4.2/4.3 Target Milestone|--- |4.4.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
--- Comment #3 from hubicka at gcc dot gnu dot org 2009-01-14 20:20 --- It might be IRA change. Chips generally preffer separate load and execute instruction as in the old loop over the load+execute since they are easier to retire. Splitting the instruction post reload probably won't do much good, since there is extra move already. If just splitting the instruction would help, we can macroize: (define_peephole2 [(match_scratch:SI 2 "r") (parallel [(set (match_operand:SI 0 "register_operand" "") (match_operator:SI 3 "arith_or_logical_operator" [(match_dup 0) (match_operand:SI 1 "memory_operand" "")])) (clobber (reg:CC FLAGS_REG))])] "optimize_insn_for_speed_p () && ! TARGET_READ_MODIFY" [(set (match_dup 2) (match_dup 1)) (parallel [(set (match_dup 0) (match_op_dup 3 [(match_dup 0) (match_dup 2)])) (clobber (reg:CC FLAGS_REG))])] "") peephole for vector modes too. Vladimir, perhaps IRA can be tweaked here somehow? -- hubicka at gcc dot gnu dot org changed: What|Removed |Added CC||vmakarov at redhat dot com, ||hubicka at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
--- Comment #2 from tim at klingt dot org 2009-01-13 16:22 --- (In reply to comment #1) > I don't see how this changes could cause more branch misses. If you do the > same .palign for the 4.4 code does the regression vanish? I would suspect > that the loop-stream detector catches one but not the other form for some > reason. Maybe the Intel folks can properly analyze this - HJ? after doing some more tests, i wouldn't think too much about the branch misses. they seem to be quite dependent on the binary, even on linked libraries. i am more concerned about the inner loop ... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824
[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3
--- Comment #1 from rguenth at gcc dot gnu dot org 2009-01-13 15:07 --- I don't see how this changes could cause more branch misses. If you do the same .palign for the 4.4 code does the regression vanish? I would suspect that the loop-stream detector catches one but not the other form for some reason. Maybe the Intel folks can properly analyze this - HJ? -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC||hjl at gcc dot gnu dot org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824