[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-03-12 Thread hjl dot tools at gmail dot com
--- Comment #30 from hjl dot tools at gmail dot com 2009-03-12 20:21 --- Fixed. -- hjl dot tools at gmail dot com changed: What|Removed |Added Status|REOPENE

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-03-12 Thread hjl at gcc dot gnu dot org
--- Comment #29 from hjl at gcc dot gnu dot org 2009-03-12 16:08 --- Subject: Bug 38824 Author: hjl Date: Thu Mar 12 16:08:02 2009 New Revision: 144817 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144817 Log: 2009-03-12 H.J. Lu PR target/38824 * config/i386

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-03-12 Thread hjl dot tools at gmail dot com
--- Comment #28 from hjl dot tools at gmail dot com 2009-03-12 16:00 --- (In reply to comment #25) > patch committed (the changelog was in gcc-patches :-). > This patch caused: http://gcc.gnu.org/ml/gcc/2009-03/msg00340.html -- hjl dot tools at gmail dot com changed:

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-16 Thread bonzini at gnu dot org
--- Comment #27 from bonzini at gnu dot org 2009-02-16 09:14 --- Added bugs corresponding to the patch fallout in case distros want to backport it (it gave quite a nice boost and probably fixed PR21676 too) -- bonzini at gnu dot org changed: What|Removed

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-12 Thread hjl at gcc dot gnu dot org
--- Comment #26 from hjl at gcc dot gnu dot org 2009-02-12 15:45 --- Subject: Bug 38824 Author: hjl Date: Thu Feb 12 15:45:20 2009 New Revision: 144129 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=144129 Log: Mention PR target/38824 in ChangeLog entries. Modified: trunk/g

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-11 Thread bonzini at gnu dot org
--- Comment #25 from bonzini at gnu dot org 2009-02-11 08:57 --- patch committed (the changelog was in gcc-patches :-). -- bonzini at gnu dot org changed: What|Removed |Added -

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-11 Thread ubizjak at gmail dot com
--- Comment #24 from ubizjak at gmail dot com 2009-02-11 08:14 --- (In reply to comment #23) > Even though you don't observe the reporter's slowdown from 4.2/4.3 to > unpatched 4.4, I guess this makes a good case for the patch. Ok for trunk? OK with a ChangeLog ;) BTW: Please watch b

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-11 Thread bonzini at gnu dot org
--- Comment #23 from bonzini at gnu dot org 2009-02-11 08:01 --- Subject: Re: [4.4 Regression] performance regression of sse code from 4.2/4.3 > [xg...@shgcc-9 38824]$ time ./gcc-42.out > real0m1.991s > > [xg...@shgcc-9 38824]$ time ./gcc-44.out > real0m1.880s > > [xg...@sh

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-10 Thread xuepeng dot guo at intel dot com
--- Comment #22 from xuepeng dot guo at intel dot com 2009-02-11 07:37 --- (In reply to comment #18) > Xuepeng, can you test with the loop as produced by my posted patch, that is: > .L11: > movaps (%rsi,%rax), %xmm0 > addps %xmm1, %xmm0 > movaps %xmm0, (%rdi,

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-10 Thread bonzini at gnu dot org
--- Comment #21 from bonzini at gnu dot org 2009-02-10 16:39 --- So my patch should be a uniform win. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38824

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-10 Thread dwarak dot rajagopal at amd dot com
--- Comment #20 from dwarak dot rajagopal at amd dot com 2009-02-10 16:28 --- Paulo, (a) movaps (%rax, %rsi), %xmm0 addps %xmm0, %xmm1 (b) movaps %xmm0, %xmm1 addps (%rax, %rsi), %xmm1 Yes, case (a) is slightly better than case (b). It shouldn't matter much though

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-09 Thread bonzini at gnu dot org
--- Comment #19 from bonzini at gnu dot org 2009-02-09 13:37 --- Also, Dwarak, here the change is not from addps (%rax, %rsi), %xmm1 to movps (%rax, %rsi), %xmm0 addps %xmm0, %xmm1 but rather from movps %xmm0, %xmm1 addps (%rax, %rsi), %xmm1 to the second s

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-09 Thread bonzini at gnu dot org
--- Comment #18 from bonzini at gnu dot org 2009-02-09 13:35 --- Xuepeng, can you test with the loop as produced by my posted patch, that is: .L11: movaps (%rsi,%rax), %xmm0 addps %xmm1, %xmm0 movaps %xmm0, (%rdi,%rax) addq$16, %rax cmpq

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-09 Thread xuepeng dot guo at intel dot com
--- Comment #17 from xuepeng dot guo at intel dot com 2009-02-09 09:16 --- Below is a loop in the case in its original form(compiled by GCC 4.4): _Z7bench_1PfS_fj: .LFB2309: shrl$2, %edx shufps $0, %xmm0, %xmm0 subl$1, %edx xorl%eax, %eax

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-08 Thread hubicka at gcc dot gnu dot org
--- Comment #16 from hubicka at gcc dot gnu dot org 2009-02-08 12:40 --- Since the splitting peep2 don't seem to be win in general (it wins only when copy propagation takes place afterwards) and we don't seem to understand what really makes the testcase faster I am unassigning myself un

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-08 Thread hubicka at gcc dot gnu dot org
--- Comment #15 from hubicka at gcc dot gnu dot org 2009-02-08 12:36 --- I tested the patch on SPECfp and core and there is not much difference. I guess without somehow tweaking regalloc there is not much to do about this problem. Xuepeng, if the testcase is core2-variant sensitive, pe

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-07 Thread rob1weld at aol dot com
--- Comment #14 from rob1weld at aol dot com 2009-02-07 16:18 --- (In reply to comment #8) > Created an attachment (id=17173) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view) [edit] > An extracted test case for this bug. > > Hi tim, I extracted this test case from

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-06 Thread dwarak dot rajagopal at amd dot com
--- Comment #13 from dwarak dot rajagopal at amd dot com 2009-02-06 22:35 --- > The patch makes GCC to generate movaps load followed by addps. On Core 2 it > speeds up the testcase from 7s to 6.2s so I guess it works as expected. > > The same however does not reproduce on AMD box and

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-02-06 Thread bonzini at gnu dot org
--- Comment #12 from bonzini at gnu dot org 2009-02-06 09:16 --- There's another peephole2, namely from [(set (match_operand 0 "register_operand") (match_operand 1 "register_operand")) (set (match_operand 0 "register_operand") (match_operator 3 "arith_or_logical_operator"

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-25 Thread rguenth at gcc dot gnu dot org
--- Comment #11 from rguenth at gcc dot gnu dot org 2009-01-25 17:56 --- We seem to have a lot of similar "sse performance regression" P2 bugs, can someone make sure that there are no duplicates here? -- rguenth at gcc dot gnu dot org changed: What|Removed

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-24 Thread tim at klingt dot org
--- Comment #10 from tim at klingt dot org 2009-01-24 13:14 --- btw, i tried the proposed patch ssef, with no big performance difference: t...@thinkpad:~/sandbox$ time ./a.out real0m2.494s user0m2.473s sys 0m0.002s t...@thinkpad:~/sandbox$ time ./a.out real0m2.479s us

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-24 Thread tim at klingt dot org
--- Comment #9 from tim at klingt dot org 2009-01-24 09:56 --- > Hi tim, I extracted this test case from your website. But I can't exactly > reproduce this bug on my machine with a core2 quard micor processor. Can you > help me to check whether my test case is valid firstly? Here I post

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-23 Thread xuepeng dot guo at intel dot com
--- Comment #8 from xuepeng dot guo at intel dot com 2009-01-24 05:12 --- Created an attachment (id=17173) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17173&action=view) An extracted test case for this bug. Hi tim, I extracted this test case from your website. But I can't exact

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-23 Thread rguenth at gcc dot gnu dot org
-- rguenth at gcc dot gnu dot org changed: What|Removed |Added Keywords||missed-optimization Summary|[4.4 regression] performance|[

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at ucw dot cz
--- Comment #7 from hubicka at ucw dot cz 2009-01-15 01:49 --- Subject: Re: [4.4 regression] performance regression of sse code from 4.2/4.3 I guess th3 main difference here is that load + addps pair generate 2 uops, while mov + loading addps generate 3 since the move has to go through

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hjl dot tools at gmail dot com
--- Comment #6 from hjl dot tools at gmail dot com 2009-01-15 01:25 --- (In reply to comment #5) > > H.J. perhaps, you can have some advice here? Or at least can we do some > benchmarking? > Joey and Xuepeng are looking into it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3882

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at gcc dot gnu dot org
--- Comment #5 from hubicka at gcc dot gnu dot org 2009-01-15 00:30 --- Created an attachment (id=17106) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17106&action=view) Proposed patch The patch makes GCC to generate movaps load followed by addps. On Core 2 it speeds up the testc

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at gcc dot gnu dot org
--- Comment #4 from hubicka at gcc dot gnu dot org 2009-01-14 20:31 --- Actually perhaps in simple case like this even peep2 will work since we can copyprop will fix it later. I am trying to add the peep -- hubicka at gcc dot gnu dot org changed: What|Removed

[Bug target/38824] [4.4 Regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread rguenth at gcc dot gnu dot org
-- rguenth at gcc dot gnu dot org changed: What|Removed |Added Keywords||missed-optimization Summary|[4.4 regression] performance|[

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-14 Thread hubicka at gcc dot gnu dot org
--- Comment #3 from hubicka at gcc dot gnu dot org 2009-01-14 20:20 --- It might be IRA change. Chips generally preffer separate load and execute instruction as in the old loop over the load+execute since they are easier to retire. Splitting the instruction post reload probably won't d

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-13 Thread tim at klingt dot org
--- Comment #2 from tim at klingt dot org 2009-01-13 16:22 --- (In reply to comment #1) > I don't see how this changes could cause more branch misses. If you do the > same .palign for the 4.4 code does the regression vanish? I would suspect > that the loop-stream detector catches one b

[Bug target/38824] [4.4 regression] performance regression of sse code from 4.2/4.3

2009-01-13 Thread rguenth at gcc dot gnu dot org
--- Comment #1 from rguenth at gcc dot gnu dot org 2009-01-13 15:07 --- I don't see how this changes could cause more branch misses. If you do the same .palign for the 4.4 code does the regression vanish? I would suspect that the loop-stream detector catches one but not the other form