[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #10 from bonzini at gnu dot org 2009-02-03 09:47 --- Can you try the patch of PR38824? -- bonzini at gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |WAITING http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #12 from bonzini at gnu dot org 2009-02-03 11:17 --- What if we forbid altogether memory operands and we *synthesize* them with a peephole2? Anyway, it seems safe to me to declare this a dup of PR38824? -- bonzini at gnu dot org changed: What|Removed |Added Status|WAITING |NEW Ever Confirmed|0 |1 Last reconfirmed|2009-02-03 10:36:46 |2009-02-03 11:17:38 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #11 from ubizjak at gmail dot com 2009-02-03 10:36 --- (In reply to comment #10) Can you try the patch of PR38824? I have tried with a similar peephole2 recognizer. The problem is, that there is no spare x register to allocate as a temporary, so peephole2 is ineffective in this particular case. -- ubizjak at gmail dot com changed: What|Removed |Added Last reconfirmed|-00-00 00:00:00 |2009-02-03 10:36:46 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #13 from ubizjak at gmail dot com 2009-02-03 11:34 --- (In reply to comment #12) What if we forbid altogether memory operands and we *synthesize* them with a peephole2? Anyway, it seems safe to me to declare this a dup of PR38824? I think that we will hit PR 19398 then... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
-- jakub at gcc dot gnu dot org changed: What|Removed |Added Priority|P3 |P2 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #6 from ubizjak at gmail dot com 2008-11-17 18:11 --- I think that addps .LC10(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC11(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC12(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC13(%rip), %xmm0 mulps %xmm1, %xmm0 addps .LC14(%rip), %xmm0 mulps %xmm1, %xmm0 is the bottleneck. Perhaps we should split impilicit memory operands out of the insn by some generic peephole (if the register is available) and schedule loads appropriately. OTOH, loop optimizer should detect invariant loads and move them out of the loop. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #7 from tim at klingt dot org 2008-11-17 18:19 --- Created an attachment (id=16710) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16710action=view) compressed preprocessed source, gcc-4.4 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #8 from tim at klingt dot org 2008-11-17 18:30 --- Created an attachment (id=16711) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16711action=view) 16684: compressed preprocessed source, gcc-4.3 -- tim at klingt dot org changed: What|Removed |Added Attachment #16684|0 |1 is obsolete|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #9 from tim at klingt dot org 2008-11-17 18:49 --- i have updated the test program and attached preprocessed sources of gcc 4.3 and 4.4 the loop prefix contains 4.4 (9 invariant loads, one store of a generated constant to the stack): pxor%xmm5, %xmm5 xorl%eax, %eax movdqa %xmm5, %xmm0 xorl%edx, %edx pcmpeqd %xmm5, %xmm0 movaps .LC2(%rip), %xmm14 psrld $31, %xmm0 movdqa .LC3(%rip), %xmm13 pslld $31, %xmm0 movaps .LC4(%rip), %xmm12 movaps .LC5(%rip), %xmm11 movaps .LC6(%rip), %xmm10 movaps .LC7(%rip), %xmm9 movaps .LC8(%rip), %xmm8 movaps .LC9(%rip), %xmm7 movaps .LC16(%rip), %xmm6 movdqa %xmm0, -24(%rsp) 4.3 (8 invariant loads, store one generated constant in register): pxor%xmm6, %xmm6 xorl%edx, %edx movdqa %xmm6, %xmm0 xorl%eax, %eax pcmpeqd %xmm6, %xmm0 movaps .LC9(%rip), %xmm15 psrld $31, %xmm0 movaps .LC10(%rip), %xmm14 pslld $31, %xmm0 movaps .LC11(%rip), %xmm13 movaps .LC12(%rip), %xmm12 movaps .LC13(%rip), %xmm11 movdqa .LC14(%rip), %xmm10 movaps .LC15(%rip), %xmm9 movaps .LC16(%rip), %xmm8 movdqa %xmm0, %xmm7 body: 4.3 (7 loads from memory, 2 loads are used in the next instruction, others are used later): .L48: movaps in(%rax), %xmm2 movaps .LC2(%rip), %xmm0 movdqa %xmm2, %xmm5 movdqa .LC3(%rip), %xmm4 pand%xmm7, %xmm5 movaps .LC4(%rip), %xmm1 addl$4, %edx #APP # 324 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 xorps %xmm5, %xmm2 # 0 2 #NO_APP mulps %xmm2, %xmm0 movaps %xmm2, %xmm3 #APP # 327 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 cvttps2dq %xmm0, %xmm0 # 0 2 #NO_APP pand%xmm0, %xmm4 paddd %xmm0, %xmm4 #APP # 330 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 cvtdq2ps %xmm4, %xmm0 # 0 2 #NO_APP pand%xmm10, %xmm4 mulps %xmm0, %xmm1 psrld $1, %xmm4 subps %xmm1, %xmm3 movaps .LC5(%rip), %xmm1 mulps %xmm0, %xmm1 mulps .LC6(%rip), %xmm0 subps %xmm1, %xmm3 subps %xmm0, %xmm3 movaps .LC7(%rip), %xmm0 movaps %xmm3, %xmm1 cmpltps %xmm2, %xmm0 mulps %xmm3, %xmm1 movaps %xmm0, %xmm2 movaps .LC8(%rip), %xmm0 mulps %xmm1, %xmm0 addps %xmm15, %xmm0 mulps %xmm1, %xmm0 addps %xmm14, %xmm0 mulps %xmm1, %xmm0 addps %xmm13, %xmm0 mulps %xmm1, %xmm0 addps %xmm12, %xmm0 mulps %xmm1, %xmm0 addps %xmm11, %xmm0 mulps %xmm1, %xmm0 mulps %xmm3, %xmm0 addps %xmm3, %xmm0 #APP # 341 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 andps %xmm2, %xmm0 # 0 2 # 342 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 andnps %xmm3, %xmm2 # 0 2 #NO_APP movaps %xmm8, %xmm3 #APP # 343 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 orps %xmm2, %xmm0 # 0 2 #NO_APP movdqa %xmm6, %xmm2 movaps %xmm0, %xmm1 psubd %xmm4, %xmm2 addps %xmm9, %xmm1 divps %xmm1, %xmm3 movaps %xmm3, %xmm1 #APP # 145 benchmarks/../source/dsp/../../libs/libsimdmath/lib/simdconst.h 1 andps %xmm2, %xmm1 # 0 2 # 146 benchmarks/../source/dsp/../../libs/libsimdmath/lib/simdconst.h 1 andnps %xmm0, %xmm2 # 0 2 # 147 benchmarks/../source/dsp/../../libs/libsimdmath/lib/simdconst.h 1 orps %xmm2, %xmm1 # 0 2 # 348 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 xorps %xmm5, %xmm1 # 0 2 #NO_APP movaps %xmm1, out(%rax) addq$16, %rax cmpl%edi, %edx jne .L48 4.4 (6 loads from memory, 5 loads are used as memory argument to opcodes): .L54: movaps in(%rax), %xmm2 movdqa -24(%rsp), %xmm3 addl$4, %edx pand%xmm2, %xmm3 #APP # 324 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 xorps %xmm3, %xmm2 # 0 2 #NO_APP movaps %xmm2, %xmm4 movaps %xmm2, %xmm15 mulps %xmm14, %xmm4 #APP # 327 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 cvttps2dq %xmm4, %xmm4 # 0 2 #NO_APP movdqa %xmm4, %xmm0 pand%xmm13, %xmm0 paddd %xmm0, %xmm4 #APP # 330 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1 cvtdq2ps %xmm4, %xmm0 # 0 2 #NO_APP pand.LC14(%rip), %xmm4 movaps %xmm0, %xmm1 psrld $1, %xmm4 mulps %xmm12, %xmm1 subps %xmm1, %xmm15
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
-- rguenth at gcc dot gnu dot org changed: What|Removed |Added GCC target triplet||x86_64-*-*-* Keywords||missed-optimization Summary|gcc-4.4 speed regression|[4.4 Regression] speed |with inline-asm sse code|regression with inline-asm ||sse code Target Milestone|--- |4.4.0 Version|unknown |4.4.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #4 from hjl dot tools at gmail dot com 2008-11-16 00:06 --- (In reply to comment #3) i tried to run the benchmark with -fno-ira, which turned out to be about 20% slower than without the flag. Can you try -O3 -march=core2 -mtune=generic and -O3 -march=core2 -mtune=generic -fno-ira ? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134
[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code
--- Comment #5 from hjl dot tools at gmail dot com 2008-11-16 00:08 --- (In reply to comment #3) anyway, i found, that the preprocessed source generated by gcc-4.3 cannot be compiled with gcc-4.4 ... the specific file can be found here http://tim.klingt.org/git?p=nova-server.git;a=blob;f=benchmarks/simd_tan_benchmarks.cpp;h=c575996de0dc916a8e654af7a36350be9c22327e;hb=844d3cf991cbbbe74b34277696dda0b940769b28 Please upload both preprocessed sources generated by gcc 4.3 and gcc 4.4. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134