[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2009-02-03 Thread bonzini at gnu dot org


--- Comment #10 from bonzini at gnu dot org  2009-02-03 09:47 ---
Can you try the patch of PR38824?


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2009-02-03 Thread bonzini at gnu dot org


--- Comment #12 from bonzini at gnu dot org  2009-02-03 11:17 ---
What if we forbid altogether memory operands and we *synthesize* them with a
peephole2?  Anyway, it seems safe to me to declare this a dup of PR38824?


-- 

bonzini at gnu dot org changed:

   What|Removed |Added

 Status|WAITING |NEW
 Ever Confirmed|0   |1
   Last reconfirmed|2009-02-03 10:36:46 |2009-02-03 11:17:38
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2009-02-03 Thread ubizjak at gmail dot com


--- Comment #11 from ubizjak at gmail dot com  2009-02-03 10:36 ---
(In reply to comment #10)
 Can you try the patch of PR38824?

I have tried with a similar peephole2 recognizer. The problem is, that there is
no spare x register to allocate as a temporary, so peephole2 is ineffective
in this particular case.


-- 

ubizjak at gmail dot com changed:

   What|Removed |Added

   Last reconfirmed|-00-00 00:00:00 |2009-02-03 10:36:46
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2009-02-03 Thread ubizjak at gmail dot com


--- Comment #13 from ubizjak at gmail dot com  2009-02-03 11:34 ---
(In reply to comment #12)
 What if we forbid altogether memory operands and we *synthesize* them with a
 peephole2?  Anyway, it seems safe to me to declare this a dup of PR38824?

I think that we will hit PR 19398 then...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-17 Thread jakub at gcc dot gnu dot org


-- 

jakub at gcc dot gnu dot org changed:

   What|Removed |Added

   Priority|P3  |P2


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-17 Thread ubizjak at gmail dot com


--- Comment #6 from ubizjak at gmail dot com  2008-11-17 18:11 ---
I think that

addps   .LC10(%rip), %xmm0
mulps   %xmm1, %xmm0
addps   .LC11(%rip), %xmm0
mulps   %xmm1, %xmm0
addps   .LC12(%rip), %xmm0
mulps   %xmm1, %xmm0
addps   .LC13(%rip), %xmm0
mulps   %xmm1, %xmm0
addps   .LC14(%rip), %xmm0
mulps   %xmm1, %xmm0

is the bottleneck. Perhaps we should split impilicit memory operands out of the
insn by some generic peephole (if the register is available) and schedule loads
appropriately.

OTOH, loop optimizer should detect invariant loads and move them out of the
loop.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-17 Thread tim at klingt dot org


--- Comment #7 from tim at klingt dot org  2008-11-17 18:19 ---
Created an attachment (id=16710)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16710action=view)
compressed preprocessed source, gcc-4.4


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-17 Thread tim at klingt dot org


--- Comment #8 from tim at klingt dot org  2008-11-17 18:30 ---
Created an attachment (id=16711)
 -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=16711action=view)
16684: compressed preprocessed source, gcc-4.3


-- 

tim at klingt dot org changed:

   What|Removed |Added

  Attachment #16684|0   |1
is obsolete||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-17 Thread tim at klingt dot org


--- Comment #9 from tim at klingt dot org  2008-11-17 18:49 ---
i have updated the test program and attached preprocessed sources of gcc 4.3
and 4.4

the loop prefix contains
4.4 (9 invariant loads, one store of a generated constant to the stack):
pxor%xmm5, %xmm5
xorl%eax, %eax
movdqa  %xmm5, %xmm0
xorl%edx, %edx
pcmpeqd %xmm5, %xmm0
movaps  .LC2(%rip), %xmm14
psrld   $31, %xmm0
movdqa  .LC3(%rip), %xmm13
pslld   $31, %xmm0
movaps  .LC4(%rip), %xmm12
movaps  .LC5(%rip), %xmm11
movaps  .LC6(%rip), %xmm10
movaps  .LC7(%rip), %xmm9
movaps  .LC8(%rip), %xmm8
movaps  .LC9(%rip), %xmm7
movaps  .LC16(%rip), %xmm6
movdqa  %xmm0, -24(%rsp)

4.3 (8 invariant loads, store one generated constant in register):
pxor%xmm6, %xmm6
xorl%edx, %edx
movdqa  %xmm6, %xmm0
xorl%eax, %eax
pcmpeqd %xmm6, %xmm0
movaps  .LC9(%rip), %xmm15
psrld   $31, %xmm0
movaps  .LC10(%rip), %xmm14
pslld   $31, %xmm0
movaps  .LC11(%rip), %xmm13
movaps  .LC12(%rip), %xmm12
movaps  .LC13(%rip), %xmm11
movdqa  .LC14(%rip), %xmm10
movaps  .LC15(%rip), %xmm9
movaps  .LC16(%rip), %xmm8
movdqa  %xmm0, %xmm7




body:
4.3 (7 loads from memory, 2 loads are used in the next instruction, others are
used later):
.L48:
movaps  in(%rax), %xmm2
movaps  .LC2(%rip), %xmm0
movdqa  %xmm2, %xmm5
movdqa  .LC3(%rip), %xmm4
pand%xmm7, %xmm5
movaps  .LC4(%rip), %xmm1
addl$4, %edx
#APP
# 324 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
xorps %xmm5, %xmm2
# 0  2
#NO_APP
mulps   %xmm2, %xmm0
movaps  %xmm2, %xmm3
#APP
# 327 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
cvttps2dq %xmm0, %xmm0
# 0  2
#NO_APP
pand%xmm0, %xmm4
paddd   %xmm0, %xmm4
#APP
# 330 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
cvtdq2ps  %xmm4, %xmm0
# 0  2
#NO_APP
pand%xmm10, %xmm4
mulps   %xmm0, %xmm1
psrld   $1, %xmm4
subps   %xmm1, %xmm3
movaps  .LC5(%rip), %xmm1
mulps   %xmm0, %xmm1
mulps   .LC6(%rip), %xmm0
subps   %xmm1, %xmm3
subps   %xmm0, %xmm3
movaps  .LC7(%rip), %xmm0
movaps  %xmm3, %xmm1
cmpltps %xmm2, %xmm0
mulps   %xmm3, %xmm1
movaps  %xmm0, %xmm2
movaps  .LC8(%rip), %xmm0
mulps   %xmm1, %xmm0
addps   %xmm15, %xmm0
mulps   %xmm1, %xmm0
addps   %xmm14, %xmm0
mulps   %xmm1, %xmm0
addps   %xmm13, %xmm0
mulps   %xmm1, %xmm0
addps   %xmm12, %xmm0
mulps   %xmm1, %xmm0
addps   %xmm11, %xmm0
mulps   %xmm1, %xmm0
mulps   %xmm3, %xmm0
addps   %xmm3, %xmm0
#APP
# 341 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
andps %xmm2, %xmm0
# 0  2
# 342 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
andnps %xmm3, %xmm2
# 0  2
#NO_APP
movaps  %xmm8, %xmm3
#APP
# 343 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
orps  %xmm2, %xmm0
# 0  2
#NO_APP
movdqa  %xmm6, %xmm2
movaps  %xmm0, %xmm1
psubd   %xmm4, %xmm2
addps   %xmm9, %xmm1
divps   %xmm1, %xmm3
movaps  %xmm3, %xmm1
#APP
# 145 benchmarks/../source/dsp/../../libs/libsimdmath/lib/simdconst.h 1
andps %xmm2, %xmm1
# 0  2
# 146 benchmarks/../source/dsp/../../libs/libsimdmath/lib/simdconst.h 1
andnps %xmm0, %xmm2
# 0  2
# 147 benchmarks/../source/dsp/../../libs/libsimdmath/lib/simdconst.h 1
orps  %xmm2, %xmm1
# 0  2
# 348 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
xorps %xmm5, %xmm1
# 0  2
#NO_APP
movaps  %xmm1, out(%rax)
addq$16, %rax
cmpl%edi, %edx
jne .L48


4.4 (6 loads from memory, 5 loads are used as memory argument to opcodes):
.L54:
movaps  in(%rax), %xmm2
movdqa  -24(%rsp), %xmm3
addl$4, %edx
pand%xmm2, %xmm3
#APP
# 324 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
xorps %xmm3, %xmm2
# 0  2
#NO_APP
movaps  %xmm2, %xmm4
movaps  %xmm2, %xmm15
mulps   %xmm14, %xmm4
#APP
# 327 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
cvttps2dq %xmm4, %xmm4
# 0  2
#NO_APP
movdqa  %xmm4, %xmm0
pand%xmm13, %xmm0
paddd   %xmm0, %xmm4
#APP
# 330 benchmarks/../source/dsp/../../libs/libsimdmath/lib/sincosf4.h 1
cvtdq2ps  %xmm4, %xmm0
# 0  2
#NO_APP
pand.LC14(%rip), %xmm4
movaps  %xmm0, %xmm1
psrld   $1, %xmm4
mulps   %xmm12, %xmm1
subps   %xmm1, %xmm15
  

[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-15 Thread rguenth at gcc dot gnu dot org


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 GCC target triplet||x86_64-*-*-*
   Keywords||missed-optimization
Summary|gcc-4.4 speed regression|[4.4 Regression] speed
   |with inline-asm sse code|regression with inline-asm
   ||sse code
   Target Milestone|--- |4.4.0
Version|unknown |4.4.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-15 Thread hjl dot tools at gmail dot com


--- Comment #4 from hjl dot tools at gmail dot com  2008-11-16 00:06 ---
(In reply to comment #3)
 i tried to run the benchmark with -fno-ira, which turned out to be about 20%
 slower than without the flag.
 

Can you try -O3 -march=core2 -mtune=generic and -O3 -march=core2
-mtune=generic -fno-ira ?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134



[Bug target/38134] [4.4 Regression] speed regression with inline-asm sse code

2008-11-15 Thread hjl dot tools at gmail dot com


--- Comment #5 from hjl dot tools at gmail dot com  2008-11-16 00:08 ---
(In reply to comment #3)
 anyway, i found, that the preprocessed source generated by gcc-4.3 cannot be
 compiled with gcc-4.4 ... the specific file can be found here
 http://tim.klingt.org/git?p=nova-server.git;a=blob;f=benchmarks/simd_tan_benchmarks.cpp;h=c575996de0dc916a8e654af7a36350be9c22327e;hb=844d3cf991cbbbe74b34277696dda0b940769b28
 

Please upload both preprocessed sources generated by gcc 4.3 and gcc 4.4.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38134