[Bug target/55953] hand loop faster then builtin memset

2013-11-09 Thread pinskia at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #6 from Andrew Pinski  ---
Dup.

*** This bug has been marked as a duplicate of bug 57890 ***


[Bug target/55953] hand loop faster then builtin memset

2013-01-12 Thread glisse at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953



--- Comment #5 from Marc Glisse  2013-01-12 11:16:01 
UTC ---

See this patch:

http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00336.html

(the thread continues in earlier and later months)


[Bug target/55953] hand loop faster then builtin memset

2013-01-11 Thread hjl.tools at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953



H.J. Lu  changed:



   What|Removed |Added



 CC||hjl.tools at gmail dot com



--- Comment #4 from H.J. Lu  2013-01-12 02:10:33 
UTC ---

Can you try memset in glibc instead of builtin memset?


[Bug target/55953] hand loop faster then builtin memset

2013-01-11 Thread dushistov at mail dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953



--- Comment #3 from Evgeniy Dushistov  2013-01-12 
00:13:09 UTC ---

Cross compiling for arm, g++ have almost the same version:

arm-angstrom-linux-gnueabi-g++ (Linaro GCC 4.7-2012.10) 4.7.3 20121001:



variant one (for):

movwr3, #2280   ; 0x8e8

movtr3, #1

vmov.i8 q8, #48 ; 0x30

mov r2, #48 ; 0x30

vst1.64 {d16-d17}, [r3 :64]

vstrd16, [r3, #16]

vstrd17, [r3, #24]

vstrd16, [r3, #32]

vstrd17, [r3, #40]  ; 0x28

vstrd16, [r3, #48]  ; 0x30

vstrd17, [r3, #56]  ; 0x38

vstrd16, [r3, #64]  ; 0x40

vstrd17, [r3, #72]  ; 0x48

vstrd16, [r3, #80]  ; 0x50

vstrd17, [r3, #88]  ; 0x58

strbr2, [r3, #96]   ; 0x60

strbr2, [r3, #97]   ; 0x61

strbr2, [r3, #98]   ; 0x62

strbr2, [r3, #99]   ; 0x63

bx  lr



variant two(memset):



movwr0, #2272   ; 0x8e0

mov r1, #48 ; 0x30

movtr0, #1

mov r2, #100; 0x64

b   0x8494 



The time difference near 5%, the first variant win,

command line options:

-march=armv7-a -mtune=cortex-a8 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=hard

-Ofast


[Bug target/55953] hand loop faster then builtin memset

2013-01-11 Thread dushistov at mail dot ru


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953



--- Comment #2 from Evgeniy Dushistov  2013-01-12 
00:05:15 UTC ---

Actually it is not only CPU 64bit related issue, for example the same CPU (i7),

32 bit mode:



variant one:

push   %ebp

vmovdqa 0x80488e0,%ymm0

mov%esp,%ebp

pop%ebp

movb   $0x30,0x804a0a0

vmovdqa %ymm0,0x804a040

vmovdqa %ymm0,0x804a060

vmovdqa %ymm0,0x804a080

movb   $0x30,0x804a0a1

movb   $0x30,0x804a0a2

movb   $0x30,0x804a0a3

vzeroupper 

ret



variant two:

mov$0x804a040,%edx

push   %edi

mov$0x30303030,%eax

mov%edx,%edi

mov$0x19,%ecx

rep stos %eax,%es:(%edi)

pop%edi

ret



The variant one win.


[Bug target/55953] hand loop faster then builtin memset

2013-01-11 Thread pinskia at gcc dot gnu.org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953



Andrew Pinski  changed:



   What|Removed |Added



 Target||x86_64-*-*

  Component|c   |target



--- Comment #1 from Andrew Pinski  2013-01-11 
23:35:53 UTC ---

This is a target issue.  The first function uses the AVX/SSE registers while

the second only uses the integer registers.  So the target decides not to

vectorize memset but only use the integer registers.