[Bug target/55953] hand loop faster then builtin memset
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #6 from Andrew Pinski --- Dup. *** This bug has been marked as a duplicate of bug 57890 ***
[Bug target/55953] hand loop faster then builtin memset
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953 --- Comment #5 from Marc Glisse 2013-01-12 11:16:01 UTC --- See this patch: http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00336.html (the thread continues in earlier and later months)
[Bug target/55953] hand loop faster then builtin memset
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953 H.J. Lu changed: What|Removed |Added CC||hjl.tools at gmail dot com --- Comment #4 from H.J. Lu 2013-01-12 02:10:33 UTC --- Can you try memset in glibc instead of builtin memset?
[Bug target/55953] hand loop faster then builtin memset
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953 --- Comment #3 from Evgeniy Dushistov 2013-01-12 00:13:09 UTC --- Cross compiling for arm, g++ have almost the same version: arm-angstrom-linux-gnueabi-g++ (Linaro GCC 4.7-2012.10) 4.7.3 20121001: variant one (for): movwr3, #2280 ; 0x8e8 movtr3, #1 vmov.i8 q8, #48 ; 0x30 mov r2, #48 ; 0x30 vst1.64 {d16-d17}, [r3 :64] vstrd16, [r3, #16] vstrd17, [r3, #24] vstrd16, [r3, #32] vstrd17, [r3, #40] ; 0x28 vstrd16, [r3, #48] ; 0x30 vstrd17, [r3, #56] ; 0x38 vstrd16, [r3, #64] ; 0x40 vstrd17, [r3, #72] ; 0x48 vstrd16, [r3, #80] ; 0x50 vstrd17, [r3, #88] ; 0x58 strbr2, [r3, #96] ; 0x60 strbr2, [r3, #97] ; 0x61 strbr2, [r3, #98] ; 0x62 strbr2, [r3, #99] ; 0x63 bx lr variant two(memset): movwr0, #2272 ; 0x8e0 mov r1, #48 ; 0x30 movtr0, #1 mov r2, #100; 0x64 b 0x8494 The time difference near 5%, the first variant win, command line options: -march=armv7-a -mtune=cortex-a8 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=hard -Ofast
[Bug target/55953] hand loop faster then builtin memset
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953 --- Comment #2 from Evgeniy Dushistov 2013-01-12 00:05:15 UTC --- Actually it is not only CPU 64bit related issue, for example the same CPU (i7), 32 bit mode: variant one: push %ebp vmovdqa 0x80488e0,%ymm0 mov%esp,%ebp pop%ebp movb $0x30,0x804a0a0 vmovdqa %ymm0,0x804a040 vmovdqa %ymm0,0x804a060 vmovdqa %ymm0,0x804a080 movb $0x30,0x804a0a1 movb $0x30,0x804a0a2 movb $0x30,0x804a0a3 vzeroupper ret variant two: mov$0x804a040,%edx push %edi mov$0x30303030,%eax mov%edx,%edi mov$0x19,%ecx rep stos %eax,%es:(%edi) pop%edi ret The variant one win.
[Bug target/55953] hand loop faster then builtin memset
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55953 Andrew Pinski changed: What|Removed |Added Target||x86_64-*-* Component|c |target --- Comment #1 from Andrew Pinski 2013-01-11 23:35:53 UTC --- This is a target issue. The first function uses the AVX/SSE registers while the second only uses the integer registers. So the target decides not to vectorize memset but only use the integer registers.