[Bug target/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-24 06:34 --- >One use of this macro is to increase alignment of medium-size >data to make it all fit in fewer cache lines. 1) This potentially makes single string fit into fewer cachelines, but it noticeably increases the sum of all strings! 2) If cacheline is >32bytes, this optimization can even make things worse: Unaligned string fits into 64 byte (say, Athlon64) cacheline: [..some_string.] ^0 ^32 ^64 Same string spills over to second cacheline after alignment: [...some_st][ring...] ^0 ^32 ^64 >Another is to >cause character arrays to be word-aligned so that `strcpy' calls >that copy constants to character arrays can be done inline. I do not fully understand. Is it about non-static local char arrays initialized by string? void f() { char s[] = "Long str"; } How alignment affects this code? x86 CPUs can do unaligned loads/stores just fine, thus 'inlinability' of implicit strcpy does not depend on alignment. Also such local arrays are not very typical, so why optimize for this case? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug target/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-23 13:03 --- Oh, I did look at http://gcc.gnu.org/ml/gcc-patches/2000-06/msg00860.html, I see 128 and 256 bit alignment added, but I don't immediately see where it is applied to byte arrays (strings) - patch is not so small, where should I look? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug target/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-23 12:56 --- In majority of cases char msg[] = "A message" is used for text strings. These are _bytes_, they need no alignment whatsoever, let alone 32 byte one. I'm perfectly fine if other people want to do it, but I don't, so I use -Os. I want to suppress this behavior for -Os. Is it a bug or not is a matter of definition 'what is a bug' really... BTW what is that another mysterious piece of code aligning something else to 32 bytes? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-23 07:07 --- Created an attachment (id=9132) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9132&action=view) Same for ix86_local_alignment() -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-23 06:59 --- Created an attachment (id=9131) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9131&action=view) While we are at it, speed up ix86_data_alignment All if()s below are true only if align<128, so we can skip all of them. And what is this? if (AGGREGATE_TYPE_P (type) && TYPE_SIZE (type) && TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST && (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256 || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256) return 256; I do not remember anything which requires such wasteful alignment. Maybe a comment would be in order there. (Or removal ;) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-23 06:04 --- Created an attachment (id=9130) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9130&action=view) Same patch with slightly different formatting Also run tested -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-23 06:03 --- Created an attachment (id=9129) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9129&action=view) Do not align at all if -Os Sorry only have 3.4.1 sources available locally... Patch is run tested. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug tree-optimization/22158] New: char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os
static char *s0 = ""; static char s1[] = ""; static char *s2 = ""; static char s3[] = ""; void f(char*); void g() { f(s0); f(s1); f(s2); f(s3); } s1 and s2 are aligned on 32 bytes even with -Os, while s2 and s4 are not. See http://gcc.gnu.org/ml/gcc/2002-01/msg01068.html, http://gcc.gnu.org/ml/gcc/2002-01/msg01068/i386.c.PATCH -- Summary: char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os Product: gcc Version: 4.0.0 Status: UNCONFIRMED Severity: normal Priority: P2 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158
[Bug inline-asm/22045] can't find a register in class 'GENERAL_REGS'
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-06-14 07:06 --- If I understand this correctly, older GCCs were able to figure out that when there is 5 registers available, "=&g" (__d3) can olny be matched with memory (on-stack local var) whereas with 6 regs it can use a register. But newer GCC cannot and we need to explicitly say "=m". Isn't it a regression? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22045
[Bug rtl-optimization/21329] optimize i386 block copy
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-05-02 09:10 --- BTW, see above comment: gcc -O2 allocated 24 bytes on stack and never uset them. ?! Now, unoptimized compilation comparison: --- t.s Mon May 2 11:41:20 2005 +++ t-new.s Mon May 2 11:39:40 2005 @@ -32,8 +32,8 @@ movl$t21, %edi movl$w21, %esi cld - movl$9, %ecx - rep + movsl + movsl movsb popl%esi popl%edi @@ -50,9 +50,9 @@ movl$t22, %edi movl$w22, %esi cld - movl$10, %ecx - rep - movsb + movsl + movsl + movsw popl%esi popl%edi leave @@ -68,8 +68,9 @@ movl$t23, %edi movl$w23, %esi cld - movl$11, %ecx - rep + movsl + movsl + movsw movsb popl%esi popl%edi @@ -86,9 +87,8 @@ movl$t30, %edi movl$w30, %esi cld - movl$3, %eax - movl%eax, %ecx - rep + movsl + movsl movsl popl%esi popl%edi @@ -105,9 +105,9 @@ movl$t40, %edi movl$w40, %esi cld - movl$4, %eax - movl%eax, %ecx - rep + movsl + movsl + movsl movsl popl%esi popl%edi @@ -168,34 +168,34 @@ movl$t21, %edi movl$w21, %esi cld - movl$9, %ecx - rep + movsl + movsl movsb movl$t22, %edi movl$w22, %esi cld - movl$10, %ecx - rep - movsb + movsl + movsl + movsw movl$t23, %edi movl$w23, %esi cld - movl$11, %ecx - rep + movsl + movsl + movsw movsb movl$t30, %edi movl$w30, %esi cld - movl$3, %eax - movl%eax, %ecx - rep + movsl + movsl movsl movl$t40, %edi movl$w40, %esi cld - movl$4, %eax - movl%eax, %ecx - rep + movsl + movsl + movsl movsl movl$t50, %edi movl$w50, %esi -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329
[Bug rtl-optimization/21329] optimize i386 block copy
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-05-02 09:04 --- Comparison between old and new code (-O2): --- tO2.s Mon May 2 11:49:24 2005 +++ tO2-new.s Mon May 2 11:50:03 2005 @@ -35,8 +35,7 @@ movl$t21, %edi movl$w21, %esi cld - movl$2, %ecx - rep + movsl movsl movsb popl%esi @@ -55,8 +54,7 @@ movl$t22, %edi movl$w22, %esi cld - movl$2, %ecx - rep + movsl movsl movsw popl%esi @@ -75,8 +73,7 @@ movl$t23, %edi movl$w23, %esi cld - movl$2, %ecx - rep + movsl movsl movsw movsb @@ -96,8 +93,8 @@ movl$t30, %edi movl$w30, %esi cld - movl$3, %ecx - rep + movsl + movsl movsl popl%esi popl%edi @@ -115,8 +112,9 @@ movl$t40, %edi movl$w40, %esi cld - movl$4, %ecx - rep + movsl + movsl + movsl movsl popl%esi popl%edi @@ -169,7 +167,6 @@ movl%esp, %ebp pushl %edi pushl %esi - subl$24, %esp movlw10, %eax movl%eax, t10 movlw20, %eax @@ -179,36 +176,34 @@ movl$t21, %edi movl$w21, %esi cld - movl$2, %ecx - rep + movsl movsl movsb movl$t22, %edi movl$w22, %esi - movb$2, %cl - rep + movsl movsl movsw movl$t23, %edi movl$w23, %esi - movb$2, %cl - rep + movsl movsl movsw movsb movl$t30, %edi movl$w30, %esi - movb$3, %cl - rep + movsl + movsl movsl movl$t40, %edi movl$w40, %esi - movb$4, %cl - rep + movsl + movsl + movsl movsl movl$t50, %edi movl$w50, %esi - movb$5, %cl + movl$5, %ecx rep movsl movl$t60, %edi @@ -216,7 +211,6 @@ movb$6, %cl rep movsl - addl$24, %esp popl%esi popl%edi leave -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329
[Bug rtl-optimization/21329] optimize i386 block copy
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-05-02 09:02 --- Created an attachment (id=8791) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8791&action=view) patch against 4.0.0 -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329
[Bug rtl-optimization/21329] optimize i386 block copy
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-05-02 09:00 --- Created an attachment (id=8790) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8790&action=view) testcase -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329
[Bug rtl-optimization/21329] New: optimize i386 block copy
gcc generates suboptimal i386 block copy code, like this: movl$9, %ecx rep movsb or this: movl$2, %ecx rep movsl movsw Such short copies can be done with few movsl's instead. Patch is attached. Note that I am not familiar with gcc internals at all, so take it with reasonable suspicion. -- Summary: optimize i386 block copy Product: gcc Version: 4.0.0 Status: UNCONFIRMED Severity: normal Priority: P2 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329
[Bug target/21147] Optimized code is much slower than non-optimized
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-27 05:38 --- Marking as invalid. I found out that this happens on Celeron but doesn't happen on Athlon. Must be instruction scheduling artifact. Same binaries were used: # gcc -O2 -o twofish_O2 twofish.c # gcc -O3 -o twofish_O3 twofish.c # gcc -Os -o twofish_Os twofish.c # gcc -o twofish twofish.c On Celeron: # ./twofish Iterations/sec: 63584 # ./twofish_O2 Iterations/sec: 41836 # ./twofish_O3 Iterations/sec: 42604 # ./twofish_Os Iterations/sec: 45956 # gcc -v gcc version 3.4.3 # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 11 model name : Intel(R) Celeron(TM) CPU stepping: 1 cpu MHz : 1196.222 cache size : 256 KB physical id : 0 siblings: 1 fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips: 2359.29 On Athlon: # ./twofish Iterations/sec: 65648 # ./twofish_O2 Iterations/sec: 64484 # ./twofish_O3 Iterations/sec: 71596 # ./twofish_Os Iterations/sec: 63560 # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Athlon(tm) XP 2400+ stepping: 1 cpu MHz : 2009.954 cache size : 256 KB fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse pni syscall mmxext 3dnowext 3dnow bogomips: 3964.92 -- What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||INVALID http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147
[Bug rtl-optimization/21202] Extra register moves generated
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-25 07:34 --- As you can see by inspecting .s file, I replaced gcc 3.4.3 with gcc 4.0.0 between compiles. Both of them produce extra moves. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21202
[Bug rtl-optimization/21202] New: Extra register moves generated
See below: two register->register moves which are not needed. # cat byteorder.c typedef unsigned long long u64; typedef unsigned u32; static inline u64 swab64(u64 val) { union { struct { u32 a,b; } s; u64 u; } v; v.u = val; asm("bswapl %0 ; bswapl %1" : "=r" (v.s.b), "=r" (v.s.a) : "0" (v.s.a), "1" (v.s.b)); return v.u; } extern u64 w; void f() { w = swab64(w); } # gcc -O3 byteorder.c -S # cat byteorder.s .file "byteorder.c" .text .p2align 2,,3 .globl f .type f, @function f: pushl %ebp movl%esp, %ebp pushl %esi pushl %ebx movlw, %esi movlw+4, %edx movl%esi, %ebx movl%edx, %esi #APP bswapl %ebx ; bswapl %esi #NO_APP movl%ebx, w+4 popl%ebx movl%esi, w popl%esi leave ret .size f, .-f .section.note.GNU-stack,"",@progbits .ident "GCC: (GNU) 3.4.3" # gcc -O3 byteorder.c -S; cat byteorder.s; gcc -v .file "byteorder.c" .text .p2align 2,,3 .globl f .type f, @function f: pushl %ebp movl%esp, %ebp pushl %esi pushl %ebx movlw, %eax movlw+4, %edx movl%eax, %ebx movl%edx, %esi #APP bswapl %ebx ; bswapl %esi #NO_APP movl%esi, w movl%ebx, w+4 popl%ebx popl%esi leave ret .size f, .-f .ident "GCC: (GNU) 4.0.0" .section.note.GNU-stack,"",@progbits Using built-in specs. Target: i386-pc-linux-gnu Configured with: ../gcc-4.0.0.src/configure --prefix=/usr/app/gcc-4.0.0 --exec-prefix=/usr/app/gcc-4.0.0 --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/app/gcc-4.0.0/libexec --datadir=/usr/app/gcc-4.0.0/share --sysconfdir=/etc --sharedstatedir=/usr/app/gcc-4.0.0/var/com --localstatedir=/usr/app/gcc-4.0.0/var --libdir=/usr/lib --includedir=/usr/include --infodir=/usr/info --mandir=/usr/man --with-slibdir=/usr/app/gcc-4.0.0/lib --with-local-prefix=/usr/local --with-gxx-include-dir=/usr/app/gcc-4.0.0/include/g++-v3 --enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix i386-pc-linux-gnu Thread model: posix gcc version 4.0.0 -- Summary: Extra register moves generated Product: gcc Version: 4.0.0 Status: UNCONFIRMED Severity: normal Priority: P2 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21202
[Bug rtl-optimization/21150] Suboptimal byte extraction from 64bits
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-24 13:26 --- I don't think that bug description is correct. I believe similar observation will be valid for byte extraction from u32 and u16, and for u16-from-u32, etc. Update for latest gcc. This is what 4.0.0 produces from the testcase: # gcc -O2 -fomit-frame-pointer -S helper.c # cat helper.s [I removed non-essential stuff] a: movlv+8, %eax shrl$8, %eax xorbv, %al xorbv+18, %al xorbv+27, %al xorbv+36, %al movlv+40, %edx movlv+44, %ecx movl%ecx, %edx xorl%ecx, %ecx shrl$8, %edx xorl%edx, %eax xorbv+54, %al xorbv+63, %al movzbl %al, %eax ret b: movlv+8, %eax movlv+12, %edx shrdl $8, %edx, %eax shrl$8, %edx xorbv, %al movlv+16, %edx movlv+20, %ecx shrdl $16, %ecx, %edx shrl$16, %ecx xorl%edx, %eax movlv+24, %edx movlv+28, %ecx shrdl $24, %ecx, %edx shrl$24, %ecx xorl%edx, %eax xorbv+36, %al movlv+40, %edx movlv+44, %ecx movl%ecx, %edx xorl%ecx, %ecx shrl$8, %edx xorl%edx, %eax xorbv+54, %al xorbv+63, %al movzbl %al, %eax ret c: movbv+9, %al xorbv, %al xorbv+18, %al xorbv+27, %al xorbv+36, %al xorbv+45, %al xorbv+54, %al xorbv+63, %al movzbl %al, %eax ret d: movlv+8, %eax movlv+12, %edx shrdl $8, %edx, %eax shrl$8, %edx xorbv, %al movlv+16, %edx movlv+20, %ecx shrdl $16, %ecx, %edx shrl$16, %ecx xorl%edx, %eax movlv+24, %edx movlv+28, %ecx shrdl $24, %ecx, %edx shrl$24, %ecx xorl%edx, %eax xorbv+36, %al movlv+40, %edx movlv+44, %ecx movl%ecx, %edx xorl%ecx, %ecx shrl$8, %edx xorl%edx, %eax xorbv+54, %al xorbv+63, %al movzbl %al, %eax ret As you can see, a,b and d results are far from optimal, while c is almost perfect. Note that people typically use d, i.e. this: #define D7(v) (((v) >> 56)) #define D6(v) (((v) >> 48) & 0xff) #define D5(v) (((v) >> 40) & 0xff) #define D4(v) (((v) >> 32) & 0xff) #define D3(v) (((v) >> 24) & 0xff) #define D2(v) (((v) >> 16) & 0xff) #define D1(v) (((v) >> 8) & 0xff) #define D0(v) ((v) & 0xff) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
[Bug rtl-optimization/21182] gcc can use registers but uses stack instead
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-24 13:05 --- With 4.0.0: gcc -O2 gives the same result as gcc -O3, which is better than gcc 3.4.3 -O2 but worse than 3.4.3 -O3. For example: movl%edx, -20(%ebp) orl %ecx, %edi movl%ebx, %esi xorl%ecx, %esi andl%eax, %ebx xorl%edi, %ebx movl%eax, %ecx notl%ecx xorl%ebx, %ecx orl %edi, %eax xorl%eax, %esi rorl$19, %esi rorl$29, -20(%ebp) xorl%esi, %ebx xorl-20(%ebp), %ecx xorl-20(%ebp), %ebx rorl$31, %ebx leal0(,%esi,8), %edx 1) Why %edx was stored in -20(%ebp), there is no %edx usage in the following insns. %edx value could stay in register and we can continue to work on its value in register. 2) rorl $31, %ebx == roll $1, %ebx, but 1 bit roll insn is smaller. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182
[Bug rtl-optimization/21182] gcc can use registers but uses stack instead
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-23 22:54 --- These are -O2 and -O3 code comparison. -O3 code have all modified variables in registers and thus is smaller and most likely faster. serpent_encrypt: pushl %ebp movl%esp, %ebp pushl %edi pushl %esi pushl %ebx subl$256, %esp movl8(%ebp), %edx movl16(%ebp), %eax movl12(%eax), %ebx movl12(%edx), %ecx xorl%ebx, %ecx movl(%edx), %edi movl%ecx, -20(%ebp) xorl(%eax), %edi movl8(%edx), %ecx movl4(%edx), %ebx movl-20(%ebp), %esi xorl8(%eax), %ecx orl %edi, -20(%ebp) xorl4(%eax), %ebx xorl%ebx, -20(%ebp) xorl%esi, %edi xorl%ecx, %esi andl%edi, %ebx xorl%edi, %ecx notl%esi xorl-20(%ebp), %edi movl%edx, -16(%ebp) serpent_encrypt: pushl %ebp movl%esp, %ebp pushl %edi pushl %esi pushl %ebx pushl %edx movl8(%ebp), %edi movl16(%ebp), %ecx movl12(%edi), %eax xorl12(%ecx), %eax movl8(%edi), %esi movl4(%edi), %edx movl(%edi), %ebx xorl8(%ecx), %esi xorl4(%ecx), %edx xorl(%ecx), %ebx movl%eax, %ecx orl %ebx, %ecx xorl%eax, %ebx xorl%esi, %eax xorl%edx, %ecx notl%eax andl%ebx, %edx xorl%eax, %edx xorl%ebx, %esi xorl%ecx, %ebx orl %ebx, %eax xorl%esi, %ebx andl%edx, %esi xorl%esi, %eax notl%edx -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182
[Bug rtl-optimization/21182] gcc can use registers but uses stack instead
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-23 22:49 --- Aha! I found out that gcc will use registers with -O3, but not with -O2. # gcc -O3 serpent.c -S -o serpent-O3.s # gcc -O2 serpent.c -S -o serpent-O2.s # ls -l -rw-r--r-- 1 root root 27975 Apr 24 01:47 serpent-O2.s -rw-r--r-- 1 root root 21566 Apr 24 01:47 serpent-O3.s # wc -l serpent-O2.s serpent-O3.s 1558 serpent-O2.s 1265 serpent-O3.s 2823 total I don't have 4.0.0 here yet... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182
[Bug rtl-optimization/21182] gcc can use registers but uses stack instead
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-23 22:32 --- Created an attachment (id=8719) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8719&action=view) testcase. change #if 0 into #if 1 and compare resulting asm -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182
[Bug rtl-optimization/21182] New: gcc can use registers but uses stack instead
in this long but relatively simple function gcc can store all frequently used local variables in registers, but it fails to do so. gcc can be forced to do this optimization by asm("reg") modifiers. Resulting code is ~1k smaller. # gcc -v Reading specs from /.share/usr/app/gcc-3.4.3/bin/../lib/gcc/i386-pc-linux-gnu/3.4.3/specs Configured with: ../gcc-3.4.3/configure --prefix=/usr/app/gcc-3.4.3 --exec-prefix=/usr/app/gcc-3.4.3 --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/app/gcc-3.4.3/libexec --datadir=/usr/app/gcc-3.4.3/share --sysconfdir=/etc --sharedstatedir=/usr/app/gcc-3.4.3/var/com --localstatedir=/usr/app/gcc-3.4.3/var --libdir=/usr/lib --includedir=/usr/include --infodir=/usr/info --mandir=/usr/man --with-slibdir=/usr/app/gcc-3.4.3/lib --with-local-prefix=/usr/local --with-gxx-include-dir=/usr/app/gcc-3.4.3/include/g++-v3 --enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix i386-pc-linux-gnu Thread model: posix gcc version 3.4.3 -- Summary: gcc can use registers but uses stack instead Product: gcc Version: 3.4.3 Status: UNCONFIRMED Severity: normal Priority: P2 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182
[Bug target/21147] Optimized code is much slower than non-optimized
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-21 13:36 --- testcase is measuring how many twofish_setkey()'s can be executed per second. By inserting extra 'return 0;' in the body of that function and running the testcase, we can measure where it spends most of the execution time. Testcase already has such return (and large comment) exactly after for() loop which runs much faster in non-optimized compile. Move "return 0" above the loop and things return to normal (-O2 is faster than non-optimized). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147
[Bug rtl-optimization/21150] Suboptimal byte extraction from larger integers
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-21 13:12 --- Created an attachment (id=8701) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701&action=view) generate assembly with -S and compare results -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
[Bug rtl-optimization/21150] New: Suboptimal byte extraction from larger integers
Bytes are typically extracted from e.g. u64's by something like #define D5(v) (((v) >> 40) & 0xff) Testcase shows that gcc does not optimize this "good enough". -- Summary: Suboptimal byte extraction from larger integers Product: gcc Version: 3.4.3 Status: UNCONFIRMED Severity: normal Priority: P2 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150
[Bug rtl-optimization/21147] Optimized code is much slower than non-optimized
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-21 13:05 --- Created an attachment (id=8700) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8700&action=view) move "return 0;" around to find out where does that happens -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147
[Bug rtl-optimization/21147] New: Optimized code is much slower than non-optimized
See testcase. # gcc twofish.c;./a.out Iterations/sec: 63252 # gcc -Os twofish.c;./a.out Iterations/sec: 45544 # gcc -O2 twofish.c;./a.out Iterations/sec: 40192 # gcc -v Reading specs from /.share/usr/app/gcc-3.4.3/bin/../lib/gcc/i386-pc-linux-gnu/3.4.3/specs Configured with: ../gcc-3.4.3/configure --prefix=/usr/app/gcc-3.4.3 --exec-prefix=/usr/app/gcc-3.4.3 --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/app/gcc-3.4.3/libexec --datadir=/usr/app/gcc-3.4.3/share --sysconfdir=/etc --sharedstatedir=/usr/app/gcc-3.4.3/var/com --localstatedir=/usr/app/gcc-3.4.3/var --libdir=/usr/lib --includedir=/usr/include --infodir=/usr/info --mandir=/usr/man --with-slibdir=/usr/app/gcc-3.4.3/lib --with-local-prefix=/usr/local --with-gxx-include-dir=/usr/app/gcc-3.4.3/include/g++-v3 --enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix i386-pc-linux-gnu Thread model: posix gcc version 3.4.3 -- Summary: Optimized code is much slower than non-optimized Product: gcc Version: 3.4.3 Status: UNCONFIRMED Severity: normal Priority: P2 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147
[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-21 11:29 --- Whoops no, locals are 256 bytes only. (/me is looking for some coffee) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141
[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-21 11:27 --- >Though on 4.0.0/4.1.0, we get better: >subl$260, %esp It's way too good. Declared locals should take 512 bytes, plus any temporaries for spills. Please find fixed testcase. My fault. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141
[Bug rtl-optimization/21141] excessive stack usage
--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa dot ua 2005-04-21 06:08 --- Created an attachment (id=8695) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8695&action=view) testcase Use gcc -O2 -S t.c -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141
[Bug tree-optimization/21141] New: excessive stack usage
# gcc -v Reading specs from /.share/usr/app/gcc-3.4.3/bin/../lib/gcc/i386-pc-linux-gnu/3.4.3/specs Configured with: ../gcc-3.4.3/configure --prefix=/usr/app/gcc-3.4.3 --exec-prefix=/usr/app/gcc-3.4.3 --bindir=/usr/bin --sbindir=/usr/sbin --libexecdir=/usr/app/gcc-3.4.3/libexec --datadir=/usr/app/gcc-3.4.3/share --sysconfdir=/etc --sharedstatedir=/usr/app/gcc-3.4.3/var/com --localstatedir=/usr/app/gcc-3.4.3/var --libdir=/usr/lib --includedir=/usr/include --infodir=/usr/info --mandir=/usr/man --with-slibdir=/usr/app/gcc-3.4.3/lib --with-local-prefix=/usr/local --with-gxx-include-dir=/usr/app/gcc-3.4.3/include/g++-v3 --enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix i386-pc-linux-gnu Thread model: posix gcc version 3.4.3 Does not happen with -Os Does not happen with 3.4.1 I have a testcase -- Summary: excessive stack usage Product: gcc Version: 3.4.3 Status: UNCONFIRMED Severity: normal Priority: P2 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua CC: gcc-bugs at gcc dot gnu dot org GCC build triplet: i386-pc-linux-gnu GCC host triplet: i386-pc-linux-gnu GCC target triplet: i386-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141