[Bug target/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-24 06:34 ---
>One use of this macro is to increase alignment of medium-size
>data to make it all fit in fewer cache lines.

1) This potentially makes single string fit into fewer cachelines,
but it noticeably increases the sum of all strings!
2) If cacheline is >32bytes, this optimization can even make things worse:

Unaligned string fits into 64 byte (say, Athlon64) cacheline:
[..some_string.]
^0  ^32 ^64

Same string spills over to second cacheline after alignment:
[...some_st][ring...]
^0  ^32 ^64

>Another is to 
>cause character arrays to be word-aligned so that `strcpy' calls
>that copy constants to character arrays can be done inline.

I do not fully understand. Is it about non-static local
char arrays initialized by string?

void f() {
char s[] = "Long str";
}

How alignment affects this code? x86 CPUs can do unaligned loads/stores
just fine, thus 'inlinability' of implicit strcpy does not depend on alignment.
Also such local arrays are not very typical, so why optimize for this case?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug target/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-23 13:03 ---
Oh, I did look at http://gcc.gnu.org/ml/gcc-patches/2000-06/msg00860.html,
I see 128 and 256 bit alignment added, but I don't immediately see where it is
applied to byte arrays (strings) - patch is not so small, where should I look?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug target/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-23 12:56 ---
In majority of cases char msg[] = "A message" is used for text strings.
These are _bytes_, they need no alignment whatsoever, let alone 32 byte one.

I'm perfectly fine if other people want to do it, but I don't, so I use -Os.
I want to suppress this behavior for -Os.

Is it a bug or not is a matter of definition 'what is a bug' really...

BTW what is that another mysterious piece of code aligning something else to 32
bytes?

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-23 07:07 ---
Created an attachment (id=9132)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9132&action=view)
Same for ix86_local_alignment()


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-22 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-23 06:59 ---
Created an attachment (id=9131)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9131&action=view)
While we are at it, speed up ix86_data_alignment

All if()s below are true only if align<128, so we can skip all of them.

And what is this?

   if (AGGREGATE_TYPE_P (type)
&& TYPE_SIZE (type)
&& TREE_CODE (TYPE_SIZE (type)) == INTEGER_CST
&& (TREE_INT_CST_LOW (TYPE_SIZE (type)) >= 256
   || TREE_INT_CST_HIGH (TYPE_SIZE (type))) && align < 256)
 return 256;

I do not remember anything which requires such wasteful alignment.
Maybe a comment would be in order there. (Or removal ;)

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-22 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-23 06:04 ---
Created an attachment (id=9130)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9130&action=view)
Same patch with slightly different formatting

Also run tested

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug tree-optimization/22158] char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-22 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-23 06:03 ---
Created an attachment (id=9129)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=9129&action=view)
Do not align at all if -Os

Sorry only have 3.4.1 sources available locally...

Patch is run tested.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug tree-optimization/22158] New: char global_var[] = "larger than 32 bytes"; uses silly amounts of alignment even with -Os

2005-06-22 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua
static char *s0 = "";
static char s1[] = "";
static char *s2 = "";
static char s3[] = "";

void f(char*);
void g() {
f(s0);
f(s1);
f(s2);
f(s3);
}

s1 and s2 are aligned on 32 bytes even with -Os, while s2 and s4 are not.
See http://gcc.gnu.org/ml/gcc/2002-01/msg01068.html,
http://gcc.gnu.org/ml/gcc/2002-01/msg01068/i386.c.PATCH

-- 
   Summary: char global_var[] = "larger than 32 bytes"; uses silly
amounts of alignment even with -Os
   Product: gcc
   Version: 4.0.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22158


[Bug inline-asm/22045] can't find a register in class 'GENERAL_REGS'

2005-06-14 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-06-14 07:06 ---
If I understand this correctly, older GCCs were able to
figure out that when there is 5 registers available,
"=&g" (__d3) can olny be matched with memory (on-stack local var)
whereas with 6 regs it can use a register.

But newer GCC cannot and we need to explicitly say "=m".

Isn't it a regression?

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=22045


[Bug rtl-optimization/21329] optimize i386 block copy

2005-05-02 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-05-02 09:10 ---
BTW, see above comment: gcc -O2 allocated 24 bytes on stack
and never uset them. ?!

Now, unoptimized compilation comparison:

--- t.s Mon May  2 11:41:20 2005
+++ t-new.s Mon May  2 11:39:40 2005
@@ -32,8 +32,8 @@
movl$t21, %edi
movl$w21, %esi
cld
-   movl$9, %ecx
-   rep
+   movsl
+   movsl
movsb
popl%esi
popl%edi
@@ -50,9 +50,9 @@
movl$t22, %edi
movl$w22, %esi
cld
-   movl$10, %ecx
-   rep
-   movsb
+   movsl
+   movsl
+   movsw
popl%esi
popl%edi
leave
@@ -68,8 +68,9 @@
movl$t23, %edi
movl$w23, %esi
cld
-   movl$11, %ecx
-   rep
+   movsl
+   movsl
+   movsw
movsb
popl%esi
popl%edi
@@ -86,9 +87,8 @@
movl$t30, %edi
movl$w30, %esi
cld
-   movl$3, %eax
-   movl%eax, %ecx
-   rep
+   movsl
+   movsl
movsl
popl%esi
popl%edi
@@ -105,9 +105,9 @@
movl$t40, %edi
movl$w40, %esi
cld
-   movl$4, %eax
-   movl%eax, %ecx
-   rep
+   movsl
+   movsl
+   movsl
movsl
popl%esi
popl%edi
@@ -168,34 +168,34 @@
movl$t21, %edi
movl$w21, %esi
cld
-   movl$9, %ecx
-   rep
+   movsl
+   movsl
movsb
movl$t22, %edi
movl$w22, %esi
cld
-   movl$10, %ecx
-   rep
-   movsb
+   movsl
+   movsl
+   movsw
movl$t23, %edi
movl$w23, %esi
cld
-   movl$11, %ecx
-   rep
+   movsl
+   movsl
+   movsw
movsb
movl$t30, %edi
movl$w30, %esi
cld
-   movl$3, %eax
-   movl%eax, %ecx
-   rep
+   movsl
+   movsl
movsl
movl$t40, %edi
movl$w40, %esi
cld
-   movl$4, %eax
-   movl%eax, %ecx
-   rep
+   movsl
+   movsl
+   movsl
movsl
movl$t50, %edi
movl$w50, %esi


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329


[Bug rtl-optimization/21329] optimize i386 block copy

2005-05-02 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-05-02 09:04 ---
Comparison between old and new code (-O2):

--- tO2.s   Mon May  2 11:49:24 2005
+++ tO2-new.s   Mon May  2 11:50:03 2005
@@ -35,8 +35,7 @@
movl$t21, %edi
movl$w21, %esi
cld
-   movl$2, %ecx
-   rep
+   movsl
movsl
movsb
popl%esi
@@ -55,8 +54,7 @@
movl$t22, %edi
movl$w22, %esi
cld
-   movl$2, %ecx
-   rep
+   movsl
movsl
movsw
popl%esi
@@ -75,8 +73,7 @@
movl$t23, %edi
movl$w23, %esi
cld
-   movl$2, %ecx
-   rep
+   movsl
movsl
movsw
movsb
@@ -96,8 +93,8 @@
movl$t30, %edi
movl$w30, %esi
cld
-   movl$3, %ecx
-   rep
+   movsl
+   movsl
movsl
popl%esi
popl%edi
@@ -115,8 +112,9 @@
movl$t40, %edi
movl$w40, %esi
cld
-   movl$4, %ecx
-   rep
+   movsl
+   movsl
+   movsl
movsl
popl%esi
popl%edi
@@ -169,7 +167,6 @@
movl%esp, %ebp
pushl   %edi
pushl   %esi
-   subl$24, %esp
movlw10, %eax
movl%eax, t10
movlw20, %eax
@@ -179,36 +176,34 @@
movl$t21, %edi
movl$w21, %esi
cld
-   movl$2, %ecx
-   rep
+   movsl
movsl
movsb
movl$t22, %edi
movl$w22, %esi
-   movb$2, %cl
-   rep
+   movsl
movsl
movsw
movl$t23, %edi
movl$w23, %esi
-   movb$2, %cl
-   rep
+   movsl
movsl
movsw
movsb
movl$t30, %edi
movl$w30, %esi
-   movb$3, %cl
-   rep
+   movsl
+   movsl
movsl
movl$t40, %edi
movl$w40, %esi
-   movb$4, %cl
-   rep
+   movsl
+   movsl
+   movsl
movsl
movl$t50, %edi
movl$w50, %esi
-   movb$5, %cl
+   movl$5, %ecx
rep
movsl
movl$t60, %edi
@@ -216,7 +211,6 @@
movb$6, %cl
rep
movsl
-   addl$24, %esp
popl%esi
popl%edi
leave


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329


[Bug rtl-optimization/21329] optimize i386 block copy

2005-05-02 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-05-02 09:02 ---
Created an attachment (id=8791)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8791&action=view)
patch against 4.0.0


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329


[Bug rtl-optimization/21329] optimize i386 block copy

2005-05-02 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-05-02 09:00 ---
Created an attachment (id=8790)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8790&action=view)
testcase


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329


[Bug rtl-optimization/21329] New: optimize i386 block copy

2005-05-02 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua
gcc generates suboptimal i386 block copy code, like this:
movl$9, %ecx
rep
movsb
or this:
movl$2, %ecx
rep
movsl
movsw

Such short copies can be done with few movsl's instead.
Patch is attached. Note that I am not familiar with gcc
internals at all, so take it with reasonable suspicion.

-- 
   Summary: optimize i386 block copy
   Product: gcc
   Version: 4.0.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
 Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21329


[Bug target/21147] Optimized code is much slower than non-optimized

2005-04-26 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-27 05:38 ---
Marking as invalid. I found out that this happens on Celeron
but doesn't happen on Athlon. Must be instruction scheduling
artifact.

Same binaries were used:

# gcc -O2 -o twofish_O2 twofish.c
# gcc -O3 -o twofish_O3 twofish.c
# gcc -Os -o twofish_Os twofish.c
# gcc -o twofish twofish.c

On Celeron:

# ./twofish
Iterations/sec: 63584
# ./twofish_O2
Iterations/sec: 41836
# ./twofish_O3
Iterations/sec: 42604
# ./twofish_Os
Iterations/sec: 45956
# gcc -v
gcc version 3.4.3
# cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 11
model name  : Intel(R) Celeron(TM) CPU
stepping: 1
cpu MHz : 1196.222
cache size  : 256 KB
physical id : 0
siblings: 1
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr sse
bogomips: 2359.29

On Athlon:

# ./twofish
Iterations/sec: 65648
# ./twofish_O2
Iterations/sec: 64484
# ./twofish_O3
Iterations/sec: 71596
# ./twofish_Os
Iterations/sec: 63560
# cat /proc/cpuinfo
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 6
model   : 8
model name  : AMD Athlon(tm) XP 2400+
stepping: 1
cpu MHz : 2009.954
cache size  : 256 KB
fdiv_bug: no
hlt_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36
mmx fxsr sse pni syscall mmxext 3dnowext 3dnow
bogomips: 3964.92


-- 
   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution||INVALID


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147


[Bug rtl-optimization/21202] Extra register moves generated

2005-04-25 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-25 07:34 ---
As you can see by inspecting .s file,
I replaced gcc 3.4.3 with gcc 4.0.0 between compiles.
Both of them produce extra moves.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21202


[Bug rtl-optimization/21202] New: Extra register moves generated

2005-04-25 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua
See below: two register->register moves which
are not needed.

# cat byteorder.c
typedef unsigned long long u64;
typedef unsigned u32;

static inline u64 swab64(u64 val) {
union {
struct { u32 a,b; } s;
u64 u;
} v;
v.u = val;
asm("bswapl %0 ; bswapl %1"
: "=r" (v.s.b), "=r" (v.s.a)
: "0" (v.s.a), "1" (v.s.b));
return v.u;
}

extern u64 w;
void f() {
w = swab64(w);
}
# gcc -O3 byteorder.c -S
# cat byteorder.s
.file   "byteorder.c"
.text
.p2align 2,,3
.globl f
.type   f, @function
f:
pushl   %ebp
movl%esp, %ebp
pushl   %esi
pushl   %ebx
movlw, %esi
movlw+4, %edx
movl%esi, %ebx
movl%edx, %esi
#APP
bswapl %ebx ; bswapl %esi
#NO_APP
movl%ebx, w+4
popl%ebx
movl%esi, w
popl%esi
leave
ret
.size   f, .-f
.section.note.GNU-stack,"",@progbits
.ident  "GCC: (GNU) 3.4.3"
# gcc -O3 byteorder.c -S; cat byteorder.s; gcc -v
.file   "byteorder.c"
.text
.p2align 2,,3
.globl f
.type   f, @function
f:
pushl   %ebp
movl%esp, %ebp
pushl   %esi
pushl   %ebx
movlw, %eax
movlw+4, %edx
movl%eax, %ebx
movl%edx, %esi
#APP
bswapl %ebx ; bswapl %esi
#NO_APP
movl%esi, w
movl%ebx, w+4
popl%ebx
popl%esi
leave
ret
.size   f, .-f
.ident  "GCC: (GNU) 4.0.0"
.section.note.GNU-stack,"",@progbits
Using built-in specs.
Target: i386-pc-linux-gnu
Configured with: ../gcc-4.0.0.src/configure --prefix=/usr/app/gcc-4.0.0
--exec-prefix=/usr/app/gcc-4.0.0 --bindir=/usr/bin --sbindir=/usr/sbin
--libexecdir=/usr/app/gcc-4.0.0/libexec --datadir=/usr/app/gcc-4.0.0/share
--sysconfdir=/etc --sharedstatedir=/usr/app/gcc-4.0.0/var/com
--localstatedir=/usr/app/gcc-4.0.0/var --libdir=/usr/lib
--includedir=/usr/include --infodir=/usr/info --mandir=/usr/man
--with-slibdir=/usr/app/gcc-4.0.0/lib --with-local-prefix=/usr/local
--with-gxx-include-dir=/usr/app/gcc-4.0.0/include/g++-v3
--enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix
i386-pc-linux-gnu
Thread model: posix
gcc version 4.0.0

-- 
   Summary: Extra register moves generated
   Product: gcc
   Version: 4.0.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
     Component: rtl-optimization
    AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21202


[Bug rtl-optimization/21150] Suboptimal byte extraction from 64bits

2005-04-24 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-24 13:26 ---
I don't think that bug description is correct.
I believe similar observation will be valid for byte extraction
from u32 and u16, and for u16-from-u32, etc.

Update for latest gcc.
This is what 4.0.0 produces from the testcase:

# gcc -O2 -fomit-frame-pointer -S helper.c
# cat helper.s
 [I removed non-essential stuff]
a:
movlv+8, %eax
shrl$8, %eax
xorbv, %al
xorbv+18, %al
xorbv+27, %al
xorbv+36, %al
movlv+40, %edx
movlv+44, %ecx
movl%ecx, %edx
xorl%ecx, %ecx
shrl$8, %edx
xorl%edx, %eax
xorbv+54, %al
xorbv+63, %al
movzbl  %al, %eax
ret
b:
movlv+8, %eax
movlv+12, %edx
shrdl   $8, %edx, %eax
shrl$8, %edx
xorbv, %al
movlv+16, %edx
movlv+20, %ecx
shrdl   $16, %ecx, %edx
shrl$16, %ecx
xorl%edx, %eax
movlv+24, %edx
movlv+28, %ecx
shrdl   $24, %ecx, %edx
shrl$24, %ecx
xorl%edx, %eax
xorbv+36, %al
movlv+40, %edx
movlv+44, %ecx
movl%ecx, %edx
xorl%ecx, %ecx
shrl$8, %edx
xorl%edx, %eax
xorbv+54, %al
xorbv+63, %al
movzbl  %al, %eax
ret
c:
movbv+9, %al
xorbv, %al
xorbv+18, %al
xorbv+27, %al
xorbv+36, %al
xorbv+45, %al
xorbv+54, %al
xorbv+63, %al
movzbl  %al, %eax
ret
d:
movlv+8, %eax
movlv+12, %edx
shrdl   $8, %edx, %eax
shrl$8, %edx
xorbv, %al
movlv+16, %edx
movlv+20, %ecx
shrdl   $16, %ecx, %edx
shrl$16, %ecx
xorl%edx, %eax
movlv+24, %edx
movlv+28, %ecx
shrdl   $24, %ecx, %edx
shrl$24, %ecx
xorl%edx, %eax
xorbv+36, %al
movlv+40, %edx
movlv+44, %ecx
movl%ecx, %edx
xorl%ecx, %ecx
shrl$8, %edx
xorl%edx, %eax
xorbv+54, %al
xorbv+63, %al
movzbl  %al, %eax
ret

As you can see, a,b and d results are far from optimal,
while c is almost perfect.

Note that people typically use d, i.e. this:
#define D7(v) (((v) >> 56))
#define D6(v) (((v) >> 48) & 0xff)
#define D5(v) (((v) >> 40) & 0xff)
#define D4(v) (((v) >> 32) & 0xff)
#define D3(v) (((v) >> 24) & 0xff)
#define D2(v) (((v) >> 16) & 0xff)
#define D1(v) (((v) >>  8) & 0xff)
#define D0(v) ((v) & 0xff)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150


[Bug rtl-optimization/21182] gcc can use registers but uses stack instead

2005-04-24 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-24 13:05 ---
With 4.0.0: gcc -O2 gives the same result as gcc -O3,
which is better than gcc 3.4.3 -O2 but worse than 3.4.3 -O3.
For example:

movl%edx, -20(%ebp)
orl %ecx, %edi
movl%ebx, %esi
xorl%ecx, %esi
andl%eax, %ebx
xorl%edi, %ebx
movl%eax, %ecx
notl%ecx
xorl%ebx, %ecx
orl %edi, %eax
xorl%eax, %esi
rorl$19, %esi
rorl$29, -20(%ebp)
xorl%esi, %ebx
xorl-20(%ebp), %ecx
xorl-20(%ebp), %ebx
rorl$31, %ebx
leal0(,%esi,8), %edx

1) Why %edx was stored in -20(%ebp), there is no %edx usage
in the following insns. %edx value could stay in register
and we can continue to work on its value in register.
2) rorl $31, %ebx == roll $1, %ebx, but 1 bit roll insn is
smaller.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182


[Bug rtl-optimization/21182] gcc can use registers but uses stack instead

2005-04-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-23 22:54 ---
These are -O2 and -O3 code comparison.
-O3 code have all modified variables in registers
and thus is smaller and most likely faster.

serpent_encrypt:
pushl   %ebp
movl%esp, %ebp
pushl   %edi
pushl   %esi
pushl   %ebx
subl$256, %esp
movl8(%ebp), %edx
movl16(%ebp), %eax
movl12(%eax), %ebx
movl12(%edx), %ecx
xorl%ebx, %ecx
movl(%edx), %edi
movl%ecx, -20(%ebp)
xorl(%eax), %edi
movl8(%edx), %ecx
movl4(%edx), %ebx
movl-20(%ebp), %esi
xorl8(%eax), %ecx
orl %edi, -20(%ebp)
xorl4(%eax), %ebx
xorl%ebx, -20(%ebp)
xorl%esi, %edi
xorl%ecx, %esi
andl%edi, %ebx
xorl%edi, %ecx
notl%esi
xorl-20(%ebp), %edi
movl%edx, -16(%ebp)

serpent_encrypt:
pushl   %ebp
movl%esp, %ebp
pushl   %edi
pushl   %esi
pushl   %ebx
pushl   %edx
movl8(%ebp), %edi
movl16(%ebp), %ecx
movl12(%edi), %eax
xorl12(%ecx), %eax
movl8(%edi), %esi
movl4(%edi), %edx
movl(%edi), %ebx
xorl8(%ecx), %esi
xorl4(%ecx), %edx
xorl(%ecx), %ebx
movl%eax, %ecx
orl %ebx, %ecx
xorl%eax, %ebx
xorl%esi, %eax
xorl%edx, %ecx
notl%eax
andl%ebx, %edx
xorl%eax, %edx
xorl%ebx, %esi
xorl%ecx, %ebx
orl %ebx, %eax
xorl%esi, %ebx
andl%edx, %esi
xorl%esi, %eax
notl%edx



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182


[Bug rtl-optimization/21182] gcc can use registers but uses stack instead

2005-04-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-23 22:49 ---
Aha!
I found out that gcc will use registers with -O3, but not with -O2.

# gcc -O3 serpent.c -S -o serpent-O3.s
# gcc -O2 serpent.c -S -o serpent-O2.s
# ls -l
-rw-r--r--  1 root root 27975 Apr 24 01:47 serpent-O2.s
-rw-r--r--  1 root root 21566 Apr 24 01:47 serpent-O3.s
# wc -l serpent-O2.s serpent-O3.s
 1558 serpent-O2.s
 1265 serpent-O3.s
 2823 total

I don't have 4.0.0 here yet...

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182


[Bug rtl-optimization/21182] gcc can use registers but uses stack instead

2005-04-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-23 22:32 ---
Created an attachment (id=8719)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8719&action=view)
testcase. change #if 0 into #if 1 and compare resulting asm


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182


[Bug rtl-optimization/21182] New: gcc can use registers but uses stack instead

2005-04-23 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua
in this long but relatively simple function gcc
can store all frequently used local variables in registers,
but it fails to do so.

gcc can be forced to do this optimization by asm("reg") modifiers.
Resulting code is ~1k smaller.

# gcc -v
Reading specs from
/.share/usr/app/gcc-3.4.3/bin/../lib/gcc/i386-pc-linux-gnu/3.4.3/specs
Configured with: ../gcc-3.4.3/configure --prefix=/usr/app/gcc-3.4.3
--exec-prefix=/usr/app/gcc-3.4.3 --bindir=/usr/bin --sbindir=/usr/sbin
--libexecdir=/usr/app/gcc-3.4.3/libexec --datadir=/usr/app/gcc-3.4.3/share
--sysconfdir=/etc --sharedstatedir=/usr/app/gcc-3.4.3/var/com
--localstatedir=/usr/app/gcc-3.4.3/var --libdir=/usr/lib
--includedir=/usr/include --infodir=/usr/info --mandir=/usr/man
--with-slibdir=/usr/app/gcc-3.4.3/lib --with-local-prefix=/usr/local
--with-gxx-include-dir=/usr/app/gcc-3.4.3/include/g++-v3
--enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix
i386-pc-linux-gnu
Thread model: posix
gcc version 3.4.3

-- 
   Summary: gcc can use registers but uses stack instead
   Product: gcc
   Version: 3.4.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
 Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21182


[Bug target/21147] Optimized code is much slower than non-optimized

2005-04-21 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-21 13:36 ---
testcase is measuring how many twofish_setkey()'s can be executed per second.
By inserting extra 'return 0;' in the body of that function and running
the testcase, we can measure where it spends most of the execution time.

Testcase already has such return (and large comment) exactly after for()
loop which runs much faster in non-optimized compile.

Move "return 0" above the loop and things return to normal
(-O2 is faster than non-optimized).

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147


[Bug rtl-optimization/21150] Suboptimal byte extraction from larger integers

2005-04-21 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-21 13:12 ---
Created an attachment (id=8701)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8701&action=view)
generate assembly with -S and compare results


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150


[Bug rtl-optimization/21150] New: Suboptimal byte extraction from larger integers

2005-04-21 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua
Bytes are typically extracted from e.g. u64's by something like

#define D5(v) (((v) >> 40) & 0xff)

Testcase shows that gcc does not optimize this "good enough".

-- 
   Summary: Suboptimal byte extraction from larger integers
   Product: gcc
   Version: 3.4.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
 Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21150


[Bug rtl-optimization/21147] Optimized code is much slower than non-optimized

2005-04-21 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-21 13:05 ---
Created an attachment (id=8700)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8700&action=view)
move "return 0;" around to find out where does that happens


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147


[Bug rtl-optimization/21147] New: Optimized code is much slower than non-optimized

2005-04-21 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua
See testcase.

# gcc twofish.c;./a.out
Iterations/sec: 63252
# gcc -Os twofish.c;./a.out
Iterations/sec: 45544
# gcc -O2 twofish.c;./a.out
Iterations/sec: 40192

# gcc -v
Reading specs from
/.share/usr/app/gcc-3.4.3/bin/../lib/gcc/i386-pc-linux-gnu/3.4.3/specs
Configured with: ../gcc-3.4.3/configure --prefix=/usr/app/gcc-3.4.3
--exec-prefix=/usr/app/gcc-3.4.3 --bindir=/usr/bin --sbindir=/usr/sbin
--libexecdir=/usr/app/gcc-3.4.3/libexec --datadir=/usr/app/gcc-3.4.3/share
--sysconfdir=/etc --sharedstatedir=/usr/app/gcc-3.4.3/var/com
--localstatedir=/usr/app/gcc-3.4.3/var --libdir=/usr/lib
--includedir=/usr/include --infodir=/usr/info --mandir=/usr/man
--with-slibdir=/usr/app/gcc-3.4.3/lib --with-local-prefix=/usr/local
--with-gxx-include-dir=/usr/app/gcc-3.4.3/include/g++-v3
--enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix
i386-pc-linux-gnu
Thread model: posix
gcc version 3.4.3

-- 
   Summary: Optimized code is much slower than non-optimized
   Product: gcc
   Version: 3.4.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
 Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21147


[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage

2005-04-21 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-21 11:29 ---
Whoops no, locals are 256 bytes only.
(/me is looking for some coffee)

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141


[Bug rtl-optimization/21141] [3.4 Regression] excessive stack usage

2005-04-21 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-21 11:27 ---
>Though on 4.0.0/4.1.0, we get better:
>subl$260, %esp

It's way too good. Declared locals should take 512 bytes, plus
any temporaries for spills.

Please find fixed testcase. My fault.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141


[Bug rtl-optimization/21141] excessive stack usage

2005-04-20 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua

--- Additional Comments From vda at port dot imtp dot ilyichevsk dot odessa 
dot ua  2005-04-21 06:08 ---
Created an attachment (id=8695)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=8695&action=view)
testcase

Use gcc -O2 -S t.c

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141


[Bug tree-optimization/21141] New: excessive stack usage

2005-04-20 Thread vda at port dot imtp dot ilyichevsk dot odessa dot ua
# gcc -v
Reading specs from
/.share/usr/app/gcc-3.4.3/bin/../lib/gcc/i386-pc-linux-gnu/3.4.3/specs
Configured with: ../gcc-3.4.3/configure --prefix=/usr/app/gcc-3.4.3
--exec-prefix=/usr/app/gcc-3.4.3 --bindir=/usr/bin --sbindir=/usr/sbin
--libexecdir=/usr/app/gcc-3.4.3/libexec --datadir=/usr/app/gcc-3.4.3/share
--sysconfdir=/etc --sharedstatedir=/usr/app/gcc-3.4.3/var/com
--localstatedir=/usr/app/gcc-3.4.3/var --libdir=/usr/lib
--includedir=/usr/include --infodir=/usr/info --mandir=/usr/man
--with-slibdir=/usr/app/gcc-3.4.3/lib --with-local-prefix=/usr/local
--with-gxx-include-dir=/usr/app/gcc-3.4.3/include/g++-v3
--enable-languages=c,c++ --with-system-zlib --disable-nls --enable-threads=posix
i386-pc-linux-gnu
Thread model: posix
gcc version 3.4.3

Does not happen with -Os
Does not happen with 3.4.1

I have a testcase

-- 
   Summary: excessive stack usage
   Product: gcc
   Version: 3.4.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P2
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vda at port dot imtp dot ilyichevsk dot odessa dot ua
CC: gcc-bugs at gcc dot gnu dot org
 GCC build triplet: i386-pc-linux-gnu
  GCC host triplet: i386-pc-linux-gnu
GCC target triplet: i386-pc-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21141