[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-10 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-10 
22:59 ---
. 

-- 
   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-10 Thread cvs-commit at gcc dot gnu dot org

--- Additional Comments From cvs-commit at gcc dot gnu dot org  2005-02-10 
22:57 ---
Subject: Bug 17549

CVSROOT:/cvs/gcc
Module name:gcc
Changes by: [EMAIL PROTECTED]   2005-02-10 22:57:31

Modified files:
gcc: ChangeLog tree-outof-ssa.c 

Log message:
PR tree-optimization/17549
* tree-outof-ssa.c (find_replaceable_in_bb): Do not allow
TER to replace a DEF with its expression if the DEF and the
rhs of the expression we replace into have the same root
variable.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7438&r2=2.7439
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/tree-outof-ssa.c.diff?cvsroot=gcc&r1=2.43&r2=2.44



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-10 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2005-02-10 
21:11 ---
(In reply to comment #37)
> So for ppc this bug is still not fixed even with my patch.  Interesting data 
> point is the ppc32 size with -Os -fno-ivopts: 
>2820   0   02820 b04 no-ivopts.o 
>  
> So perhaps the pending IVopts patches will also help for this problem. 

I bet the problem here for PPC is the same as 18219 where we generate more than 
one IV.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-10 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-10 
10:06 ---
'size' for susan_edged_mod_1 .o files 
 
33 = pre 3.3.3-suse (hammer branch 
40 = CVS head 20050209 
patched = CVS head 20050209 with the 'TER hack' patch applied. 
 
i686: 
   textdata bss dec hex filename 
   2133   0   02133 855 33.o 
   3003   0   03003 bbb 40.o 
   2237   0   02237 8bd patched.o 
 
amd64: 
   textdata bss dec hex filename 
   2710   0   02710 a96 33.o 
   3414   0   03414 d56 40.o 
   2421   0   02421 975 patched.o 
 
ppc32: 
   textdata bss dec hex filename 
   2780   0   02780 adc 33.o 
   3348   0   03348 d14 40.o 
   3140   0   03140 c44 patched.o 
 
So for ppc this bug is still not fixed even with my patch.  Interesting data 
point is the ppc32 size with -Os -fno-ivopts: 
   2820   0   02820 b04 no-ivopts.o 
 
So perhaps the pending IVopts patches will also help for this problem. 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-10 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-10 
09:08 ---
The slowdown is probably some unfortunate icache effect - ccould be anything 
from alignment, the slightly larger instructions due to using r8 instead of 
rcx.  I guess we should not care too much about such random effects that we 
cannot do anything about anyway.  I'm going to see if it doesn't hurt on i686, 
and submit the patch if things look good. 
 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-09 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-09 
23:35 ---
The entire diff of .optimized dumps and .s output for twolf on AMD64 is really 
small, in fact the asm output is different for only one file: 
 
 config1.c.t65.optimized   |  120 
++ 
 configure.c.t65.optimized |   78 +++-- 
 outpins.c.t65.optimized   |6 +- 
 outpins.s |   36 ++--- 
 qsorte.c.t65.optimized|3 - 
 qsortg.c.t65.optimized|3 - 
 qsortgdx.c.t65.optimized  |3 - 
 qsortx.c.t65.optimized|3 - 
 readcell.c.t65.optimized  |3 - 
 readseg.c.t65.optimized   |6 +- 
 ucgxp.c.t65.optimized |3 - 
 uloop.c.t65.optimized |6 +- 
 12 files changed, 174 insertions(+), 96 deletions(-) 
 
The file with the assembler difference is outpins.c.  The relevant diff is 
below.  There is nothing in the diff that explains the ~4% slowdown I see in 
my SPEC benchmarks (3 runs, so the slowdown is consistent).  The same 
instructions are there, just ordered differently and using different 
registers.  So I'm not sure how to proceed... 
 
diff -u base/outpins.c.t65.optimized hacked/outpins.c.t65.optimized 
--- base/outpins.c.t65.optimized2005-02-10 00:19:20.950581229 +0100 
+++ patched/outpins.c.t65.optimized  2005-02-10 00:16:19.436444879 +0100 
@@ -99,8 +99,9 @@ 
   pairArray.39 = pairArray; 
   carray.40 = carray; 
   D.3698 = *((struct cellbox * *) ((long unsigned int) *(*((int * *) D.3712 + 
pairArray.39 - 8B) + 4B) * 8) + carray.40); 
+  end.81 = D.3698->cxcenter + (int) D.3698->tileptr->left; 
   temp.59 = *(carray.40 + (struct cellbox * *) ((long unsigned int) 
*(*(pairArray.39 + (int * *) D.3712) + 4B) * 8)); 
-  end = MAX_EXPR cxcenter + (int) D.3698->tileptr->left, 
temp.59->cxcenter + (int) temp.59->tileptr->left>; 
+  end = MAX_EXPR cxcenter + (int) temp.59->tileptr->left>; 
 
 :; 
   return end; 
@@ -228,9 +229,10 @@ 
   D.3668 = *((int * *) D.3664 + pairArray.36 - 8B); 
   carray.37 = carray; 
   D.3646 = *((struct cellbox * *) ((long unsigned int) *(D.3668 + (int *) 
((long unsigned int) *D.3668 * 4)) * 8) + carray.37); 
+  end.121 = D.3646->cxcenter + (int) D.3646->tileptr->right; 
   D.3676 = *(pairArray.36 + (int * *) D.3664); 
   temp.99 = *(carray.37 + (struct cellbox * *) ((long unsigned int) *(D.3676 
+ (int *) ((long unsigned int) *D.3676 * 4)) * 8)); 
-  end = MIN_EXPR cxcenter + (int) D.3646->tileptr->right, 
temp.99->cxcenter + (int) temp.99->tileptr->right>; 
+  end = MIN_EXPR cxcenter + (int) 
temp.99->tileptr->right>; 
 
 :; 
   return end; 
diff -u base/outpins.s hacked/outpins.s 
--- base/outpins.s  2005-02-10 00:19:21.064543028 +0100 
+++ patched/outpins.s2005-02-10 00:16:19.551406289 +0100 
@@ -18,18 +18,18 @@ 
movq-8(%rdx,%rcx), %rax 
movslq  4(%rax),%rax 
movq(%rsi,%rax,8), %rdi 
+   movq40(%rdi), %rax 
+   movswl  (%rax),%r8d 
movq(%rcx,%rdx), %rax 
+   addl12(%rdi), %r8d 
movslq  4(%rax),%rax 
movq(%rsi,%rax,8), %rdx 
-   movq40(%rdi), %rax 
-   movswl  (%rax),%ecx 
movq40(%rdx), %rax 
-   addl12(%rdi), %ecx 
movswl  (%rax),%eax 
addl12(%rdx), %eax 
-   cmpl%eax, %ecx 
-   cmovl   %eax, %ecx 
-   movl%ecx, %eax 
+   cmpl%eax, %r8d 
+   cmovl   %eax, %r8d 
+   movl%r8d, %eax 
ret 
.p2align 4,,7 
 .L11: 
@@ -40,9 +40,9 @@ 
movqcarray(%rip), %rax 
movq(%rax,%rdx,8), %rdx 
movq40(%rdx), %rax 
-   movswl  (%rax),%ecx 
-   addl12(%rdx), %ecx 
-   movl%ecx, %eax 
+   movswl  (%rax),%r8d 
+   addl12(%rdx), %r8d 
+   movl%r8d, %eax 
ret 
.p2align 4,,7 
 .L12: 
@@ -72,18 +72,18 @@ 
movslq  (%rcx),%rax 
movslq  (%rcx,%rax,4),%rax 
movq(%rdi,%rax,8), %rcx 
+   movq40(%rcx), %rax 
+   movswl  2(%rax),%r8d 
movslq  (%rdx),%rax 
+   addl12(%rcx), %r8d 
movslq  (%rdx,%rax,4),%rax 
movq(%rdi,%rax,8), %rdx 
-   movq40(%rcx), %rax 
-   movswl  2(%rax),%esi 
movq40(%rdx), %rax 
-   addl12(%rcx), %esi 
movswl  2(%rax),%eax 
addl12(%rdx), %eax 
-   cmpl%eax, %esi 
-   cmovg   %eax, %esi 
-   movl%esi, %eax 
+   cmpl%eax, %r8d 
+   cmovg   %eax, %r8d 
+   movl%r8d, %eax 
ret 
.p2align 4,,7 
 .L22: 
@@ -95,9 +95,9 @@ 
movqcarray(%rip), %rax 
movq(%rax,%rdx,8), %rdx 
movq40(%rdx), %rax 
-   movswl  2(%rax),%esi 
-   addl12(%rdx), %esi 
-   movl%esi, %eax 
+   movswl  2(%rax),%r8d 
+   addl12(%rdx), %r8d 
+   movl%r8d, %eax 
ret 
.p2align 4,,7 
 .L23: 
 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549

[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-09 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-09 
22:00 ---
My TER hack does fix most of the problems, but it also causes a significant 
regression in the SPEC twolf benchmark.  All other benchmarks are roughly the 
same.  I'll try to figure out what is causing the regression. 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-08 Thread amacleod at redhat dot com

--- Additional Comments From amacleod at redhat dot com  2005-02-08 14:26 
---
(In reply to comment #28)
> Using var_to_partition does not help.  The reason is that the SSA names with 
> the same root var are not in the same partition, e.g. 
>  
> _7  -->  
> x_3  --> x 
> x_4 not coalesced with x -->  New temp:  'x.0' 
> x_5 not coalesced with x.0 -->  New temp:  'x.1' 
<...> 
>  
> Partition 0 (a - 1 ) 
> Partition 1 (b - 2 ) 
> Partition 2 (x - 3 ) 
> Partition 3 (x.0 - 4 ) 
> Partition 4 (x.1 - 5 ) 
> Partition 5 ( - 7 ) 
>  
> So if you replace the root var comparison in my hack with a check to make 
> sure 
> def and def2 are not in the same partition, that whole check will always be 
> false and you still get crap code. 
>  


of course.  doh.  Accumulation will result in live range splitting, so they will
all be different variables.  Stick with checking the root variable, its probably
our simplest measure I guess. :-)



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-08 Thread amacleod at redhat dot com

--- Additional Comments From amacleod at redhat dot com  2005-02-08 14:02 
---
(In reply to comment #30)
> Subject: Re:  [4.0 Regression] 10% increase in codesize with C code compared
to GCC 3.3
> 
> On Mon, Feb 07, 2005 at 11:13:27PM -, steven at gcc dot gnu dot org wrote:
> >   x = a + b; 
> >   x = x * a; 
> >   x = x * b; 
> ...
> > After Coalescing: 
> ...
> > Partition 2 (x_3 - 3 ) 
> > Partition 3 (x_4 - 4 ) 
> > Partition 4 (x_5 - 5 ) 
> 
> That is curious.  Certainly not the way I'd have expected things to work.
> Why are we not coalescing here?  Do we think that x_4 as an input to the
> same insn that creates x_5 means that the two conflict?  Unless someone
> can convince me otherwise, I'd call this a bug.
> 

No we dont think that, but they are disjoint live ranges, so we want to keep
them seperate. we do live range splitting on the way out of SSA.  there is no
conflict, just reason NOT to coalesce them.  if there was a copy between them,
then we consider coalescing them.




-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-07 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-08 
00:15 ---
Might as well make it mine while I'm looking at it. 
 

-- 
   What|Removed |Added

 AssignedTo|unassigned at gcc dot gnu   |steven at gcc dot gnu dot
   |dot org |org
 Status|NEW |ASSIGNED
   Last reconfirmed|2005-02-07 22:13:59 |2005-02-08 00:15:16
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-07 Thread rth at gcc dot gnu dot org

--- Additional Comments From rth at gcc dot gnu dot org  2005-02-07 23:36 
---
Subject: Re:  [4.0 Regression] 10% increase in codesize with C code compared to 
GCC 3.3

On Mon, Feb 07, 2005 at 11:13:27PM -, steven at gcc dot gnu dot org wrote:
>   x = a + b; 
>   x = x * a; 
>   x = x * b; 
...
> After Coalescing: 
...
> Partition 2 (x_3 - 3 ) 
> Partition 3 (x_4 - 4 ) 
> Partition 4 (x_5 - 5 ) 

That is curious.  Certainly not the way I'd have expected things to work.
Why are we not coalescing here?  Do we think that x_4 as an input to the
same insn that creates x_5 means that the two conflict?  Unless someone
can convince me otherwise, I'd call this a bug.


r~


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-07 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-07 
23:16 ---
Note the following: 
x_4 not coalesced with x -->  New temp:  'x.0' 
x_5 not coalesced with x.0 -->  New temp:  'x.1' 
Not very useful, because x_4 and x_5 have no uses left.  So you start with 
this: 
 
foo (xD.1447, aD.1448, bD.1449) 
{ 
  intD.0 D.1452; 
 
  # BLOCK 0 
  # PRED: ENTRY [100.0%]  (fallthru,exec) 
  xD.1447_3 = aD.1448_1 + bD.1449_2; 
  xD.1447_4 = aD.1448_1 * xD.1447_3; 
  xD.1447_5 = bD.1449_2 * xD.1447_4; 
  return xD.1447_5; 
  # SUCC: EXIT [100.0%] 
 
} 
 
and you end with this: 
 
foo (xD.1447, aD.1448, bD.1449) 
{ 
  intD.0 x.1D.1456; 
  intD.0 x.0D.1455; 
  intD.0 D.1452; 
 
  # BLOCK 0 
  # PRED: ENTRY [100.0%]  (fallthru,exec) 
  return bD.1449 * aD.1448 * (aD.1448 + bD.1449); 
  # SUCC: EXIT [100.0%] 
 
} 
 
Note the redundant temporaries. 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-07 Thread steven at gcc dot gnu dot org

--- Additional Comments From steven at gcc dot gnu dot org  2005-02-07 
23:13 ---
Using var_to_partition does not help.  The reason is that the SSA names with 
the same root var are not in the same partition, e.g. 
 
int 
foo (int x, int a, int b) 
{ 
  x = a + b; 
  x = x * a; 
  x = x * b; 
  return x; 
} 
 
--> 
 
Sorted Coalesce list: 
 
Partition map 
 
Partition 0 (x_3 - 3 ) 
Partition 1 (x_4 - 4 ) 
Partition 2 (x_5 - 5 ) 
 
After Coalescing: 
 
Partition map 
 
Partition 0 (a_1 - 1 ) 
Partition 1 (b_2 - 2 ) 
Partition 2 (x_3 - 3 ) 
Partition 3 (x_4 - 4 ) 
Partition 4 (x_5 - 5 ) 
Partition 5 (_7 - 7 ) 
 
 
Replacing Expressions 
x_3 replace with --> a_1 + b_2 
x_4 replace with --> a_1 * x_3 
x_5 replace with --> b_2 * x_4 
 
_7  -->  
x_3  --> x 
x_4 not coalesced with x -->  New temp:  'x.0' 
x_5 not coalesced with x.0 -->  New temp:  'x.1' 
b_2  --> b 
a_1  --> a 
After Root variable replacement: 
 
Partition map 
 
Partition 0 (a - 1 ) 
Partition 1 (b - 2 ) 
Partition 2 (x - 3 ) 
Partition 3 (x.0 - 4 ) 
Partition 4 (x.1 - 5 ) 
Partition 5 ( - 7 ) 
 
So if you replace the root var comparison in my hack with a check to make sure 
def and def2 are not in the same partition, that whole check will always be 
false and you still get crap code. 
 

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-03 Thread amacleod at redhat dot com

--- Additional Comments From amacleod at redhat dot com  2005-02-03 16:05 
---
(In reply to comment #26)
> > Are we looking to do this at -O2 as well? I guess thats a key question.
> > at just -Os, it might very well be sufficient.
> 
> As stevenb noted today in IRC, the code reduction substantially comes from 
> less 
> spilling code being emitted, which also means that the generated code *will* 
> run faster. I guess it's better to have it at -O2 too (until the register 
> allocator gets fixed, but hey...)


You missed my point. Sure, in the cases where we end up spilling less, its
likely to run faster. But we are going to miss combine opportunites in other
cases where we wouldnt have spilled in the first place.   If we are going to do
this at >=-O2, we're probably better off actually looking at how many live
ranges we are introducing to solve the issue. If the big expression doesnt
introduce too many live ranges, we might as well allow TER to continue and give
the expanders a better view. Thats why TER exists in the first place.

In any case, its all heuristical, so there will always be cases it doesnt work
right for. If we measure SPEC or something else and the simple rootvar approach
works fine, then go with it.  If there are issues, then we can visit something a
bit different.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-03 Thread giovannibajo at libero dot it

--- Additional Comments From giovannibajo at libero dot it  2005-02-03 
15:15 ---
> Are we looking to do this at -O2 as well? I guess thats a key question.
> at just -Os, it might very well be sufficient.

As stevenb noted today in IRC, the code reduction substantially comes from less 
spilling code being emitted, which also means that the generated code *will* 
run faster. I guess it's better to have it at -O2 too (until the register 
allocator gets fixed, but hey...)

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-03 Thread amacleod at redhat dot com

--- Additional Comments From amacleod at redhat dot com  2005-02-03 14:37 
---
(In reply to comment #23)
> We have incomming into out-of-ssa,
> 
>x.1 = exp1
>x.2 = x.1 + exp2
>x.3 = x.2 + exp3
> 
> We're currently allowing TER to produce
> 
>x.3 = exp1 + exp2 + exp3
> 
> What if we were to disable TER substitution when the base variable on the lhs
> matches the base variable on the rhs?  So in this case we'd notice x.1 and x.2
> have the same base variable and not merge.  And (more importantly) so forth so
> that the definitions of x.20 and x.19 aren't merged either.


of course, the real issue isnt just the root variable is it? We could have TER
introduce an expression with 1000 different base variables in the expression,
and the problem is that we've got 1000 live now instead of something a little
more sane by accumulating along the way.

IN the above example, it depends on what is in exp1, exp2 and exp3 as to whether
you want to avoid the subsitution. if all 3 exp's end up equating to y.6 + y.6,
you probably DO want to let it happen so you end up with

x.3 = y.6 + y.6 + y.6 + y.6 + y.6 + y.6

> This can probably still fall down, especially when a lot of our variables get
> replaced by "ivtmp.x" and "pretmp.y", but at least we'll have some cut off 
> that
> handles accumulation naturally.  Perhaps there's some loser notion of "base
> variable" we can use, like "in the same partition", or something.


The partitions have been decided by the time TER runs, so we know what SSA_NAME
is going to be a different variable and what isn't.  all you need to do is
compare var_to_partition(ssa_name) to see if they are in the same partition 
number.


I suspect that anything easy we can do will have performance slow downs if we do
it at >=-O2, but you never know :-).  I suspect the better solution is to limit
the number of live ranges it can introduce to something sane.  I dont think that
is likely to be too hard to code, the question is how quickly can it run :-) At
every expression replacement point, we'd have to make a quick run through the
operands and count unique partitions. 

Are we looking to do this at -O2 as well? I guess thats a key question. at just
-Os, it might very well be sufficient.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-02 Thread rth at gcc dot gnu dot org

--- Additional Comments From rth at gcc dot gnu dot org  2005-02-03 06:37 
---
We have incomming into out-of-ssa,

   x.1 = exp1
   x.2 = x.1 + exp2
   x.3 = x.2 + exp3

We're currently allowing TER to produce

   x.3 = exp1 + exp2 + exp3

What if we were to disable TER substitution when the base variable on the lhs
matches the base variable on the rhs?  So in this case we'd notice x.1 and x.2
have the same base variable and not merge.  And (more importantly) so forth so
that the definitions of x.20 and x.19 aren't merged either.

This can probably still fall down, especially when a lot of our variables get
replaced by "ivtmp.x" and "pretmp.y", but at least we'll have some cut off that
handles accumulation naturally.  Perhaps there's some loser notion of "base
variable" we can use, like "in the same partition", or something.

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-02 Thread amacleod at redhat dot com

--- Additional Comments From amacleod at redhat dot com  2005-02-03 03:27 
---
(In reply to comment #20)
> (In reply to comment #19)
> > The balance of the problem appears to come from TER.  We're building very 
> > large
> > constructs, such as
> 
> Hmm, shouldn't the register allocator fix this anyways (yes I know we have a
semi-stupid one but still).

someday it will, but since it doesnt now we will have to fix it in TER. TER is a
temporary measure and will cease to exist someday too. All we need is smarter
expanders.  Its all on a list somewhere :-)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-02 Thread amacleod at redhat dot com

--- Additional Comments From amacleod at redhat dot com  2005-02-03 03:23 
---
(In reply to comment #19)
> The balance of the problem appears to come from TER.  We're building very 
> large
> constructs, such as
...
> I'm not sure how best to attack this, Andrew.  I think it's clear that turning
> off TER entirely isn't the correct option, but I'm not sure how to throttle it
> to keep things from running completely off the rails.  Thoughts?

thats a good question.  Its not like TER is simple to throttle. I'll have a look
at it tomorrow. we might be able to do something about counting the number of
temporaries we substitute into a stmt, and cease and desist if we are going to
increase the stmt size to more than some number, like 10 or 8 or 12 or 20 or
NUMREGS/4 or somesuch thing. we might also only want to do it for MODIFY_EXPRs.
 I'd have to accumulate it per expression, but it shouldnt be too bad.  I'll see
if anything else occurs to me overnight or tomorrow.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-02 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2005-02-03 
02:38 ---
(In reply to comment #19)
> The balance of the problem appears to come from TER.  We're building very 
> large
> constructs, such as

Hmm, shouldn't the register allocator fix this anyways (yes I know we have a 
semi-stupid one but still).

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549


[Bug tree-optimization/17549] [4.0 Regression] 10% increase in codesize with C code compared to GCC 3.3

2005-02-02 Thread rth at gcc dot gnu dot org

--- Additional Comments From rth at gcc dot gnu dot org  2005-02-03 02:23 
---
The balance of the problem appears to come from TER.  We're building very large
constructs, such as

  [z.c : 76] x.182 = -temp.218 + temp.220 - temp.688 - temp.222 + temp.224 +
temp.692 - temp.226 * 3 - temp.227 * 2 - temp.228 + temp.230 + temp.231 * 2 +
temp.232 * 3 - (int) *(cp - (uchar *) (unsigned int) *(temp.677 - 6B)) * 3 -
(int) *(cp - (uchar *) (unsigned int) *(temp.677 + -5B)) * 2 - (int) *(cp -
(uchar *) (unsigned int) *(temp.677 + -4B)) + (int) *(cp - (uchar *) (unsigned
int) *(temp.677 + -2B)) + (int) *(cp - (uchar *) (unsigned int) *(temp.677 +
-1B)) * 2 + (int) *(cp - (uchar *) (unsigned int) *temp.677) * 3 - temp.239 * 3
- temp.240 * 2 - temp.241 + temp.243 + temp.244 * 2 + temp.245 * 3 - temp.699 -
temp.247 + temp.249 + temp.703 - temp.251 + temp.253;

Instead of simply accumulating into X at each step, we're extending the
lifetime of all of the temporaries until the end.  Ouch.

Anyway, if I use -Os -fno-tree-ter, the stack frame size drops to 40 bytes
from 256.  And object size is much nicer too:

   textdata bss dec hex filename
   2136   0   02136 858 a.out-34
   2006   0   02006 7d6 a.out-noter
   2809   0   02809 af9 a.out-ter

I'm not sure how best to attack this, Andrew.  I think it's clear that turning
off TER entirely isn't the correct option, but I'm not sure how to throttle it
to keep things from running completely off the rails.  Thoughts?

-- 
   What|Removed |Added

 CC||amacleod at redhat dot com
  Component|middle-end  |tree-optimization
   Last reconfirmed|2004-09-18 17:05:39 |2005-02-03 02:23:45
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17549