[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-26 Thread rguenth at gcc dot gnu dot org


--- Comment #8 from rguenth at gcc dot gnu dot org  2010-04-26 10:36 ---
(In reply to comment #7)
 Subject: Re:  [4.4/4.5/4.6 Regression] Performance
 degradation for simple fibonacci numbers calculation due to extra
 stack alignment
 
  The slowdown also happens on x86-64. Stack alignment checks
  leaf function. But I am sure if it detects tail-recursion.
  Is such information available to ix86_finalize_stack_realign_flags? 
 Tail recursion is recognized at gimple level, so rtl code should not be at all
 bothered here.

There is a recursive self-call left (but that's the only call, so its still
a leaf function).

 Honza
 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-26 Thread jakub at gcc dot gnu dot org


--- Comment #9 from jakub at gcc dot gnu dot org  2010-04-26 12:40 ---
In the leaf_function_p sense it is non-leaf.  For the stack alignment it of
course would be possible to change the stack alignment requirements of the
function if it calls itself, doesn't call other functions (nor tail call them)
and it is changed not to assume the standard alignment in the whole function.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-26 Thread hjl dot tools at gmail dot com


--- Comment #10 from hjl dot tools at gmail dot com  2010-04-26 13:44 
---
(In reply to comment #9)
 In the leaf_function_p sense it is non-leaf.  For the stack alignment it of
 course would be possible to change the stack alignment requirements of the
 function if it calls itself, doesn't call other functions (nor tail call them)
 and it is changed not to assume the standard alignment in the whole function.
 

That is true. For tail call, we only need to align outgoing stack to
minimum of maximum local stack alignment and incoming stack alignment.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-26 Thread jakub at gcc dot gnu dot org


--- Comment #11 from jakub at gcc dot gnu dot org  2010-04-26 13:57 ---
Tail call needs to consider incoming alignment requirements of the target
function (which is often in other CU).  In this case it is not a tail call, but
non-tail recursion (tail-recursion would be handled by wrapping the function's
body into a loop).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-26 Thread hubicka at ucw dot cz


--- Comment #12 from hubicka at ucw dot cz  2010-04-26 14:27 ---
Subject: Re:  [4.4/4.5/4.6 Regression] Performance
degradation for simple fibonacci numbers calculation due to extra
stack alignment

 That is true. For tail call, we only need to align outgoing stack to
 minimum of maximum local stack alignment and incoming stack alignment.

Well, the tail call gets the same stack alignment as the function itself,
so I guess when expanding a tail call, we need to bump up the incomming
stack alignment to one needed by the call.

We should special case the self recursion and do nothing in case of tail
calls and in case of normal calls.  In normal self recursive calls we need
to remember the fact that function is self recursive and when finalizing
be sure that outgoing stack alignment is at least as good as incomming.
This can not be decided at expansion time since we do not know yet what
alignment function has.

Old preferred alignment code had this logic, I guess somehow this got
broken during the merge of stack alignment branch?

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-26 Thread hjl dot tools at gmail dot com


--- Comment #13 from hjl dot tools at gmail dot com  2010-04-26 14:47 
---
(In reply to comment #12)
 Subject: Re:  [4.4/4.5/4.6 Regression] Performance
 degradation for simple fibonacci numbers calculation due to extra
 stack alignment
 
  That is true. For tail call, we only need to align outgoing stack to
  minimum of maximum local stack alignment and incoming stack alignment.
 
 Well, the tail call gets the same stack alignment as the function itself,
 so I guess when expanding a tail call, we need to bump up the incomming
 stack alignment to one needed by the call.
 
 We should special case the self recursion and do nothing in case of tail
 calls and in case of normal calls.  In normal self recursive calls we need
 to remember the fact that function is self recursive and when finalizing
 be sure that outgoing stack alignment is at least as good as incomming.

The outgoing stack alignment should be the minimum of incoming and
local.  If incoming stack is 16byte aligned and local variable only
needs 4byte alignment, there is no difference in stack realignment
when incoming stack is 4byte, 8byte and 16byte aligned.

 This can not be decided at expansion time since we do not know yet what
 alignment function has.
 
 Old preferred alignment code had this logic, I guess somehow this got
 broken during the merge of stack alignment branch?
 

I will investigate.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-25 Thread rguenth at gcc dot gnu dot org


--- Comment #3 from rguenth at gcc dot gnu dot org  2010-04-25 20:03 ---
Well, the innermost loop with current trunk is

.L3:
leal-1(%ebx), %eax
subl$2, %ebx
movl%eax, (%esp)
callfib
addl%eax, %esi
cmpl$2, %ebx
jg  .L3

which is pretty much optimal.  The intel compiler doesn't detect the
tail-recursion (huh) but has multiple entry-points into the function
and uses register passing conventions for the recursions.

With -fwhole-program GCC does the same (or with static fib), and we
then end up with a program faster than what ICC produces (16s)
A 4.3 compiled version is indeed a bit faster (as fast as 4.4 on i?86, 15.4s).
A 4.1 compiled version is even faster (14.1s), the 3.4 baseline is 21.5s.

That's on i?86-linux, all -O2.

4.1 assembly, fib is not inlined:

fib:
pushl   %esi
pushl   %ebx
movl%eax, %ebx
cmpl$2, %ebx
movl$1, %eax
jle .L5
xorl%esi, %esi
.p2align 4,,7
.L6:
leal-1(%ebx), %eax
subl$2, %ebx
callfib
addl%eax, %esi
cmpl$2, %ebx
jg  .L6
leal1(%esi), %eax
.L5:
popl%ebx
popl%esi
ret

trunk assembler:

fib:
pushl   %esi
pushl   %ebx
movl%eax, %ebx
subl$4, %esp
cmpl$2, %ebx
movl$1, %eax
jle .L2
xorl%esi, %esi
.p2align 4,,7
.p2align 3
.L3:
leal-1(%ebx), %eax
subl$2, %ebx
callfib
addl%eax, %esi
cmpl$2, %ebx
jg  .L3
leal1(%esi), %eax
.L2:
addl$4, %esp
popl%ebx
popl%esi
ret

where the only difference is different loop alignment and keeping the
stack 16-bytes aligned.  Indeed we get the same speed as 4.1 when
building with -mpreffered-stack-boundary=2.  Why do we bother to
keep the stack aligned for leaf functions?


-- 

rguenth at gcc dot gnu dot org changed:

   What|Removed |Added

 CC||hjl at gcc dot gnu dot org,
   ||hubicka at gcc dot gnu dot
   ||org
  Component|c++ |target
 GCC target triplet||i?86-*-*
   Keywords||missed-optimization
  Known to work||4.1.3
Summary|[4.4/4.5 Regression]|[4.4/4.5/4.6 Regression]
   |Performance degradation for |Performance degradation for
   |simple fibonacci numbers|simple fibonacci numbers
   |calculation |calculation due to extra
   ||stack alignment
   Target Milestone|--- |4.4.4


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-25 Thread rguenth at gcc dot gnu dot org


--- Comment #4 from rguenth at gcc dot gnu dot org  2010-04-25 20:06 ---
Btw, with the optimal options -O2 -fwhole-program -fomit-frame-pointer
-mpreferred-stack-boundary=2 GCC 4.3 and 4.4 are slower than 4.1 and 4.5
(14.3s vs. 13.8s).  The extra stack alignment drops us to 16.4s(!).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-25 Thread hjl dot tools at gmail dot com


--- Comment #5 from hjl dot tools at gmail dot com  2010-04-25 22:01 ---
(In reply to comment #4)
 Btw, with the optimal options -O2 -fwhole-program -fomit-frame-pointer
 -mpreferred-stack-boundary=2 GCC 4.3 and 4.4 are slower than 4.1 and 4.5
 (14.3s vs. 13.8s).  The extra stack alignment drops us to 16.4s(!).


The slowdown also happens on x86-64. Stack alignment checks
leaf function. But I am sure if it detects tail-recursion.
Is such information available to ix86_finalize_stack_realign_flags? 


-- 

hjl dot tools at gmail dot com changed:

   What|Removed |Added

 CC|hjl at gcc dot gnu dot org  |hjl dot tools at gmail dot
   ||com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-25 Thread hubicka at ucw dot cz


--- Comment #6 from hubicka at ucw dot cz  2010-04-25 23:42 ---
Subject: Re:  [4.4/4.5/4.6 Regression] Performance
degradation for simple fibonacci numbers calculation due to extra
stack alignment

 where the only difference is different loop alignment and keeping the
 stack 16-bytes aligned.  Indeed we get the same speed as 4.1 when
 building with -mpreffered-stack-boundary=2.  Why do we bother to
 keep the stack aligned for leaf functions?
We should not.  Probably fallout of stack alignment patches? I will check out
later.

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884



[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment

2010-04-25 Thread hubicka at ucw dot cz


--- Comment #7 from hubicka at ucw dot cz  2010-04-25 23:43 ---
Subject: Re:  [4.4/4.5/4.6 Regression] Performance
degradation for simple fibonacci numbers calculation due to extra
stack alignment

 The slowdown also happens on x86-64. Stack alignment checks
 leaf function. But I am sure if it detects tail-recursion.
 Is such information available to ix86_finalize_stack_realign_flags? 
Tail recursion is recognized at gimple level, so rtl code should not be at all
bothered here.

Honza


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884