[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #8 from rguenth at gcc dot gnu dot org 2010-04-26 10:36 --- (In reply to comment #7) Subject: Re: [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment The slowdown also happens on x86-64. Stack alignment checks leaf function. But I am sure if it detects tail-recursion. Is such information available to ix86_finalize_stack_realign_flags? Tail recursion is recognized at gimple level, so rtl code should not be at all bothered here. There is a recursive self-call left (but that's the only call, so its still a leaf function). Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #9 from jakub at gcc dot gnu dot org 2010-04-26 12:40 --- In the leaf_function_p sense it is non-leaf. For the stack alignment it of course would be possible to change the stack alignment requirements of the function if it calls itself, doesn't call other functions (nor tail call them) and it is changed not to assume the standard alignment in the whole function. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #10 from hjl dot tools at gmail dot com 2010-04-26 13:44 --- (In reply to comment #9) In the leaf_function_p sense it is non-leaf. For the stack alignment it of course would be possible to change the stack alignment requirements of the function if it calls itself, doesn't call other functions (nor tail call them) and it is changed not to assume the standard alignment in the whole function. That is true. For tail call, we only need to align outgoing stack to minimum of maximum local stack alignment and incoming stack alignment. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #11 from jakub at gcc dot gnu dot org 2010-04-26 13:57 --- Tail call needs to consider incoming alignment requirements of the target function (which is often in other CU). In this case it is not a tail call, but non-tail recursion (tail-recursion would be handled by wrapping the function's body into a loop). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #12 from hubicka at ucw dot cz 2010-04-26 14:27 --- Subject: Re: [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment That is true. For tail call, we only need to align outgoing stack to minimum of maximum local stack alignment and incoming stack alignment. Well, the tail call gets the same stack alignment as the function itself, so I guess when expanding a tail call, we need to bump up the incomming stack alignment to one needed by the call. We should special case the self recursion and do nothing in case of tail calls and in case of normal calls. In normal self recursive calls we need to remember the fact that function is self recursive and when finalizing be sure that outgoing stack alignment is at least as good as incomming. This can not be decided at expansion time since we do not know yet what alignment function has. Old preferred alignment code had this logic, I guess somehow this got broken during the merge of stack alignment branch? Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #13 from hjl dot tools at gmail dot com 2010-04-26 14:47 --- (In reply to comment #12) Subject: Re: [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment That is true. For tail call, we only need to align outgoing stack to minimum of maximum local stack alignment and incoming stack alignment. Well, the tail call gets the same stack alignment as the function itself, so I guess when expanding a tail call, we need to bump up the incomming stack alignment to one needed by the call. We should special case the self recursion and do nothing in case of tail calls and in case of normal calls. In normal self recursive calls we need to remember the fact that function is self recursive and when finalizing be sure that outgoing stack alignment is at least as good as incomming. The outgoing stack alignment should be the minimum of incoming and local. If incoming stack is 16byte aligned and local variable only needs 4byte alignment, there is no difference in stack realignment when incoming stack is 4byte, 8byte and 16byte aligned. This can not be decided at expansion time since we do not know yet what alignment function has. Old preferred alignment code had this logic, I guess somehow this got broken during the merge of stack alignment branch? I will investigate. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #3 from rguenth at gcc dot gnu dot org 2010-04-25 20:03 --- Well, the innermost loop with current trunk is .L3: leal-1(%ebx), %eax subl$2, %ebx movl%eax, (%esp) callfib addl%eax, %esi cmpl$2, %ebx jg .L3 which is pretty much optimal. The intel compiler doesn't detect the tail-recursion (huh) but has multiple entry-points into the function and uses register passing conventions for the recursions. With -fwhole-program GCC does the same (or with static fib), and we then end up with a program faster than what ICC produces (16s) A 4.3 compiled version is indeed a bit faster (as fast as 4.4 on i?86, 15.4s). A 4.1 compiled version is even faster (14.1s), the 3.4 baseline is 21.5s. That's on i?86-linux, all -O2. 4.1 assembly, fib is not inlined: fib: pushl %esi pushl %ebx movl%eax, %ebx cmpl$2, %ebx movl$1, %eax jle .L5 xorl%esi, %esi .p2align 4,,7 .L6: leal-1(%ebx), %eax subl$2, %ebx callfib addl%eax, %esi cmpl$2, %ebx jg .L6 leal1(%esi), %eax .L5: popl%ebx popl%esi ret trunk assembler: fib: pushl %esi pushl %ebx movl%eax, %ebx subl$4, %esp cmpl$2, %ebx movl$1, %eax jle .L2 xorl%esi, %esi .p2align 4,,7 .p2align 3 .L3: leal-1(%ebx), %eax subl$2, %ebx callfib addl%eax, %esi cmpl$2, %ebx jg .L3 leal1(%esi), %eax .L2: addl$4, %esp popl%ebx popl%esi ret where the only difference is different loop alignment and keeping the stack 16-bytes aligned. Indeed we get the same speed as 4.1 when building with -mpreffered-stack-boundary=2. Why do we bother to keep the stack aligned for leaf functions? -- rguenth at gcc dot gnu dot org changed: What|Removed |Added CC||hjl at gcc dot gnu dot org, ||hubicka at gcc dot gnu dot ||org Component|c++ |target GCC target triplet||i?86-*-* Keywords||missed-optimization Known to work||4.1.3 Summary|[4.4/4.5 Regression]|[4.4/4.5/4.6 Regression] |Performance degradation for |Performance degradation for |simple fibonacci numbers|simple fibonacci numbers |calculation |calculation due to extra ||stack alignment Target Milestone|--- |4.4.4 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #4 from rguenth at gcc dot gnu dot org 2010-04-25 20:06 --- Btw, with the optimal options -O2 -fwhole-program -fomit-frame-pointer -mpreferred-stack-boundary=2 GCC 4.3 and 4.4 are slower than 4.1 and 4.5 (14.3s vs. 13.8s). The extra stack alignment drops us to 16.4s(!). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #5 from hjl dot tools at gmail dot com 2010-04-25 22:01 --- (In reply to comment #4) Btw, with the optimal options -O2 -fwhole-program -fomit-frame-pointer -mpreferred-stack-boundary=2 GCC 4.3 and 4.4 are slower than 4.1 and 4.5 (14.3s vs. 13.8s). The extra stack alignment drops us to 16.4s(!). The slowdown also happens on x86-64. Stack alignment checks leaf function. But I am sure if it detects tail-recursion. Is such information available to ix86_finalize_stack_realign_flags? -- hjl dot tools at gmail dot com changed: What|Removed |Added CC|hjl at gcc dot gnu dot org |hjl dot tools at gmail dot ||com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #6 from hubicka at ucw dot cz 2010-04-25 23:42 --- Subject: Re: [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment where the only difference is different loop alignment and keeping the stack 16-bytes aligned. Indeed we get the same speed as 4.1 when building with -mpreffered-stack-boundary=2. Why do we bother to keep the stack aligned for leaf functions? We should not. Probably fallout of stack alignment patches? I will check out later. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884
[Bug target/43884] [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment
--- Comment #7 from hubicka at ucw dot cz 2010-04-25 23:43 --- Subject: Re: [4.4/4.5/4.6 Regression] Performance degradation for simple fibonacci numbers calculation due to extra stack alignment The slowdown also happens on x86-64. Stack alignment checks leaf function. But I am sure if it detects tail-recursion. Is such information available to ix86_finalize_stack_realign_flags? Tail recursion is recognized at gimple level, so rtl code should not be at all bothered here. Honza -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43884