Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 08:01:57PM +0100, Thomas Gleixner wrote: Just compiled with -mincoming-stack-boundary=4 and the problem goes away as gcc now thinks that the incoming stack is already 16 byte aligned. But that might break code which actually uses SSE Please don't do this, lying to the compiler is just going to result in wrong-code sooner or later, with the above switch gcc will assume the incoming stack is 16-byte aligned (which is not true in the ix86 kernel) and could very well e.g. optimize away code that looks at alignment of stack variables etc. Jakub
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Mon, 23 Nov 2009, Jakub Jelinek wrote: On Thu, Nov 19, 2009 at 08:01:57PM +0100, Thomas Gleixner wrote: Just compiled with -mincoming-stack-boundary=4 and the problem goes away as gcc now thinks that the incoming stack is already 16 byte aligned. But that might break code which actually uses SSE Please don't do this, lying to the compiler is just going to result in wrong-code sooner or later, with the above switch gcc will assume the incoming stack is 16-byte aligned (which is not true in the ix86 kernel) and could very well e.g. optimize away code that looks at alignment of stack variables etc. Right. I gave up the idea pretty fast. But in the current situation we are forced to lie to the compiler in some way. Forcing -mtune=generic when the function graph tracer is enabled seems to be a halfways sane work around. Thanks, tglx
[PATCH] gcc mcount-nofp was Re: BUG: GCC-4.4.x changes the function frame on some functions
Steven Rostedt rost...@goodmis.org writes: And frame pointers do add a little overhead as well. Too bad the mcount ABI wasn't something like this: function: callmcount [...] This way, the function address for mcount would have been (%esp) and the parent address would be 4(%esp). Mcount would work without frame pointers and this whole mess would also become moot. I did a patch to do this in x86 gcc some time ago. The motivation was indeed the frame pointer overhead on Atom with tracing. Unfortunately it also requires glibc changes (I did those too). For compatibility and catching mistakes the new function was called __mcount_nofp. I haven't tried it with current gcc and last time I missed the gcc feature merge window with this. But perhaps you find it useful. Of course it would need more kernel changes to probe for the new option and handle it. Here's the old patch. I haven't retested it with a current gcc version, but I think it still applies at least. If there's interest I can polish it up and submit formally. -Andi Index: gcc/doc/tm.texi === --- gcc/doc/tm.texi (revision 149140) +++ gcc/doc/tm.texi (working copy) @@ -1884,6 +1884,12 @@ of words in each data entry. @end defmac +...@defmac TARGET_FUNCTION_PROFILE +Define if the target has a custom function_profiler function. +The target should not set this macro, it is implicitely set from +the PROFILE_BEFORE_PROLOGUE macro. +...@end defmac + @node Registers @section Register Usage @cindex register usage Index: gcc/doc/invoke.texi === --- gcc/doc/invoke.texi (revision 149140) +++ gcc/doc/invoke.texi (working copy) @@ -593,7 +593,7 @@ -momit-leaf-frame-pointer -mno-red-zone -mno-tls-direct-seg-refs @gol -mcmod...@var{code-model} -ma...@var{name} @gol -m32 -m64 -mlarge-data-thresho...@var{num} @gol --mfused-madd -mno-fused-madd -msse2avx} +-mfused-madd -mno-fused-madd -msse2avx -mmcount-nofp} @emph{IA-64 Options} @gccoptlist{-mbig-endian -mlittle-endian -mgnu-as -mgnu-ld -mno-pic @gol @@ -11749,6 +11749,11 @@ @opindex msse2avx Specify that the assembler should encode SSE instructions with VEX prefix. The option @option{-mavx} turns this on by default. + +...@item -mmcount-nofp +Don't force the frame counter with @option{-pg} function profiling. +Instead call a new @code{__mcount_nofp} function before a stack +frame is set up. @end table These @samp{-m} switches are supported in addition to the above Index: gcc/target.h === --- gcc/target.h(revision 149140) +++ gcc/target.h(working copy) @@ -1132,6 +1132,9 @@ */ bool arm_eabi_unwinder; + /* True when the function profiler code is outputted before the prologue. */ + bool profile_before_prologue; + /* Leave the boolean fields at the end. */ }; Index: gcc/final.c === --- gcc/final.c (revision 149140) +++ gcc/final.c (working copy) @@ -1520,10 +1520,8 @@ /* The Sun386i and perhaps other machines don't work right if the profiling code comes after the prologue. */ -#ifdef PROFILE_BEFORE_PROLOGUE - if (crtl-profile) + if (targetm.profile_before_prologue crtl-profile) profile_function (file); -#endif /* PROFILE_BEFORE_PROLOGUE */ #if defined (DWARF2_UNWIND_INFO) defined (HAVE_prologue) if (dwarf2out_do_frame ()) @@ -1565,10 +1563,8 @@ static void profile_after_prologue (FILE *file ATTRIBUTE_UNUSED) { -#ifndef PROFILE_BEFORE_PROLOGUE - if (crtl-profile) + if (!targetm.profile_before_prologue crtl-profile) profile_function (file); -#endif /* not PROFILE_BEFORE_PROLOGUE */ } static void Index: gcc/gcc.c === --- gcc/gcc.c (revision 149140) +++ gcc/gcc.c (working copy) @@ -797,6 +797,12 @@ # define SYSROOT_HEADERS_SUFFIX_SPEC #endif +/* Target can override this to allow -pg/-fomit-frame-pointer together */ +#ifndef TARGET_PG_OPTION_SPEC +#define TARGET_PG_OPTION_SPEC \ +%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}} +#endif + static const char *asm_debug; static const char *cpp_spec = CPP_SPEC; static const char *cc1_spec = CC1_SPEC; @@ -866,8 +872,8 @@ /* NB: This is shared amongst all front-ends, except for Ada. */ static const char *cc1_options = -%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}}\ - %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*}\ + TARGET_PG_OPTION_SPEC + %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*}\ %{fcompare-debug-second:%:compare-debug-auxbase-opt(%b)} \ %{!fcompare-debug-second:%{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}}%{!c:%{!S:-auxbase %b}} \ %{g*} %{O*} %{W*pedantic*} %{w} %{std*ansitrigraphs}\ Index: gcc/target-def.h
Re: BUG: GCC-4.4.x changes the function frame on some functions
Thomas Gleixner wrote: While testing various kernel configs we found out that the problem comes and goes. Finally I started to compare the gcc command line options and after some fiddling it turned out that the following minimal deltas change the code generator behaviour: Bad: -march=pentium-mmx-Wa,-mtune=generic32 Good: -march=i686-mtune=generic -Wa,-mtune=generic32 Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32 I'm not supposed to understand the logic behind that, right ? I don't either. I'm seeing: timer_stats_update_stats: timer_stats_update_stats: pushl %edi leal8(%esp), %edi andl$-16, %esp pushl -4(%edi) pushl %ebppushl %ebp movl%esp, %ebp movl %esp, %ebp pushl %edi | andl $-16, %esp pushl %esi | subl $112, %esp pushl %ebx | movl %ebx, 100(%esp) subl$108, %esp| movl %esi, 104(%esp) movl %edi, 108(%esp) callmcount call mcount where the only difference is -mtune=generic. I'm investigating. Andrew.
Re: BUG: GCC-4.4.x changes the function frame on some functions
Andrew Haley wrote: Thomas Gleixner wrote: While testing various kernel configs we found out that the problem comes and goes. Finally I started to compare the gcc command line options and after some fiddling it turned out that the following minimal deltas change the code generator behaviour: Bad: -march=pentium-mmx-Wa,-mtune=generic32 Good: -march=i686-mtune=generic -Wa,-mtune=generic32 Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32 I'm not supposed to understand the logic behind that, right ? I don't either. I'm seeing: timer_stats_update_stats: timer_stats_update_stats: pushl %edi leal8(%esp), %edi andl$-16, %esp pushl -4(%edi) pushl %ebppushl %ebp movl%esp, %ebp movl %esp, %ebp pushl %edi | andl $-16, %esp pushl %esi | subl $112, %esp pushl %ebx | movl %ebx, 100(%esp) subl$108, %esp| movl %esi, 104(%esp) movl %edi, 108(%esp) callmcount call mcount where the only difference is -mtune=generic. I'm investigating. Forget that, I see from the gcc-bugs list that hj has tracked it down to the use of DRAP, and for some reason the mtune options affect that. He's the best person to fix this. Andrew.
Re: [PATCH] gcc mcount-nofp was Re: BUG: GCC-4.4.x changes the function frame on some functions
On Fri, 2009-11-20 at 10:57 +0100, Andi Kleen wrote: Steven Rostedt rost...@goodmis.org writes: And frame pointers do add a little overhead as well. Too bad the mcount ABI wasn't something like this: function: callmcount [...] This way, the function address for mcount would have been (%esp) and the parent address would be 4(%esp). Mcount would work without frame pointers and this whole mess would also become moot. I did a patch to do this in x86 gcc some time ago. The motivation was indeed the frame pointer overhead on Atom with tracing. Yes, I remember you talking about this but I don't remember how far it went. Unfortunately it also requires glibc changes (I did those too). For compatibility and catching mistakes the new function was called __mcount_nofp. Actually, could you change the name? I really hate the mcount name, it is legacy and with a new feature, it should be abandoned. Something like __fentry__ would be nicer. I haven't tried it with current gcc and last time I missed the gcc feature merge window with this. But perhaps you find it useful. Of course it would need more kernel changes to probe for the new option and handle it. Here's the old patch. I haven't retested it with a current gcc version, but I think it still applies at least. If there's interest I can polish it up and submit formally. I would definitely be interested, and I would also be willing to test it. Thanks! -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Thomas Gleixner wrote: Can the GCC folks please shed some light on this: standard function start: push %ebp mov%esp, %ebp call mcount modified function start on a handful of functions only seen with gcc 4.4.x on x86 32 bit: push %edi lea0x8(%esp),%edi and$0xfff0,%esp pushl -0x4(%edi) push %ebp mov%esp,%ebp ... call mcount This modification leads to a hard to solve problem in the kernel function graph tracer which assumes that the stack looks like: return address saved ebp With the modified function start sequence this is not longer true and the manipulation of the return address on the stack fails silently. Neither gcc 4.3 nor gcc 3.4 are generating such function frames, so it looks like a gcc 4.4.x feature. There is no real obvious reason why the edi magic needs to be done _before_ push %ebp mov%esp,%ebp Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
Thomas Gleixner wrote: On Thu, 19 Nov 2009, Thomas Gleixner wrote: Can the GCC folks please shed some light on this: standard function start: push %ebp mov%esp, %ebp call mcount modified function start on a handful of functions only seen with gcc 4.4.x on x86 32 bit: push %edi lea0x8(%esp),%edi and$0xfff0,%esp pushl -0x4(%edi) push %ebp mov%esp,%ebp ... call mcount This modification leads to a hard to solve problem in the kernel function graph tracer which assumes that the stack looks like: return address saved ebp With the modified function start sequence this is not longer true and the manipulation of the return address on the stack fails silently. Neither gcc 4.3 nor gcc 3.4 are generating such function frames, so it looks like a gcc 4.4.x feature. There is no real obvious reason why the edi magic needs to be done _before_ push %ebp mov%esp,%ebp Sure there is: unless you do the adjustment first %ebp won't be 16-aligned. We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? Andrew.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/2009 07:37 AM, Thomas Gleixner wrote: modified function start on a handful of functions only seen with gcc 4.4.x on x86 32 bit: push %edi lea0x8(%esp),%edi and$0xfff0,%esp pushl -0x4(%edi) push %ebp mov%esp,%ebp ... call mcount The real questions is why we're aligning the stack in the kernel. It is probably not what we want -- we don't use SSE for anything but a handful of special cases in the kernel, and we don't want the overhead. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 4:45 PM, H. Peter Anvin h...@zytor.com wrote: On 11/19/2009 07:37 AM, Thomas Gleixner wrote: modified function start on a handful of functions only seen with gcc 4.4.x on x86 32 bit: push %edi lea 0x8(%esp),%edi and $0xfff0,%esp pushl -0x4(%edi) push %ebp mov %esp,%ebp ... call mcount The real questions is why we're aligning the stack in the kernel. It is probably not what we want -- we don't use SSE for anything but a handful of special cases in the kernel, and we don't want the overhead. It's likely because you have long long vars on the stack which is faster when they are aligned. -mno-stackrealign may do what you want (or may not, I have not checked). I assume you already use -mpreferred-stack-boundary=2. Richard.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 4:49 PM, Richard Guenther richard.guent...@gmail.com wrote: On Thu, Nov 19, 2009 at 4:45 PM, H. Peter Anvin h...@zytor.com wrote: On 11/19/2009 07:37 AM, Thomas Gleixner wrote: modified function start on a handful of functions only seen with gcc 4.4.x on x86 32 bit: push %edi lea 0x8(%esp),%edi and $0xfff0,%esp pushl -0x4(%edi) push %ebp mov %esp,%ebp ... call mcount The real questions is why we're aligning the stack in the kernel. It is probably not what we want -- we don't use SSE for anything but a handful of special cases in the kernel, and we don't want the overhead. It's likely because you have long long vars on the stack which is faster when they are aligned. -mno-stackrealign may do what you want (or may not, I have not checked). I assume you already use -mpreferred-stack-boundary=2. Just checking it seems you must be using -mincoming-stack-boundary=2 instead but keep the preferred stack boundary at 4. Richard.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/2009 07:44 AM, Andrew Haley wrote: We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? Per the ABI requirements? We're talking 32 bits, here. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 4:54 PM, H. Peter Anvin h...@zytor.com wrote: On 11/19/2009 07:44 AM, Andrew Haley wrote: We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? Per the ABI requirements? We're talking 32 bits, here. Hm, even with void bar (int *); void foo (void) { int x; bar (x); } gcc -S -O2 -m32 -mincoming-stack-boundary=2 t.c we re-align the stack. That looks indeed bogus. HJ, you invented all this code, what's the reason for the above? Richard.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 15:44 +, Andrew Haley wrote: Thomas Gleixner wrote: We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? And how do we do that? The hooks that are in place have no idea of what happened before they were called? -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Andrew Haley wrote: Thomas Gleixner wrote: There is no real obvious reason why the edi magic needs to be done _before_ push %ebp mov%esp,%ebp Sure there is: unless you do the adjustment first %ebp won't be 16-aligned. And why is this not done in 99% of the functions in the kernel, just in this one and some random others ? We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? Where is that ABI requirement that push %ebp needs to happen on an aligned stack ? And why is this something GCC did not care about until GCC4.4 ? Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/2009 08:02 AM, Steven Rostedt wrote: On Thu, 2009-11-19 at 15:44 +, Andrew Haley wrote: Thomas Gleixner wrote: We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? And how do we do that? The hooks that are in place have no idea of what happened before they were called? Furthermore, it is nonsense -- ABI stack alignment on *32 bits* is 4 bytes, not 16. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 15:44 +, Andrew Haley wrote: We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? Unfortunately, this is the only fix we have: diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index b416512..cd39064 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -143,7 +143,6 @@ config FUNCTION_GRAPH_TRACER bool Kernel Function Graph Tracer depends on HAVE_FUNCTION_GRAPH_TRACER depends on FUNCTION_TRACER - depends on !X86_32 || !CC_OPTIMIZE_FOR_SIZE default y help Enable the kernel to trace a function at both its return diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 45e6c01..50c2251 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -1070,6 +1070,11 @@ static __init int init_graph_trace(void) { max_bytes_for_cpu = snprintf(NULL, 0, %d, nr_cpu_ids - 1); +#if defined(CONFIG_X86_32 __GNUC__ = 4 __GNUC_MINOR__ = 4) + pr_info(WARNING: GCC 4.4.X breaks the function graph tracer on i686.\n +The function graph tracer will be disabled.\n); + return -1; +#endif return register_tracer(graph_trace); } -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
Thomas Gleixner wrote: On Thu, 19 Nov 2009, Andrew Haley wrote: Thomas Gleixner wrote: There is no real obvious reason why the edi magic needs to be done _before_ push %ebp mov%esp,%ebp Sure there is: unless you do the adjustment first %ebp won't be 16-aligned. And why is this not done in 99% of the functions in the kernel, just in this one and some random others ? If I could see the function I might be able to tell you. It's either a performance enhancement, something to do with SSE, or it's a bug. Andrew.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 11:02:32AM -0500, Steven Rostedt wrote: On Thu, 2009-11-19 at 15:44 +, Andrew Haley wrote: Thomas Gleixner wrote: We're aligning the stack properly, as per the ABI requirements. Can't you just fix the tracer? And how do we do that? The hooks that are in place have no idea of what happened before they were called? -- Steve Yep, this is really something we can't fix from the tracer
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Andrew Haley wrote: Thomas Gleixner wrote: On Thu, 19 Nov 2009, Andrew Haley wrote: Thomas Gleixner wrote: There is no real obvious reason why the edi magic needs to be done _before_ push %ebp mov%esp,%ebp Sure there is: unless you do the adjustment first %ebp won't be 16-aligned. And why is this not done in 99% of the functions in the kernel, just in this one and some random others ? If I could see the function I might be able to tell you. It's either a performance enhancement, something to do with SSE, or it's a bug. kernel/time/timer_stats.c timer_stats_update_stats() Here is the disassembly: 8107ad50 timer_stats_update_stats: 8107ad50: 57 push %edi 8107ad51: 8d 7c 24 08 lea0x8(%esp),%edi 8107ad55: 83 e4 f0and$0xfff0,%esp 8107ad58: ff 77 fcpushl -0x4(%edi) 8107ad5b: 55 push %ebp 8107ad5c: 89 e5 mov%esp,%ebp 8107ad5e: 57 push %edi 8107ad5f: 56 push %esi 8107ad60: 53 push %ebx 8107ad61: 83 ec 6csub$0x6c,%esp 8107ad64: e8 47 92 f8 ff call 81003fb0 mcount 8107ad69: 8b 77 04mov0x4(%edi),%esi 8107ad6c: 89 75 a4mov%esi,-0x5c(%ebp) 8107ad6f: 65 8b 35 14 00 00 00mov%gs:0x14,%esi 8107ad76: 89 75 e4mov%esi,-0x1c(%ebp) 8107ad79: 31 f6 xor%esi,%esi 8107ad7b: 8b 35 60 5a cd 81 mov0x81cd5a60,%esi 8107ad81: 8b 1f mov(%edi),%ebx 8107ad83: 85 f6 test %esi,%esi 8107ad85: 8b 7f 08mov0x8(%edi),%edi 8107ad88: 75 18 jne8107ada2 timer_stats_update_stats+0x52 8107ad8a: 8b 45 e4mov-0x1c(%ebp),%eax 8107ad8d: 65 33 05 14 00 00 00xor%gs:0x14,%eax 8107ad94: 75 53 jne8107ade9 timer_stats_update_stats+0x99 8107ad96: 83 c4 6cadd$0x6c,%esp 8107ad99: 5b pop%ebx 8107ad9a: 5e pop%esi 8107ad9b: 5f pop%edi 8107ad9c: 5d pop%ebp 8107ad9d: 8d 67 f8lea-0x8(%edi),%esp 8107ada0: 5f pop%edi 8107ada1: c3 ret 8107ada2: be 00 7a d6 81 mov$0x81d67a00,%esi 8107ada7: 89 45 acmov%eax,-0x54(%ebp) 8107adaa: 89 75 a0mov%esi,-0x60(%ebp) 8107adad: 89 5d b4mov%ebx,-0x4c(%ebp) 8107adb0: 64 8b 35 78 6a d6 81mov%fs:0x81d66a78,%esi 8107adb7: 8b 34 b5 20 50 cd 81mov-0x7e32afe0(,%esi,4),%esi 8107adbe: 89 4d b0mov%ecx,-0x50(%ebp) 8107adc1: 01 75 a0add%esi,-0x60(%ebp) 8107adc4: 89 55 b8mov%edx,-0x48(%ebp) 8107adc7: 8b 45 a0mov-0x60(%ebp),%eax 8107adca: 89 7d c0mov%edi,-0x40(%ebp) 8107adcd: e8 de f7 76 00 call 817ea5b0 _spin_lock_irqsave 8107add2: 83 3d 60 5a cd 81 00cmpl $0x0,0x81cd5a60 8107add9: 89 c3 mov%eax,%ebx 8107addb: 75 11 jne8107adee timer_stats_update_stats+0x9e 8107addd: 89 da mov%ebx,%edx 8107addf: 8b 45 a0mov-0x60(%ebp),%eax 8107ade2: e8 79 fc 76 00 call 817eaa60 _spin_unlock_irqrestore 8107ade7: eb a1 jmp8107ad8a timer_stats_update_stats+0x3a 8107ade9: e8 52 e4 fc ff call 81049240 __stack_chk_fail 8107adee: 8d 45 a8lea-0x58(%ebp),%eax 8107adf1: 8b 55 a4mov-0x5c(%ebp),%edx 8107adf4: e8 f7 fd ff ff call 8107abf0 tstat_lookup 8107adf9: 85 c0 test %eax,%eax 8107adfb: 74 05 je 8107ae02 timer_stats_update_stats+0xb2 8107adfd: ff 40 14incl 0x14(%eax) 8107ae00: eb db jmp8107addd timer_stats_update_stats+0x8d 8107ae02: f0 ff 05 00 67 fd 81lock incl 0x81fd6700 8107ae09: eb d2 jmp8107addd timer_stats_update_stats+0x8d 8107ae0b: 90 nop 8107ae0c: 90 nop 8107ae0d: 90 nop 8107ae0e: 90 nop 8107ae0f: 90 nop There is a dozen more of those. Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
Richard Guenther richard.guent...@gmail.com writes: It's likely because you have long long vars on the stack which is faster when they are aligned. It's not faster for 32bit. -Andi -- a...@linux.intel.com -- Speaking for myself only.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Thomas Gleixner wrote: standard function start: push %ebp mov%esp, %ebp call mcount modified function start on a handful of functions only seen with gcc 4.4.x on x86 32 bit: push %edi lea0x8(%esp),%edi and$0xfff0,%esp pushl -0x4(%edi) push %ebp mov%esp,%ebp ... call mcount That's some crazy sh*t anyway, since we don't _want_ the stack to be 16-byte aligned in the kernel. We do KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=2) why is that not working? So this looks like a gcc bug, plain and simple. This modification leads to a hard to solve problem in the kernel function graph tracer which assumes that the stack looks like: return address saved ebp Umm. But it still does, doesn't it? That pushl -0x4(%edi) push %ebp should do it - the -0x4(%edi) thing seems to be trying to reload the return address. No? Maybe I misread the code - but regardless, it does look like a gcc code generation bug if only because we really don't want a 16-byte aligned stack anyway, and have asked for it to not be done. So I agree that gcc shouldn't do that crazy prologue (and certainly _not_ before calling mcount anyway), but I'm not sure I agree with that detail of your analysis or explanation. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: Umm. But it still does, doesn't it? That pushl -0x4(%edi) push %ebp should do it - the -0x4(%edi) thing seems to be trying to reload the return address. No? Maybe I misread the code - but regardless, it does look like a gcc code generation bug if only because we really don't want a 16-byte aligned stack anyway, and have asked for it to not be done. So I agree that gcc shouldn't do that crazy prologue (and certainly _not_ before calling mcount anyway), but I'm not sure I agree with that detail of your analysis or explanation. Yes, it does store the return address before the pushed ebp, but this is a copy of the real stack entry which is before the pushed edi. The function graph tracer needs to redirect the return into the tracer and it therefor saves the real return address and modifies the stack so the return ends up in the tracer code which then goes back to the real return address. But in this prologue/aligment case we modify the copy and not the real return address on the stack, so we return without calling into the tracer which is causing the headache because the state of the tracer becomes confused. Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 09:39 -0800, Linus Torvalds wrote: This modification leads to a hard to solve problem in the kernel function graph tracer which assumes that the stack looks like: return address saved ebp Umm. But it still does, doesn't it? That pushl -0x4(%edi) push %ebp should do it - the -0x4(%edi) thing seems to be trying to reload the return address. No? Yes that is what it is doing. The problem we have is that it is putting into the frame pointer a copy of the return address, and not the actual pointer. Which is fine for the function tracer, but breaks the function graph tracer (which is a much more powerful tracer). Technically, this is all that mcount must have. And yes, we are making an assumption that the return address in the frame pointer is the one that will be used to leave the function. But the reason for making this copy just seems to be all messed up. I don't know if the ABI says anything about the return address in the frame pointer must be the actual return address. But it would be nice if the gcc folks would let us guarantee that it is. -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 6:59 PM, Steven Rostedt rost...@goodmis.org wrote: On Thu, 2009-11-19 at 09:39 -0800, Linus Torvalds wrote: This modification leads to a hard to solve problem in the kernel function graph tracer which assumes that the stack looks like: return address saved ebp Umm. But it still does, doesn't it? That pushl -0x4(%edi) push %ebp should do it - the -0x4(%edi) thing seems to be trying to reload the return address. No? Yes that is what it is doing. The problem we have is that it is putting into the frame pointer a copy of the return address, and not the actual pointer. Which is fine for the function tracer, but breaks the function graph tracer (which is a much more powerful tracer). Technically, this is all that mcount must have. And yes, we are making an assumption that the return address in the frame pointer is the one that will be used to leave the function. But the reason for making this copy just seems to be all messed up. I don't know if the ABI says anything about the return address in the frame pointer must be the actual return address. But it would be nice if the gcc folks would let us guarantee that it is. Note that I only can reproduce the issue with -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2. And you didn't provide us with a testcase either ... so please open a bugzilla and attach preprocessed source of a file that shows the problem, note the function it happens in and provide the command-line options you used for building. Otherwise it's going to be all speculation on our side. Thanks, Richard.
Re: BUG: GCC-4.4.x changes the function frame on some functions
Thomas Gleixner wrote: On Thu, 19 Nov 2009, Thomas Gleixner wrote: Can the GCC folks please shed some light on this: standard function start: push %ebp mov%esp, %ebp call mcount modified function start on a handful of functions only seen with gcc 4.4.x on x86 32 bit: push %edi lea0x8(%esp),%edi and$0xfff0,%esp pushl -0x4(%edi) push %ebp mov%esp,%ebp ... call mcount This modification leads to a hard to solve problem in the kernel function graph tracer which assumes that the stack looks like: return address saved ebp With the modified function start sequence this is not longer true and the manipulation of the return address on the stack fails silently. Neither gcc 4.3 nor gcc 3.4 are generating such function frames, so it looks like a gcc 4.4.x feature. There is no real obvious reason why the edi magic needs to be done _before_ push %ebp mov%esp,%ebp OK, I found it. There is a struct defined as struct entry { ... } __attribute__((__aligned__((1 (4); and then in timer_stats_update_stats you have a local variable of type struct entry: void timer_stats_update_stats() { spinlock_t *lock; struct entry *entry, input; So, gcc has to 16-align the stack pointer to satisfy the alignment for struct entry. Andrew.
Re: BUG: GCC-4.4.x changes the function frame on some functions
Richard Guenther wrote: And you didn't provide us with a testcase either ... so please open a bugzilla and attach preprocessed source of a file that shows the problem, note the function it happens in and provide the command-line options you used for building. I've got all that off-list. I found the cause, and replied in another email. It's not a bug. Andrew.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 18:20 +, Andrew Haley wrote: OK, I found it. There is a struct defined as struct entry { ... } __attribute__((__aligned__((1 (4); and then in timer_stats_update_stats you have a local variable of type struct entry: void timer_stats_update_stats() { spinlock_t *lock; struct entry *entry, input; So, gcc has to 16-align the stack pointer to satisfy the alignment for struct entry. It has to align the entire stack? Why not just the variable within the stack? -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Richard Guenther wrote: Note that I only can reproduce the issue with -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2. And you didn't provide us with a testcase either ... so please open a bugzilla and attach preprocessed source of a file that shows the problem, note the function it happens in and provide the command-line options you used for building. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109 Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 10:33 AM, Steven Rostedt rost...@goodmis.org wrote: It has to align the entire stack? Why not just the variable within the stack? I had proposed a patch which just aligns the variable but that patch was never really commented on and HJL's patches to realign the whole stack went in afterwards. Thanks, Andrew Pinski
Re: BUG: GCC-4.4.x changes the function frame on some functions
Steven Rostedt wrote: On Thu, 2009-11-19 at 18:20 +, Andrew Haley wrote: OK, I found it. There is a struct defined as struct entry { ... } __attribute__((__aligned__((1 (4); and then in timer_stats_update_stats you have a local variable of type struct entry: void timer_stats_update_stats() { spinlock_t *lock; struct entry *entry, input; So, gcc has to 16-align the stack pointer to satisfy the alignment for struct entry. It has to align the entire stack? Why not just the variable within the stack? How?. gcc has to know, at compile time, the offset from sp of each variable. So, it of course makes sure that offset is 16-aligned, but it also has to 16-align the stack pointer. Andrew.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/2009 10:33 AM, Steven Rostedt wrote: It has to align the entire stack? Why not just the variable within the stack? Because if the stack pointer isn't aligned, it won't know where it can stuff the variable. It has to pad *somewhere*, and since you may have more than one such variable, the most efficient way -- and by far least complex -- is for the compiler to align the stack when it sets up the stack frame. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Andrew Haley wrote: OK, I found it. There is a struct defined as struct entry { ... } __attribute__((__aligned__((1 (4); and then in timer_stats_update_stats you have a local variable of type struct entry: void timer_stats_update_stats() { spinlock_t *lock; struct entry *entry, input; So, gcc has to 16-align the stack pointer to satisfy the alignment for struct entry. This does not explain why GCC 4.4.x actually puts push %ebp mov %esp, %ebp first and why GCC 4.4.x decides to create an extra copy of the return address instead of just keeping the mcount stack magic right at the function entry. Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Richard Guenther wrote: Note that I only can reproduce the issue with -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2. Since you can reproduce it with -mincoming-stack-boundary=2, I woul suggest just fixing mcount handling that way regardless of anything else. The current code generated by gcc is just insane - even for the case where you _want_ 16-byte stack alignment. Instead crazy code like push %edi lea0x8(%esp),%edi and$0xfff0,%esp pushl -0x4(%edi) push %ebp mov%esp,%ebp ... call mcount the sane thing to do would be to just do it as push %ebp mov%esp,%ebp call mcount and$0xfff0,%esp since - no sane 'mcount' implementation can ever care about 16-byte stack alignment anyway, so aliging the stack before mcount is crazy. - mcount is special anyway, and is the only thing that cares about that whole ebp/return address thing is mcount, and _all_ your games with %edi are about that mcount thing. IOW, once you as a compiler person understand that the 'mcount' call is special, you should have realized that all the work you did for it was totally pointless and stupid to begin with. You must already have that special mcount logic (the whole code to save a register early and push the fake mcount stack frame), so instead of _that_ special logic, change it to a different mcount special logic that associates the 'mcount' call with theframe pointer pushing. That will not only make the Linux kernel tracer happy, it will make all your _other_ users happier too, since you can generate smaller and more efficient code. Admittedly, anybody who compiles with -pg probably doesn't care deeply about smaller and more efficient code, since the mcount call overhead tends to make the thing moot anyway, but it really looks like a win-win situation to just fix the mcount call sequence regardless. And you didn't provide us with a testcase either ... so please open a bugzilla and attach preprocessed source of a file that shows the problem, note the function it happens in and provide the command-line options you used for building. Otherwise it's going to be all speculation on our side. See above - all you need to do is to just fix mcount calling. Now, there is a separate bug that shows that you seem to over-align the stack when not asked for, and yes, since we noticed that I hope that Thomas and friends will fix that, but I think your mcount logic could (and should) be fixed as an independent sillyness. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Andrew Haley wrote: I've got all that off-list. I found the cause, and replied in another email. It's not a bug. Oh Gods, are we back to gcc people saying sure, we do stupid things, but it's allowed, so we don't consider it a bug because it doesn't matter that real people care about real life, we only care about some paper, and real life doesn't matter, if it's 'undefined' we can make our idiotic choices regardless of what people need, and regardless of whether it actually generates better code or not. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: Oh Gods, are we back to gcc people saying sure, we do stupid things, but it's allowed, so we don't consider it a bug because it doesn't matter that real people care about real life, we only care about some paper, and real life doesn't matter, if it's 'undefined' we can make our idiotic choices regardless of what people need, and regardless of whether it actually generates better code or not. Put another way: the stack alignment itself may not be a bug, but gcc generating God-awful code for the mcount handling that results in problems in real life sure as hell is *stupid* enough to be called a bug. I bet other people than just the kernel use the mcount hook for subtler things than just doing profiles. And even if they don't, the quoted code generation is just crazy _crap_. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
* Linus Torvalds torva...@linux-foundation.org wrote: Admittedly, anybody who compiles with -pg probably doesn't care deeply about smaller and more efficient code, since the mcount call overhead tends to make the thing moot anyway, but it really looks like a win-win situation to just fix the mcount call sequence regardless. Just a sidenote: due to dyn-ftrace, which patches out all mcounts during bootup to be NOPs (and opt-in patches them in again if someone runs the function tracer), the cost is not as large as one would have it with say -pg based user-space profiling. It's not completely zero-cost as the pure NOPs balloon the i$ footprint a bit and GCC generates different code too in some cases. But it's certainly good enough that it's generally pretty hard to prove overhead via micro or macro benchmarks that the patched out mcounts call sites are there. Ingo
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: I bet other people than just the kernel use the mcount hook for subtler things than just doing profiles. And even if they don't, the quoted code generation is just crazy _crap_. For the kernel, if the only case is that timer_stat.c thing that Thomas pointed at, I guess we can at least work around it with something like the appended. The kernel code is certainly ugly too, no question about that. It's just that we'd like to be able to depend on mcount code generation not being insane even in the presense of ugly code.. The alternative would be to have some warning when this happens, so that we can at least see it. mcount won't work reliably Linus --- kernel/time/timer_stats.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c index ee5681f..488c7b8 100644 --- a/kernel/time/timer_stats.c +++ b/kernel/time/timer_stats.c @@ -76,7 +76,7 @@ struct entry { */ charcomm[TASK_COMM_LEN + 1]; -} cacheline_aligned_in_smp; +}; /* * Spinlock protecting the tables - not taken during lookup: @@ -114,7 +114,7 @@ static ktime_t time_start, time_stop; #define MAX_ENTRIES(1UL MAX_ENTRIES_BITS) static unsigned long nr_entries; -static struct entry entries[MAX_ENTRIES]; +static struct entry entries[MAX_ENTRIES] cacheline_aligned_in_smp; static atomic_t overflow_count;
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: I bet other people than just the kernel use the mcount hook for subtler things than just doing profiles. And even if they don't, the quoted code generation is just crazy _crap_. For the kernel, if the only case is that timer_stat.c thing that Thomas pointed at, I guess we can at least work around it with something like the appended. The kernel code is certainly ugly too, no question about that. It's just that we'd like to be able to depend on mcount code generation not being insane even in the presense of ugly code.. The alternative would be to have some warning when this happens, so that we can at least see it. mcount won't work reliably There are at least 20 other random functions which have the same problem. Have not looked at the details yet. Just compiled with -mincoming-stack-boundary=4 and the problem goes away as gcc now thinks that the incoming stack is already 16 byte aligned. But that might break code which actually uses SSE Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 19:47 +0100, Ingo Molnar wrote: * Linus Torvalds torva...@linux-foundation.org wrote: Admittedly, anybody who compiles with -pg probably doesn't care deeply about smaller and more efficient code, since the mcount call overhead tends to make the thing moot anyway, but it really looks like a win-win situation to just fix the mcount call sequence regardless. Just a sidenote: due to dyn-ftrace, which patches out all mcounts during bootup to be NOPs (and opt-in patches them in again if someone runs the function tracer), the cost is not as large as one would have it with say -pg based user-space profiling. It's not completely zero-cost as the pure NOPs balloon the i$ footprint a bit and GCC generates different code too in some cases. But it's certainly good enough that it's generally pretty hard to prove overhead via micro or macro benchmarks that the patched out mcounts call sites are there. And frame pointers do add a little overhead as well. Too bad the mcount ABI wasn't something like this: function: callmcount [...] This way, the function address for mcount would have been (%esp) and the parent address would be 4(%esp). Mcount would work without frame pointers and this whole mess would also become moot. -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
Linus Torvalds wrote: On Thu, 19 Nov 2009, Linus Torvalds wrote: I bet other people than just the kernel use the mcount hook for subtler things than just doing profiles. And even if they don't, the quoted code generation is just crazy _crap_. For the kernel, if the only case is that timer_stat.c thing that Thomas pointed at, I guess we can at least work around it with something like the appended. The kernel code is certainly ugly too, no question about that. It's just that we'd like to be able to depend on mcount code generation not being insane even in the presense of ugly code.. The alternative would be to have some warning when this happens, so that we can at least see it. mcount won't work reliably For the MIPS port of GCC and Linux I recently added the -mmcount-ra-address switch. It causes the location of the return address (on the stack) to be passed to mcount in a scratch register. Perhaps something similar could be done for x86. It would make this patching of the return location more reliable at the expense of more code at the mcount invocation site. For the MIPS case the code size doesn't increase, as it is done in the delay slot of the call instruction, which would otherwise be a nop. David Daney
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote: On Thu, 2009-11-19 at 11:10 -0800, David Daney wrote: Linus Torvalds wrote: For the MIPS port of GCC and Linux I recently added the -mmcount-ra-address switch. It causes the location of the return address (on the stack) to be passed to mcount in a scratch register. Hehe, scratch register on i686 ;-) i686 has no extra regs. It just has: %eax, %ebx, %ecx, %edx - as the general purpose regs %esp - stack %ebp - frame pointer %edi, %esi - counter regs That's just 8 regs, and half of those are special. Perhaps something similar could be done for x86. It would make this patching of the return location more reliable at the expense of more code at the mcount invocation site. I rather not put any more code in the call site. For the MIPS case the code size doesn't increase, as it is done in the delay slot of the call instruction, which would otherwise be a nop. I showed in a previous post what the best would be for x86. That is just calling mcount at the very beginning of the function. The return address is automatically pushed onto the stack. Perhaps we could create another profiler? Instead of calling mcount, call a new function: __fentry__ or something. Have it activated with another switch. This could make the performance of the function tracer even better without all these exceptions. function: call __fentry__ [...] -- Steve I would really like this. So that we can forget about other possible further suprises due to sophisticated function prologues beeing before the mcount call. And I guess that would fix it in every archs. That said, Linus had a good point about the fact there might other uses of mcount even more tricky than what does the function graph tracer, outside the kernel, and those may depend on the strict ABI assumption that 4(ebp) is always the _real_ return address, and that through all the previous stack call. This is even a concern that extrapolates the single mcount case. So I wonder that actually the real problem is the lack of something that could provide this guarantee. We may need a -real-ra-before-fp (yeah I suck in naming).
Re: BUG: GCC-4.4.x changes the function frame on some functions
* Steven Rostedt rost...@goodmis.org wrote: On Thu, 2009-11-19 at 19:47 +0100, Ingo Molnar wrote: * Linus Torvalds torva...@linux-foundation.org wrote: Admittedly, anybody who compiles with -pg probably doesn't care deeply about smaller and more efficient code, since the mcount call overhead tends to make the thing moot anyway, but it really looks like a win-win situation to just fix the mcount call sequence regardless. Just a sidenote: due to dyn-ftrace, which patches out all mcounts during bootup to be NOPs (and opt-in patches them in again if someone runs the function tracer), the cost is not as large as one would have it with say -pg based user-space profiling. It's not completely zero-cost as the pure NOPs balloon the i$ footprint a bit and GCC generates different code too in some cases. But it's certainly good enough that it's generally pretty hard to prove overhead via micro or macro benchmarks that the patched out mcounts call sites are there. And frame pointers do add a little overhead as well. Too bad the mcount ABI wasn't something like this: function: callmcount [...] This way, the function address for mcount would have been (%esp) and the parent address would be 4(%esp). Mcount would work without frame pointers and this whole mess would also become moot. In that case we could also fix up static callsites to this address as well (to jump +5 bytes into the function) and avoid the NOP as well in most cases. (That would in essence merge any slow-path function epilogue with the mcount cal instruction in terms of I$ footprint - i.e. it would be an even lower overhead feature.) If only the kernel had its own compiler. Ingo
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/2009 11:28 AM, Steven Rostedt wrote: Hehe, scratch register on i686 ;-) i686 has no extra regs. It just has: %eax, %ebx, %ecx, %edx - as the general purpose regs %esp - stack %ebp - frame pointer %edi, %esi - counter regs That's just 8 regs, and half of those are special. For a modern ABI it is better described as: %eax, %edx, %ecx- argument/return/scratch registers %ebx, %esi, %edi- saved registers %esp- stack pointer %ebp- frame pointer (saved) Perhaps we could create another profiler? Instead of calling mcount, call a new function: __fentry__ or something. Have it activated with another switch. This could make the performance of the function tracer even better without all these exceptions. function: call __fentry__ [...] Calling the profiler immediately at the entry point is clearly the more sane option. It means the ABI is well-defined, stable, and independent of what the actual function contents are. It means that ABI isn't the normal C ABI (the __fentry__ function would have to preserve all registers), but that's fine... -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: BUG: GCC-4.4.x changes the function frame on some functions
2009/11/19 Frederic Weisbecker fweis...@gmail.com: I would really like this. So that we can forget about other possible further suprises due to sophisticated function prologues beeing before the mcount call. And I guess that would fix it in every archs. My 5 cent for this, too. That said, Linus had a good point about the fact there might other uses of mcount even more tricky than what does the function graph tracer, outside the kernel, and those may depend on the strict ABI assumption that 4(ebp) is always the _real_ return address, and that through all the previous stack call. This is even a concern that extrapolates the single mcount case. So I wonder that actually the real problem is the lack of something that could provide this guarantee. We may need a -real-ra-before-fp (yeah I suck in naming). There are, especially in windows world. We noticed that for example the Sun's JDK (which is compiled by VC) can be used in gcc compiled code only by -fno-omit-frame-pointer, as otherwise it fails badly reasoned by wrong ebp accesses. Kai -- | (\_/) This is Bunny. Copy and paste | (='.'=) Bunny into your signature to help | ()_() him gain world domination
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 08:54:56PM +0100, Kai Tietz wrote: 2009/11/19 Frederic Weisbecker fweis...@gmail.com: I would really like this. So that we can forget about other possible further suprises due to sophisticated function prologues beeing before the mcount call. And I guess that would fix it in every archs. My 5 cent for this, too. That said, Linus had a good point about the fact there might other uses of mcount even more tricky than what does the function graph tracer, outside the kernel, and those may depend on the strict ABI assumption that 4(ebp) is always the _real_ return address, and that through all the previous stack call. This is even a concern that extrapolates the single mcount case. So I wonder that actually the real problem is the lack of something that could provide this guarantee. We may need a -real-ra-before-fp (yeah I suck in naming). There are, especially in windows world. We noticed that for example the Sun's JDK (which is compiled by VC) can be used in gcc compiled code only by -fno-omit-frame-pointer, as otherwise it fails badly reasoned by wrong ebp accesses. Yeah but what we need is not only to ensure ebp is used as the frame pointer but also that ebp + 4 is really the address that will be used to return to the caller, and not a copy of the return value.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 20:46 +0100, Frederic Weisbecker wrote: On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote: function: call __fentry__ [...] -- Steve I would really like this. So that we can forget about other possible further suprises due to sophisticated function prologues beeing before the mcount call. And I guess that would fix it in every archs. Well, other archs use a register to store the return address. But it would also be easy to do (pseudo arch assembly): function: mov lr, (%sp) add 8, %sp blr __fentry__ sub 8, %sp mov (%sp), lr That way the lr would have the current function, and the parent would still be at 8(%sp) That said, Linus had a good point about the fact there might other uses of mcount even more tricky than what does the function graph tracer, outside the kernel, and those may depend on the strict ABI assumption that 4(ebp) is always the _real_ return address, and that through all the previous stack call. This is even a concern that extrapolates the single mcount case. As I am proposing a new call. This means that mcount stay as is for legacy reasons. Yes I know there exists the -finstrument-functions but that adds way too much bloat to the code. One single call to the profiler is all I want. So I wonder that actually the real problem is the lack of something that could provide this guarantee. We may need a -real-ra-before-fp (yeah I suck in naming). Don't worry, so do the C compiler folks, I mean, come on mcount? -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, H. Peter Anvin wrote: Calling the profiler immediately at the entry point is clearly the more sane option. It means the ABI is well-defined, stable, and independent of what the actual function contents are. It means that ABI isn't the normal C ABI (the __fentry__ function would have to preserve all registers), but that's fine... As far as I know, that's true of _mcount already: it's not a normal ABI and is rather a highly architecture-specific special case to begin with. At least ARM has some (several?) special mcount calling conventions, afaik. (And then ARM people use __attribute__((naked)) and insert the code by inline asm, or something. That seems to be standard in the embedded world, where they often do even stranger things than we do in the kernel. At least our low-level system call and interrupt handlers are written as assembly language - the embedded world seems to commonly write them as C functions with magic attributes and inline asm). Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 11:50 -0800, H. Peter Anvin wrote: Perhaps we could create another profiler? Instead of calling mcount, call a new function: __fentry__ or something. Have it activated with another switch. This could make the performance of the function tracer even better without all these exceptions. function: call __fentry__ [...] Calling the profiler immediately at the entry point is clearly the more sane option. It means the ABI is well-defined, stable, and independent of what the actual function contents are. It means that ABI isn't the normal C ABI (the __fentry__ function would have to preserve all registers), but that's fine... mcount already has that requirement (saving all/most regs). Anyway, you are right, we don't care. The tracer should carry the blunt of the load, not the individual callers. -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 15:05 -0500, Steven Rostedt wrote: Well, other archs use a register to store the return address. But it would also be easy to do (pseudo arch assembly): function: mov lr, (%sp) add 8, %sp blr __fentry__ Should be bl __fentry__ for branch and link. sub 8, %sp mov (%sp), lr That way the lr would have the current function, and the parent would still be at 8(%sp) Actually, if we add a new profiler and can make our own specification, I would say that the add and sub lines be the responsibility of __fentry__. Then we would have: function: mov lr, (%sp) bl __fentry__ mov (%sp), lr If sp points to the current content, then replace (%sp) above with -8(%sp). Then the implementation of a nop __fentry__ would simply be: __fentry__: blr For anything more elaborate, __fentry__ would be responsible for all adjustments. -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 03:05:41PM -0500, Steven Rostedt wrote: On Thu, 2009-11-19 at 20:46 +0100, Frederic Weisbecker wrote: On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote: function: call __fentry__ [...] -- Steve I would really like this. So that we can forget about other possible further suprises due to sophisticated function prologues beeing before the mcount call. And I guess that would fix it in every archs. Well, other archs use a register to store the return address. But it would also be easy to do (pseudo arch assembly): function: mov lr, (%sp) add 8, %sp blr __fentry__ sub 8, %sp mov (%sp), lr That way the lr would have the current function, and the parent would still be at 8(%sp) Yeah right, we need at least such very tiny prologue for archs that store return addresses in a reg. That said, Linus had a good point about the fact there might other uses of mcount even more tricky than what does the function graph tracer, outside the kernel, and those may depend on the strict ABI assumption that 4(ebp) is always the _real_ return address, and that through all the previous stack call. This is even a concern that extrapolates the single mcount case. As I am proposing a new call. This means that mcount stay as is for legacy reasons. Yes I know there exists the -finstrument-functions but that adds way too much bloat to the code. One single call to the profiler is all I want. Sure, the purpose is not to change the existing -mcount thing. What I meant is that we could have -mcount and -real-ra-before-fp at the same time to guarantee fp + 4 is really what we want while using -mcount. The __fentry__ idea is more neat, but the guarantee of a real pointer to the return address is still something that lacks. So I wonder that actually the real problem is the lack of something that could provide this guarantee. We may need a -real-ra-before-fp (yeah I suck in naming). Don't worry, so do the C compiler folks, I mean, come on mcount? I guess it has been first created for the single purpose of counting specific functions but then it has been used for wider, unpredicted uses :)
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, Nov 19, 2009 at 03:17:16PM -0500, Steven Rostedt wrote: On Thu, 2009-11-19 at 15:05 -0500, Steven Rostedt wrote: Well, other archs use a register to store the return address. But it would also be easy to do (pseudo arch assembly): function: mov lr, (%sp) add 8, %sp blr __fentry__ Should be bl __fentry__ for branch and link. sub 8, %sp mov (%sp), lr That way the lr would have the current function, and the parent would still be at 8(%sp) Actually, if we add a new profiler and can make our own specification, I would say that the add and sub lines be the responsibility of __fentry__. Then we would have: function: mov lr, (%sp) bl __fentry__ mov (%sp), lr If sp points to the current content, then replace (%sp) above with -8(%sp). Then the implementation of a nop __fentry__ would simply be: __fentry__: blr Good point! For anything more elaborate, __fentry__ would be responsible for all adjustments. Yep. The more we control it from __fentry__, the less we fall down into unexpected surprises.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: On Thu, 19 Nov 2009, Richard Guenther wrote: Note that I only can reproduce the issue with -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2. Since you can reproduce it with -mincoming-stack-boundary=2, I woul suggest just fixing mcount handling that way regardless of anything else. The current code generated by gcc is just insane - even for the case where you _want_ 16-byte stack alignment. Instead crazy code like push %edi lea0x8(%esp),%edi and$0xfff0,%esp pushl -0x4(%edi) push %ebp mov%esp,%ebp ... call mcount the sane thing to do would be to just do it as push %ebp mov%esp,%ebp call mcount and$0xfff0,%esp which is what the 64bit compile does except that the mcount call happens a bit later which is fine. 8107cd34 timer_stats_update_stats: 8107cd34: 55 push %rbp 8107cd35: 48 89 e5mov%rsp,%rbp 8107cd38: 48 83 e4 c0 and$0xffc0,%rsp Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Frederic Weisbecker wrote: That way the lr would have the current function, and the parent would still be at 8(%sp) Yeah right, we need at least such very tiny prologue for archs that store return addresses in a reg. Well, it will be architecture-dependent. For example, alpha can store the return value in _any_ register if I recall correctly, so you can do the call to __fentry__ by just picking another register than the default one as the return address. And powerpc has two special registers: link and ctr, but iirc you can only load 'link' with a branch instruction. Which means that you could do something like mflr 0 bl __fentry__ in the caller (I forget if R0 is actually a call-trashed register or not), and then __fentry__ could do something like mflr 12 # save _new_ link mtlr 0 # restore original link mtctr 12# move __fentry__ link to ctr .. do whatever .. bctr# return to __fentry__ caller to return with 'link' restored (but ctr and r0/r12 trashed - I don't recall the ppc calling conventions any more, but I think that is ok). Saving to stack seems unnecessary and pointless. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 12:36 -0800, Linus Torvalds wrote: On Thu, 19 Nov 2009, Frederic Weisbecker wrote: That way the lr would have the current function, and the parent would still be at 8(%sp) Yeah right, we need at least such very tiny prologue for archs that store return addresses in a reg. Well, it will be architecture-dependent. Totally agree, as mcount today is architecture dependent. For example, alpha can store the return value in _any_ register if I recall correctly, so you can do the call to __fentry__ by just picking another register than the default one as the return address. And powerpc has two special registers: link and ctr, but iirc you can only load 'link' with a branch instruction. Which means that you could do something like mflr 0 bl __fentry__ in the caller (I forget if R0 is actually a call-trashed register or not), and then __fentry__ could do something like mflr 12 # save _new_ link mtlr 0 # restore original link mtctr 12# move __fentry__ link to ctr .. do whatever .. bctr# return to __fentry__ caller to return with 'link' restored (but ctr and r0/r12 trashed - I don't recall the ppc calling conventions any more, but I think that is ok). Saving to stack seems unnecessary and pointless. I was just using an example. But as you pointed out, each arch can find its best way to handle it. Having the profiler called at the beginning of the function is what I feel is the best. We also get access to the function's parameters :-) -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On i386, if we call __fentry__ immediately on entry the return address will be in 4(%esp), so I fail to see how you could not reliably have the return address. Other arches would have different constraints, of course. Frederic Weisbecker fweis...@gmail.com wrote: On Thu, Nov 19, 2009 at 03:05:41PM -0500, Steven Rostedt wrote: On Thu, 2009-11-19 at 20:46 +0100, Frederic Weisbecker wrote: On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote: function: call __fentry__ [...] -- Steve I would really like this. So that we can forget about other possible further suprises due to sophisticated function prologues beeing before the mcount call. And I guess that would fix it in every archs. Well, other archs use a register to store the return address. But it would also be easy to do (pseudo arch assembly): function: mov lr, (%sp) add 8, %sp blr __fentry__ sub 8, %sp mov (%sp), lr That way the lr would have the current function, and the parent would still be at 8(%sp) Yeah right, we need at least such very tiny prologue for archs that store return addresses in a reg. That said, Linus had a good point about the fact there might other uses of mcount even more tricky than what does the function graph tracer, outside the kernel, and those may depend on the strict ABI assumption that 4(ebp) is always the _real_ return address, and that through all the previous stack call. This is even a concern that extrapolates the single mcount case. As I am proposing a new call. This means that mcount stay as is for legacy reasons. Yes I know there exists the -finstrument-functions but that adds way too much bloat to the code. One single call to the profiler is all I want. Sure, the purpose is not to change the existing -mcount thing. What I meant is that we could have -mcount and -real-ra-before-fp at the same time to guarantee fp + 4 is really what we want while using -mcount. The __fentry__ idea is more neat, but the guarantee of a real pointer to the return address is still something that lacks. So I wonder that actually the real problem is the lack of something that could provide this guarantee. We may need a -real-ra-before-fp (yeah I suck in naming). Don't worry, so do the C compiler folks, I mean, come on mcount? I guess it has been first created for the single purpose of counting specific functions but then it has been used for wider, unpredicted uses :) -- Sent from my mobile phone. Please excuse any lack of formatting.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/09 12:50, H. Peter Anvin wrote: Calling the profiler immediately at the entry point is clearly the more sane option. It means the ABI is well-defined, stable, and independent of what the actual function contents are. It means that ABI isn't the normal C ABI (the __fentry__ function would have to preserve all registers), but that's fine... Note there are targets (even some old x86 variants) that required the profiling calls to occur after the prologue. Unfortunately, nobody documented *why* that was the case. Sigh. Jeff
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/09 13:06, Linus Torvalds wrote: On Thu, 19 Nov 2009, H. Peter Anvin wrote: Calling the profiler immediately at the entry point is clearly the more sane option. It means the ABI is well-defined, stable, and independent of what the actual function contents are. It means that ABI isn't the normal C ABI (the __fentry__ function would have to preserve all registers), but that's fine... As far as I know, that's true of _mcount already: it's not a normal ABI and is rather a highly architecture-specific special case to begin with. At least ARM has some (several?) special mcount calling conventions, afaik. Correct. _mcount's ABI typically has been defined by the implementation of the vendor's C library mcount. GCC has options to emit the profiling code prior to or after the prologue controllable through the usual variety of target macros hooks. I can't imagine anyone would object to a clean, tested patch to change how x86-linux's profiling code was implemented. jeff
Re: BUG: GCC-4.4.x changes the function frame on some functions
Hence a new unconstrained option... Jeff Law l...@redhat.com wrote: On 11/19/09 12:50, H. Peter Anvin wrote: Calling the profiler immediately at the entry point is clearly the more sane option. It means the ABI is well-defined, stable, and independent of what the actual function contents are. It means that ABI isn't the normal C ABI (the __fentry__ function would have to preserve all registers), but that's fine... Note there are targets (even some old x86 variants) that required the profiling calls to occur after the prologue. Unfortunately, nobody documented *why* that was the case. Sigh. Jeff -- Sent from my mobile phone. Please excuse any lack of formatting.
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/09 14:14, H. Peter Anvin wrote: Hence a new unconstrained option... Not arguing against it, just noting there are targets where after the prologue mcount is mandated. There's certainly hooks in GCC to do it both ways and if there's no clear need to use after-prologue on x86-linux, then before-prologue seems reasonable to me. It's also the case that aligning stacks on the x86 and the poor code generated when used with profiling is an interaction I doubt anyone has looked at until now. The result is definitely ugly and inefficient -- and there's something to be said for cleaning that up and at least marginally reducing the overhead of profiling. Having said all that, I don't expect to personally be looking at the problem, given the list of other codegen issues that need to be looked at (reload in particular), profiling/stack interactions would be around 87 millionth on my list. jeff
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 2009-11-19 at 14:25 -0700, Jeff Law wrote: Having said all that, I don't expect to personally be looking at the problem, given the list of other codegen issues that need to be looked at (reload in particular), profiling/stack interactions would be around 87 millionth on my list. Is there someone else that can look at it? Or at the very least, could you point us to where that code is, and one of us tracing folks could take a crack at switching hats to be a compiler writer (with the obvious prerequisite of drinking a lot of beer first, or is there a better drug to cope with the pain of writing gcc?). -- Steve
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/09 15:43, Steven Rostedt wrote: On Thu, 2009-11-19 at 14:25 -0700, Jeff Law wrote: Having said all that, I don't expect to personally be looking at the problem, given the list of other codegen issues that need to be looked at (reload in particular), profiling/stack interactions would be around 87 millionth on my list. Is there someone else that can look at it? Unsure at the moment... Like everyone else, GCC developers are busy and this probably isn't going to be a high priority item for anyone. Or at the very least, could you point us to where that code is, and one of us tracing folks could take a crack at switching hats to be a compiler writer (with the obvious prerequisite of drinking a lot of beer first, or is there a better drug to cope with the pain of writing gcc?). It _might_ be as easy as defining PROFILE_BEFORE_PROLOGUE in gcc-someversiongcc/config/i386/linux.h rebuilding GCC. Based on comments elsewhere, the sun386i support may have used PROFILE_BEFORE_PROLOGUE in the past and thus the x86 backend may not need further adjustment. That is obviously the ideal case. If that appears to work for your needs, I'll volunteer to test it more thoroughly and assuming those tests look good shepherd it into the source tree. Jeff
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Jeff Law wrote: On 11/19/09 15:43, Steven Rostedt wrote: On Thu, 2009-11-19 at 14:25 -0700, Jeff Law wrote: Having said all that, I don't expect to personally be looking at the problem, given the list of other codegen issues that need to be looked at (reload in particular), profiling/stack interactions would be around 87 millionth on my list. Is there someone else that can look at it? Unsure at the moment... Like everyone else, GCC developers are busy and this probably isn't going to be a high priority item for anyone. Or at the very least, could you point us to where that code is, and one of us tracing folks could take a crack at switching hats to be a compiler writer (with the obvious prerequisite of drinking a lot of beer first, or is there a better drug to cope with the pain of writing gcc?). It _might_ be as easy as defining PROFILE_BEFORE_PROLOGUE in gcc-someversiongcc/config/i386/linux.h rebuilding GCC. Based on comments elsewhere, the sun386i support may have used PROFILE_BEFORE_PROLOGUE in the past and thus the x86 backend may not need further adjustment. That is obviously the ideal case. If that appears to work for your needs, I'll volunteer to test it more thoroughly and assuming those tests look good shepherd it into the source tree. We definitely want to see that ASAP. While testing various kernel configs we found out that the problem comes and goes. Finally I started to compare the gcc command line options and after some fiddling it turned out that the following minimal deltas change the code generator behaviour: Bad: -march=pentium-mmx-Wa,-mtune=generic32 Good: -march=i686-mtune=generic -Wa,-mtune=generic32 Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32 I'm not supposed to understand the logic behind that, right ? Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Fri, 20 Nov 2009, Thomas Gleixner wrote: While testing various kernel configs we found out that the problem comes and goes. Finally I started to compare the gcc command line options and after some fiddling it turned out that the following minimal deltas change the code generator behaviour: Bad: -march=pentium-mmx-Wa,-mtune=generic32 Good: -march=i686-mtune=generic -Wa,-mtune=generic32 Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32 I'm not supposed to understand the logic behind that, right ? Are you sure it's just the compiler flags? There's another configuration portion: the size of the alignment itself. That's dependent on L1_CACHE_SHIFT, which in turn is taken from the kernel config CONFIG_X86_L1_CACHE_SHIFT. Maybe that value matters too - for example maybe gcc will not try to align the stack if it's big? [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT totally unrelated numbers? Very confusing. ] The compiler flags we use are tied to some of the same choices that choose the cache shift, so the correlation you found while debugging this would still hold. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: On Fri, 20 Nov 2009, Thomas Gleixner wrote: While testing various kernel configs we found out that the problem comes and goes. Finally I started to compare the gcc command line options and after some fiddling it turned out that the following minimal deltas change the code generator behaviour: Bad: -march=pentium-mmx-Wa,-mtune=generic32 Good: -march=i686-mtune=generic -Wa,-mtune=generic32 Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32 I'm not supposed to understand the logic behind that, right ? Are you sure it's just the compiler flags? I first captured the command line with V=1 and created a script of it. Then I changed the -march -mtune options in that script and compiled just that single file manually w/o changing .config or invoking the kernel make magic. The good ones produce: 650: 55 push %ebp 651: 89 e5 mov%esp,%ebp 653: 83 e4 f0and$0xfff0,%esp The bad one: 05f0 timer_stats_update_stats: 5f0: 57 push %edi 5f1: 8d 7c 24 08 lea0x8(%esp),%edi 5f5: 83 e4 f0and$0xfff0,%esp 5f8: ff 77 fcpushl -0x4(%edi) 5fb: 55 push %ebp 5fc: 89 e5 mov%esp,%ebp There's another configuration portion: the size of the alignment itself. That's dependent on L1_CACHE_SHIFT, which in turn is taken from the kernel config CONFIG_X86_L1_CACHE_SHIFT. Maybe that value matters too - for example maybe gcc will not try to align the stack if it's big? That does not change any of the compiler options, but yes it could have some effect via the various include magics, but all I have seen so far is linkage.h which should not affect the compiler. And the manual compile did not change any of this. [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT totally unrelated numbers? Very confusing. ] Agreed. The compiler flags we use are tied to some of the same choices that choose the cache shift, so the correlation you found while debugging this would still hold. Digging further tomorrow when my brain is more awake. Thanks, tglx
Re: BUG: GCC-4.4.x changes the function frame on some functions
On 11/19/2009 04:59 PM, Linus Torvalds wrote: [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT totally unrelated numbers? Very confusing. ] Yes, there is another thread to clean up that particular mess; it is already in -tip: http://git.kernel.org/tip/350f8f5631922c7848ec4b530c111cb8c2ff7caa -hpa