Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line
On Thu, Dec 21, 2023 at 10:46:08AM +, Christophe Leroy wrote: > > > Le 08/12/2023 à 17:30, Naveen N Rao a écrit : > > Function profile sequence on powerpc includes two instructions at the > > beginning of each function: > > > > mflrr0 > > bl ftrace_caller > > > > The call to ftrace_caller() gets nop'ed out during kernel boot and is > > patched in when ftrace is enabled. > > > > There are two issues with this: > > 1. The 'mflr r0' instruction at the beginning of each function remains > > even though ftrace is not being used. > > 2. When ftrace is activated, we return from ftrace_caller() with a bctr > > instruction to preserve r0 and LR, resulting in the link stack > > becoming unbalanced. > > > > To address (1), we have tried to nop'out the 'mflr r0' instruction when > > nop'ing out the call to ftrace_caller() and restoring it when enabling > > ftrace. But, that required additional synchronization slowing down > > ftrace activation. It also left an additional nop instruction at the > > beginning of each function and that wasn't desirable on 32-bit powerpc. > > > > Instead of that, move the function profile sequence out-of-line leaving > > a single nop at function entry. On ftrace activation, the nop is changed > > to an unconditional branch to the out-of-line sequence that in turn > > calls ftrace_caller(). This removes the need for complex synchronization > > during ftrace activation and simplifies the code. More importantly, this > > improves performance of the kernel when ftrace is not in use. > > > > To address (2), change the ftrace trampoline to return with a 'blr' > > instruction with the original return address in r0 intact. Then, an > > additional 'mtlr r0' instruction in the function profile sequence can > > move the correct return address back to LR. > > > > With the above two changes, the function profile sequence now looks like > > the following: > > > > [func:# GEP -- 64-bit powerpc, optional > > addis r2,r12,imm1 > > addir2,r2,imm2] > >tramp: > > mflrr0 > > bl ftrace_caller > > mtlrr0 > > b func > > nop > > [nop] # 64-bit powerpc only > >func:# LEP > > nop > > > > On 32-bit powerpc, the ftrace mcount trampoline is now completely > > outside the function. This is also the case on 64-bit powerpc for > > functions that do not need a GEP. However, for functions that need a > > GEP, the additional instructions are inserted between the GEP and the > > LEP. Since we can only have a fixed number of instructions between GEP > > and LEP, we choose to emit 6 instructions. Four of those instructions > > are used for the function profile sequence and two instruction slots are > > reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On > > 32-bit powerpc, we emit one additional nop for this purpose resulting in > > a total of 5 nops before function entry. > > > > To enable ftrace, the nop at function entry is changed to an > > unconditional branch to 'tramp'. The call to ftrace_caller() may be > > updated to ftrace_regs_caller() depending on the registered ftrace ops. > > On 64-bit powerpc, we additionally change the instruction at 'tramp' to > > 'mflr r0' from an unconditional branch back to func+4. This is so that > > functions entered through the GEP can skip the function profile sequence > > unless ftrace is enabled. > > > > With the context_switch microbenchmark on a P9 machine, there is a > > performance improvement of ~6% with this patch applied, going from 650k > > context switches to 690k context switches without ftrace enabled. With > > ftrace enabled, the performance was similar at 86k context switches. > > Wondering how significant that context_switch micorbenchmark is. > > I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the > results: > # ./context_switch --no-fp > Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no > vdso:no > > On 885, I get the following results before and after your patch. > > CONFIG_FTRACE not selected : 44,9k > CONFIG_FTRACE selected, before : 32,8k > CONFIG_FTRACE selected, after : 33,6k > > All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But > when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected > result is only 34,4. > > On 8321: > > CONFIG_FTRACE not selected : 100,3k > CONFIG_FTRACE selected, before : 72,5k > CONFIG_FTRACE selected, after : 116k > > So the results look odd to me. That's indeed odd, though it looks to be showing good improvement. Are those numbers with/without the function tracer enabled? Do you see more reasonable numbers if you use a different FUNCTION_ALIGNMENT? > > > > > The downside of this approach is the increase in vmlinux size, > > especially on 32-bit powerpc. We now emit 3 additional instructions for > > each function (excluding the one or two instructions for supporting > > DYNAMIC_FTRACE_
Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line
Le 21/12/2023 à 15:25, Steven Rostedt a écrit : > On Thu, 21 Dec 2023 10:46:08 + > Christophe Leroy wrote: > >>> To enable ftrace, the nop at function entry is changed to an >>> unconditional branch to 'tramp'. The call to ftrace_caller() may be >>> updated to ftrace_regs_caller() depending on the registered ftrace ops. >>> On 64-bit powerpc, we additionally change the instruction at 'tramp' to >>> 'mflr r0' from an unconditional branch back to func+4. This is so that >>> functions entered through the GEP can skip the function profile sequence >>> unless ftrace is enabled. >>> >>> With the context_switch microbenchmark on a P9 machine, there is a >>> performance improvement of ~6% with this patch applied, going from 650k >>> context switches to 690k context switches without ftrace enabled. With >>> ftrace enabled, the performance was similar at 86k context switches. >> >> Wondering how significant that context_switch micorbenchmark is. >> >> I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the >> results: >> # ./context_switch --no-fp >> Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no >> vdso:no >> >> On 885, I get the following results before and after your patch. >> >> CONFIG_FTRACE not selected : 44,9k >> CONFIG_FTRACE selected, before : 32,8k >> CONFIG_FTRACE selected, after : 33,6k >> >> All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But >> when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected >> result is only 34,4. >> >> On 8321: >> >> CONFIG_FTRACE not selected : 100,3k >> CONFIG_FTRACE selected, before : 72,5k >> CONFIG_FTRACE selected, after : 116k >> >> So the results look odd to me. > > > BTW, CONFIG_FTRACE just enables the tracing system (I would like to change > that to CONFIG_TRACING, but not sure if I can without breaking .configs all > over the place). > > The nops for ftrace is enabled with CONFIG_FUNCTION_TRACER. Yes I selected both CONFIG_FTRACE and CONFIG_FUNCTION_TRACER Christophe
Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line
On Thu, 21 Dec 2023 10:46:08 + Christophe Leroy wrote: > > To enable ftrace, the nop at function entry is changed to an > > unconditional branch to 'tramp'. The call to ftrace_caller() may be > > updated to ftrace_regs_caller() depending on the registered ftrace ops. > > On 64-bit powerpc, we additionally change the instruction at 'tramp' to > > 'mflr r0' from an unconditional branch back to func+4. This is so that > > functions entered through the GEP can skip the function profile sequence > > unless ftrace is enabled. > > > > With the context_switch microbenchmark on a P9 machine, there is a > > performance improvement of ~6% with this patch applied, going from 650k > > context switches to 690k context switches without ftrace enabled. With > > ftrace enabled, the performance was similar at 86k context switches. > > Wondering how significant that context_switch micorbenchmark is. > > I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the > results: > # ./context_switch --no-fp > Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no > vdso:no > > On 885, I get the following results before and after your patch. > > CONFIG_FTRACE not selected : 44,9k > CONFIG_FTRACE selected, before : 32,8k > CONFIG_FTRACE selected, after : 33,6k > > All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But > when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected > result is only 34,4. > > On 8321: > > CONFIG_FTRACE not selected : 100,3k > CONFIG_FTRACE selected, before : 72,5k > CONFIG_FTRACE selected, after : 116k > > So the results look odd to me. BTW, CONFIG_FTRACE just enables the tracing system (I would like to change that to CONFIG_TRACING, but not sure if I can without breaking .configs all over the place). The nops for ftrace is enabled with CONFIG_FUNCTION_TRACER. -- Steve
Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line
Le 08/12/2023 à 17:30, Naveen N Rao a écrit : > Function profile sequence on powerpc includes two instructions at the > beginning of each function: > > mflrr0 > bl ftrace_caller > > The call to ftrace_caller() gets nop'ed out during kernel boot and is > patched in when ftrace is enabled. > > There are two issues with this: > 1. The 'mflr r0' instruction at the beginning of each function remains > even though ftrace is not being used. > 2. When ftrace is activated, we return from ftrace_caller() with a bctr > instruction to preserve r0 and LR, resulting in the link stack > becoming unbalanced. > > To address (1), we have tried to nop'out the 'mflr r0' instruction when > nop'ing out the call to ftrace_caller() and restoring it when enabling > ftrace. But, that required additional synchronization slowing down > ftrace activation. It also left an additional nop instruction at the > beginning of each function and that wasn't desirable on 32-bit powerpc. > > Instead of that, move the function profile sequence out-of-line leaving > a single nop at function entry. On ftrace activation, the nop is changed > to an unconditional branch to the out-of-line sequence that in turn > calls ftrace_caller(). This removes the need for complex synchronization > during ftrace activation and simplifies the code. More importantly, this > improves performance of the kernel when ftrace is not in use. > > To address (2), change the ftrace trampoline to return with a 'blr' > instruction with the original return address in r0 intact. Then, an > additional 'mtlr r0' instruction in the function profile sequence can > move the correct return address back to LR. > > With the above two changes, the function profile sequence now looks like > the following: > > [func: # GEP -- 64-bit powerpc, optional > addis r2,r12,imm1 > addir2,r2,imm2] >tramp: > mflrr0 > bl ftrace_caller > mtlrr0 > b func > nop > [nop] # 64-bit powerpc only >func: # LEP > nop > > On 32-bit powerpc, the ftrace mcount trampoline is now completely > outside the function. This is also the case on 64-bit powerpc for > functions that do not need a GEP. However, for functions that need a > GEP, the additional instructions are inserted between the GEP and the > LEP. Since we can only have a fixed number of instructions between GEP > and LEP, we choose to emit 6 instructions. Four of those instructions > are used for the function profile sequence and two instruction slots are > reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On > 32-bit powerpc, we emit one additional nop for this purpose resulting in > a total of 5 nops before function entry. > > To enable ftrace, the nop at function entry is changed to an > unconditional branch to 'tramp'. The call to ftrace_caller() may be > updated to ftrace_regs_caller() depending on the registered ftrace ops. > On 64-bit powerpc, we additionally change the instruction at 'tramp' to > 'mflr r0' from an unconditional branch back to func+4. This is so that > functions entered through the GEP can skip the function profile sequence > unless ftrace is enabled. > > With the context_switch microbenchmark on a P9 machine, there is a > performance improvement of ~6% with this patch applied, going from 650k > context switches to 690k context switches without ftrace enabled. With > ftrace enabled, the performance was similar at 86k context switches. Wondering how significant that context_switch micorbenchmark is. I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the results: # ./context_switch --no-fp Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no vdso:no On 885, I get the following results before and after your patch. CONFIG_FTRACE not selected : 44,9k CONFIG_FTRACE selected, before : 32,8k CONFIG_FTRACE selected, after : 33,6k All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected result is only 34,4. On 8321: CONFIG_FTRACE not selected : 100,3k CONFIG_FTRACE selected, before : 72,5k CONFIG_FTRACE selected, after : 116k So the results look odd to me. > > The downside of this approach is the increase in vmlinux size, > especially on 32-bit powerpc. We now emit 3 additional instructions for > each function (excluding the one or two instructions for supporting > DYNAMIC_FTRACE_WITH_CALL_OPS). On 64-bit powerpc with the current > implementation of -fpatchable-function-entry though, this is not > avoidable since we are forced to emit 6 instructions between the GEP and > the LEP even if we are to only support DYNAMIC_FTRACE_WITH_CALL_OPS. The increase is almost 5% on the few 32 bits defconfig I have tested. That's a lot. On 32 bits powerpc, only the e500 has a link stack that could end up being unbalanced. Could we keep the bctr and