Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line

2023-12-22 Thread Naveen N Rao
On Thu, Dec 21, 2023 at 10:46:08AM +, Christophe Leroy wrote:
> 
> 
> Le 08/12/2023 à 17:30, Naveen N Rao a écrit :
> > Function profile sequence on powerpc includes two instructions at the
> > beginning of each function:
> > 
> > mflrr0
> > bl  ftrace_caller
> > 
> > The call to ftrace_caller() gets nop'ed out during kernel boot and is
> > patched in when ftrace is enabled.
> > 
> > There are two issues with this:
> > 1. The 'mflr r0' instruction at the beginning of each function remains
> > even though ftrace is not being used.
> > 2. When ftrace is activated, we return from ftrace_caller() with a bctr
> > instruction to preserve r0 and LR, resulting in the link stack
> > becoming unbalanced.
> > 
> > To address (1), we have tried to nop'out the 'mflr r0' instruction when
> > nop'ing out the call to ftrace_caller() and restoring it when enabling
> > ftrace. But, that required additional synchronization slowing down
> > ftrace activation. It also left an additional nop instruction at the
> > beginning of each function and that wasn't desirable on 32-bit powerpc.
> > 
> > Instead of that, move the function profile sequence out-of-line leaving
> > a single nop at function entry. On ftrace activation, the nop is changed
> > to an unconditional branch to the out-of-line sequence that in turn
> > calls ftrace_caller(). This removes the need for complex synchronization
> > during ftrace activation and simplifies the code. More importantly, this
> > improves performance of the kernel when ftrace is not in use.
> > 
> > To address (2), change the ftrace trampoline to return with a 'blr'
> > instruction with the original return address in r0 intact. Then, an
> > additional 'mtlr r0' instruction in the function profile sequence can
> > move the correct return address back to LR.
> > 
> > With the above two changes, the function profile sequence now looks like
> > the following:
> > 
> >   [func:# GEP -- 64-bit powerpc, optional
> > addis   r2,r12,imm1
> > addir2,r2,imm2]
> >tramp:
> > mflrr0
> > bl  ftrace_caller
> > mtlrr0
> > b   func
> > nop
> > [nop]   # 64-bit powerpc only
> >func:# LEP
> > nop
> > 
> > On 32-bit powerpc, the ftrace mcount trampoline is now completely
> > outside the function. This is also the case on 64-bit powerpc for
> > functions that do not need a GEP. However, for functions that need a
> > GEP, the additional instructions are inserted between the GEP and the
> > LEP. Since we can only have a fixed number of instructions between GEP
> > and LEP, we choose to emit 6 instructions. Four of those instructions
> > are used for the function profile sequence and two instruction slots are
> > reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On
> > 32-bit powerpc, we emit one additional nop for this purpose resulting in
> > a total of 5 nops before function entry.
> > 
> > To enable ftrace, the nop at function entry is changed to an
> > unconditional branch to 'tramp'. The call to ftrace_caller() may be
> > updated to ftrace_regs_caller() depending on the registered ftrace ops.
> > On 64-bit powerpc, we additionally change the instruction at 'tramp' to
> > 'mflr r0' from an unconditional branch back to func+4. This is so that
> > functions entered through the GEP can skip the function profile sequence
> > unless ftrace is enabled.
> > 
> > With the context_switch microbenchmark on a P9 machine, there is a
> > performance improvement of ~6% with this patch applied, going from 650k
> > context switches to 690k context switches without ftrace enabled. With
> > ftrace enabled, the performance was similar at 86k context switches.
> 
> Wondering how significant that context_switch micorbenchmark is.
> 
> I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the 
> results:
> # ./context_switch --no-fp
> Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no 
> vdso:no
> 
> On 885, I get the following results before and after your patch.
> 
> CONFIG_FTRACE not selected : 44,9k
> CONFIG_FTRACE selected, before : 32,8k
> CONFIG_FTRACE selected, after : 33,6k
> 
> All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But 
> when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected 
> result is only 34,4.
> 
> On 8321:
> 
> CONFIG_FTRACE not selected : 100,3k
> CONFIG_FTRACE selected, before : 72,5k
> CONFIG_FTRACE selected, after : 116k
> 
> So the results look odd to me.

That's indeed odd, though it looks to be showing good improvement. Are 
those numbers with/without the function tracer enabled?

Do you see more reasonable numbers if you use a different 
FUNCTION_ALIGNMENT?

> 
> > 
> > The downside of this approach is the increase in vmlinux size,
> > especially on 32-bit powerpc. We now emit 3 additional instructions for
> > each function (excluding the one or two instructions for supporting
> > DYNAMIC_FTRACE_

Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line

2023-12-21 Thread Christophe Leroy


Le 21/12/2023 à 15:25, Steven Rostedt a écrit :
> On Thu, 21 Dec 2023 10:46:08 +
> Christophe Leroy  wrote:
> 
>>> To enable ftrace, the nop at function entry is changed to an
>>> unconditional branch to 'tramp'. The call to ftrace_caller() may be
>>> updated to ftrace_regs_caller() depending on the registered ftrace ops.
>>> On 64-bit powerpc, we additionally change the instruction at 'tramp' to
>>> 'mflr r0' from an unconditional branch back to func+4. This is so that
>>> functions entered through the GEP can skip the function profile sequence
>>> unless ftrace is enabled.
>>>
>>> With the context_switch microbenchmark on a P9 machine, there is a
>>> performance improvement of ~6% with this patch applied, going from 650k
>>> context switches to 690k context switches without ftrace enabled. With
>>> ftrace enabled, the performance was similar at 86k context switches.
>>
>> Wondering how significant that context_switch micorbenchmark is.
>>
>> I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the
>> results:
>> # ./context_switch --no-fp
>> Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no
>> vdso:no
>>
>> On 885, I get the following results before and after your patch.
>>
>> CONFIG_FTRACE not selected : 44,9k
>> CONFIG_FTRACE selected, before : 32,8k
>> CONFIG_FTRACE selected, after : 33,6k
>>
>> All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But
>> when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected
>> result is only 34,4.
>>
>> On 8321:
>>
>> CONFIG_FTRACE not selected : 100,3k
>> CONFIG_FTRACE selected, before : 72,5k
>> CONFIG_FTRACE selected, after : 116k
>>
>> So the results look odd to me.
> 
> 
> BTW, CONFIG_FTRACE just enables the tracing system (I would like to change
> that to CONFIG_TRACING, but not sure if I can without breaking .configs all
> over the place).
> 
> The nops for ftrace is enabled with CONFIG_FUNCTION_TRACER.

Yes I selected both CONFIG_FTRACE and CONFIG_FUNCTION_TRACER

Christophe


Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 10:46:08 +
Christophe Leroy  wrote:

> > To enable ftrace, the nop at function entry is changed to an
> > unconditional branch to 'tramp'. The call to ftrace_caller() may be
> > updated to ftrace_regs_caller() depending on the registered ftrace ops.
> > On 64-bit powerpc, we additionally change the instruction at 'tramp' to
> > 'mflr r0' from an unconditional branch back to func+4. This is so that
> > functions entered through the GEP can skip the function profile sequence
> > unless ftrace is enabled.
> > 
> > With the context_switch microbenchmark on a P9 machine, there is a
> > performance improvement of ~6% with this patch applied, going from 650k
> > context switches to 690k context switches without ftrace enabled. With
> > ftrace enabled, the performance was similar at 86k context switches.  
> 
> Wondering how significant that context_switch micorbenchmark is.
> 
> I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the 
> results:
> # ./context_switch --no-fp
> Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no 
> vdso:no
> 
> On 885, I get the following results before and after your patch.
> 
> CONFIG_FTRACE not selected : 44,9k
> CONFIG_FTRACE selected, before : 32,8k
> CONFIG_FTRACE selected, after : 33,6k
> 
> All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But 
> when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected 
> result is only 34,4.
> 
> On 8321:
> 
> CONFIG_FTRACE not selected : 100,3k
> CONFIG_FTRACE selected, before : 72,5k
> CONFIG_FTRACE selected, after : 116k
> 
> So the results look odd to me.


BTW, CONFIG_FTRACE just enables the tracing system (I would like to change
that to CONFIG_TRACING, but not sure if I can without breaking .configs all
over the place).

The nops for ftrace is enabled with CONFIG_FUNCTION_TRACER.

-- Steve


Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line

2023-12-21 Thread Christophe Leroy


Le 08/12/2023 à 17:30, Naveen N Rao a écrit :
> Function profile sequence on powerpc includes two instructions at the
> beginning of each function:
> 
>   mflrr0
>   bl  ftrace_caller
> 
> The call to ftrace_caller() gets nop'ed out during kernel boot and is
> patched in when ftrace is enabled.
> 
> There are two issues with this:
> 1. The 'mflr r0' instruction at the beginning of each function remains
> even though ftrace is not being used.
> 2. When ftrace is activated, we return from ftrace_caller() with a bctr
> instruction to preserve r0 and LR, resulting in the link stack
> becoming unbalanced.
> 
> To address (1), we have tried to nop'out the 'mflr r0' instruction when
> nop'ing out the call to ftrace_caller() and restoring it when enabling
> ftrace. But, that required additional synchronization slowing down
> ftrace activation. It also left an additional nop instruction at the
> beginning of each function and that wasn't desirable on 32-bit powerpc.
> 
> Instead of that, move the function profile sequence out-of-line leaving
> a single nop at function entry. On ftrace activation, the nop is changed
> to an unconditional branch to the out-of-line sequence that in turn
> calls ftrace_caller(). This removes the need for complex synchronization
> during ftrace activation and simplifies the code. More importantly, this
> improves performance of the kernel when ftrace is not in use.
> 
> To address (2), change the ftrace trampoline to return with a 'blr'
> instruction with the original return address in r0 intact. Then, an
> additional 'mtlr r0' instruction in the function profile sequence can
> move the correct return address back to LR.
> 
> With the above two changes, the function profile sequence now looks like
> the following:
> 
>   [func:  # GEP -- 64-bit powerpc, optional
>   addis   r2,r12,imm1
>   addir2,r2,imm2]
>tramp:
>   mflrr0
>   bl  ftrace_caller
>   mtlrr0
>   b   func
>   nop
>   [nop]   # 64-bit powerpc only
>func:  # LEP
>   nop
> 
> On 32-bit powerpc, the ftrace mcount trampoline is now completely
> outside the function. This is also the case on 64-bit powerpc for
> functions that do not need a GEP. However, for functions that need a
> GEP, the additional instructions are inserted between the GEP and the
> LEP. Since we can only have a fixed number of instructions between GEP
> and LEP, we choose to emit 6 instructions. Four of those instructions
> are used for the function profile sequence and two instruction slots are
> reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On
> 32-bit powerpc, we emit one additional nop for this purpose resulting in
> a total of 5 nops before function entry.
> 
> To enable ftrace, the nop at function entry is changed to an
> unconditional branch to 'tramp'. The call to ftrace_caller() may be
> updated to ftrace_regs_caller() depending on the registered ftrace ops.
> On 64-bit powerpc, we additionally change the instruction at 'tramp' to
> 'mflr r0' from an unconditional branch back to func+4. This is so that
> functions entered through the GEP can skip the function profile sequence
> unless ftrace is enabled.
> 
> With the context_switch microbenchmark on a P9 machine, there is a
> performance improvement of ~6% with this patch applied, going from 650k
> context switches to 690k context switches without ftrace enabled. With
> ftrace enabled, the performance was similar at 86k context switches.

Wondering how significant that context_switch micorbenchmark is.

I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the 
results:
# ./context_switch --no-fp
Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no 
vdso:no

On 885, I get the following results before and after your patch.

CONFIG_FTRACE not selected : 44,9k
CONFIG_FTRACE selected, before : 32,8k
CONFIG_FTRACE selected, after : 33,6k

All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But 
when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected 
result is only 34,4.

On 8321:

CONFIG_FTRACE not selected : 100,3k
CONFIG_FTRACE selected, before : 72,5k
CONFIG_FTRACE selected, after : 116k

So the results look odd to me.

> 
> The downside of this approach is the increase in vmlinux size,
> especially on 32-bit powerpc. We now emit 3 additional instructions for
> each function (excluding the one or two instructions for supporting
> DYNAMIC_FTRACE_WITH_CALL_OPS). On 64-bit powerpc with the current
> implementation of -fpatchable-function-entry though, this is not
> avoidable since we are forced to emit 6 instructions between the GEP and
> the LEP even if we are to only support DYNAMIC_FTRACE_WITH_CALL_OPS.

The increase is almost 5% on the few 32 bits defconfig I have tested. 
That's a lot.

On 32 bits powerpc, only the e500 has a link stack that could end up 
being unbalanced. Could we keep the bctr and