Nicholas Piggin <npig...@gmail.com> writes: > On Thu, 17 May 2018 12:04:13 +0200 (CEST) > Christophe Leroy <christophe.le...@c-s.fr> wrote: > >> commit 87a156fb18fe1 ("Align hot loops of some string functions") >> degraded the performance of string functions by adding useless >> nops >> >> A simple benchmark on an 8xx calling 100000x a memchr() that >> matches the first byte runs in 41668 TB ticks before this patch >> and in 35986 TB ticks after this patch. So this gives an >> improvement of approx 10% >> >> Another benchmark doing the same with a memchr() matching the 128th >> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks >> after this patch, so regardless on the number of loops, removing >> those useless nops improves the test by 5683 TB ticks. >> >> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions") >> Signed-off-by: Christophe Leroy <christophe.le...@c-s.fr> >> --- >> Was sent already as part of a serie optimising string functions. >> Resending on itself as it is independent of the other changes in the >> serie >> >> arch/powerpc/lib/string.S | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S >> index a787776822d8..a026d8fa8a99 100644 >> --- a/arch/powerpc/lib/string.S >> +++ b/arch/powerpc/lib/string.S >> @@ -23,7 +23,9 @@ _GLOBAL(strncpy) >> mtctr r5 >> addi r6,r3,-1 >> addi r4,r4,-1 >> +#ifdef CONFIG_PPC64 >> .balign 16 >> +#endif >> 1: lbzu r0,1(r4) >> cmpwi 0,r0,0 >> stbu r0,1(r6) > > The ifdefs are a bit ugly, but you can't argue with the numbers. These > alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise > the ifetch performance when you have such a loop (although there is > always a tradeoff for a single iteration). > > Would it make sense to define that for 32-bit as well, and you could use > it here instead of the ifdefs? Small CPUs could just use 0.
Can we do it with a macro in the header, eg. like: #ifdef CONFIG_PPC64 #define IFETCH_BALIGN .balign IFETCH_ALIGN_BYTES #endif ... addi r4,r4,-1 IFETCH_BALIGN 1: lbzu r0,1(r4) cheers