https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80479

--- Comment #14 from jreiser at bitwagon dot com ---
Here's how to retain the increased speed (and save around 300 bytes per call)
while enabling valgrind happiness. 

Make a closed subroutine __gcc_strcmp_ppc64le whose calling sequence is:
    la r3,arg1  // address of first string
    la r4,arg2  // address of second string
    ldbrx r5,0,r3  // first 8 bytes of arg1, big endian
    ldbrx r6,0,r4  // first 8 bytes of arg2, big endian
    bl __gcc_strcmp_ppc64le
Put this subroutine in archive library libgcc_s.a only, and not in shared
library libgcc_s.so.  Then the linkage for 'bl' is direct, avoiding PLT
(ProgramLinkageTable), and the time for 'bl' is hidden by cache latency for
'ldbrx'.  The return 'blr' often is free, but may cost 1 cycle if it
immediately follows a conditional branch that tests for termination. 
Valgrind(memcheck) can be happy because it can intercept and re-direct the
entire routine by name, thus avoiding having to analyze 'cmpb'.

The simplest implementation of __gcc_strcmp_ppc64le is just "b strcmp", because
arg1 and arg2 have not been incremented before the call.  Otherwise the two
"addi r,r,8" probably can fit into unused superscalar ALU slots early in the
subroutine, or the code can just remember that the addresses always are 8
behind.

There can be multiple named entry points, each specialized differently, such as
for known alignment of operands, etc.

Notes: ldbrx and lwbrx are functional for non-aligned addresses.  The UPX
(de-)compressor for executables uses those opcodes, and they work correctly on
64-bit PPC970FX (PowerMac8,2) and 32-bit 7447A (PowerMac10,1), both running
Debian 8 (jessie).  The hardware documentation warns that the implementation
may be significantly slower than a regular load.  Trapping to operating system
emulation (always, or only for unaligned address, or ...) is an option.  From
the viewpoint of chip design, there can be a great temptation to implement only
the 32-bit lwbrx, but always trap for 64-bit ldbrx.  The 32-bit lwbrx has
notable use cases for the network functions htonl and ntohl, while the 64-bit
ldbrx has lacked such high-profile clients.

Reply via email to