On Sat, 27 Jun 2020 01:27:14 +0200 Christian Weisgerber <na...@mips.inka.de> wrote:
> That function simply copies as many (double)words plus a tail of > bytes as the length argument specifies. Neither source nor destination > are checked for alignment, so this will happily run a loop of > unaligned accesses, which doesn't sound very optimal. I made a benchmark and concluded that unaligned word copies are slower than aligned word copies, but faster than byte copies. In most cases, memmove.S is faster than memmove.c, but if aligned word copies between unaligned buffers are possible, then memmove.c is faster. The benchmark was on a 32-bit macppc G3 with cpu0 at mainbus0: 750 (Revision 0x202): 400 MHz: 512KB backside cache The benchmark has 4 implementations of memmove, stbu => byte copy with lbzu,stbu loop stbx => byte copy with lbzx,stbx,addi loop C => aligned word copy or byte copy (libc/string/memmove.c) asm => unaligned word copy (libc/arch/powerpc/string/memmove.S) It shows time measured by mftb (move from timebase). 1st bench: move 10000 bytes up by 4 bytes, then down by 4 bytes, in aligned buffer (offset 0). asm wins: $ ./bench 10000 4 0 stbu stbx C asm 2639 2814 792 633 2502 2814 784 628 2501 2814 783 627 2501 2814 784 626 2nd bench: unaligned buffer (offset 1), but (src & 3) == (dst & 3), so C does aligned word copies, while asm does misaligned. C wins: $ ./bench 10000 4 1 stbu stbx C asm 2638 3006 795 961 2502 2814 786 938 2501 2814 786 939 2501 2813 785 939 3rd bench: move up then down by 5 bytes, src & 3 != dst & 3, can't align word copies. C does byte copies. asm wins: $ ./bench 10000 5 0 stbu stbx C asm 2675 2815 2514 809 2501 2813 2504 782 2502 2815 2504 782 2501 2814 2503 782 I think that memmove.S is probably better than memmove.c on G3. I haven't run the bench on POWER9.