Re: powerpc64: 64-bit-ize memmove.S

2020-06-27 Thread George Koehler
On Sat, 27 Jun 2020 01:27:14 +0200
Christian Weisgerber  wrote:

> That function simply copies as many (double)words plus a tail of
> bytes as the length argument specifies.  Neither source nor destination
> are checked for alignment, so this will happily run a loop of
> unaligned accesses, which doesn't sound very optimal.

I made a benchmark and concluded that unaligned word copies are slower
than aligned word copies, but faster than byte copies.  In most cases,
memmove.S is faster than memmove.c, but if aligned word copies between
unaligned buffers are possible, then memmove.c is faster.

The benchmark was on a 32-bit macppc G3 with
cpu0 at mainbus0: 750 (Revision 0x202): 400 MHz: 512KB backside cache

The benchmark has 4 implementations of memmove,
  stbu  =>  byte copy with lbzu,stbu loop
  stbx  =>  byte copy with lbzx,stbx,addi loop
  C =>  aligned word copy or byte copy (libc/string/memmove.c)
  asm   =>  unaligned word copy (libc/arch/powerpc/string/memmove.S)

It shows time measured by mftb (move from timebase).

1st bench: move 1 bytes up by 4 bytes, then down by 4 bytes, in
aligned buffer (offset 0).  asm wins:

$ ./bench 1 4 0
stbustbxC   asm
26392814792 633
25022814784 628
25012814783 627
25012814784 626

2nd bench: unaligned buffer (offset 1), but (src & 3) == (dst & 3), so
C does aligned word copies, while asm does misaligned.  C wins:

$ ./bench 1 4 1
stbustbxC   asm
26383006795 961
25022814786 938
25012814786 939
25012813785 939

3rd bench: move up then down by 5 bytes, src & 3 != dst & 3, can't
align word copies.  C does byte copies.  asm wins:

$ ./bench 1 5 0 
stbustbxC   asm
267528152514809
250128132504782
250228152504782
250128142503782

I think that memmove.S is probably better than memmove.c on G3.
I haven't run the bench on POWER9.



Re: powerpc64: 64-bit-ize memmove.S

2020-06-26 Thread Theo de Raadt
George Koehler  wrote:

> On Sat, 27 Jun 2020 01:27:14 +0200
> Christian Weisgerber  wrote:
> 
> > I'm also intrigued by this aside in the PowerPC ISA documentation:
> > | Moreover, Load with Update instructions may take longer to execute
> > | in some implementations than the corresponding pair of a non-update
> > | Load instruction and an Add instruction.
> > What does clang generate?
> 
> clang likes load/store with update instructions.  For example, the
> powerpc64 kernel has /sys/lib/libkern/memcpy.c, which copies bytes:
> 
>   while (n-- > 0)
>   *t++ = *f++;

Keep in mind in userland we use memcpy.c, which does backwards detection.
It's probably slower.

The original idea was to use the C version with backwards detection until
it stopped finding bugs, but... it keeps finding bugs



Re: powerpc64: 64-bit-ize memmove.S

2020-06-26 Thread George Koehler
On Sat, 27 Jun 2020 01:27:14 +0200
Christian Weisgerber  wrote:

> I'm also intrigued by this aside in the PowerPC ISA documentation:
> | Moreover, Load with Update instructions may take longer to execute
> | in some implementations than the corresponding pair of a non-update
> | Load instruction and an Add instruction.
> What does clang generate?

clang likes load/store with update instructions.  For example, the
powerpc64 kernel has /sys/lib/libkern/memcpy.c, which copies bytes:

while (n-- > 0)
*t++ = *f++;

clang uses lbzu and stbu:

memcpy: cmpldi r5,0x0
memcpy+0x4: beqlr
memcpy+0x8: addi r4,r4,-1
memcpy+0xc: addi r6,r3,-1
memcpy+0x10:mtspr ctr,r5
memcpy+0x14:lbzu r5,1(r4)
memcpy+0x18:stbu r5,1(r6)
memcpy+0x1c:bdnz 0x26cd0d4 (memcpy+0x14)
memcpy+0x20:blr

> I think we should consider dropping this "optimized" memmove.S on
> both powerpc and powerpc64.

I might want to benchmark memmove.S against memmove.c to check if
those unaligned accesses are too slow.  First I would have to write
a benchmark.



Re: powerpc64: 64-bit-ize memmove.S

2020-06-26 Thread Christian Weisgerber
Christian Weisgerber:

> Well, I suggested it, so here's my attempt to switch powerpc64's
> libc memmove.S over to 64 bits:

Actually, on second thought:

That function simply copies as many (double)words plus a tail of
bytes as the length argument specifies.  Neither source nor destination
are checked for alignment, so this will happily run a loop of
unaligned accesses, which doesn't sound very optimal.

I'm also intrigued by this aside in the PowerPC ISA documentation:
| Moreover, Load with Update instructions may take longer to execute
| in some implementations than the corresponding pair of a non-update
| Load instruction and an Add instruction.
What does clang generate?

I think we should consider dropping this "optimized" memmove.S on
both powerpc and powerpc64.

-- 
Christian "naddy" Weisgerber  na...@mips.inka.de



powerpc64: 64-bit-ize memmove.S

2020-06-26 Thread Christian Weisgerber
Well, I suggested it, so here's my attempt to switch powerpc64's
libc memmove.S over to 64 bits:
* Treat length parameter as 64 bits (size_t) instead of 32 (u_int).
* Set up the main loop to copy 64-bit doublewords instead of 32-bit
  words.

This definitely needs double and triple checking.

Things I wonder about, but didn't touch:
* Why is the memcpy entry point commented out?
* END... STRONG? WEAK? BUILTIN?

Index: memmove.S
===
RCS file: /cvs/src/lib/libc/arch/powerpc64/string/memmove.S,v
retrieving revision 1.1
diff -u -p -r1.1 memmove.S
--- memmove.S   25 Jun 2020 02:34:22 -  1.1
+++ memmove.S   26 Jun 2020 22:22:51 -
@@ -39,7 +39,7 @@
  * ==
  */
 
-#include "SYS.h"
+#include "DEFS.h"
 
 .text
 
@@ -64,45 +64,45 @@ ENTRY(memmove)
/* start of dest*/
 
 fwd:
-   addi%r4, %r4, -4/* Back up src and dst pointers */
-   addi%r8, %r8, -4/* due to auto-update of 'load' */
+   addi%r4, %r4, -8/* Back up src and dst pointers */
+   addi%r8, %r8, -8/* due to auto-update of 'load' */
 
-   srwi.   %r9,%r5,2   /* How many words in total cnt  */
-   beq-last1   /* Handle byte by byte if < 4   */
+   srdi.   %r9,%r5,3   /* Doublewords in total count   */
+   beq-last1   /* Handle byte by byte if < 8   */
/* bytes total  */
-   mtctr   %r9 /* Count of words for loop  */
-   lwzu%r7, 4(%r4) /* Preload first word   */
+   mtctr   %r9 /* Count of dwords for loop */
+   ldu %r7, 8(%r4) /* Preload first doubleword */
 
b   g1
 
 g0:/* Main loop*/
 
-   lwzu%r7, 4(%r4) /* Load a new word  */
-   stwu%r6, 4(%r8) /* Store previous word  */
+   ldu %r7, 8(%r4) /* Load a new doubleword*/
+   stdu%r6, 8(%r8) /* Store previous doubleword*/
 
 g1:
 
bdz-last/* Dec cnt, and branch if just  */
-   /* one word to store*/
-   lwzu%r6, 4(%r4) /* Load another word*/
-   stwu%r7, 4(%r8) /* Store previous word  */
+   /* one doubleword to store  */
+   ldu %r6, 8(%r4) /* Load another doubleword  */
+   stdu%r7, 8(%r8) /* Store previous doubleword*/
bdnz+   g0  /* Dec cnt, and loop again if   */
-   /* more words   */
-   mr  %r7, %r6/* If word count -> 0, then...  */
+   /* more doublewords */
+   mr  %r7, %r6/* If dword count -> 0, then... */
 
 last:
 
-   stwu%r7, 4(%r8) /* ... store last word  */
+   stdu%r7, 8(%r8) /* ... store last doubleword*/
 
 last1: /* Byte-by-byte copy*/
 
-   clrlwi. %r5,%r5,30  /* If count -> 0, then ...  */
+   clrldi. %r5,%r5,61  /* If count -> 0, then ...  */
beqlr   /* we're done   */
 
mtctr   %r5 /* else load count for loop */
 
-   lbzu%r6, 4(%r4) /* 1st byte: update addr by 4   */
-   stbu%r6, 4(%r8) /* since we pre-adjusted by 4   */
+   lbzu%r6, 8(%r4) /* 1st byte: update addr by 8   */
+   stbu%r6, 8(%r8) /* since we pre-adjusted by 8   */
bdzlr-  /* in anticipation of main loop */
 
 last2:
@@ -120,40 +120,40 @@ reverse:
 
add %r4, %r4, %r5   /* Work from end to beginning   */
add %r8, %r8, %r5   /* so add count to string ptrs  */
-   srwi.   %r9,%r5,2   /* Words in total count */
-   beq-rlast1  /* Handle byte by byte if < 4   */
+   srdi.   %r9,%r5,3   /* Doublewords in total count   */
+   beq-rlast1  /* Handle byte by byte if < 8   */
/* bytes total  */
 
-   mtctr   %r9 /* Count of words for loop  */
+   mtctr   %r9 /* Count of dwords for loop */
 
-   lwzu%r7, -4(%r4)/* Preload first word   */
+   ldu