Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-24 Thread Segher Boessenkool
On Thu, May 24, 2018 at 10:18:44AM +, Christophe Leroy wrote:
> On 05/24/2018 06:20 AM, Christophe LEROY wrote:
> >Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> >>On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> >>>The generic csum_ipv6_magic() generates a pretty bad result
> >>
> >>
> >>
> >>Please try with a more recent compiler, what you used is pretty ancient.
> >>It's not like recent compilers do great on this either, but it's not
> >>*that* bad anymore ;-)
> 
> Here is what I get with GCC 8.1
> It doesn't look much better, does it ?

There are no more mfocrf, which is a big speedup.  Other than that it is
pretty lousy still, I totally agree.  This improvement happened quite a
while ago, it's fixed in GCC 6.


Segher


Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-24 Thread Segher Boessenkool
On Thu, May 24, 2018 at 08:20:16AM +0200, Christophe LEROY wrote:
> Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :
> >On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> >>+_GLOBAL(csum_ipv6_magic)
> >>+   lwz r8, 0(r3)
> >>+   lwz r9, 4(r3)
> >>+   lwz r10, 8(r3)
> >>+   lwz r11, 12(r3)
> >>+   addcr0, r5, r6
> >>+   adder0, r0, r7
> >>+   adder0, r0, r8
> >>+   adder0, r0, r9
> >>+   adder0, r0, r10
> >>+   adder0, r0, r11
> >>+   lwz r8, 0(r4)
> >>+   lwz r9, 4(r4)
> >>+   lwz r10, 8(r4)
> >>+   lwz r11, 12(r4)
> >>+   adder0, r0, r8
> >>+   adder0, r0, r9
> >>+   adder0, r0, r10
> >>+   adder0, r0, r11
> >>+   addze   r0, r0
> >>+   rotlwi  r3, r0, 16
> >>+   add r3, r0, r3
> >>+   not r3, r3
> >>+   rlwinm  r3, r3, 16, 16, 31
> >>+   blr
> >>+EXPORT_SYMBOL(csum_ipv6_magic)
> >
> >Clustering the loads and carry insns together is pretty much the worst you
> >can do on most 32-bit CPUs.
> 
> Oh, really ? __csum_partial is written that way too.

I thought I told you about this before?  Maybe not.

> Right, now I tried interleaving the lwz and adde. I get no improvment at 
> all on a 885, but I get a 15% improvment on a 8321.

It won't likely help on single-issue cores (like the one 885 has), yes.


Segher


Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-24 Thread Christophe Leroy



On 05/24/2018 06:20 AM, Christophe LEROY wrote:



Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :

On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:

The generic csum_ipv6_magic() generates a pretty bad result




Please try with a more recent compiler, what you used is pretty ancient.
It's not like recent compilers do great on this either, but it's not
*that* bad anymore ;-)



Here is what I get with GCC 8.1
It doesn't look much better, does it ?


net/ipv6/ip6_checksum.o: file format elf32-powerpc


Disassembly of section .text:

 :
   0:   94 21 ff f0 stwur1,-16(r1)
   4:   80 04 00 00 lwz r0,0(r4)
   8:   81 64 00 04 lwz r11,4(r4)
   c:   81 04 00 08 lwz r8,8(r4)
  10:   93 e1 00 0c stw r31,12(r1)
  14:   81 43 00 00 lwz r10,0(r3)
  18:   83 e3 00 04 lwz r31,4(r3)
  1c:   81 23 00 08 lwz r9,8(r3)
  20:   81 83 00 0c lwz r12,12(r3)
  24:   7c ea 3a 14 add r7,r10,r7
  28:   7d 4a 38 10 subfc   r10,r10,r7
  2c:   7c ff 3a 14 add r7,r31,r7
  30:   81 44 00 0c lwz r10,12(r4)
  34:   7c 63 19 10 subfe   r3,r3,r3
  38:   7c 63 38 50 subfr3,r3,r7
  3c:   7f ff 18 10 subfc   r31,r31,r3
  40:   7c e9 1a 14 add r7,r9,r3
  44:   83 e1 00 0c lwz r31,12(r1)
  48:   7c 63 19 10 subfe   r3,r3,r3
  4c:   38 21 00 10 addir1,r1,16
  50:   7c 63 38 50 subfr3,r3,r7
  54:   7d 29 18 10 subfc   r9,r9,r3
  58:   7d 2c 1a 14 add r9,r12,r3
  5c:   7c 63 19 10 subfe   r3,r3,r3
  60:   7c 63 48 50 subfr3,r3,r9
  64:   7d 8c 18 10 subfc   r12,r12,r3
  68:   7d 20 1a 14 add r9,r0,r3
  6c:   7c 63 19 10 subfe   r3,r3,r3
  70:   7c 63 48 50 subfr3,r3,r9
  74:   7c 00 18 10 subfc   r0,r0,r3
  78:   7d 2b 1a 14 add r9,r11,r3
  7c:   7c 63 19 10 subfe   r3,r3,r3
  80:   7c 63 48 50 subfr3,r3,r9
  84:   7d 6b 18 10 subfc   r11,r11,r3
  88:   7d 28 1a 14 add r9,r8,r3
  8c:   7c 63 19 10 subfe   r3,r3,r3
  90:   7c 63 48 50 subfr3,r3,r9
  94:   7d 08 18 10 subfc   r8,r8,r3
  98:   7d 2a 1a 14 add r9,r10,r3
  9c:   7c 63 19 10 subfe   r3,r3,r3
  a0:   7c 63 48 50 subfr3,r3,r9
  a4:   7d 4a 18 10 subfc   r10,r10,r3
  a8:   7d 23 2a 14 add r9,r3,r5
  ac:   7c 63 19 10 subfe   r3,r3,r3
  b0:   7c 63 48 50 subfr3,r3,r9
  b4:   7c a5 18 10 subfc   r5,r5,r3
  b8:   7c 63 32 14 add r3,r3,r6
  bc:   7d 29 49 10 subfe   r9,r9,r9
  c0:   7d 29 18 50 subfr9,r9,r3
  c4:   7c c6 48 10 subfc   r6,r6,r9
  c8:   7c 63 19 10 subfe   r3,r3,r3
  cc:   7c 63 48 50 subfr3,r3,r9
  d0:   54 69 80 3e rotlwi  r9,r3,16
  d4:   7c 63 4a 14 add r3,r3,r9
  d8:   7c 63 18 f8 not r3,r3
  dc:   54 63 84 3e rlwinm  r3,r3,16,16,31
  e0:   4e 80 00 20 blr

net/ipv6/ip6_checksum.o: file format elf64-powerpc


Disassembly of section .text:

 <.csum_ipv6_magic>:
   0:   fb e1 ff f8 std r31,-8(r1)
   4:   81 43 00 00 lwz r10,0(r3)
   8:   81 83 00 04 lwz r12,4(r3)
   c:   81 23 00 08 lwz r9,8(r3)
  10:   80 03 00 0c lwz r0,12(r3)
  14:   7c e7 52 14 add r7,r7,r10
  18:   80 64 00 08 lwz r3,8(r4)
  1c:   81 04 00 00 lwz r8,0(r4)
  20:   78 ff 00 20 clrldi  r31,r7,32
  24:   7c ec 3a 14 add r7,r12,r7
  28:   81 64 00 04 lwz r11,4(r4)
  2c:   7f ea f8 50 subfr31,r10,r31
  30:   81 44 00 0c lwz r10,12(r4)
  34:   7b ff 0f e0 rldicl  r31,r31,1,63
  38:   7c ff 3a 14 add r7,r31,r7
  3c:   eb e1 ff f8 ld  r31,-8(r1)
  40:   78 e4 00 20 clrldi  r4,r7,32
  44:   7c e9 3a 14 add r7,r9,r7
  48:   7d 8c 20 50 subfr12,r12,r4
  4c:   79 8c 0f e0 rldicl  r12,r12,1,63
  50:   7d 8c 3a 14 add r12,r12,r7
  54:   79 87 00 20 clrldi  r7,r12,32
  58:   7d 80 62 14 add r12,r0,r12
  5c:   7d 29 38 50 subfr9,r9,r7
  60:   79 29 0f e0 rldicl  r9,r9,1,63
  64:   7d 29 62 14 add r9,r9,r12
  68:   79 27 00 20 clrldi  r7,r9,32
  6c:   7d 28 4a 14 add r9,r8,r9
  70:   7c 00 38 50 subfr0,r0,r7
  74:   78 00 0f e0 rldicl  r0,r0,1,63
  78:   7c 00 4a 14 add r0,r0,r9
  7c:   78 09 00 20 clrldi  r9,r0,32
  80:   7c 0b 02 14 add r0,r11,r0
  84:   7d 08 48 50 subfr8,r8,r9
  88:   79 08 0f e0 rldicl  r8,r8,1,63
  8c:   7d 08 02 14 add r8,r8,r0
  90:   79 09 00 20 clrldi  r9,r8,32
  94:   7d 03 42 14 add r8,r3,r8
  98:   7d 2b 48 50 subfr9,r11,r9
  9c:   79 29 0f e0 rldicl  r9,r9,1,63
  a0:   7d 29 42 14 add r9,r9,r8
  a4:   79 28 00 20 clrldi  r8,r9,32
  a8:   7d 2a 4a 14 add r9,r10,r9
  ac:   7d 03 40 50 subfr8,r3,r8
  b0:   79 08 0f e0 rldicl  r8,r8,1,63
  b4:   7d 08 4a 14 add r8,r8,r9
  b8:   

Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-23 Thread Christophe LEROY



Le 23/05/2018 à 20:34, Segher Boessenkool a écrit :

On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:

The generic csum_ipv6_magic() generates a pretty bad result




Please try with a more recent compiler, what you used is pretty ancient.
It's not like recent compilers do great on this either, but it's not
*that* bad anymore ;-)


--- a/arch/powerpc/lib/checksum_32.S
+++ b/arch/powerpc/lib/checksum_32.S
@@ -293,3 +293,36 @@ dst_error:
EX_TABLE(51b, dst_error);
  
  EXPORT_SYMBOL(csum_partial_copy_generic)

+
+/*
+ * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ *   const struct in6_addr *daddr,
+ *   __u32 len, __u8 proto, __wsum sum)
+ */
+
+_GLOBAL(csum_ipv6_magic)
+   lwz r8, 0(r3)
+   lwz r9, 4(r3)
+   lwz r10, 8(r3)
+   lwz r11, 12(r3)
+   addcr0, r5, r6
+   adder0, r0, r7
+   adder0, r0, r8
+   adder0, r0, r9
+   adder0, r0, r10
+   adder0, r0, r11
+   lwz r8, 0(r4)
+   lwz r9, 4(r4)
+   lwz r10, 8(r4)
+   lwz r11, 12(r4)
+   adder0, r0, r8
+   adder0, r0, r9
+   adder0, r0, r10
+   adder0, r0, r11
+   addze   r0, r0
+   rotlwi  r3, r0, 16
+   add r3, r0, r3
+   not r3, r3
+   rlwinm  r3, r3, 16, 16, 31
+   blr
+EXPORT_SYMBOL(csum_ipv6_magic)


Clustering the loads and carry insns together is pretty much the worst you
can do on most 32-bit CPUs.


Oh, really ? __csum_partial is written that way too.

Right, now I tried interleaving the lwz and adde. I get no improvment at 
all on a 885, but I get a 15% improvment on a 8321.


Christophe




Segher



Re: [PATCH v3] powerpc: Implement csum_ipv6_magic in assembly

2018-05-23 Thread Segher Boessenkool
On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote:
> The generic csum_ipv6_magic() generates a pretty bad result



Please try with a more recent compiler, what you used is pretty ancient.
It's not like recent compilers do great on this either, but it's not
*that* bad anymore ;-)

> --- a/arch/powerpc/lib/checksum_32.S
> +++ b/arch/powerpc/lib/checksum_32.S
> @@ -293,3 +293,36 @@ dst_error:
>   EX_TABLE(51b, dst_error);
>  
>  EXPORT_SYMBOL(csum_partial_copy_generic)
> +
> +/*
> + * static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
> + * const struct in6_addr *daddr,
> + * __u32 len, __u8 proto, __wsum sum)
> + */
> +
> +_GLOBAL(csum_ipv6_magic)
> + lwz r8, 0(r3)
> + lwz r9, 4(r3)
> + lwz r10, 8(r3)
> + lwz r11, 12(r3)
> + addcr0, r5, r6
> + adder0, r0, r7
> + adder0, r0, r8
> + adder0, r0, r9
> + adder0, r0, r10
> + adder0, r0, r11
> + lwz r8, 0(r4)
> + lwz r9, 4(r4)
> + lwz r10, 8(r4)
> + lwz r11, 12(r4)
> + adder0, r0, r8
> + adder0, r0, r9
> + adder0, r0, r10
> + adder0, r0, r11
> + addze   r0, r0
> + rotlwi  r3, r0, 16
> + add r3, r0, r3
> + not r3, r3
> + rlwinm  r3, r3, 16, 16, 31
> + blr
> +EXPORT_SYMBOL(csum_ipv6_magic)

Clustering the loads and carry insns together is pretty much the worst you
can do on most 32-bit CPUs.


Segher