RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-24 Thread Eric Dumazet
On Mon, 2014-03-24 at 06:17 -0700, Eric Dumazet wrote:
> On Mon, 2014-03-24 at 10:30 +, David Laight wrote:

> > ip_fast_csum() either needs an explicit "m" constraint for the actual
> > buffer (and target) bytes, or the stronger "memory" constraint.
> > The 'volatile' is then not needed.

I am testing the following :

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..89d7fa8837b5 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -45,6 +45,9 @@ static inline __sum16 csum_fold(__wsum sum)
 static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 {
unsigned int sum;
+   struct full_ip_header {
+   unsigned int w[ihl];
+   };
 
asm("  movl (%1), %0\n"
"  subl $4, %2\n"
@@ -67,8 +70,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
   are modified, we must also specify them as outputs, or gcc
   will assume they contain their original values. */
: "=r" (sum), "=r" (iph), "=r" (ihl)
-   : "1" (iph), "2" (ihl)
-   : "memory");
+   : "1" (iph), "2" (ihl),
+ "m" (*(struct full_ip_header *) iph)
+   );
return (__force __sum16)sum;
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-24 Thread Eric Dumazet
On Mon, 2014-03-24 at 10:30 +, David Laight wrote:
> From: Eric Dumazet
> > On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
> > > From: Eric Dumazet 
> > > Date: Fri, 21 Mar 2014 05:50:50 -0700
> > >
> > > > It looks like a barrier() would be more appropriate.
> > >
> > > barrier() == __asm__ __volatile__(:::"memory")
> > 
> > Indeed, but now you mention it, ip_fast_csum() do not uses volatile
> > keyword on x86_64, and has no "m" constraint either.
> 
> Adding 'volatile' isn't sufficient to force gcc to write data
> into the area being checksummed.

You missed the point. Its not about forcing gcc to write data, because
it does.

Point is : gcc doesn't recompute the checksum a second time.

> ip_fast_csum() either needs an explicit "m" constraint for the actual
> buffer (and target) bytes, or the stronger "memory" constraint.
> The 'volatile' is then not needed.

What about you take a look at the actual code ?

"memory" constraint is already there. And no, its not enough, otherwise
I wouldn't have sent this mail.

I actually compiled the code and double checked.

7010 :
7010:   e8 00 00 00 00  callq  7015 
7011: R_X86_64_PC32 __fentry__-0x4
7015:   55  push   %rbp
7016:   31 c0   xor%eax,%eax
7018:   b9 05 00 00 00  mov$0x5,%ecx
701d:   48 89 e5mov%rsp,%rbp
7020:   48 83 ec 20 sub$0x20,%rsp
7024:   48 89 5d e8 mov%rbx,-0x18(%rbp)
7028:   4c 89 6d f8 mov%r13,-0x8(%rbp)
702c:   48 89 fbmov%rdi,%rbx
702f:   4c 89 65 f0 mov%r12,-0x10(%rbp)
7033:   41 89 d5mov%edx,%r13d
7036:   66 89 47 0a mov%ax,0xa(%rdi)
703a:   66 89 77 02 mov%si,0x2(%rdi)
703e:   48 89 f8mov%rdi,%rax
7041:   48 89 femov%rdi,%rsi
7044:   44 8b 20mov(%rax),%r12d
7047:   83 e9 04sub$0x4,%ecx
704a:   76 2e   jbe707a 
704c:   44 03 60 04 add0x4(%rax),%r12d
7050:   44 13 60 08 adc0x8(%rax),%r12d
7054:   44 13 60 0c adc0xc(%rax),%r12d
7058:   44 13 60 10 adc0x10(%rax),%r12d
705c:   48 8d 40 04 lea0x4(%rax),%rax
7060:   ff c9   dec%ecx
7062:   75 f4   jne7058 
7064:   41 83 d4 00 adc$0x0,%r12d
7068:   44 89 e1mov%r12d,%ecx
706b:   41 c1 ec 10 shr$0x10,%r12d
706f:   66 41 01 cc add%cx,%r12w
7073:   41 83 d4 00 adc$0x0,%r12d
7077:   41 f7 d4not%r12d
707a:   31 c0   xor%eax,%eax
707c:   66 44 89 67 0a  mov%r12w,0xa(%rdi)
7081:   48 c7 c7 00 00 00 00mov$0x0,%rdi
7084: R_X86_64_32S  .rodata.str1.1+0xabd
7088:   e8 00 00 00 00  callq  708d 
7089: R_X86_64_PC32 printk-0x4
708d:   66 44 89 6b 02  mov%r13w,0x2(%rbx)
7092:   66 44 89 63 0a  mov%r12w,0xa(%rbx)
7097:   4c 8b 6d f8 mov-0x8(%rbp),%r13
709b:   48 8b 5d e8 mov-0x18(%rbp),%rbx
709f:   4c 8b 65 f0 mov-0x10(%rbp),%r12
70a3:   c9  leaveq 
70a4:   c3  retq   


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-24 Thread David Laight
From: Eric Dumazet
> On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
> > From: Eric Dumazet 
> > Date: Fri, 21 Mar 2014 05:50:50 -0700
> >
> > > It looks like a barrier() would be more appropriate.
> >
> > barrier() == __asm__ __volatile__(:::"memory")
> 
> Indeed, but now you mention it, ip_fast_csum() do not uses volatile
> keyword on x86_64, and has no "m" constraint either.

Adding 'volatile' isn't sufficient to force gcc to write data
into the area being checksummed.
ip_fast_csum() either needs an explicit "m" constraint for the actual
buffer (and target) bytes, or the stronger "memory" constraint.
The 'volatile' is then not needed.

David



RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-24 Thread David Laight
From: Eric Dumazet
 On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
  From: Eric Dumazet eric.duma...@gmail.com
  Date: Fri, 21 Mar 2014 05:50:50 -0700
 
   It looks like a barrier() would be more appropriate.
 
  barrier() == __asm__ __volatile__(:::memory)
 
 Indeed, but now you mention it, ip_fast_csum() do not uses volatile
 keyword on x86_64, and has no m constraint either.

Adding 'volatile' isn't sufficient to force gcc to write data
into the area being checksummed.
ip_fast_csum() either needs an explicit m constraint for the actual
buffer (and target) bytes, or the stronger memory constraint.
The 'volatile' is then not needed.

David



RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-24 Thread Eric Dumazet
On Mon, 2014-03-24 at 10:30 +, David Laight wrote:
 From: Eric Dumazet
  On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
   From: Eric Dumazet eric.duma...@gmail.com
   Date: Fri, 21 Mar 2014 05:50:50 -0700
  
It looks like a barrier() would be more appropriate.
  
   barrier() == __asm__ __volatile__(:::memory)
  
  Indeed, but now you mention it, ip_fast_csum() do not uses volatile
  keyword on x86_64, and has no m constraint either.
 
 Adding 'volatile' isn't sufficient to force gcc to write data
 into the area being checksummed.

You missed the point. Its not about forcing gcc to write data, because
it does.

Point is : gcc doesn't recompute the checksum a second time.

 ip_fast_csum() either needs an explicit m constraint for the actual
 buffer (and target) bytes, or the stronger memory constraint.
 The 'volatile' is then not needed.

What about you take a look at the actual code ?

memory constraint is already there. And no, its not enough, otherwise
I wouldn't have sent this mail.

I actually compiled the code and double checked.

7010 foobar:
7010:   e8 00 00 00 00  callq  7015 foobar+0x5
7011: R_X86_64_PC32 __fentry__-0x4
7015:   55  push   %rbp
7016:   31 c0   xor%eax,%eax
7018:   b9 05 00 00 00  mov$0x5,%ecx
701d:   48 89 e5mov%rsp,%rbp
7020:   48 83 ec 20 sub$0x20,%rsp
7024:   48 89 5d e8 mov%rbx,-0x18(%rbp)
7028:   4c 89 6d f8 mov%r13,-0x8(%rbp)
702c:   48 89 fbmov%rdi,%rbx
702f:   4c 89 65 f0 mov%r12,-0x10(%rbp)
7033:   41 89 d5mov%edx,%r13d
7036:   66 89 47 0a mov%ax,0xa(%rdi)
703a:   66 89 77 02 mov%si,0x2(%rdi)
703e:   48 89 f8mov%rdi,%rax
7041:   48 89 femov%rdi,%rsi
7044:   44 8b 20mov(%rax),%r12d
7047:   83 e9 04sub$0x4,%ecx
704a:   76 2e   jbe707a foobar+0x6a
704c:   44 03 60 04 add0x4(%rax),%r12d
7050:   44 13 60 08 adc0x8(%rax),%r12d
7054:   44 13 60 0c adc0xc(%rax),%r12d
7058:   44 13 60 10 adc0x10(%rax),%r12d
705c:   48 8d 40 04 lea0x4(%rax),%rax
7060:   ff c9   dec%ecx
7062:   75 f4   jne7058 foobar+0x48
7064:   41 83 d4 00 adc$0x0,%r12d
7068:   44 89 e1mov%r12d,%ecx
706b:   41 c1 ec 10 shr$0x10,%r12d
706f:   66 41 01 cc add%cx,%r12w
7073:   41 83 d4 00 adc$0x0,%r12d
7077:   41 f7 d4not%r12d
707a:   31 c0   xor%eax,%eax
707c:   66 44 89 67 0a  mov%r12w,0xa(%rdi)
7081:   48 c7 c7 00 00 00 00mov$0x0,%rdi
7084: R_X86_64_32S  .rodata.str1.1+0xabd
7088:   e8 00 00 00 00  callq  708d foobar+0x7d
7089: R_X86_64_PC32 printk-0x4
708d:   66 44 89 6b 02  mov%r13w,0x2(%rbx)
7092:   66 44 89 63 0a  mov%r12w,0xa(%rbx)
7097:   4c 8b 6d f8 mov-0x8(%rbp),%r13
709b:   48 8b 5d e8 mov-0x18(%rbp),%rbx
709f:   4c 8b 65 f0 mov-0x10(%rbp),%r12
70a3:   c9  leaveq 
70a4:   c3  retq   


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-24 Thread Eric Dumazet
On Mon, 2014-03-24 at 06:17 -0700, Eric Dumazet wrote:
 On Mon, 2014-03-24 at 10:30 +, David Laight wrote:

  ip_fast_csum() either needs an explicit m constraint for the actual
  buffer (and target) bytes, or the stronger memory constraint.
  The 'volatile' is then not needed.

I am testing the following :

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..89d7fa8837b5 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -45,6 +45,9 @@ static inline __sum16 csum_fold(__wsum sum)
 static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 {
unsigned int sum;
+   struct full_ip_header {
+   unsigned int w[ihl];
+   };
 
asm(  movl (%1), %0\n
  subl $4, %2\n
@@ -67,8 +70,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
   are modified, we must also specify them as outputs, or gcc
   will assume they contain their original values. */
: =r (sum), =r (iph), =r (ihl)
-   : 1 (iph), 2 (ihl)
-   : memory);
+   : 1 (iph), 2 (ihl),
+ m (*(struct full_ip_header *) iph)
+   );
return (__force __sum16)sum;
 }
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-23 Thread Eric Dumazet
On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
> From: Eric Dumazet 
> Date: Fri, 21 Mar 2014 05:50:50 -0700
> 
> > It looks like a barrier() would be more appropriate.
> 
> barrier() == __asm__ __volatile__(:::"memory")

Indeed, but now you mention it, ip_fast_csum() do not uses volatile
keyword on x86_64, and has no "m" constraint either.

This means that for the following hypothetical networking code :

void foobar(struct iphdr *iph, __be16 newlen, __be16 newid)
{
iph->tot_len = newlen;
iph->check = 0;
iph->check = ip_fast_csum((u8 *)iph, 5);

pr_err("%p\n", iph);

iph->id = newid;
iph->check = 0;
iph->check = ip_fast_csum((u8 *)iph, 5);
}


ip_fast_csum() is done _once_ only.

Following patch seems needed. Thats one another call for x86 code factorization 
...

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..c67778544880 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -46,7 +46,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
 {
unsigned int sum;
 
-   asm("  movl (%1), %0\n"
+   asm volatile("  movl (%1), %0\n"
"  subl $4, %2\n"
"  jbe 2f\n"
"  addl 4(%1), %0\n"


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-23 Thread Eric Dumazet
On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
 From: Eric Dumazet eric.duma...@gmail.com
 Date: Fri, 21 Mar 2014 05:50:50 -0700
 
  It looks like a barrier() would be more appropriate.
 
 barrier() == __asm__ __volatile__(:::memory)

Indeed, but now you mention it, ip_fast_csum() do not uses volatile
keyword on x86_64, and has no m constraint either.

This means that for the following hypothetical networking code :

void foobar(struct iphdr *iph, __be16 newlen, __be16 newid)
{
iph-tot_len = newlen;
iph-check = 0;
iph-check = ip_fast_csum((u8 *)iph, 5);

pr_err(%p\n, iph);

iph-id = newid;
iph-check = 0;
iph-check = ip_fast_csum((u8 *)iph, 5);
}


ip_fast_csum() is done _once_ only.

Following patch seems needed. Thats one another call for x86 code factorization 
...

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..c67778544880 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -46,7 +46,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
 {
unsigned int sum;
 
-   asm(  movl (%1), %0\n
+   asm volatile(  movl (%1), %0\n
  subl $4, %2\n
  jbe 2f\n
  addl 4(%1), %0\n


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread David Miller
From: Eric Dumazet 
Date: Fri, 21 Mar 2014 05:50:50 -0700

> It looks like a barrier() would be more appropriate.

barrier() == __asm__ __volatile__(:::"memory")
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread David Laight
From: Eric Dumazet
> On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
> > Eric Dumazet  writes:
> > >
> > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> > > is insane...
> >
> >
> > Couldn't it just be the cache miss?
> 
> Or the fact that we mix 16 bit stores and 32bit loads ?
> 
> BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ?

The correct constraint would be one that told gcc that it
accesses the 20 bytes from the source pointer.

Without it gcc won't necessarily write out the values before
the asm instructions execute.

David



Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Fri, 2014-03-21 at 06:47 -0700, Eric Dumazet wrote:

> Another idea would be to move the ip_fast_csum() call at the end of
> inet_gro_complete()
> 
> I'll try this :
> 
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 8c54870db792..0ca8f350a532 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int 
> nhoff)
> int proto = iph->protocol;
> int err = -ENOSYS;
>  
> -   csum_replace2(>check, iph->tot_len, newlen);
> iph->tot_len = newlen;
> +   iph->check = 0;
>  
> rcu_read_lock();
> ops = rcu_dereference(inet_offloads[proto]);
> @@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int 
> nhoff)
>  * inet_gro_receive().
>  */
> err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
> +   iph->check = ip_fast_csum((u8 *)iph, 5);
>  
>  out_unlock:
> rcu_read_unlock();
> 

This doesn't help... same stall.

Looks like the best way is to use some 16 bit loads in ip_fast_csum()
for the ihl=5 case.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Fri, 2014-03-21 at 06:32 -0700, Eric Dumazet wrote:
> On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote:
> 
> > Or the fact that we mix 16 bit stores and 32bit loads ?
> > 
> > iph->tot_len = newlen;
> > iph->check = 0;
> > iph->check = ip_fast_csum(iph, 5);
> 
> Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the
> stall. I no longer see inet_gro_complete() in perf top...
> 
> +   if (__builtin_constant_p(ihl) && ihl == 5) {
> +   asm("  movw (%[iph]), %w[sum]\n"/* ihl/version/tos */
> +   "  addw 2(%[iph]), %w[sum]\n"   /* tot_len  */
> +   "  adcw 8(%[iph]), %w[sum]\n"   /* ttl/protocol */
> +   "  adcw 10(%[iph]), %w[sum]\n"  /* check*/
> +   "  adcl 4(%[iph]), %[sum]\n"/* id/frag_off  */
> +   "  adcl 12(%[iph]), %[sum]\n"   /* saddr*/
> +   "  adcl 16(%[iph]), %[sum]\n"   /* daddr*/
> +   "  adcl $0, %[sum]\n"
> +   : [sum] "=r" (sum)
> +   : [iph] "r" (iph)
> +   );
> +   return csum_fold(sum);
> 

Another idea would be to move the ip_fast_csum() call at the end of
inet_gro_complete()

I'll try this :

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..0ca8f350a532 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int 
nhoff)
int proto = iph->protocol;
int err = -ENOSYS;
 
-   csum_replace2(>check, iph->tot_len, newlen);
iph->tot_len = newlen;
+   iph->check = 0;
 
rcu_read_lock();
ops = rcu_dereference(inet_offloads[proto]);
@@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int 
nhoff)
 * inet_gro_receive().
 */
err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+   iph->check = ip_fast_csum((u8 *)iph, 5);
 
 out_unlock:
rcu_read_unlock();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote:

> Or the fact that we mix 16 bit stores and 32bit loads ?
> 
> iph->tot_len = newlen;
> iph->check = 0;
> iph->check = ip_fast_csum(iph, 5);

Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the
stall. I no longer see inet_gro_complete() in perf top...

+   if (__builtin_constant_p(ihl) && ihl == 5) {
+   asm("  movw (%[iph]), %w[sum]\n"/* ihl/version/tos */
+   "  addw 2(%[iph]), %w[sum]\n"   /* tot_len  */
+   "  adcw 8(%[iph]), %w[sum]\n"   /* ttl/protocol */
+   "  adcw 10(%[iph]), %w[sum]\n"  /* check*/
+   "  adcl 4(%[iph]), %[sum]\n"/* id/frag_off  */
+   "  adcl 12(%[iph]), %[sum]\n"   /* saddr*/
+   "  adcl 16(%[iph]), %[sum]\n"   /* daddr*/
+   "  adcl $0, %[sum]\n"
+   : [sum] "=r" (sum)
+   : [iph] "r" (iph)
+   );
+   return csum_fold(sum);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Andi Kleen
On Fri, Mar 21, 2014 at 05:50:50AM -0700, Eric Dumazet wrote:
> On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
> > Eric Dumazet  writes:
> > >
> > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> > > is insane...
> > 
> > 
> > Couldn't it just be the cache miss?
> 
> Or the fact that we mix 16 bit stores and 32bit loads ?

It should cause a small stall from not doing load-store
forwarding, but 1% of a serious workload would be surprising.

Are you sure it's not some skid effect?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
> Eric Dumazet  writes:
> >
> > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> > is insane...
> 
> 
> Couldn't it just be the cache miss?

Or the fact that we mix 16 bit stores and 32bit loads ?

iph->tot_len = newlen;
iph->check = 0;
iph->check = ip_fast_csum(iph, 5);

-> following perf annotation :

 :  81538e70 :
0.49 :  81538e70:   callq  81578c80 
<__entry_text_start>
0.49 :  81538e75:   push   %rbp
1.46 :  81538e76:   movzwl 0x60(%rdi),%edx
0.00 :  81538e7a:   movslq %esi,%rax
0.00 :  81538e7d:   add0xc8(%rdi),%rax
1.46 :  81538e84:   mov%rsp,%rbp
0.00 :  81538e87:   sub%esi,%edx
0.00 :  81538e89:   rol$0x8,%dx
0.00 :  81538e8d:   movzbl 0x9(%rax),%ecx
0.98 :  81538e91:   mov%dx,0x2(%rax) iph->tot_len = 
newlen;
0.49 :  81538e95:   xor%edx,%edx
0.00 :  81538e97:   mov%dx,0xa(%rax) iph->check = 0;
0.00 :  81538e9b:   mov(%rax),%edx   // 32bit load  
-> stall 
   86.34 :  81538e9d:   add0x4(%rax),%edx
0.49 :  81538ea0:   adc0x8(%rax),%edx
0.49 :  81538ea3:   adc0xc(%rax),%edx
0.98 :  81538ea6:   adc0x10(%rax),%edx
0.49 :  81538ea9:   adc$0x0,%edx
0.00 :  81538eac:   mov%edx,%r8d
0.49 :  81538eaf:   xor%dx,%dx
0.00 :  81538eb2:   shl$0x10,%r8d
0.98 :  81538eb6:   add%r8d,%edx
0.98 :  81538eb9:   adc$0x,%edx
0.98 :  81538ebf:   not%edx
0.00 :  81538ec1:   shr$0x10,%edx
0.49 :  81538ec4:   mov%dx,0xa(%rax)
0.00 :  81538ec8:   movslq %ecx,%rax
0.00 :  81538ecb:   mov-0x7e734f40(,%rax,8),%rax
0.00 :  81538ed3:   test   %rax,%rax
0.00 :  81538ed6:   je 81538ef0 


BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ?
It looks like a barrier() would be more appropriate.

Following patch seems to help, but the stall seems to be the fact that
we write on iph->check (16bits) before doing the checksum using 32bit
loads.

Note that we could share same ip_fast_csum() implementation between x86
32/64 bits...

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..a81cc3a5facc 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -46,6 +46,24 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
 {
unsigned int sum;
 
+   /*
+* Callers might clear iph->check before calling us, make sure
+* compiler wont mess things...
+*/
+   barrier();
+
+   if (__builtin_constant_p(ihl) && ihl == 5) {
+   asm("  movl (%[iph]), %[sum]\n"
+   "  addl 4(%[iph]), %[sum]\n"
+   "  adcl 8(%[iph]), %[sum]\n"
+   "  adcl 12(%[iph]), %[sum]\n"
+   "  adcl 16(%[iph]), %[sum]\n"
+   "  adcl $0, %[sum]\n"
+   : [sum] "=r" (sum)
+   : [iph] "r" (iph)
+   );
+   return csum_fold(sum);
+   }
asm("  movl (%1), %0\n"
"  subl $4, %2\n"
"  jbe 2f\n"
@@ -68,7 +86,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
   will assume they contain their original values. */
: "=r" (sum), "=r" (iph), "=r" (ihl)
: "1" (iph), "2" (ihl)
-   : "memory");
+   );
return (__force __sum16)sum;
 }
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..90aa562dfbf5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,9 @@ static int inet_gro_complete(struct sk_buff *skb, int 
nhoff)
int proto = iph->protocol;
int err = -ENOSYS;
 
-   csum_replace2(>check, iph->tot_len, newlen);
iph->tot_len = newlen;
+   iph->check = 0;
+   iph->check = ip_fast_csum((u8 *)iph, 5);
 
rcu_read_lock();
ops = rcu_dereference(inet_offloads[proto]);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
 Eric Dumazet eric.duma...@gmail.com writes:
 
  I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
  is insane...
 
 
 Couldn't it just be the cache miss?

Or the fact that we mix 16 bit stores and 32bit loads ?

iph-tot_len = newlen;
iph-check = 0;
iph-check = ip_fast_csum(iph, 5);

- following perf annotation :

 :  81538e70 inet_gro_complete:
0.49 :  81538e70:   callq  81578c80 
__entry_text_start
0.49 :  81538e75:   push   %rbp
1.46 :  81538e76:   movzwl 0x60(%rdi),%edx
0.00 :  81538e7a:   movslq %esi,%rax
0.00 :  81538e7d:   add0xc8(%rdi),%rax
1.46 :  81538e84:   mov%rsp,%rbp
0.00 :  81538e87:   sub%esi,%edx
0.00 :  81538e89:   rol$0x8,%dx
0.00 :  81538e8d:   movzbl 0x9(%rax),%ecx
0.98 :  81538e91:   mov%dx,0x2(%rax) iph-tot_len = 
newlen;
0.49 :  81538e95:   xor%edx,%edx
0.00 :  81538e97:   mov%dx,0xa(%rax) iph-check = 0;
0.00 :  81538e9b:   mov(%rax),%edx   // 32bit load  
- stall 
   86.34 :  81538e9d:   add0x4(%rax),%edx
0.49 :  81538ea0:   adc0x8(%rax),%edx
0.49 :  81538ea3:   adc0xc(%rax),%edx
0.98 :  81538ea6:   adc0x10(%rax),%edx
0.49 :  81538ea9:   adc$0x0,%edx
0.00 :  81538eac:   mov%edx,%r8d
0.49 :  81538eaf:   xor%dx,%dx
0.00 :  81538eb2:   shl$0x10,%r8d
0.98 :  81538eb6:   add%r8d,%edx
0.98 :  81538eb9:   adc$0x,%edx
0.98 :  81538ebf:   not%edx
0.00 :  81538ec1:   shr$0x10,%edx
0.49 :  81538ec4:   mov%dx,0xa(%rax)
0.00 :  81538ec8:   movslq %ecx,%rax
0.00 :  81538ecb:   mov-0x7e734f40(,%rax,8),%rax
0.00 :  81538ed3:   test   %rax,%rax
0.00 :  81538ed6:   je 81538ef0 
inet_gro_complete+0x80

BTW, any idea why ip_fast_csum() on x86 contains a memory constraint ?
It looks like a barrier() would be more appropriate.

Following patch seems to help, but the stall seems to be the fact that
we write on iph-check (16bits) before doing the checksum using 32bit
loads.

Note that we could share same ip_fast_csum() implementation between x86
32/64 bits...

diff --git a/arch/x86/include/asm/checksum_64.h 
b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..a81cc3a5facc 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -46,6 +46,24 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
 {
unsigned int sum;
 
+   /*
+* Callers might clear iph-check before calling us, make sure
+* compiler wont mess things...
+*/
+   barrier();
+
+   if (__builtin_constant_p(ihl)  ihl == 5) {
+   asm(  movl (%[iph]), %[sum]\n
+ addl 4(%[iph]), %[sum]\n
+ adcl 8(%[iph]), %[sum]\n
+ adcl 12(%[iph]), %[sum]\n
+ adcl 16(%[iph]), %[sum]\n
+ adcl $0, %[sum]\n
+   : [sum] =r (sum)
+   : [iph] r (iph)
+   );
+   return csum_fold(sum);
+   }
asm(  movl (%1), %0\n
  subl $4, %2\n
  jbe 2f\n
@@ -68,7 +86,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
   will assume they contain their original values. */
: =r (sum), =r (iph), =r (ihl)
: 1 (iph), 2 (ihl)
-   : memory);
+   );
return (__force __sum16)sum;
 }
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..90aa562dfbf5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,9 @@ static int inet_gro_complete(struct sk_buff *skb, int 
nhoff)
int proto = iph-protocol;
int err = -ENOSYS;
 
-   csum_replace2(iph-check, iph-tot_len, newlen);
iph-tot_len = newlen;
+   iph-check = 0;
+   iph-check = ip_fast_csum((u8 *)iph, 5);
 
rcu_read_lock();
ops = rcu_dereference(inet_offloads[proto]);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Andi Kleen
On Fri, Mar 21, 2014 at 05:50:50AM -0700, Eric Dumazet wrote:
 On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
  Eric Dumazet eric.duma...@gmail.com writes:
  
   I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
   is insane...
  
  
  Couldn't it just be the cache miss?
 
 Or the fact that we mix 16 bit stores and 32bit loads ?

It should cause a small stall from not doing load-store
forwarding, but 1% of a serious workload would be surprising.

Are you sure it's not some skid effect?

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote:

 Or the fact that we mix 16 bit stores and 32bit loads ?
 
 iph-tot_len = newlen;
 iph-check = 0;
 iph-check = ip_fast_csum(iph, 5);

Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the
stall. I no longer see inet_gro_complete() in perf top...

+   if (__builtin_constant_p(ihl)  ihl == 5) {
+   asm(  movw (%[iph]), %w[sum]\n/* ihl/version/tos */
+ addw 2(%[iph]), %w[sum]\n   /* tot_len  */
+ adcw 8(%[iph]), %w[sum]\n   /* ttl/protocol */
+ adcw 10(%[iph]), %w[sum]\n  /* check*/
+ adcl 4(%[iph]), %[sum]\n/* id/frag_off  */
+ adcl 12(%[iph]), %[sum]\n   /* saddr*/
+ adcl 16(%[iph]), %[sum]\n   /* daddr*/
+ adcl $0, %[sum]\n
+   : [sum] =r (sum)
+   : [iph] r (iph)
+   );
+   return csum_fold(sum);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Fri, 2014-03-21 at 06:32 -0700, Eric Dumazet wrote:
 On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote:
 
  Or the fact that we mix 16 bit stores and 32bit loads ?
  
  iph-tot_len = newlen;
  iph-check = 0;
  iph-check = ip_fast_csum(iph, 5);
 
 Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the
 stall. I no longer see inet_gro_complete() in perf top...
 
 +   if (__builtin_constant_p(ihl)  ihl == 5) {
 +   asm(  movw (%[iph]), %w[sum]\n/* ihl/version/tos */
 + addw 2(%[iph]), %w[sum]\n   /* tot_len  */
 + adcw 8(%[iph]), %w[sum]\n   /* ttl/protocol */
 + adcw 10(%[iph]), %w[sum]\n  /* check*/
 + adcl 4(%[iph]), %[sum]\n/* id/frag_off  */
 + adcl 12(%[iph]), %[sum]\n   /* saddr*/
 + adcl 16(%[iph]), %[sum]\n   /* daddr*/
 + adcl $0, %[sum]\n
 +   : [sum] =r (sum)
 +   : [iph] r (iph)
 +   );
 +   return csum_fold(sum);
 

Another idea would be to move the ip_fast_csum() call at the end of
inet_gro_complete()

I'll try this :

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..0ca8f350a532 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int 
nhoff)
int proto = iph-protocol;
int err = -ENOSYS;
 
-   csum_replace2(iph-check, iph-tot_len, newlen);
iph-tot_len = newlen;
+   iph-check = 0;
 
rcu_read_lock();
ops = rcu_dereference(inet_offloads[proto]);
@@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int 
nhoff)
 * inet_gro_receive().
 */
err = ops-callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+   iph-check = ip_fast_csum((u8 *)iph, 5);
 
 out_unlock:
rcu_read_unlock();


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread Eric Dumazet
On Fri, 2014-03-21 at 06:47 -0700, Eric Dumazet wrote:

 Another idea would be to move the ip_fast_csum() call at the end of
 inet_gro_complete()
 
 I'll try this :
 
 diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
 index 8c54870db792..0ca8f350a532 100644
 --- a/net/ipv4/af_inet.c
 +++ b/net/ipv4/af_inet.c
 @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int 
 nhoff)
 int proto = iph-protocol;
 int err = -ENOSYS;
  
 -   csum_replace2(iph-check, iph-tot_len, newlen);
 iph-tot_len = newlen;
 +   iph-check = 0;
  
 rcu_read_lock();
 ops = rcu_dereference(inet_offloads[proto]);
 @@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int 
 nhoff)
  * inet_gro_receive().
  */
 err = ops-callbacks.gro_complete(skb, nhoff + sizeof(*iph));
 +   iph-check = ip_fast_csum((u8 *)iph, 5);
  
  out_unlock:
 rcu_read_unlock();
 

This doesn't help... same stall.

Looks like the best way is to use some 16 bit loads in ip_fast_csum()
for the ihl=5 case.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread David Laight
From: Eric Dumazet
 On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
  Eric Dumazet eric.duma...@gmail.com writes:
  
   I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
   is insane...
 
 
  Couldn't it just be the cache miss?
 
 Or the fact that we mix 16 bit stores and 32bit loads ?
 
 BTW, any idea why ip_fast_csum() on x86 contains a memory constraint ?

The correct constraint would be one that told gcc that it
accesses the 20 bytes from the source pointer.

Without it gcc won't necessarily write out the values before
the asm instructions execute.

David



Re: [RFC] csum experts, csum_replace2() is too expensive

2014-03-21 Thread David Miller
From: Eric Dumazet eric.duma...@gmail.com
Date: Fri, 21 Mar 2014 05:50:50 -0700

 It looks like a barrier() would be more appropriate.

barrier() == __asm__ __volatile__(:::memory)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/