RE: [RFC] csum experts, csum_replace2() is too expensive
On Mon, 2014-03-24 at 06:17 -0700, Eric Dumazet wrote: > On Mon, 2014-03-24 at 10:30 +, David Laight wrote: > > ip_fast_csum() either needs an explicit "m" constraint for the actual > > buffer (and target) bytes, or the stronger "memory" constraint. > > The 'volatile' is then not needed. I am testing the following : diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index e6fd8a026c7b..89d7fa8837b5 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -45,6 +45,9 @@ static inline __sum16 csum_fold(__wsum sum) static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { unsigned int sum; + struct full_ip_header { + unsigned int w[ihl]; + }; asm(" movl (%1), %0\n" " subl $4, %2\n" @@ -67,8 +70,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) are modified, we must also specify them as outputs, or gcc will assume they contain their original values. */ : "=r" (sum), "=r" (iph), "=r" (ihl) - : "1" (iph), "2" (ihl) - : "memory"); + : "1" (iph), "2" (ihl), + "m" (*(struct full_ip_header *) iph) + ); return (__force __sum16)sum; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] csum experts, csum_replace2() is too expensive
On Mon, 2014-03-24 at 10:30 +, David Laight wrote: > From: Eric Dumazet > > On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote: > > > From: Eric Dumazet > > > Date: Fri, 21 Mar 2014 05:50:50 -0700 > > > > > > > It looks like a barrier() would be more appropriate. > > > > > > barrier() == __asm__ __volatile__(:::"memory") > > > > Indeed, but now you mention it, ip_fast_csum() do not uses volatile > > keyword on x86_64, and has no "m" constraint either. > > Adding 'volatile' isn't sufficient to force gcc to write data > into the area being checksummed. You missed the point. Its not about forcing gcc to write data, because it does. Point is : gcc doesn't recompute the checksum a second time. > ip_fast_csum() either needs an explicit "m" constraint for the actual > buffer (and target) bytes, or the stronger "memory" constraint. > The 'volatile' is then not needed. What about you take a look at the actual code ? "memory" constraint is already there. And no, its not enough, otherwise I wouldn't have sent this mail. I actually compiled the code and double checked. 7010 : 7010: e8 00 00 00 00 callq 7015 7011: R_X86_64_PC32 __fentry__-0x4 7015: 55 push %rbp 7016: 31 c0 xor%eax,%eax 7018: b9 05 00 00 00 mov$0x5,%ecx 701d: 48 89 e5mov%rsp,%rbp 7020: 48 83 ec 20 sub$0x20,%rsp 7024: 48 89 5d e8 mov%rbx,-0x18(%rbp) 7028: 4c 89 6d f8 mov%r13,-0x8(%rbp) 702c: 48 89 fbmov%rdi,%rbx 702f: 4c 89 65 f0 mov%r12,-0x10(%rbp) 7033: 41 89 d5mov%edx,%r13d 7036: 66 89 47 0a mov%ax,0xa(%rdi) 703a: 66 89 77 02 mov%si,0x2(%rdi) 703e: 48 89 f8mov%rdi,%rax 7041: 48 89 femov%rdi,%rsi 7044: 44 8b 20mov(%rax),%r12d 7047: 83 e9 04sub$0x4,%ecx 704a: 76 2e jbe707a 704c: 44 03 60 04 add0x4(%rax),%r12d 7050: 44 13 60 08 adc0x8(%rax),%r12d 7054: 44 13 60 0c adc0xc(%rax),%r12d 7058: 44 13 60 10 adc0x10(%rax),%r12d 705c: 48 8d 40 04 lea0x4(%rax),%rax 7060: ff c9 dec%ecx 7062: 75 f4 jne7058 7064: 41 83 d4 00 adc$0x0,%r12d 7068: 44 89 e1mov%r12d,%ecx 706b: 41 c1 ec 10 shr$0x10,%r12d 706f: 66 41 01 cc add%cx,%r12w 7073: 41 83 d4 00 adc$0x0,%r12d 7077: 41 f7 d4not%r12d 707a: 31 c0 xor%eax,%eax 707c: 66 44 89 67 0a mov%r12w,0xa(%rdi) 7081: 48 c7 c7 00 00 00 00mov$0x0,%rdi 7084: R_X86_64_32S .rodata.str1.1+0xabd 7088: e8 00 00 00 00 callq 708d 7089: R_X86_64_PC32 printk-0x4 708d: 66 44 89 6b 02 mov%r13w,0x2(%rbx) 7092: 66 44 89 63 0a mov%r12w,0xa(%rbx) 7097: 4c 8b 6d f8 mov-0x8(%rbp),%r13 709b: 48 8b 5d e8 mov-0x18(%rbp),%rbx 709f: 4c 8b 65 f0 mov-0x10(%rbp),%r12 70a3: c9 leaveq 70a4: c3 retq -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] csum experts, csum_replace2() is too expensive
From: Eric Dumazet > On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote: > > From: Eric Dumazet > > Date: Fri, 21 Mar 2014 05:50:50 -0700 > > > > > It looks like a barrier() would be more appropriate. > > > > barrier() == __asm__ __volatile__(:::"memory") > > Indeed, but now you mention it, ip_fast_csum() do not uses volatile > keyword on x86_64, and has no "m" constraint either. Adding 'volatile' isn't sufficient to force gcc to write data into the area being checksummed. ip_fast_csum() either needs an explicit "m" constraint for the actual buffer (and target) bytes, or the stronger "memory" constraint. The 'volatile' is then not needed. David
Re: [RFC] csum experts, csum_replace2() is too expensive
On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote: > From: Eric Dumazet > Date: Fri, 21 Mar 2014 05:50:50 -0700 > > > It looks like a barrier() would be more appropriate. > > barrier() == __asm__ __volatile__(:::"memory") Indeed, but now you mention it, ip_fast_csum() do not uses volatile keyword on x86_64, and has no "m" constraint either. This means that for the following hypothetical networking code : void foobar(struct iphdr *iph, __be16 newlen, __be16 newid) { iph->tot_len = newlen; iph->check = 0; iph->check = ip_fast_csum((u8 *)iph, 5); pr_err("%p\n", iph); iph->id = newid; iph->check = 0; iph->check = ip_fast_csum((u8 *)iph, 5); } ip_fast_csum() is done _once_ only. Following patch seems needed. Thats one another call for x86 code factorization ... diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index e6fd8a026c7b..c67778544880 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -46,7 +46,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { unsigned int sum; - asm(" movl (%1), %0\n" + asm volatile(" movl (%1), %0\n" " subl $4, %2\n" " jbe 2f\n" " addl 4(%1), %0\n" -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] csum experts, csum_replace2() is too expensive
From: Eric Dumazet Date: Fri, 21 Mar 2014 05:50:50 -0700 > It looks like a barrier() would be more appropriate. barrier() == __asm__ __volatile__(:::"memory") -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC] csum experts, csum_replace2() is too expensive
From: Eric Dumazet > On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > > Eric Dumazet writes: > > > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > > is insane... > > > > > > Couldn't it just be the cache miss? > > Or the fact that we mix 16 bit stores and 32bit loads ? > > BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ? The correct constraint would be one that told gcc that it accesses the 20 bytes from the source pointer. Without it gcc won't necessarily write out the values before the asm instructions execute. David
Re: [RFC] csum experts, csum_replace2() is too expensive
On Fri, 2014-03-21 at 06:47 -0700, Eric Dumazet wrote: > Another idea would be to move the ip_fast_csum() call at the end of > inet_gro_complete() > > I'll try this : > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 8c54870db792..0ca8f350a532 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int > nhoff) > int proto = iph->protocol; > int err = -ENOSYS; > > - csum_replace2(&iph->check, iph->tot_len, newlen); > iph->tot_len = newlen; > + iph->check = 0; > > rcu_read_lock(); > ops = rcu_dereference(inet_offloads[proto]); > @@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int > nhoff) > * inet_gro_receive(). > */ > err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph)); > + iph->check = ip_fast_csum((u8 *)iph, 5); > > out_unlock: > rcu_read_unlock(); > This doesn't help... same stall. Looks like the best way is to use some 16 bit loads in ip_fast_csum() for the ihl=5 case. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] csum experts, csum_replace2() is too expensive
On Fri, 2014-03-21 at 06:32 -0700, Eric Dumazet wrote: > On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote: > > > Or the fact that we mix 16 bit stores and 32bit loads ? > > > > iph->tot_len = newlen; > > iph->check = 0; > > iph->check = ip_fast_csum(iph, 5); > > Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the > stall. I no longer see inet_gro_complete() in perf top... > > + if (__builtin_constant_p(ihl) && ihl == 5) { > + asm(" movw (%[iph]), %w[sum]\n"/* ihl/version/tos */ > + " addw 2(%[iph]), %w[sum]\n" /* tot_len */ > + " adcw 8(%[iph]), %w[sum]\n" /* ttl/protocol */ > + " adcw 10(%[iph]), %w[sum]\n" /* check*/ > + " adcl 4(%[iph]), %[sum]\n"/* id/frag_off */ > + " adcl 12(%[iph]), %[sum]\n" /* saddr*/ > + " adcl 16(%[iph]), %[sum]\n" /* daddr*/ > + " adcl $0, %[sum]\n" > + : [sum] "=r" (sum) > + : [iph] "r" (iph) > + ); > + return csum_fold(sum); > Another idea would be to move the ip_fast_csum() call at the end of inet_gro_complete() I'll try this : diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 8c54870db792..0ca8f350a532 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) int proto = iph->protocol; int err = -ENOSYS; - csum_replace2(&iph->check, iph->tot_len, newlen); iph->tot_len = newlen; + iph->check = 0; rcu_read_lock(); ops = rcu_dereference(inet_offloads[proto]); @@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) * inet_gro_receive(). */ err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph)); + iph->check = ip_fast_csum((u8 *)iph, 5); out_unlock: rcu_read_unlock(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] csum experts, csum_replace2() is too expensive
On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote: > Or the fact that we mix 16 bit stores and 32bit loads ? > > iph->tot_len = newlen; > iph->check = 0; > iph->check = ip_fast_csum(iph, 5); Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the stall. I no longer see inet_gro_complete() in perf top... + if (__builtin_constant_p(ihl) && ihl == 5) { + asm(" movw (%[iph]), %w[sum]\n"/* ihl/version/tos */ + " addw 2(%[iph]), %w[sum]\n" /* tot_len */ + " adcw 8(%[iph]), %w[sum]\n" /* ttl/protocol */ + " adcw 10(%[iph]), %w[sum]\n" /* check*/ + " adcl 4(%[iph]), %[sum]\n"/* id/frag_off */ + " adcl 12(%[iph]), %[sum]\n" /* saddr*/ + " adcl 16(%[iph]), %[sum]\n" /* daddr*/ + " adcl $0, %[sum]\n" + : [sum] "=r" (sum) + : [iph] "r" (iph) + ); + return csum_fold(sum); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] csum experts, csum_replace2() is too expensive
On Fri, Mar 21, 2014 at 05:50:50AM -0700, Eric Dumazet wrote: > On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > > Eric Dumazet writes: > > > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > > is insane... > > > > > > Couldn't it just be the cache miss? > > Or the fact that we mix 16 bit stores and 32bit loads ? It should cause a small stall from not doing load-store forwarding, but 1% of a serious workload would be surprising. Are you sure it's not some skid effect? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] csum experts, csum_replace2() is too expensive
On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > Eric Dumazet writes: > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > is insane... > > > Couldn't it just be the cache miss? Or the fact that we mix 16 bit stores and 32bit loads ? iph->tot_len = newlen; iph->check = 0; iph->check = ip_fast_csum(iph, 5); -> following perf annotation : : 81538e70 : 0.49 : 81538e70: callq 81578c80 <__entry_text_start> 0.49 : 81538e75: push %rbp 1.46 : 81538e76: movzwl 0x60(%rdi),%edx 0.00 : 81538e7a: movslq %esi,%rax 0.00 : 81538e7d: add0xc8(%rdi),%rax 1.46 : 81538e84: mov%rsp,%rbp 0.00 : 81538e87: sub%esi,%edx 0.00 : 81538e89: rol$0x8,%dx 0.00 : 81538e8d: movzbl 0x9(%rax),%ecx 0.98 : 81538e91: mov%dx,0x2(%rax) iph->tot_len = newlen; 0.49 : 81538e95: xor%edx,%edx 0.00 : 81538e97: mov%dx,0xa(%rax) iph->check = 0; 0.00 : 81538e9b: mov(%rax),%edx // 32bit load -> stall 86.34 : 81538e9d: add0x4(%rax),%edx 0.49 : 81538ea0: adc0x8(%rax),%edx 0.49 : 81538ea3: adc0xc(%rax),%edx 0.98 : 81538ea6: adc0x10(%rax),%edx 0.49 : 81538ea9: adc$0x0,%edx 0.00 : 81538eac: mov%edx,%r8d 0.49 : 81538eaf: xor%dx,%dx 0.00 : 81538eb2: shl$0x10,%r8d 0.98 : 81538eb6: add%r8d,%edx 0.98 : 81538eb9: adc$0x,%edx 0.98 : 81538ebf: not%edx 0.00 : 81538ec1: shr$0x10,%edx 0.49 : 81538ec4: mov%dx,0xa(%rax) 0.00 : 81538ec8: movslq %ecx,%rax 0.00 : 81538ecb: mov-0x7e734f40(,%rax,8),%rax 0.00 : 81538ed3: test %rax,%rax 0.00 : 81538ed6: je 81538ef0 BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ? It looks like a barrier() would be more appropriate. Following patch seems to help, but the stall seems to be the fact that we write on iph->check (16bits) before doing the checksum using 32bit loads. Note that we could share same ip_fast_csum() implementation between x86 32/64 bits... diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index e6fd8a026c7b..a81cc3a5facc 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -46,6 +46,24 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { unsigned int sum; + /* +* Callers might clear iph->check before calling us, make sure +* compiler wont mess things... +*/ + barrier(); + + if (__builtin_constant_p(ihl) && ihl == 5) { + asm(" movl (%[iph]), %[sum]\n" + " addl 4(%[iph]), %[sum]\n" + " adcl 8(%[iph]), %[sum]\n" + " adcl 12(%[iph]), %[sum]\n" + " adcl 16(%[iph]), %[sum]\n" + " adcl $0, %[sum]\n" + : [sum] "=r" (sum) + : [iph] "r" (iph) + ); + return csum_fold(sum); + } asm(" movl (%1), %0\n" " subl $4, %2\n" " jbe 2f\n" @@ -68,7 +86,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) will assume they contain their original values. */ : "=r" (sum), "=r" (iph), "=r" (ihl) : "1" (iph), "2" (ihl) - : "memory"); + ); return (__force __sum16)sum; } diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 8c54870db792..90aa562dfbf5 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1434,8 +1434,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) int proto = iph->protocol; int err = -ENOSYS; - csum_replace2(&iph->check, iph->tot_len, newlen); iph->tot_len = newlen; + iph->check = 0; + iph->check = ip_fast_csum((u8 *)iph, 5); rcu_read_lock(); ops = rcu_dereference(inet_offloads[proto]); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/