Re: networking todo, was Re: Linux-2.4.0-test9-pre2
On Wed, Sep 20, 2000 at 03:24:28PM +0200, Andi Kleen wrote: > That would just break the whole idea behind softnet. When you're juggling I agree ;(. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
On Wed, Sep 20, 2000 at 03:29:39PM +0200, Andrea Arcangeli wrote: > On Wed, Sep 20, 2000 at 04:38:24AM +0200, Andi Kleen wrote: > > We must be talking about different things. It of course detects it on > > ACK input, but only for data it did send itself. Every TCP detects > > reordering automatically on the input with the sequence number check, > > but all we still do is to send an ACK immediately and go into > > quickack mode. > > Agreed. I couldn't see how to "detect" the reordering in the receiver without > additional informations from the dev.c layer (I don't think it's possible to > detect it from the TCP layer without adding something like described by > Andi in the below section that looks not very nice because it would > hurt links with real packet loss). Assuming you have outgoing traffic on the NIC as well it isn't that bad, because there will be likely some queueing delay for the ACK in which you could in theory still cancel it of you know better (it would hurt the abstraction between device drivers and dev.c a lot though) > The solution consists in a sequence number protected by the spinlock in the NIC > device driver (as DaveM suggested) and to increase this sequence number for > each packet received and to set the sequence number value of the moment in each > skb. Then we'd need a secondary (shared :( ) sequence numbers in the > net_rx_action layer that will tell us the sequence number of the last packet > that entered the higher level layer (IP in our case). With this per packet > sequence number information we'll know if we're generating reordering by > pushing the next packet in the softnet queue from net_rx_action to the IP > layer. In that case we should only set up a wake-up-me-later logic and giveup. That would just break the whole idea behind softnet. When you're juggling the cache line for the sequence counter you could as well juggle the cache line with a shared queue head, e.g. we would be at 2.2 level again. > The other almost zero cost way to fix the reordering is of course to set the > irq affinity of each nic only to one CPU :)) (this applies to the less smart > protocol as well) That's the real solution long term, right. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
On Wed, Sep 20, 2000 at 04:38:24AM +0200, Andi Kleen wrote: > We must be talking about different things. It of course detects it on > ACK input, but only for data it did send itself. Every TCP detects > reordering automatically on the input with the sequence number check, > but all we still do is to send an ACK immediately and go into > quickack mode. Agreed. I couldn't see how to "detect" the reordering in the receiver without additional informations from the dev.c layer (I don't think it's possible to detect it from the TCP layer without adding something like described by Andi in the below section that looks not very nice because it would hurt links with real packet loss). > Or did I miss something ? > > The only way to do true reordering handling on the receiver I can think of > would be to use something like the soft-timers to do very very fast delayed > acks even for rcv_mss sized packets and hope that you collect packets from > all CPUs in the delay, but overall it could still cost you a lot by > disturbing the ACK clock > > [I talked a lot with Andrea about this during OLS, and we couldn't > figure out a good way] I couldn't figure out a good way to fix that internally to TCP, but I have a quite simple solution at the dev.c layer (that would address also the problem with the not smart protocols that needs to retransmit everything once an hole in the sequence space is found). The solution consists in a sequence number protected by the spinlock in the NIC device driver (as DaveM suggested) and to increase this sequence number for each packet received and to set the sequence number value of the moment in each skb. Then we'd need a secondary (shared :( ) sequence numbers in the net_rx_action layer that will tell us the sequence number of the last packet that entered the higher level layer (IP in our case). With this per packet sequence number information we'll know if we're generating reordering by pushing the next packet in the softnet queue from net_rx_action to the IP layer. In that case we should only set up a wake-up-me-later logic and giveup. No idea how much this additional logic hurts. Certainly it could be enabled per-device/per-protocol basis. The other almost zero cost way to fix the reordering is of course to set the irq affinity of each nic only to one CPU :)) (this applies to the less smart protocol as well) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
On Tue, Sep 19, 2000 at 08:56:57PM -0700, David S. Miller wrote: >Just use __cacheline_aligned instead, like I did with the >ip_local_data in the patch you just rejected. There is still the >problem that generic SMP x86 kernels use a 32byte cacheline. Not a >problem currently because the only x86 SMP is pII/pIII which has >32byte, but with Williamette/Athlon SMP it'll be a problem because >these have 64byte and 128byte cache lines. With __cacheline_aligned >it'll be easy though to pick up such changes. > > Ok, feel free to send me a patch which does only this. I may doctor > it up a bit, be warned :-) Here is the complete patch I'm using, pick what you want from it. This time I documented my intentions properly, see the source code comments. I removed all power-of-two paddings because the shift/add sequences seem to be always short enough. I also documented a bigger potential win on machines with immediate memory access and sane interrupt model (and an even bigger -- Dimitris Michailidis's PDA patches) which both need more extensive modifications >On the other hand, the inetpeer code is only really exercised on >machines that talk to lots and lots of destinations (=real >servers), and 2.4 testing on such machines still has to begin. > >Given that 2.4 testing has not really begun yet I would guess that >it is safer to remove it now @) > > True, but the following is what I'm really thinking about. Suppose a > few weeks from now Alexey greets both of our mailboxes with a fabulous > solution to the tw recycle masqeurading problem, wouldn't it be quite > a pain in the ass to put back in and retest all the inetpeer code? > > Let's sit on this until Alexey gives a forecast about the > possibilities of a solution in the near future ok? Ok. I just don't want the code uselessly wasting cycles and cache lines in 2.4.0 if possible. -Andi Index: include/net/snmp.h === RCS file: /cvs/linux/include/net/snmp.h,v retrieving revision 1.16 diff -u -u -r1.16 snmp.h --- include/net/snmp.h 2000/08/09 11:59:03 1.16 +++ include/net/snmp.h 2000/09/20 11:19:25 @@ -14,17 +14,34 @@ * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. * + * $Id$ + * */ #ifndef _SNMP_H #define _SNMP_H + +#include /* * We use all unsigned longs. Linux will soon be so reliable that even these * will rapidly get too small 8-). Seriously consider the IpInReceives count * on the 20Gb/s + networks people expect in a few years time! */ - + +/* + * The rule for padding: + * Best is power of two because then the right structure can be found by a simple + * shift. The structure should be always cache line aligned. + * gcc needs n=alignto(cachelinesize, popcnt(sizeof(bla_mib))) shift/add instructions + * to emulate multiply in case it is not power-of-two. Currently n is always <=3 for + * all sizes so simple cache line alignment is enough. + * + * The best solution would be a global CPU local area , especially on 64 and 128byte + * cacheline machine it makes a *lot* of sense -AK + */ + + struct ip_mib { unsigned long IpInReceives; @@ -44,8 +61,8 @@ unsigned long IpFragOKs; unsigned long IpFragFails; unsigned long IpFragCreates; - unsigned long __pad[32-19]; -}; + unsigned long __pad[0]; +} cacheline_aligned; struct ipv6_mib { @@ -71,8 +88,8 @@ unsigned long Ip6FragCreates; unsigned long Ip6InMcastPkts; unsigned long Ip6OutMcastPkts; - unsigned long __pad[32-22]; -}; + unsigned long __pad[0]; +} cacheline_aligned; struct icmp_mib { @@ -102,8 +119,8 @@ unsigned long IcmpOutTimestampReps; unsigned long IcmpOutAddrMasks; unsigned long IcmpOutAddrMaskReps; - unsigned long __pad[32-26]; -}; + unsigned long __pad[0]; +} cacheline_aligned; struct icmpv6_mib { @@ -140,8 +157,8 @@ unsigned long Icmp6OutRedirects; unsigned long Icmp6OutGroupMembResponses; unsigned long Icmp6OutGroupMembReductions; - unsigned long __pad[32-28]; -}; + unsigned long __pad[0]; +} cacheline_aligned; struct tcp_mib { @@ -159,8 +176,8 @@ unsigned long TcpRetransSegs; unsigned long TcpInErrs; unsigned long TcpOutRsts; - unsigned long __pad[16-14]; -}; + unsigned long __pad[0]; +} cacheline_aligned; struct udp_mib { @@ -168,8 +185,8 @@ unsigned long UdpNoPorts; unsigned long UdpInErrors; unsigned long UdpOutDatagrams; - unsigned long __pad[0]; -}; + unsigned long __pad[0]; +} cacheline_aligned; struct linux_mib { @@ -237,9 +254,15 @@ unsigned long TCPAbortOnLinger;
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
Date:Wed, 20 Sep 2000 04:38:24 +0200 From: "Andi Kleen" <[EMAIL PROTECTED]> We must be talking about different things. It of course detects it on ACK input, but only for data it did send itself. Every TCP detects reordering automatically on the input with the sequence number check, but all we still do is to send an ACK immediately and go into quickack mode. Or did I miss something ? ... [I talked a lot with Andrea about this during OLS, and we couldn't figure out a good way] Nevermind, I've apparently got my directions reversed. :-) I'll think about this a bit. Just use __cacheline_aligned instead, like I did with the ip_local_data in the patch you just rejected. There is still the problem that generic SMP x86 kernels use a 32byte cacheline. Not a problem currently because the only x86 SMP is pII/pIII which has 32byte, but with Williamette/Athlon SMP it'll be a problem because these have 64byte and 128byte cache lines. With __cacheline_aligned it'll be easy though to pick up such changes. Ok, feel free to send me a patch which does only this. I may doctor it up a bit, be warned :-) On the other hand, the inetpeer code is only really exercised on machines that talk to lots and lots of destinations (=real servers), and 2.4 testing on such machines still has to begin. Given that 2.4 testing has not really begun yet I would guess that it is safer to remove it now @) True, but the following is what I'm really thinking about. Suppose a few weeks from now Alexey greets both of our mailboxes with a fabulous solution to the tw recycle masqeurading problem, wouldn't it be quite a pain in the ass to put back in and retest all the inetpeer code? Let's sit on this until Alexey gives a forecast about the possibilities of a solution in the near future ok? Later, David S. Miller [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
On Tue, Sep 19, 2000 at 06:54:30PM -0700, David S. Miller wrote: >Date: Wed, 20 Sep 2000 03:51:37 +0200 >From: "Andi Kleen" <[EMAIL PROTECTED]> > >>Receiver side SMP reordering is still there, but I'm not sure if it is >>fixable (but it'll surely hit people that cannot use Linux senders, I >>just see the reports) >> >> Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input >> rewrite was strictly about dealing with this. No SMP reordering >> should cause any bogus fast retransmits etc. for example. > >You mean TCP output rewrite ? The fix was strictly sender side only >(I first thought one of the ack changes alexey did was an attempt at a >hackish receiver side solution, but he told me that was false) > >> What is the problem? Are you referering to the LAPB/X25 stuff? > >When you have a non linux 2.4 sender you lose. > > Alexey's changes detect reordering on the input side, regardless of > whether it is speaking to a Linux senders or not, to avoid false > retransmits. We must be talking about different things. It of course detects it on ACK input, but only for data it did send itself. Every TCP detects reordering automatically on the input with the sequence number check, but all we still do is to send an ACK immediately and go into quickack mode. Or did I miss something ? The only way to do true reordering handling on the receiver I can think of would be to use something like the soft-timers to do very very fast delayed acks even for rcv_mss sized packets and hope that you collect packets from all CPUs in the delay, but overall it could still cost you a lot by disturbing the ACK clock [I talked a lot with Andrea about this during OLS, and we couldn't figure out a good way] > Please show me (and even more importantly Alexey) an example of where > receive reordering detection is dependant upon Linux TCP sender > behavior, his code it is as generic as I can imagine it to be. In > fact, his code got lots of the testing on a web server serving almost > exclusively clients running windows :-) Web clients probably do not send enough data to make reordering a problem because the request fits into 1-2 packets and the 3way handshake is not reordering sensitive. (I haven't looked at SpecWeb that closely, but has it really any client sent data that is >packet size?) > Ok, linux_mib is obviously not exact but in that case I would argue > that the extra size needed (to get to the next a power of 2) would > outweight whatever instruction performance gain we'd get. > > As for the udp etc. case, how do we pick a "number" to make these > arrays as you say they should be? Just use __cacheline_aligned instead, like I did with the ip_local_data in the patch you just rejected. There is still the problem that generic SMP x86 kernels use a 32byte cacheline. Not a problem currently because the only x86 SMP is pII/pIII which has 32byte, but with Williamette/Athlon SMP it'll be a problem because these have 64byte and 128byte cache lines. With __cacheline_aligned it'll be easy though to pick up such changes. > I would only make these changes for the snmp mibs which are "smaller" > than this "number" we pick, the larger ones won't see much false > sharing at all. > > I think this is really a small and trite issue actually. I don't think so. There was so much pain to avoid cache line bouncing for fast paths, it would be a shame to add it again for silly statistics keeping. > Removing a lot of code also means undoing all the testing done so far > with that code present :-))) True. On the other hand, the inetpeer code is only really exercised on machines that talk to lots and lots of destinations (=real servers), and 2.4 testing on such machines still has to begin. Given that 2.4 testing has not really begun yet I would guess that it is safer to remove it now @) -Andi (who is off to bed now) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
Date: Wed, 20 Sep 2000 03:51:37 +0200 From: "Andi Kleen" <[EMAIL PROTECTED]> >Receiver side SMP reordering is still there, but I'm not sure if it is >fixable (but it'll surely hit people that cannot use Linux senders, I >just see the reports) > > Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input > rewrite was strictly about dealing with this. No SMP reordering > should cause any bogus fast retransmits etc. for example. You mean TCP output rewrite ? The fix was strictly sender side only (I first thought one of the ack changes alexey did was an attempt at a hackish receiver side solution, but he told me that was false) > What is the problem? Are you referering to the LAPB/X25 stuff? When you have a non linux 2.4 sender you lose. Alexey's changes detect reordering on the input side, regardless of whether it is speaking to a Linux senders or not, to avoid false retransmits. Please show me (and even more importantly Alexey) an example of where receive reordering detection is dependant upon Linux TCP sender behavior, his code it is as generic as I can imagine it to be. In fact, his code got lots of the testing on a web server serving almost exclusively clients running windows :-) When I count correctly sizeof(udp_mib) is 16 bytes currently (lots of false sharing on 32/64byte cacheline sized architectures) and linux_mib is 288 bytes currently (does not make any sense at all) Ok, linux_mib is obviously not exact but in that case I would argue that the extra size needed (to get to the next a power of 2) would outweight whatever instruction performance gain we'd get. As for the udp etc. case, how do we pick a "number" to make these arrays as you say they should be? I would only make these changes for the snmp mibs which are "smaller" than this "number" we pick, the larger ones won't see much false sharing at all. I think this is really a small and trite issue actually. >UDP recvmsg error handling for csum errors is bogus (fix pending) > > Ok, send me a patch so I can see what the problem is. Appended. Applied. The "unknown gain" is just removing a lot of complexity => speed and less bugs. Removing a lot of code also means undoing all the testing done so far with that code present :-))) The only advantage I know left is saving pmtus for a bit longer, but I doubt it is worth the complexity. Neither of us are experts in this area, Alexey is. Do you have any specific plan for salvaging tw recycling ? Just keeping such an intrusive piece of code for benchmarks around looks wasteful to me. Nope, but this also requires Alexey's input. Later, David S. Miller [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
On Tue, Sep 19, 2000 at 06:13:38PM -0700, David S. Miller wrote: >Date: Wed, 20 Sep 2000 03:14:10 +0200 >From: "Andi Kleen" <[EMAIL PROTECTED]> > >The ipid handling is still fishy, it will break when you talk to >more destinations than the inetpeer cache can take (I fixed it in >my local tree with the appended patch) > > I don't like this change, please work with Alexey on this one. What do you dislike ? Stateless ipid looks much nicer to me than stateful one (and the "security" lost by not refetching the secret is minimal, especially given the state of the random device on machines without diskio and keyboard input) >Receiver side SMP reordering is still there, but I'm not sure if it is >fixable (but it'll surely hit people that cannot use Linux senders, I >just see the reports) > > Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input > rewrite was strictly about dealing with this. No SMP reordering > should cause any bogus fast retransmits etc. for example. You mean TCP output rewrite ? The fix was strictly sender side only (I first thought one of the ack changes alexey did was an attempt at a hackish receiver side solution, but he told me that was false) > What is the problem? Are you referering to the LAPB/X25 stuff? When you have a non linux 2.4 sender you lose. >The TCP connect running out of ports problem is still there (I >fixed it locally, but the changes are probably far too extensive >for 2.4.x now) > > I think people can change their ip_local_port_range and I really > consider this a non-issue, at least, it's not a 2.4.x show stopper. I agree. > >The TCP ISN computation is not SMP safe. > > "Big deal". So how much effort will you go to to get how much more > protection and how much of it do we really need? How less secure are > our TCP ISN's because of this? Not a show-stopper. > > I think you're reading too much tcp-impl :-) No, it has nothing to do with tcp-impl ;), I discovered it while playing with the ipids much earlier. With some bad luck could could get very similar ISNs, which is probably bad (an easy fix is to do the secret hash with a stack copy instead of the static) (but it is not a show stopper I agree) > >include/linux/snmp.h is probably still not properly aligned for all >cache line sizes. > > Read, and reread what Alexey tries to tell you over and over about > this. It is not meant to be cache line aligned, it is meant to be a > power of two and nothing more. It has to be at least cache line aligned, otherwise the arrays wouldn't make any sense at all, the power of two would just be a (imho misguided but anyways) optimization. When I count correctly sizeof(udp_mib) is 16 bytes currently (lots of false sharing on 32/64byte cacheline sized architectures) and linux_mib is 288 bytes currently (does not make any sense at all) >UDP recvmsg error handling for csum errors is bogus (fix pending) > > Ok, send me a patch so I can see what the problem is. Appended. > >I also would like to remove the inetpeer cache code before 2.4.0, now >that nobody managed to salvage tw recycling and ipid can do without it. > > I would prefer not to, it seems to be too potentially destabilizing > for questionable and unknown gain (ie. we don't know if tw recycling > can be salvaged, so lets not take any changes). The "unknown gain" is just removing a lot of complexity => speed and less bugs. The only advantage I know left is saving pmtus for a bit longer, but I doubt it is worth the complexity. Do you have any specific plan for salvaging tw recycling ? Just keeping such an intrusive piece of code for benchmarks around looks wasteful to me. -Andi Index: net/ipv4/udp.c === RCS file: /cvs/linux/net/ipv4/udp.c,v retrieving revision 1.85 diff -u -u -d -r1.85 udp.c --- net/ipv4/udp.c 2000/08/09 11:59:04 1.85 +++ net/ipv4/udp.c 2000/09/17 11:55:29 @@ -493,8 +493,6 @@ if (usin->sin_family != AF_INET) { if (usin->sin_family != AF_UNSPEC) return -EINVAL; - if (net_ratelimit()) - printk("Remind Kuznetsov, he has to repair %s eventually\n", current->comm); } ufh.daddr = usin->sin_addr.s_addr; @@ -678,6 +676,8 @@ if (flags & MSG_ERRQUEUE) return ip_recv_error(sk, msg, len); + + retry: /* * From here the generic datagram does a lot of the work. Come * the finished NET3, it will do _ALL_ the work! @@ -733,26 +733,21 @@ csum_copy_err: UDP_INC_STATS_BH(UdpInErrors); - /* Clear queue. */ - if (flags&MSG_PEEK) { - int clear = 0; + if (flags&(MSG_PEEK|MSG_DONTWAIT)) { + struct sk_buff *skb2; + spin_lock_irq(&sk->receive
Re: networking todo, was Re: Linux-2.4.0-test9-pre2
Date: Wed, 20 Sep 2000 03:14:10 +0200 From: "Andi Kleen" <[EMAIL PROTECTED]> The ipid handling is still fishy, it will break when you talk to more destinations than the inetpeer cache can take (I fixed it in my local tree with the appended patch) I don't like this change, please work with Alexey on this one. Do not assume we are removing the inetpeer cache also, see below. Receiver side SMP reordering is still there, but I'm not sure if it is fixable (but it'll surely hit people that cannot use Linux senders, I just see the reports) Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input rewrite was strictly about dealing with this. No SMP reordering should cause any bogus fast retransmits etc. for example. What is the problem? Are you referering to the LAPB/X25 stuff? The TCP connect running out of ports problem is still there (I fixed it locally, but the changes are probably far too extensive for 2.4.x now) I think people can change their ip_local_port_range and I really consider this a non-issue, at least, it's not a 2.4.x show stopper. The TCP ISN computation is not SMP safe. "Big deal". So how much effort will you go to to get how much more protection and how much of it do we really need? How less secure are our TCP ISN's because of this? Not a show-stopper. I think you're reading too much tcp-impl :-) include/linux/snmp.h is probably still not properly aligned for all cache line sizes. Read, and reread what Alexey tries to tell you over and over about this. It is not meant to be cache line aligned, it is meant to be a power of two and nothing more. Not a bug. UDP recvmsg error handling for csum errors is bogus (fix pending) Ok, send me a patch so I can see what the problem is. I also would like to remove the inetpeer cache code before 2.4.0, now that nobody managed to salvage tw recycling and ipid can do without it. I would prefer not to, it seems to be too potentially destabilizing for questionable and unknown gain (ie. we don't know if tw recycling can be salvaged, so lets not take any changes). Later, David S. Miller [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/