Re: Netchannles: first stage has been completed. Further ideas.
From: Rusty Russell [EMAIL PROTECTED] Date: Fri, 28 Jul 2006 15:54:04 +1000 (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that holds (some subset of?) flows. A successful lookup immediately after packet comes off NIC gives destiny for packet: what route, (optionally) what socket, what filtering, what connection tracking ( what NAT), etc? I don't know if this should be a general array of fn data ptrs, or specialized fields for each one, or a mix. Maybe there's a too hard, do slow path bit, or maybe hard cases just never get put in the cache. Perhaps we need a separate one for locally-generated packets, a-la ip_route_output(). Anyway, we trade slightly more expensive flow setup for faster packet processing within flows. So, specifically, one of the methods you are thinking about might be implemented by adding: void (*input)(struct sk_buff *, void *); void *input_data; to struct flow_cache_entry or whatever replaces it? This way we don't need some kind of type information in the flow cache entry, since the input handler knows the type. One way to do this is to add a have_interest callback into the hook_ops, which takes each about-to-be-inserted GUFC entry and adds any destinies this hook cares about. In the case of packet filtering this would do a traversal and append a fn/data ptr to the entry for each rule which could effect it. Can you give a concrete example of how the GUFC might make use of this? Just some small abstract code snippets will do. The other way is to have the hooks register what they are interested in into a general data structure which GUFC entry creation then looks up itself. This general data structure will need to support wildcards though. My gut reaction is that imposing a global data structure on all object classes is not prudent. When we take a GUFC miss, it seems better we call into the subsystems to resolve things. It can implement whatever slow path lookup algorithm is most appropriate for it's data. We also need efficient ways of reflecting rule changes into the GUFC. We can be pretty slack with conntrack timeouts, but we either need to flush or handle callbacks from GUFC on timed-out entries. Packet filtering changes need to be synchronous, definitely. This, I will remind, is similar to the problem of doing RCU locking of the TCP hash tables. (3) Smart NICs that do some flowid work themselves can accelerate lookup implicitly (same flow goes to same CPU/thread) or explicitly (each CPU/thread maintains only part of GUFC which it needs, or even NIC returns flow cookie which is pointer to GUFC entry or subtree?). AFAICT this will magnify the payoff from the GUFC. I want to warn you about HW issues that I mentioned to Alexey the other week. If we are not careful, we can run into the same issues TOE cards run into, performance wise. Namely, it is important to be careful about how the GUFC table entries get updated in the card. If you add them synchronously, your connection rates will deteriorate dramatically. I had the idea of a lazy scheme. When we create a GUFC entry, we tack it onto a DMA'able linked list the card uses. We do not notify the card, we just entail the update onto the list. Then, if the card misses it's on-chip GUFC table on an incoming packet, it checks the DMA update list by reading it in from memory. It updates it's GUFC table with whatever entries are found on this list, then it retries to classify the packet. This seems like a possible good solution until we try to address GUFC entry deletion, which unfortunately cannot be evaluated in a lazy fashion. It must be synchronous. This is because if, for example, we just killed off a TCP socket we must make sure we don't hit the GUFC entry for the TCP identity of that socket any longer. Just something to think about, when considering how to translate these ideas into hardware. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 27 Jul 2006 11:54:19 -0700 I think we sell our existing stack short. I agree. There are lots of opportunities left to look more closely at actual real performance bottlenecks and improve incrementally. But it requires, tools, time, faster net hardware, and some creative insight. I guess it just isn't as cool. We are in fact suggesting some ideas that address the current stack issues along the way. Witness the discussion we had about the tcp_ack() costs wrt. pruning the retransmit queue and tagging packets for SACK, I'm working on a new data structure and layout to cure all that stuff. But I think we can do better. Jamal said to me one email, If even only half of Van's numbers are real, this is really exciting. Rusty and Alexey are looking at the problem from another direction. Go back to the unified flow cache, implement all the hair to do that, and then we can look at netchannels because they will be so much more straight forward at that point. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Rusty Russell [EMAIL PROTECTED] Date: Thu, 27 Jul 2006 15:46:12 +1000 Yes, my first thought back in January was how netfilter would interact with this in a sane way. One answer is don't: once someone registers on any hook we go into slow path. Another is to run the hooks in socket context, which is better, but precludes having the consumer in userspace, which still appeals to me 8) Small steps, small steps. I have not ruled out userspace TCP just yet, but we are not prepared to go there right now anyways. It is just the same kind of jump to go to kernel level netchannels as it is to go from kernel level netchannels to userspace netchannel based TCP. What would the tuple look like? Off the top of my head: SRCIP/DSTIP/PROTO/SPT/DPT/IN/OUT (where IN and OUT are boolean values indicating whether the src/dest is local). Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it. I think we are able to finally talk seriously about revamping netfilter on this level because we finally have a good incentive to do so and some kind of model exists to work against. Robert's trie might be able to handle your tuple very well, fwiw, perhaps even with prefixing. But something occurs to me. Socket has ID when it is created and goes to established state. This means we have this tuple, and thus we can prelookup the netfilter rule and attach this cached lookup state on the socket. Your tuple in this case is defined to be: SRCIP/DSTIP/TCP/SPT/DPT/0/1 I do not know how practical this is, it is just some suggestion. Would there be prefixing in these tuples? That's where the trouble starts. If you add prefixing, troubles and limitations of lookup of today reappear. If you disallow prefixing, tables get very large but lookup becomes simpler and practical. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! On Thu, Jul 27, 2006 at 03:46:12PM +1000, Rusty Russell wrote: Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it. That's what I love to hear. Not a joke. :-) Could I only suggest not to relate this to netchannels? :-) In the past we used to call this thing (grand-unified) flow cache. I don't think they are equivalent. In channels, I understand this. Actually, it was what I said in the next paragraph, which you even cited. I really do not like to repeat myself, it is nothing but idle talk, but if the questions are questioned... First, it was stated that suggested implementation performs better and even much better. I am asking why do we see such improvement? I am absolutely not satisifed with statement It is better. Period. From all that I see, this particular implementation does not implement optimizations suggested by VJ, it implements only the things, which are not supposed to affect performance or to affect it negatively. Idle talk? I am sure that if that improvement happened not due to a severe protocol violation we can easily fix existing stack. userspace), no dequeue lock is required. And that was a part of the second question. I do not see, how single threaded TCP is possible. In receiver path it has to ack with quite strict time bounds, to delack etc., in sender path it has to slow start, I am even not saying about slow path things: retransmit, probing window, lingering without process context etc. It looks like, VJ implies the protocol must be changed. We can't, we mustn't. After we deidealize this idealization and recognize that some slow path should exist and some part of this slow path has to be executed with higher priority than the fast one, where do we arrive? Is not it exactly what we have right now? Clean fast path, separate slow path. Not good enough? Where? Let's find and fix this. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello, Alexey. On Thu, Jul 27, 2006 at 08:33:35PM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) wrote: First, it was stated that suggested implementation performs better and even much better. I am asking why do we see such improvement? I am absolutely not satisifed with statement It is better. Period. From all that I see, this particular implementation does not implement optimizations suggested by VJ, it implements only the things, which are not supposed to affect performance or to affect it negatively. Just for clarifications: I showed that even using _existing_ stack (using sk_backlog_rcv) performance in process context can exceed two level processing. And after creating own TCP implemetation (which does not include two-level related overhead among other things) performance different was even higher. I can agree that it is possible that in second case part of the gain is obtained from the new TCP implementation, but not 100% from process' context, but in first place existing socket code was used. userspace), no dequeue lock is required. And that was a part of the second question. I do not see, how single threaded TCP is possible. In receiver path it has to ack with quite strict time bounds, to delack etc., in sender path it has to slow start, I am even not saying about slow path things: retransmit, probing window, lingering without process context etc. It looks like, VJ implies the protocol must be changed. We can't, we mustn't. After we deidealize this idealization and recognize that some slow path should exist and some part of this slow path has to be executed with higher priority than the fast one, where do we arrive? Is not it exactly what we have right now? Clean fast path, separate slow path. Not good enough? Where? Let's find and fix this. Slow path does exist, retransmits and friends are there too in new stack. And my initial netchannel implementation used _existing_ socket code from process context. Again, there is no need to crate two levels between fast and slow or softirq and process, and it was proven and shown that it can perform faster. Why don't you want to see, that existing model is just path enlargement: there might also exist delayes between hard and soft irqs, so acks will be delayed and so on... But stack works without problems even if some kernel thread takes 100% cpu (with preemption), and there are very big delays for ack generations, but userspace is not possible to get that data. With netchannels it is essentially the same (heh, I said that already a lot of times). Alexey -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Wed, 26 Jul 2006 23:00:28 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Rusty Russell [EMAIL PROTECTED] Date: Thu, 27 Jul 2006 15:46:12 +1000 Yes, my first thought back in January was how netfilter would interact with this in a sane way. One answer is don't: once someone registers on any hook we go into slow path. Another is to run the hooks in socket context, which is better, but precludes having the consumer in userspace, which still appeals to me 8) Small steps, small steps. I have not ruled out userspace TCP just yet, but we are not prepared to go there right now anyways. It is just the same kind of jump to go to kernel level netchannels as it is to go from kernel level netchannels to userspace netchannel based TCP. I think we sell our existing stack short. There are lots of opportunities left to look more closely at actual real performance bottlenecks and improve incrementally. But it requires, tools, time, faster net hardware, and some creative insight. I guess it just isn't as cool. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! kernel thread takes 100% cpu (with preemption Preemption, you tell... :-) I begged you to spend 1 minute of your time to press ^Z. Did you? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Thu, 2006-07-27 at 20:33 +0400, Alexey Kuznetsov wrote: Hello! On Thu, Jul 27, 2006 at 03:46:12PM +1000, Rusty Russell wrote: Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it. That's what I love to hear. Not a joke. :-) Could I only suggest not to relate this to netchannels? :-) In the past we used to call this thing (grand-unified) flow cache. Yes. Thankyou for all your explanation, it was very helpful. I agree, grand unified lookup idea returns 8). Netchannels proposal vs. netfilter forced me back into thinking about it again, but it is unrelated. Any netfilter bypass acceleration will want similar ideas. I apologize for misreading your discussion of Evgeniy's implementation with general channel problem. My mistake. userspace), no dequeue lock is required. And that was a part of the second question. I do not see, how single threaded TCP is possible. In receiver path it has to ack with quite strict time bounds, to delack etc., in sender path it has to slow start, I am even not saying about slow path things: retransmit, probing window, lingering without process context etc. It looks like, VJ implies the protocol must be changed. We can't, we mustn't. All good points. I can see two kinds of problems here: performance problems due to wakeup (eg. ack processing for 5MB write), and correctness problems due to no kernel enforcement. We need measurements for the performance issues, so I'll ignore them for the moment. For correctness, in true end-to-end, kernel is just a router for userspace, then we do not worry about such problems 8) In real life kernel must enforce linger and sending tuple correctness, but I don't know how much else we must regulate. Too much, and you are right: we have slow and fast path split just like now. After we deidealize this idealization and recognize that some slow path should exist and some part of this slow path has to be executed with higher priority than the fast one, where do we arrive? Is not it exactly what we have right now? Clean fast path, separate slow path. Not good enough? Where? Let's find and fix this. I am still not sure how significant slow path is: if 99% can be in userspace, it could work very well for RDMA. I would like to have seen VJ's implementation so we could compare and steal bits. Thanks, Rusty. -- Help! Save Australia from the worst of the DMCA: http://linux.org.au/law - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Fri, Jul 28, 2006 at 12:56:51AM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) wrote: Hello! kernel thread takes 100% cpu (with preemption Preemption, you tell... :-) I begged you to spend 1 minute of your time to press ^Z. Did you? What would you expect from non-preemptible kernel? Hard lockup, no acks, no soft irqs. So this case still does not differ from process' context processing. And after several minutes I pressed hardware reset button... Alexey -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Fri, 28 Jul 2006 09:17:25 +0400 What would you expect from non-preemptible kernel? Hard lockup, no acks, no soft irqs. Why does pressing Ctrl-Z on the user process stop kernel soft irq processing? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Thu, Jul 27, 2006 at 10:34:00PM -0700, David Miller ([EMAIL PROTECTED]) wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Fri, 28 Jul 2006 09:17:25 +0400 What would you expect from non-preemptible kernel? Hard lockup, no acks, no soft irqs. Why does pressing Ctrl-Z on the user process stop kernel soft irq processing? I do not know, why Alexey decided that Ctrl-Z was ever pressed. I'm saying about the case when keventd ate 100% of CPU, but stack worked with (very) long delays. Obviously userspace was unresponsible and no data arrived there. It is an analogy that posponed softirq work does not destroy connections as long as process context protocol processing with delays. User does not get it's data, so no need to send an ack. And if it is impossible to get that data at all, user should not care that sending side does not see acks. When user is capable to get that data, it starts to acknowledge. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Wed, 2006-07-26 at 23:00 -0700, David Miller wrote: From: Rusty Russell [EMAIL PROTECTED] Date: Thu, 27 Jul 2006 15:46:12 +1000 Yes, my first thought back in January was how netfilter would interact with this in a sane way. One answer is don't: once someone registers on any hook we go into slow path. Another is to run the hooks in socket context, which is better, but precludes having the consumer in userspace, which still appeals to me 8) Small steps, small steps. I have not ruled out userspace TCP just yet, but we are not prepared to go there right now anyways. It is just the same kind of jump to go to kernel level netchannels as it is to go from kernel level netchannels to userspace netchannel based TCP. I think I was unclear; the possibility of userspace netchannels adds weight to the idea that we should rework netfilter hooks sooner rather than later. What would the tuple look like? Off the top of my head: SRCIP/DSTIP/PROTO/SPT/DPT/IN/OUT (where IN and OUT are boolean values indicating whether the src/dest is local). Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it. I think we are able to finally talk seriously about revamping netfilter on this level because we finally have a good incentive to do so and some kind of model exists to work against. Robert's trie might be able to handle your tuple very well, fwiw, perhaps even with prefixing. But something occurs to me. Socket has ID when it is created and goes to established state. This means we have this tuple, and thus we can prelookup the netfilter rule and attach this cached lookup state on the socket. Your tuple in this case is defined to be: SRCIP/DSTIP/TCP/SPT/DPT/0/1 I do not know how practical this is, it is just some suggestion. Would there be prefixing in these tuples? That's where the trouble starts. If you add prefixing, troubles and limitations of lookup of today reappear. If you disallow prefixing, tables get very large but lookup becomes simpler and practical. OK. AFAICT, there are three ideas in play here (ignoring netchannels). First, there should be a unified lookup for efficiency (Grand Unified Cache). Secondly, that netfilter hook users need to publish information about what they are actually looking at if they are to use this lookup. Thirdly, that smart cards can accelerate lookup. (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that holds (some subset of?) flows. A successful lookup immediately after packet comes off NIC gives destiny for packet: what route, (optionally) what socket, what filtering, what connection tracking ( what NAT), etc? I don't know if this should be a general array of fn data ptrs, or specialized fields for each one, or a mix. Maybe there's a too hard, do slow path bit, or maybe hard cases just never get put in the cache. Perhaps we need a separate one for locally-generated packets, a-la ip_route_output(). Anyway, we trade slightly more expensive flow setup for faster packet processing within flows. (2) To make this work sanely in the presence of netfilter hooks, we need them to register the tuples they are interested in. Not at the hook level, but *in addition*. For example, we need to know what flows each packet filtering rule cares about. Connection tracking wants to see the first packet (and first reply packet), but then probably only want to see packets with RST/SYN/FIN set. (Erk, window tracking wants to see every packet, but maybe we could do something). NAT definitely needs to see every packet on a connection which is natted. One way to do this is to add a have_interest callback into the hook_ops, which takes each about-to-be-inserted GUFC entry and adds any destinies this hook cares about. In the case of packet filtering this would do a traversal and append a fn/data ptr to the entry for each rule which could effect it. The other way is to have the hooks register what they are interested in into a general data structure which GUFC entry creation then looks up itself. This general data structure will need to support wildcards though. We also need efficient ways of reflecting rule changes into the GUFC. We can be pretty slack with conntrack timeouts, but we either need to flush or handle callbacks from GUFC on timed-out entries. Packet filtering changes need to be synchronous, definitely. (3) Smart NICs that do some flowid work themselves can accelerate lookup implicitly (same flow goes to same CPU/thread) or explicitly (each CPU/thread maintains only part of GUFC which it needs, or even NIC returns flow cookie which is pointer to GUFC entry or subtree?). AFAICT this will magnify the payoff from the GUFC. Sorry for the length, Rusty. -- Help! Save Australia from the worst of the DMCA: http://linux.org.au/law - To unsubscribe from this
Re: Netchannles: first stage has been completed. Further ideas.
On Wed, 2006-07-19 at 03:01 +0400, Alexey Kuznetsov wrote: Hello! Can I ask couple of questions? Just as a person who looked at VJ's slides once and was confused. And startled, when found that it is not considered as another joke of genuis. :-) Hi Alexey! About locks: is completely lockless (there is one irq lock when skb is queued/dequeued into netchannels queue in hard/soft irq, Equivalent of socket spinlock. I don't think they are equivalent. In channels, this can be split into two locks, queue lock and an dequeue lock, which operate independently. The socket spinlock cannot. Moreover, in the case where there is a guarantee about IRQs being bound to a single CPU (as Dave's ideas on MSI), the queue lock is no longer required. In the case where there is a single reader of the socket (or, as VJ did, the other end is in userspace), no dequeue lock is required. VJ slides describe a totally different scheme, where softirq part is omitted completely, protocol processing is moved to user space as whole. It is an amazing toy. But I see nothing, which could promote its status to practical. Exokernels used to do this thing for ages, and all the performance gains are compensated by overcomplicated classification engine, which has to remain in kernel and essentially to do the same work which routing/firewalling/socket hash tables do. My feeling is that modern cards will do partial demux for us; whether we use netchannels or not, we should use that to accelerate lookup. Making card aim MSI at same CPU for same flow is a start (and as Dave said, much less code). As the next step, having the card give us a cookie too, would allow us to explicitly skip first level of lookup. This should allow us to identify which flows are simple enough to be directly accelerated (whether by channels or something else): no bonding, raw sockets, non-trivial netfilter rules, connection tracking changes, etc. Thoughts? Rusty. -- Help! Save Australia from the worst of the DMCA: http://linux.org.au/law - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Rusty Russell [EMAIL PROTECTED] Date: Thu, 27 Jul 2006 12:17:51 +1000 On Wed, 2006-07-19 at 03:01 +0400, Alexey Kuznetsov wrote: About locks: is completely lockless (there is one irq lock when skb is queued/dequeued into netchannels queue in hard/soft irq, Equivalent of socket spinlock. I don't think they are equivalent. In channels, this can be split into two locks, queue lock and an dequeue lock, which operate independently. The socket spinlock cannot. Moreover, in the case where there is a guarantee about IRQs being bound to a single CPU (as Dave's ideas on MSI), the queue lock is no longer required. In the case where there is a single reader of the socket (or, as VJ did, the other end is in userspace), no dequeue lock is required. Cost is a very interesting question here. I guess your main point is that eventually this lock can be made to go away, whereas Alexey speaks about the state of Evgivny's specific implementation. My feeling is that modern cards will do partial demux for us; whether we use netchannels or not, we should use that to accelerate lookup. Making card aim MSI at same CPU for same flow is a start (and as Dave said, much less code). As the next step, having the card give us a cookie too, would allow us to explicitly skip first level of lookup. This should allow us to identify which flows are simple enough to be directly accelerated (whether by channels or something else): no bonding, raw sockets, non-trivial netfilter rules, connection tracking changes, etc. I read this as we will be able to get around the problems but no specific answer as to how. I am an optimist too but I want to start seeing concrete discussion about the way in which the problems will be dealt with. Alexey has some ideas, such as running the netfilter path from the netchannel consumer socket context. That is the kind of thing we need to be talking about. Robert Olsson is also doing some work involving full flow classifications using special trie structures in the routing cache that might be extendable to netchannels. His trick is to watch for the FIN shutdown sequence and GC route cache entries for a flow when this is seen. This is in order to keep the trie shallow and thus have a better bound on memory accesses for routing lookups. We are not a group of mathematicians discussing the tractability of some problem. Our interest is practice not theory. :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Wed, 2006-07-26 at 22:17 -0700, David Miller wrote: I read this as we will be able to get around the problems but no specific answer as to how. I am an optimist too but I want to start seeing concrete discussion about the way in which the problems will be dealt with. Alexey has some ideas, such as running the netfilter path from the netchannel consumer socket context. That is the kind of thing we need to be talking about. Yes, my first thought back in January was how netfilter would interact with this in a sane way. One answer is don't: once someone registers on any hook we go into slow path. Another is to run the hooks in socket context, which is better, but precludes having the consumer in userspace, which still appeals to me 8) So I don't like either. The mistake (?) with netfilter was that we are completely general: you will see all packets, do what you want. If, instead, we had forced all rules to be of form show me all packets matching this tuple we would be in a combine it in a single lookup with routing etc. What would the tuple look like? Off the top of my head: SRCIP/DSTIP/PROTO/SPT/DPT/IN/OUT (where IN and OUT are boolean values indicating whether the src/dest is local). Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it. Rusty. -- Help! Save Australia from the worst of the DMCA: http://linux.org.au/law - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Wed, 19 Jul 2006 13:01:50 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 19 Jul 2006 15:52:04 -0400 As a related note, I am looking into fixing inet hash tables to use RCU. IBM had posted a patch a long time ago, which would be not so hard to munge into the current tree. See if you can spot it in the archives :) Srivatsa Vaddagiri from IBM did patch: http://lkml.org/lkml/2004/8/31/129 And Ben had a patch: http://lwn.net/Articles/174596/ Srivata's was more complete but pre-dates Acme's rearrangement. Also, there is some code for refcnt's in it that looks wrong. Or at minimum is masking underlying design flaws. /* Ungrab socket and destroy it, if it was the last reference. */ static inline void sock_put(struct sock *sk) { - if (atomic_dec_and_test(sk-sk_refcnt)) - sk_free(sk); +sp_loop: + if (atomic_dec_and_test(sk-sk_refcnt)) { + /* Restore ref count and schedule callback. +* If we don't restore ref count, then the callback can be +* scheduled by more than one CPU. +*/ + atomic_inc(sk-sk_refcnt); + + if (atomic_read(sk-sk_refcnt) == 1) + call_rcu(sk-sk_rcu, sk_free_rcu); + else + goto sp_loop; + } } Ben's still left reader writer locks, and needed IPV6 work. He said he plans to get back to it. -- Stephen Hemminger [EMAIL PROTECTED] And in the Packet there writ down that doome - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Also, there is some code for refcnt's in it that looks wrong. Yes, it is disgusting. rcu does not allow to increase socket refcnt in lookup routine. Ben's version looks cleaner here, it does not touch refcnt in rcu lookups. But it is dubious too: do_time_wait: + sock_hold(sk); is obviously in violation of the rule. Probably, rcu lookup should do something like: if (!atomic_inc_not_zero(sk-sk_refcnt)) pretend_it_is_not_found; It is clear Ben did not look into IBM patch, because one known place of trouble is missed: when socket moves from established to timewait, timewait bucket must be inserted before established socket is removed. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Netchannles: first stage has been completed. Further ideas.
[EMAIL PROTECTED] wrote: Evgeniy Polyakov wrote: On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) wrote: Out of curiosity, is it possible to have the single producer logic if you have two+ ethernet interfaces handling frames for a single TCP connection? (I am assuming some sort of multi-path routing logic...) I do not think it is possible with additional logic like what is implemented in softirqs, i.e. per cpu queues of data, which in turn will be converted into skbs one-by-one. Couldn't you have two NICs being handled by two separate CPUs, with both CPUs trying to write to the same socket queue? The receive path works with RCU locking from what I understand, so a protocol's receive function must be re-entrant. Wouldn't it be easier simply not have two NICs feed the same ring? What packets end up in which ring is fully controllable. On the rare occasion that a single connection must be fed by two NICs a software merge of the two rings would be far cheaper than having to co-ordinate between producers all the time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Thu, Jul 20, 2006 at 09:55:04PM -0700, David Miller ([EMAIL PROTECTED]) wrote: From: Alexey Kuznetsov [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 02:59:08 +0400 Moving protocol (no matter if it is TCP or not) closer to user allows naturally control the dataflow - when user can read that data(and _this_ is the main goal), user acks, when it can not - it does not generate ack. In theory To all that I rememeber, in theory absence of feedback leads to loss of control yet. The same is in practice, unfortunately. You must say that window is closed, otherwise sender is totally confused. Correct, and too large delay even results in retransmits. You can say that RTT will be adjusted by delay of ACK, but if user context switches cleanly at the beginning, resulting in near immediate ACKs, and then blocks later you will get spurious retransmits. Alexey's example of blocking on a disk write is a good example. I really don't like when pure NULL data sinks are used for benchmarking these kinds of things because real applications 1) touch the data, 2) do something with that data, and 3) have some life outside of TCP! And what will happen with sockets? Data will arive and ack will be generated, until queue is filled and duplicate ack started to be sent thus reducing window even more. Results _are_ the same, both will have duplicate acks and so on, but with netchannels there is no complex queue management, no two or more rings, where data is procesed (bh, process context and so on), no locks and ... hugh, I reacll I wrote it already several times :) My userspace applications do memset, and actually writing data into /dev/null through the stdout pipe does not change the overall picture. I read a lot of your critics about benchmarking, so I'm ready :) If you optimize an application that does nothing with the data it receives, you have likewise optimized nothing :-) I've run that test - dump all data into file through pipe. 84byte packet bulk receiving: netchannels: 8 Mb/sec (down 6 when VFS cache is filled) socket: 7 Mb/sec (down to 6 when VFS cache is filled) So you asked to create narrow pipe, and speed becomes equal to the speed of that pipe. No more, no less. All this talk reminds me of one thing, how expensive tcp_ack() is. And this expense has nothing to do with TCP really. The main cost is purging and freeing up the skbs which have been ACK'd in the retransmit queue. Yes, allocation always takes first places in all profiles. I'm working to eliminate that - it is a side effect of zero-copy networking design I'm working on right now. So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs which haven't been touched by the cpu in some time and are thus nearly guarenteed to be cold in the cache. This is the kind of work we could think about batching to user sleeping on some socket call. Also notice that retransmit queue is potentially a good use of an array similar VJ netchannel lockless queue data structure. :) Array has a lot of disadvantages with it's resizing, there will be a lot of troubles with recv/send queue len changes. But it allows to remove several pointer from skb, which is always a good start. BTW, notice that TSO makes this work touch less skb state. TSO also decreases cpu utilization. I think these two things are no coincidence. :-) TSO/GSO is a good idea definitely, but it is completely unrelated to other problems. If it will be implemented with netchannels we will have even better perfomance. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) wrote: Out of curiosity, is it possible to have the single producer logic if you have two+ ethernet interfaces handling frames for a single TCP connection? (I am assuming some sort of multi-path routing logic...) I do not think it is possible with additional logic like what is implemented in softirqs, i.e. per cpu queues of data, which in turn will be converted into skbs one-by-one. Thanks, Ben -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Fri, Jul 21, 2006 at 11:19:00AM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) wrote: Out of curiosity, is it possible to have the single producer logic if you have two+ ethernet interfaces handling frames for a single TCP connection? (I am assuming some sort of multi-path routing logic...) I do not think it is possible with additional logic like what is I think it is posssible ... -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Fri, Jul 21, 2006 at 09:40:32AM +1200, Ian McDonald ([EMAIL PROTECTED]) wrote: If we consider netchannels as how Van Jackobson discribed them, then mutext is not needed, since it is impossible to have several readers or writers. But in socket case even if there is only one userspace consumer, that lock must be held to protect against bh (or introduce several queues and complicate a lot their's management (ucopy for example)). As I recall Van's talk you don't need a lock with a ring buffer if you have a start and end variable pointing to location within ring buffer. He didn't explain this in great depth as it is computer science 101 but here is how I would explain it: Once socket is initialiased consumer is the only one that sets start variable and network driver reads this only. It is the other way around for the end variable. As long as the writes are atomic then you are fine. You only need one ring buffer in this scenario and two atomic variables. Having atomic writes does have overhead but far less than locking semantic. With netchannels and one data producer it should not be even atomic. Problems start to appear when there are several producers or consumers - there must be implemented either atomic or locking logic indeed. -- Ian McDonald Web: http://wand.net.nz/~iam4 Blog: http://imcdnzl.blogspot.com WAND Network Research Group Department of Computer Science University of Waikato New Zealand -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 11:10:10 +0400 On Thu, Jul 20, 2006 at 09:55:04PM -0700, David Miller ([EMAIL PROTECTED]) wrote: Correct, and too large delay even results in retransmits. You can say that RTT will be adjusted by delay of ACK, but if user context switches cleanly at the beginning, resulting in near immediate ACKs, and then blocks later you will get spurious retransmits. Alexey's example of blocking on a disk write is a good example. I really don't like when pure NULL data sinks are used for benchmarking these kinds of things because real applications 1) touch the data, 2) do something with that data, and 3) have some life outside of TCP! And what will happen with sockets? Data will arive and ack will be generated, until queue is filled and duplicate ack started to be sent thus reducing window even more. Results _are_ the same, both will have duplicate acks and so on, but with netchannels there is no complex queue management, no two or more rings, where data is procesed (bh, process context and so on), no locks and ... hugh, I reacll I wrote it already several times :) Packets will be retransmitted spuriously and unnecessarily, and we cannot over-stress how bad this is. Sure, your local 1gbit network can absorb this extra cost when the application is blocked for a long time, but in the real internet it is a real concern. Please address the fact that your design makes for retransmits that are totally unnecessary. Your TCP stack is flawed if it allows this to happen. Proper closing of window and timely ACKs are not some optional feature of TCP, they are in fact mandatory. If you want to bypass these things, this is fine, but do not name it TCP :-))) As a related example, deeply stretched ACKs can help and are perfect when there is no packet loss. But in the event of packet loss a stretch ACK will kill performance, because it makes packet loss recovery take at least one extra round trip to occur. Therefore I disabled stretch ACKs in the input path of TCP last year. If you optimize an application that does nothing with the data it receives, you have likewise optimized nothing :-) I've run that test - dump all data into file through pipe. 84byte packet bulk receiving: netchannels: 8 Mb/sec (down 6 when VFS cache is filled) socket: 7 Mb/sec (down to 6 when VFS cache is filled) So you asked to create narrow pipe, and speed becomes equal to the speed of that pipe. No more, no less. If you cause unnecessary retransmits, you add unnecessary congestion to the network for other flows. All this talk reminds me of one thing, how expensive tcp_ack() is. And this expense has nothing to do with TCP really. The main cost is purging and freeing up the skbs which have been ACK'd in the retransmit queue. Yes, allocation always takes first places in all profiles. I'm working to eliminate that - it is a side effect of zero-copy networking design I'm working on right now. When you say these things over and over again, people like Alexey and myself perceive it as La la la la, I'm not listening to you guys Our point is not that your work cannot lead you to fixing these problems. Our point is that existing TCP stack can have these problems fixed too! With advantage that we don't need all the negative aspects of moving TCP into userspace. You can eliminate allocation overhead in our existing stack, with the simple design I outlined. In fact, I outlined two approaches, there is such an abundance of ways to do it that you have a choice of which one you like the best :) Array has a lot of disadvantages with it's resizing, there will be a lot of troubles with recv/send queue len changes. But it allows to remove several pointer from skb, which is always a good start. Yes it is something to consider. Large pipes with 4000+ packet windows present considerable problems in this area. TSO/GSO is a good idea definitely, but it is completely unrelated to other problems. If it will be implemented with netchannels we will have even better perfomance. I like TSO-like ideas because it points to solutions within existing stack. Radical changes are great, when they buy us something that is impossible with current design. A lot of things being shown and discussed here are indeed possible with current design. You have a nice toy and you should be proud of it, but do not make it into panacea. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Fri, Jul 21, 2006 at 12:47:13AM -0700, David Miller ([EMAIL PROTECTED]) wrote: Correct, and too large delay even results in retransmits. You can say that RTT will be adjusted by delay of ACK, but if user context switches cleanly at the beginning, resulting in near immediate ACKs, and then blocks later you will get spurious retransmits. Alexey's example of blocking on a disk write is a good example. I really don't like when pure NULL data sinks are used for benchmarking these kinds of things because real applications 1) touch the data, 2) do something with that data, and 3) have some life outside of TCP! And what will happen with sockets? Data will arive and ack will be generated, until queue is filled and duplicate ack started to be sent thus reducing window even more. Results _are_ the same, both will have duplicate acks and so on, but with netchannels there is no complex queue management, no two or more rings, where data is procesed (bh, process context and so on), no locks and ... hugh, I reacll I wrote it already several times :) Packets will be retransmitted spuriously and unnecessarily, and we cannot over-stress how bad this is. In theory practice and theory are the same, but in practice they are different (c) Larry McVoy as far as I recall :) And even in theory Linux behaves the same. I see the only point about process context tcp processing is following issue: we started tcp connection, and acks are generated very fast, then suddenly receiving userspace is blocked. In that case BH processing apologists state that sending side starts to retransmit. Let's see how it works. If receiving side works for a long with maximum speed, then window is opened enough, so it can even exceed socket buffer size (max 200k, I saw several megs socket windows in my tests), so sending side will continue to send until window is filled. Receiving side, nor matter if it is socket or netchannel, will drop packets (socket due to queue overfull, netchannels will not drop, but will not ack (it's maximum queue len is 1mb)). So both approaches behave _exactly_ the same. Did I miss something? Btw, here are tests which were ran with netchannels: * surfing the web (index pages of different remote sites only) * 1gb transfers * 1gb - 100mb transfers Sure, your local 1gbit network can absorb this extra cost when the application is blocked for a long time, but in the real internet it is a real concern. Writing into the pipe (or into 100mb NIC) and file is a real internet example - data is blocked, acks and retransmits happen. Please address the fact that your design makes for retransmits that are totally unnecessary. Your TCP stack is flawed if it allows this to happen. Proper closing of window and timely ACKs are not some optional feature of TCP, they are in fact mandatory. If you want to bypass these things, this is fine, but do not name it TCP :-))) Hey, you did not look into atcp.c in my patches :) As a related example, deeply stretched ACKs can help and are perfect when there is no packet loss. But in the event of packet loss a stretch ACK will kill performance, because it makes packet loss recovery take at least one extra round trip to occur. Therefore I disabled stretch ACKs in the input path of TCP last year. For slow start it is definitely a must. If stretching alog is based on timers and round trip time, then I do not have that in atcp, but proper delaying based on sequence is used instead. If you optimize an application that does nothing with the data it receives, you have likewise optimized nothing :-) I've run that test - dump all data into file through pipe. 84byte packet bulk receiving: netchannels: 8 Mb/sec (down 6 when VFS cache is filled) socket: 7 Mb/sec (down to 6 when VFS cache is filled) So you asked to create narrow pipe, and speed becomes equal to the speed of that pipe. No more, no less. If you cause unnecessary retransmits, you add unnecessary congestion to the network for other flows. Please refer to my description above. Situation is perfectly the same as with socket code or with netchannels. All this talk reminds me of one thing, how expensive tcp_ack() is. And this expense has nothing to do with TCP really. The main cost is purging and freeing up the skbs which have been ACK'd in the retransmit queue. Yes, allocation always takes first places in all profiles. I'm working to eliminate that - it is a side effect of zero-copy networking design I'm working on right now. When you say these things over and over again, people like Alexey and myself perceive it as La la la la, I'm not listening to you guys Hmm, I've confirmed that allocation is a problem no matter which stack is used. My problem fix has nothing special to netchannels at all. Our point is not that your work cannot lead you to fixing these problems. Our point is that existing TCP stack can
Re: Netchannles: first stage has been completed. Further ideas.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 13:06:11 +0400 Receiving side, nor matter if it is socket or netchannel, will drop packets (socket due to queue overfull, netchannels will not drop, but will not ack (it's maximum queue len is 1mb)). So both approaches behave _exactly_ the same. Did I miss something? Socket will not drop the packets on receive because sender will not violate the window which receiver advertises, therefore there is no reason to drop the packets. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Fri, Jul 21, 2006 at 02:19:55AM -0700, David Miller ([EMAIL PROTECTED]) wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 13:06:11 +0400 Receiving side, nor matter if it is socket or netchannel, will drop packets (socket due to queue overfull, netchannels will not drop, but will not ack (it's maximum queue len is 1mb)). So both approaches behave _exactly_ the same. Did I miss something? Socket will not drop the packets on receive because sender will not violate the window which receiver advertises, therefore there is no reason to drop the packets. How come? sk_stream_rmem_schedule(), sk_rmem_alloc and friends... -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 13:39:09 +0400 On Fri, Jul 21, 2006 at 02:19:55AM -0700, David Miller ([EMAIL PROTECTED]) wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 13:06:11 +0400 Receiving side, nor matter if it is socket or netchannel, will drop packets (socket due to queue overfull, netchannels will not drop, but will not ack (it's maximum queue len is 1mb)). So both approaches behave _exactly_ the same. Did I miss something? Socket will not drop the packets on receive because sender will not violate the window which receiver advertises, therefore there is no reason to drop the packets. How come? sk_stream_rmem_schedule(), sk_rmem_alloc and friends... sk_stream_rmem_schedule() allocates bytes from the global memory pool quota for TCP sockets. It is not something will trigger when, for example, application blocks on a disk write. In fact it will rarely trigger once size of window is known, since sk_forward_alloc will grow to fill that size, then statically stay at the value being able to service all allocation requests in the future. Only when there is severe global TCP memory pressure will it be decreased. And again this isn't something which happens when a user simply blocks on some non-TCP operation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Evgeniy Polyakov wrote: On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) wrote: Out of curiosity, is it possible to have the single producer logic if you have two+ ethernet interfaces handling frames for a single TCP connection? (I am assuming some sort of multi-path routing logic...) I do not think it is possible with additional logic like what is implemented in softirqs, i.e. per cpu queues of data, which in turn will be converted into skbs one-by-one. Couldn't you have two NICs being handled by two separate CPUs, with both CPUs trying to write to the same socket queue? The receive path works with RCU locking from what I understand, so a protocol's receive function must be re-entrant. -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
All this talk reminds me of one thing, how expensive tcp_ack() is. And this expense has nothing to do with TCP really. The main cost is purging and freeing up the skbs which have been ACK'd in the retransmit queue. So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs which haven't been touched by the cpu in some time and are thus nearly guarenteed to be cold in the cache. This is the kind of work we could think about batching to user sleeping on some socket call. Ultimately isn't that just trying to squeeze the balloon? rick jones nice to see people seeing ACKs as expensive though :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Fri, Jul 21, 2006 at 09:14:39AM -0700, Ben Greear ([EMAIL PROTECTED]) wrote: Out of curiosity, is it possible to have the single producer logic if you have two+ ethernet interfaces handling frames for a single TCP connection? (I am assuming some sort of multi-path routing logic...) I do not think it is possible with additional logic like what is implemented in softirqs, i.e. per cpu queues of data, which in turn will be converted into skbs one-by-one. Couldn't you have two NICs being handled by two separate CPUs, with both CPUs trying to write to the same socket queue? The receive path works with RCU locking from what I understand, so a protocol's receive function must be re-entrant. There will not be socket queue on that stage - only per-cpu queues, which then will be processed one-by-one by _exactly_ single user. That user can get skb in round-robin manner and put them into socket queue and call protocol receiving function. -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Rick Jones [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 09:26:42 -0700 All this talk reminds me of one thing, how expensive tcp_ack() is. And this expense has nothing to do with TCP really. The main cost is purging and freeing up the skbs which have been ACK'd in the retransmit queue. So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs which haven't been touched by the cpu in some time and are thus nearly guarenteed to be cold in the cache. This is the kind of work we could think about batching to user sleeping on some socket call. Ultimately isn't that just trying to squeeze the balloon? In this case, the goal is not to eliminate the cost, but to move it to user context so that it: 1) gets charged to the user instead of being lost in the ether of anonymous software interrupt execution, and more importantly... 2) it gets moved to the cpu where the user socket code is executing instead of the cpu where the ACK packet arrives which is basically arbitrary #2 is in-line with the system level end-to-end principle goals of netchannels. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Hello, Alexey. [ Sorry for long delay, there are some problems with mail servers, so I can not access them remotely, so I create mail by hads, hopefully thread will not be broken. ] There is no socket spinlock anymore. Above lock is skb_queue lock which is held inside skb_dequeue/skb_queue_tail calls. Lock is named differently, but it is still here. BTW for UDP even the name is the same. There is no bh processing, that lock is needed for 4 operations when skb is enqueued/dequeued. And if I would changed skbs to different structures there were no locks at all - it is extremely lightweight, it can not be compared with socket lock at all. No bh/irq processing at all, natural speed management - that is main idea behind netchannels. Equivalent of socket user lock. No, it is an equivalent for hash lock in socket table. OK. But you have to introduce socket mutex somewhere in any case. Even in ATCP. Actually not - VJ's idea is to have only one consumer and one provider, so no locks needed, but I agree, in general case it is needed, but _only_ to protect against several netchannel userspace consumers. There is no BH protocol processing at all, so there is no need to pprotect against someone who will add data while you are processing own chunk. Just an example - tcp_established() can be called with bh disabled under the socket lock. When we have a process context in hands, it is not. Did you ask youself, why do not we put all the packets to backlog/prequeue and just wait when user will read the data? It would be 100% equivalent to netchannels. How many hacks just to be a bit closer to userspace processing, implemented in netchannels! The answer is simple: because we cannot wait. If user delays for 200msec, wait for connection collapse due to retransmissions. If the segment is out of order, immediate attention is required. Any scheme, which tries to wait for user unconditionally, at least has to run a watchdog timer, which fires before sender senses the gap. If userspace is scheduled away for too much time, it is bloody wrong to ack the data, that is impossible to read due to the fact that system is being busy. It is just postponing the work from one end to another - ack now and stop when queue is full, or postpone the ack generation when segment is realy being read. And this is what we do for ages. Grep for VJ in sources. :-) netchannels have nothing to do with it, it is much elder idea. And it was Van, who decided to move away from BH/irq processing. It was slow and a bit pain way (how many hacks with prequeue, with direct processing, it is enough just to look how TCP socket lock is locked in different contexts :) In that case one copies the whole data into userspace, so access for 20 bytes of headers completely does not matter. For short packets it matters. But I said not this. I said it looks _worse_. A bit, but worse. At least for 80 bytes it does not matter at all. And it is very likely that data is misaligned, so half of the header will be in a cache line. And socket code has the same problem - skb-cb can be flushed away, and tcp_recvmsg() needs to get it again. And actually I never understood nanooptimisation behind more serious problems (i.e. one cache line vs. 50MB/sec speed). Hmm, for 80 bytes sized packets win was about 2.5 times. Could you please show me lines inside existing code, which should be commented, so I got 50Mbyte/sec for that? If I knew it would be done. :-) Actually, it is the action, which I would expect. This, but not dropping all the TCP stack. I tried to use existing one, and I had speed and CPU usage win, but it's magnitude was not what I expected, so I started userspace network stack implementation. It was succeded, and there are _very_ major optimisations over existing code, when processing is fully moved into userspace, but also there are big problems, like one syscall per ack, so I decided to use that stack as a base for in-kernel process protocol processing, and I succeded. Probably I will return to the userspace network stack idea when I complete zero-copy networking support. I showed there, that using existing stack it is imposible Please, understand, it is such statements that compromise your work. If it is impossible then it is not interesting. Do not mix soft and warm - I just post the facts, that netchannel TCP implementation works (sumetimes much) faster. It is socket code that probably has some misoptimisations, and if it is impossible to fix them (well, it least it is very hard), then it is not interesting. I definitely do not say, that it must be removed/replaced/anything - it works perfectly ok, but it is possible to have better performance by changing architecture, and it was done. Alexey -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello. [ Sorry for long delay, there are some problems with mail servers, so I can not access them remotely, so I create mail by hads, hopefully thread will not be broken. ] Your description makes it sound as if you would take a huge leap, changing all in-kernel code _and_ the userspace interface in a single patch. Am I wrong? Or am I right and would it make sense to extract small incremental steps from your patch similar to those Van did in his non-published work? My first implementation used existing kernel code and showed small performance win - there was binding of the socket to netchannel and all protocol processing was moved into process context. Iirc, Van didn't show performance numbers but rather cpu utilization numbers. And those went down significantly without changing the userspace interface. At least lca presentation graphs shows exactly different numbers - performance without CPU utilization (but not as his tables). Did you look at cpu utilization as well? If you did and your numbers are worse than Vans, he either did something smarter than you or forged his numbers (quite unlikely). Interesting sentence from political correcteness point of view :) I did both CPU and speed measurements when used socket code [1], and both of them showed small gain, but I only tested 1gbit setup, so they can not be compared with Van's. But even with 1gb I was not satisfied with them, so I started different implementation, which I described in my e-mail to Alexey. 1. speed/cpu measurements of one of the netchannels implementation which used socket code. http://thread.gmane.org/gmane.linux.network/36609/focus=36614 -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Small question first: userspace, but also there are big problems, like one syscall per ack, I do not see redundant syscalls. Is not it expected to send ACKs only after receiving data as you said? What is the problem? Now boring things: There is no BH protocol processing at all, so there is no need to pprotect against someone who will add data while you are processing own chunk. Essential part of socket user lock is the same mutex. Backlog is actually not a protection, but a thing equivalent to netchannel. The difference is only that it tries to process something immediately, when it is safe. You can omit this and push everything to backlog(=netchannel), which is processed only by syscalls, if you do not care about latency. How many hacks just to be a bit closer to userspace processing, implemented in netchannels! Moving processing closer to userspace is not a goal, it is a tool. Which sometimes useful, but generally quite useless. F.e. in your tests it should not affect performance at all, end user is just a sink. What's about prequeueing, it is a bright example. Guess why is it useful? What does it save? Nothing, like netchannel. Answer is: it is just a tool to generate coarsed ACKs in a controlled manner without essential violation of protocol. (Well, and to combine checksumming and copy if you do not like how your card does this) If userspace is scheduled away for too much time, it is bloody wrong to ack the data, that is impossible to read due to the fact that system is being busy. It is just postponing the work from one end to another - ack now and stop when queue is full, or postpone the ack generation when segment is realy being read. ... when you get all the segments nicely aligned, blah-blah-blah. If you do not care about losses-congestion-delays-delacks-whatever, you have a totally different protocol. Sending window feedback is only a minor part of tcp. But even these boring tcp intrinsics are not so important, look at ideal lossless network: Think what happens f.e. while plain file transfer to your notebook. You get 110MB/sec for a few seconds, then writeback is fired and disk io subsystems discovers that the disk holds only 50MB/sec. If you are unlucky and some another application starts, disk is so congested that it will take lots of seconds to make a progress with io. For this time another side will retransmit, because poor thing thought rtt is 100 usecs and you will never return to 50MB/sec. You have to _CLOSE_ window in the case of long delay, rather than to forget to ack. See the difference? It is just because actual end user is still far far away. And this happens all the time, when you relay the results to another application via pipe, when... Well, the only case where real end user is user of netchannel is when you receive to a sink. But I said not this. I said it looks _worse_. A bit, but worse. At least for 80 bytes it does not matter at all. Hello-o, do you hear me? :-) I am asking: it looks not much better, but a bit worse, then what is real reason for better performance, unless it is due to castration of protocol? Simplify protocol, move all the processing (even memory copies) to softirq, leave to user space only feeding pages to copy and you will have unbeatable performance. Been there, done that, not with TCP of course, but if you do not care about losses and ACK clocking and send an ACK once per window, I do not see how it can spoil the situation. And actually I never understood nanooptimisation behind more serious problems (i.e. one cache line vs. 50MB/sec speed). You deal with 80 byte packets, to all that I understand. If you lose one cacheline per packet, it is a big problem. All that we can change is protocol overhead. Handling data part is invariant anyway. You are scared of complexity of tcp, but you obviously forget one thing: cpu is fast. The code can look very complicated: some crazy hash functions, damn hairy protocol processing, but if you take care about caches etc., all this is dominated by the first look into packet in eth_type_trans() or ip_rcv(). BTW, when you deal with normal data flow, cache can be not dirtied by data at all, it can be bypassed. works perfectly ok, but it is possible to have better performance by changing architecture, and it was done. It is exactly the point of trouble. From all that I see and you said, better performance is got not due to change of architecture, but despite of this. A proof that we can perform better by changing protocol is not required, it is kinda obvious. The question is how to make existing protocol to perform better. I have no idea, why your tcp performs better. It can be everything: absence of slow start, more coarse ACKs, whatever. I believe you were careful to check those reasons and to do a fair comparison, but then the only guess remains that you saved lots of i-cache getting rid of long code path. And none of those guesses can be attributed to
Re: Netchannles: first stage has been completed. Further ideas.
On Thu, Jul 20, 2006 at 08:41:00PM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) wrote: Hello! Hello, Alexey. Small question first: userspace, but also there are big problems, like one syscall per ack, I do not see redundant syscalls. Is not it expected to send ACKs only after receiving data as you said? What is the problem? I mean that each ack is a pure syscall without any data, so overhead is quite huge compared to the situatin when acks are created in kernelspace. At least slow start will eat a lot of CPU with them. Now boring things: There is no BH protocol processing at all, so there is no need to pprotect against someone who will add data while you are processing own chunk. Essential part of socket user lock is the same mutex. Backlog is actually not a protection, but a thing equivalent to netchannel. The difference is only that it tries to process something immediately, when it is safe. You can omit this and push everything to backlog(=netchannel), which is processed only by syscalls, if you do not care about latency. If we consider netchannels as how Van Jackobson discribed them, then mutext is not needed, since it is impossible to have several readers or writers. But in socket case even if there is only one userspace consumer, that lock must be held to protect against bh (or introduce several queues and complicate a lot their's management (ucopy for example)). How many hacks just to be a bit closer to userspace processing, implemented in netchannels! Moving processing closer to userspace is not a goal, it is a tool. Which sometimes useful, but generally quite useless. F.e. in your tests it should not affect performance at all, end user is just a sink. What's about prequeueing, it is a bright example. Guess why is it useful? What does it save? Nothing, like netchannel. Answer is: it is just a tool to generate coarsed ACKs in a controlled manner without essential violation of protocol. (Well, and to combine checksumming and copy if you do not like how your card does this) I can not agree here. The main goal of the protocol is data delivery to the user, but not it's blind accepting and data transmit from user, but not some other ring. As you see, sending is already implemented in process' context, but receiving is not directly connected to the user. THe more elemnts between user and it's data we have, the more probability of some problems there. And we already have two queues just to eliminate one of them. Moving protocol (no matter if it is TCP or not) closer to user allows naturally control the dataflow - when user can read that data(and _this_ is the main goal), user acks, when it can not - it does not generate ack. In theory that can lead to the full absence of the congestions, especially if receiving window can be controlled in both directions. At least with current state of routers it does not lead to the broken connections. If userspace is scheduled away for too much time, it is bloody wrong to ack the data, that is impossible to read due to the fact that system is being busy. It is just postponing the work from one end to another - ack now and stop when queue is full, or postpone the ack generation when segment is realy being read. ... when you get all the segments nicely aligned, blah-blah-blah. If you do not care about losses-congestion-delays-delacks-whatever, you have a totally different protocol. Sending window feedback is only a minor part of tcp. But even these boring tcp intrinsics are not so important, look at ideal lossless network: Think what happens f.e. while plain file transfer to your notebook. You get 110MB/sec for a few seconds, then writeback is fired and disk io subsystems discovers that the disk holds only 50MB/sec. If you are unlucky and some another application starts, disk is so congested that it will take lots of seconds to make a progress with io. For this time another side will retransmit, because poor thing thought rtt is 100 usecs and you will never return to 50MB/sec. You have to _CLOSE_ window in the case of long delay, rather than to forget to ack. See the difference? It is just because actual end user is still far far away. And this happens all the time, when you relay the results to another application via pipe, when... Well, the only case where real end user is user of netchannel is when you receive to a sink. There is one problem in your logic. RTT will not be so small, since acks are not sent when user does not read data. But I said not this. I said it looks _worse_. A bit, but worse. At least for 80 bytes it does not matter at all. Hello-o, do you hear me? :-) I am asking: it looks not much better, but a bit worse, then what is real reason for better performance, unless it is due to castration of protocol? Well, if speed would be measured in lines of code, that atcp gets far less than existing tcp, but performance win is only 2.5 times.
Re: Netchannles: first stage has been completed. Further ideas.
Evgeniy Polyakov wrote: Backlog is actually not a protection, but a thing equivalent to netchannel. The difference is only that it tries to process something immediately, when it is safe. You can omit this and push everything to backlog(=netchannel), which is processed only by syscalls, if you do not care about latency. If we consider netchannels as how Van Jackobson discribed them, then mutext is not needed, since it is impossible to have several readers or writers. But in socket case even if there is only one userspace consumer, that lock must be held to protect against bh (or introduce several queues and complicate a lot their's management (ucopy for example)). Out of curiosity, is it possible to have the single producer logic if you have two+ ethernet interfaces handling frames for a single TCP connection? (I am assuming some sort of multi-path routing logic...) Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
If we consider netchannels as how Van Jackobson discribed them, then mutext is not needed, since it is impossible to have several readers or writers. But in socket case even if there is only one userspace consumer, that lock must be held to protect against bh (or introduce several queues and complicate a lot their's management (ucopy for example)). As I recall Van's talk you don't need a lock with a ring buffer if you have a start and end variable pointing to location within ring buffer. He didn't explain this in great depth as it is computer science 101 but here is how I would explain it: Once socket is initialiased consumer is the only one that sets start variable and network driver reads this only. It is the other way around for the end variable. As long as the writes are atomic then you are fine. You only need one ring buffer in this scenario and two atomic variables. Having atomic writes does have overhead but far less than locking semantic. -- Ian McDonald Web: http://wand.net.nz/~iam4 Blog: http://imcdnzl.blogspot.com WAND Network Research Group Department of Computer Science University of Waikato New Zealand - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Moving protocol (no matter if it is TCP or not) closer to user allows naturally control the dataflow - when user can read that data(and _this_ is the main goal), user acks, when it can not - it does not generate ack. In theory To all that I rememeber, in theory absence of feedback leads to loss of control yet. The same is in practice, unfortunately. You must say that window is closed, otherwise sender is totally confused. There is one problem in your logic. RTT will not be so small, since acks are not sent when user does not read data. It is arithmetics: rtt = window/rate. And rto stays rounded up to 200 msec, unless you messed the connection so hard that it is not alive. Check. Simplify protocol, move all the processing (even memory copies) to softirq, leave to user space only feeding pages to copy and you will have unbeatable performance. Been there, done that, not with TCP of course, but if you do not care about losses and ACK clocking and send an ACK once per window, I do not see how it can spoil the situation. Do you live in a perfect world, where user does not want what was requested? All the time I am trying to bring you attention that you read to sink. :-) At least, read to disk to move it a little closer to reality. Or at least do it from terminal and press ^Z sometimes. You deal with 80 byte packets, to all that I understand. If you lose one cacheline per packet, it is a big problem. So actual netchannels speed is even better? :) atcp. If you get rid of netchannels, leave only atcp, the speed will be at least not worse. No doubts. tell me, why we should keep (enabled) that redundant functionality? Because it can work better in some other places, and that is correct, but why it should be enabled then in majority of the cases? Did not I tell you something like that? :-) Optimize real thing, even trying to detect the situations when retransmissions are redundant and eliminate the code. Let's draw the line. ... That was my opinion on the topic. It looks like neither you, nor me will not change our point of view about that right now :) I agree. :) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Alexey Kuznetsov [EMAIL PROTECTED] Date: Fri, 21 Jul 2006 02:59:08 +0400 Moving protocol (no matter if it is TCP or not) closer to user allows naturally control the dataflow - when user can read that data(and _this_ is the main goal), user acks, when it can not - it does not generate ack. In theory To all that I rememeber, in theory absence of feedback leads to loss of control yet. The same is in practice, unfortunately. You must say that window is closed, otherwise sender is totally confused. Correct, and too large delay even results in retransmits. You can say that RTT will be adjusted by delay of ACK, but if user context switches cleanly at the beginning, resulting in near immediate ACKs, and then blocks later you will get spurious retransmits. Alexey's example of blocking on a disk write is a good example. I really don't like when pure NULL data sinks are used for benchmarking these kinds of things because real applications 1) touch the data, 2) do something with that data, and 3) have some life outside of TCP! If you optimize an application that does nothing with the data it receives, you have likewise optimized nothing :-) All this talk reminds me of one thing, how expensive tcp_ack() is. And this expense has nothing to do with TCP really. The main cost is purging and freeing up the skbs which have been ACK'd in the retransmit queue. So tcp_ack() sort of inherits the cost of freeing a bunch of SKBs which haven't been touched by the cpu in some time and are thus nearly guarenteed to be cold in the cache. This is the kind of work we could think about batching to user sleeping on some socket call. Also notice that retransmit queue is potentially a good use of an array similar VJ netchannel lockless queue data structure. :) BTW, notice that TSO makes this work touch less skb state. TSO also decreases cpu utilization. I think these two things are no coincidence. :-) I have even toyed with the idea of eventually abstracting the retransmit queue into a pure data representation. The skb_shinfo() page vector is very nearly this already. Or, a less extreme idea where we have fully retained huge TSO skbs, but we do not chop them up to create smaller TSO frames. Instead, we add offset GSO attribute which is used in the clones. Calls to tso_fragment() would be replaced with pure clones and adjustment of skb-len and the new skb-gso_offset in the clone. Rest of the logic would remain identical except that non-linear data would start skb-gso_offset bytes into the skb_shinfo() described area. In this way we could also set tp-xmit_size_goal to it's maximum possible value, always. Actually, I was looking at this the other day and this clamping of xmit_size_goal to 1/2 max_window is extremely dubious. In fact it's downright wrong, only MSS needs this limiting for sender side SWS avoidance. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Tue, 18 July 2006 23:08:01 +0400, Evgeniy Polyakov wrote: On Tue, Jul 18, 2006 at 02:15:17PM +0200, J?rn Engel ([EMAIL PROTECTED]) wrote: Your description makes it sound as if you would take a huge leap, changing all in-kernel code _and_ the userspace interface in a single patch. Am I wrong? Or am I right and would it make sense to extract small incremental steps from your patch similar to those Van did in his non-published work? My first implementation used existing kernel code and showed small performance win - there was binding of the socket to netchannel and all protocol processing was moved into process context. Iirc, Van didn't show performance numbers but rather cpu utilization numbers. And those went down significantly without changing the userspace interface. Did you look at cpu utilization as well? If you did and your numbers are worse than Vans, he either did something smarter than you or forged his numbers (quite unlikely). Jörn -- Sometimes, asking the right question is already the answer. -- Unknown - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! There is no socket spinlock anymore. Above lock is skb_queue lock which is held inside skb_dequeue/skb_queue_tail calls. Lock is named differently, but it is still here. BTW for UDP even the name is the same. Equivalent of socket user lock. No, it is an equivalent for hash lock in socket table. OK. But you have to introduce socket mutex somewhere in any case. Even in ATCP. Just an example - tcp_established() can be called with bh disabled under the socket lock. When we have a process context in hands, it is not. Did you ask youself, why do not we put all the packets to backlog/prequeue and just wait when user will read the data? It would be 100% equivalent to netchannels. The answer is simple: because we cannot wait. If user delays for 200msec, wait for connection collapse due to retransmissions. If the segment is out of order, immediate attention is required. Any scheme, which tries to wait for user unconditionally, at least has to run a watchdog timer, which fires before sender senses the gap. And this is what we do for ages. Grep for VJ in sources. :-) netchannels have nothing to do with it, it is much elder idea. In that case one copies the whole data into userspace, so access for 20 bytes of headers completely does not matter. For short packets it matters. But I said not this. I said it looks _worse_. A bit, but worse. Hmm, for 80 bytes sized packets win was about 2.5 times. Could you please show me lines inside existing code, which should be commented, so I got 50Mbyte/sec for that? If I knew it would be done. :-) Actually, it is the action, which I would expect. This, but not dropping all the TCP stack. I showed there, that using existing stack it is imposible Please, understand, it is such statements that compromise your work. If it is impossible then it is not interesting. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
As a related note, I am looking into fixing inet hash tables to use RCU. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 19 Jul 2006 15:52:04 -0400 As a related note, I am looking into fixing inet hash tables to use RCU. IBM had posted a patch a long time ago, which would be not so hard to munge into the current tree. See if you can spot it in the archives :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Wed, 19 Jul 2006 13:01:50 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 19 Jul 2006 15:52:04 -0400 As a related note, I am looking into fixing inet hash tables to use RCU. IBM had posted a patch a long time ago, which would be not so hard to munge into the current tree. See if you can spot it in the archives :) Ben posted a patch in March, and IBM did one a while ago. I am looking at both. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Tue, 18 Jul 2006 12:16:26 +0400 I would ask to push netchannel support into -mm tree, but I expect in advance that having two separate TCP stacks (one of which can contain some bugs (I mean atcp.c)) is not that good idea, so I understand possible negative feedback on that issue, but it is much better than silence. Evgeniy, you are present in my queue of work to review. Perhaps I am mistaken with my priorities, but I tend to hit all the easy patches and bug fixes first, before significant new work. And even in the realm of new work, your things require the most serious thinking and consideration. I apologize for the time it takes me, therefore, to get to reviewing deep work such as your's. I will make a real effort to properly review your excellent work this week, and I encourage any other netdev hackers with some spare cycles to do the same. :) Thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Tue, Jul 18, 2006 at 01:34:37AM -0700, David Miller ([EMAIL PROTECTED]) wrote: Perhaps I am mistaken with my priorities, but I tend to hit all the easy patches and bug fixes first, before significant new work. And even in the realm of new work, your things require the most serious thinking and consideration. I apologize for the time it takes me, therefore, to get to reviewing deep work such as your's. I will make a real effort to properly review your excellent work this week, and I encourage any other netdev hackers with some spare cycles to do the same. :) That would be great! Please don't think that I wash people's mind with weekly get it, get it zombying stuff, I completely understand that there are things with much higher priority that netchannels, so it can wait (for a while :). Thank you. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello Evgeniy, +asmlinkage long sys_netchannel_control(void __user *arg) [...] + if (copy_from_user(ctl, arg, sizeof(struct unetchannel_control))) + return -ERESTARTSYS; ^^^ [...] + if (copy_to_user(arg, ctl, sizeof(struct unetchannel_control))) + return -ERESTARTSYS; ^^^ I think this should be -EFAULT instead of -ERESTARTSYS, right? -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Tue, Jul 18, 2006 at 01:16:18PM +0200, Christian Borntraeger ([EMAIL PROTECTED]) wrote: Hello Evgeniy, +asmlinkage long sys_netchannel_control(void __user *arg) [...] + if (copy_from_user(ctl, arg, sizeof(struct unetchannel_control))) + return -ERESTARTSYS; ^^^ [...] + if (copy_to_user(arg, ctl, sizeof(struct unetchannel_control))) + return -ERESTARTSYS; ^^^ I think this should be -EFAULT instead of -ERESTARTSYS, right? I have no strong feeling on what must be returned in that case. As far as I see, copy*user can fail due to absence of the next destination page, so -ERESTARTSYS makes sence, but if failure happens due to process size limitation, -EFAULT is correct. Let's change it to -EFAULT. -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Tue, 18 July 2006 12:16:26 +0400, Evgeniy Polyakov wrote: Current tests with the latest netchannel patch show that netchannels outperforms sockets in any type of bulk transfer (big-sized, small-sized, sending, receiving) over 1gb wire. I omit graphs and numbers here, since I posted it already several times. I also plan to proceed some negotiations which would allow to test netchannel support in 10gbit environment, but it can also happen after second development stage completed. [ I don't have enough time for a deeper look. So if my questions are stupid, please just tell me so and don't take it personal. ] After having seen Van Jacobson's presentation at LCA twice, it appeared to me that Van could get astonishing speedups with small incremental steps, only changing kernel code and leaving the kernel-userspace interface as is. Changing (or rather adding a new) the userspace interface was just the last step, which also gave some performance benefits but is also a change to the userspace interface and therefore easy to get wrong and hard to fix later. Your description makes it sound as if you would take a huge leap, changing all in-kernel code _and_ the userspace interface in a single patch. Am I wrong? Or am I right and would it make sense to extract small incremental steps from your patch similar to those Van did in his non-published work? Jörn -- When people work hard for you for a pat on the back, you've got to give them that pat. -- Robert Heinlein - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Tuesday 18 July 2006 13:51, Evgeniy Polyakov wrote: I think this should be -EFAULT instead of -ERESTARTSYS, right? I have no strong feeling on what must be returned in that case. As far as I see, copy*user can fail due to absence of the next destination page, so -ERESTARTSYS makes sence, but if failure happens due to process size limitation, -EFAULT is correct. If I am not completely mistaken ERESTARTSYS is wrong. include/linux/errno.h says userspace should never see ERESTARTSYS, therefore we should only return it if we were interrupted by a signal as do_signal takes care of ERESTARTSYS. Furthermore, copy*user transparently faults in necessary pages as long as the address is valid in the user context. Let's change it to -EFAULT. Thanks :-) -- Mit freundlichen Grüßen / Best Regards Christian Borntraeger Linux Software Engineer zSeries Linux Virtualization - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Tue, Jul 18, 2006 at 02:15:17PM +0200, J?rn Engel ([EMAIL PROTECTED]) wrote: Your description makes it sound as if you would take a huge leap, changing all in-kernel code _and_ the userspace interface in a single patch. Am I wrong? Or am I right and would it make sense to extract small incremental steps from your patch similar to those Van did in his non-published work? My first implementation used existing kernel code and showed small performance win - there was binding of the socket to netchannel and all protocol processing was moved into process context. It actually is the same what IBM folks do, but my investigation showed that linux sending side has some issues which would not allow to grow speed very noticebly (after creating yet another congestion control algo I now think that the problem is there, but I'm not 100% sure). And after looking into Van's presentation (and his words about _userspace_ protocol processing) I think they used own stack too. So I reinvented the wheel and created my own too. J?rn -- When people work hard for you for a pat on the back, you've got to give them that pat. -- Robert Heinlein -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Tue, 18 Jul 2006 23:11:37 +0400 Actually userspace will not see ERESTARTSYS, when it is returned from syscall. This is true only when a signal is pending. It is the signal dispatch code that fixes up the return value either by changing it to -EINTR or by resetting the register state such that the signal handler returns to re-execute the system call with the original set of argument register values. If a signal is not pending, you risk leaking ERESTARTSYS to userspace. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Can I ask couple of questions? Just as a person who looked at VJ's slides once and was confused. And startled, when found that it is not considered as another joke of genuis. :-) About locks: is completely lockless (there is one irq lock when skb is queued/dequeued into netchannels queue in hard/soft irq, Equivalent of socket spinlock. one mutex for netchannel's bucket Equivalent of socket user lock. and some locks on qdisk/NIC driver layer, The same as in traditional code, right? From all that I see, this completely lockless code has not less locks than traditional approach, even when doing no protocol processing. Where am I wrong? Frankly speaking, when talking about locks, I do not see anything, which could be saved, only TCP hash table lookup can be RCUized, but this optimization obviously has nothing to do with netchannels. The only improvement in this area suggested in VJ's slides is a lock-free producer-consumer ring. It is missing in your patch and I could guess it is not big loss, it is unlikely to improve something significantly until the lock is heavily contended, which never happens without massive network-level parallelism for a single bucket. The next question is about locality: To find netchannel bucket in netif_receive_skb() you have to access all the headers of packet. Right? Then you wait for processing in user context, and this information is washed out of cache or even scheduled on another CPU. In traditional approach you also fetch all the headers on softirq, but you do all the required work with them immediately and do not access them when the rest of processing is done in process context. I do not see how netchannels (without hardware classification) can improve something here. At the first sight it makes locality worse. Honestly, I do not see how this approach could improve performance even a little. And it looks like your benchmarks confirm that all the win is not due to architectural changes, but just because some required bits of code are castrated. VJ slides describe a totally different scheme, where softirq part is omitted completely, protocol processing is moved to user space as whole. It is an amazing toy. But I see nothing, which could promote its status to practical. Exokernels used to do this thing for ages, and all the performance gains are compensated by overcomplicated classification engine, which has to remain in kernel and essentially to do the same work which routing/firewalling/socket hash tables do. advance that having two separate TCP stacks (one of which can contain some bugs (I mean atcp.c)) is not that good idea, so I understand possible negative feedback on that issue, but it is much better than silence. You are absolutely right here. Moreover, I can guess that absense of feedback is a direct consequence of this thing. I would advise to get rid of it and never mention it again. :-) If you took VJ suggestion seriously and moved TCP engine to user space, it could remain unnoticed. But if TCP stays in kernel (and it obviously has to), you want to work with normal stack, you can improve, optimize and rewrite it infinitely, but do not start with a toy. It proves nothing and compromises the whole approach. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Alexey Kuznetsov [EMAIL PROTECTED] Date: Wed, 19 Jul 2006 03:01:21 +0400 The only improvement in this area suggested in VJ's slides is a lock-free producer-consumer ring. It is missing in your patch and I could guess it is not big loss, it is unlikely to improve something significantly until the lock is heavily contended, which never happens without massive network-level parallelism for a single bucket. And the gains from this ring can be obtained by stateless hardware classification pointing to unique MSI-X PCI interrupt vectors that get targetted to specific unique cpus. It is true zero cost in that case. I guess my excitement about VJ channels, from a practical viewpoint, begin to wane even further. How depressing :) Devices can move flow work to individual cpus via intellegent interrupt targeting, and OS should just get out of the way and continue doing what it does today. This idea is actually very old, and PCI MSI-X interrupts just make it practical for commodity devices. At least, there is less code to write. :-))) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Wed, Jul 19, 2006 at 03:01:21AM +0400, Alexey Kuznetsov ([EMAIL PROTECTED]) wrote: Hello! Hello, Alexey. Can I ask couple of questions? Just as a person who looked at VJ's slides once and was confused. And startled, when found that it is not considered as another joke of genuis. :-) About locks: is completely lockless (there is one irq lock when skb is queued/dequeued into netchannels queue in hard/soft irq, Equivalent of socket spinlock. There is no socket spinlock anymore. Above lock is skb_queue lock which is held inside skb_dequeue/skb_queue_tail calls. one mutex for netchannel's bucket Equivalent of socket user lock. No, it is an equivalent for hash lock in socket table. and some locks on qdisk/NIC driver layer, The same as in traditional code, right? I use dst_output(), so it is possible to have as many locks inside low-level NIC driver as you want. From all that I see, this completely lockless code has not less locks than traditional approach, even when doing no protocol processing. Where am I wrong? Frankly speaking, when talking about locks, I do not see anything, which could be saved, only TCP hash table lookup can be RCUized, but this optimization obviously has nothing to do with netchannels. It looks like you should looks at it again :) Just an example - tcp_established() can be called with bh disabled under the socket lock. In netchannels there is no need for that. The only improvement in this area suggested in VJ's slides is a lock-free producer-consumer ring. It is missing in your patch and I could guess it is not big loss, it is unlikely to improve something significantly until the lock is heavily contended, which never happens without massive network-level parallelism for a single bucket. That's because I decided to use skbs, but not special structures and thus I use the same queue as socket code (and have the only one lock inside skb_queue_tail()/skb_dequeue()). I will describe below why I do not changed it to more hardware-friendly stuff. The next question is about locality: To find netchannel bucket in netif_receive_skb() you have to access all the headers of packet. Right? Then you wait for processing in user context, and this information is washed out of cache or even scheduled on another CPU. In traditional approach you also fetch all the headers on softirq, but you do all the required work with them immediately and do not access them when the rest of processing is done in process context. I do not see how netchannels (without hardware classification) can improve something here. At the first sight it makes locality worse. In that case one copies the whole data into userspace, so access for 20 bytes of headers completely does not matter. Honestly, I do not see how this approach could improve performance even a little. And it looks like your benchmarks confirm that all the win is not due to architectural changes, but just because some required bits of code are castrated. Hmm, for 80 bytes sized packets win was about 2.5 times. Could you please show me lines inside existing code, which should be commented, so I got 50Mbyte/sec for that? VJ slides describe a totally different scheme, where softirq part is omitted completely, protocol processing is moved to user space as whole. It is an amazing toy. But I see nothing, which could promote its status to practical. Exokernels used to do this thing for ages, and all the performance gains are compensated by overcomplicated classification engine, which has to remain in kernel and essentially to do the same work which routing/firewalling/socket hash tables do. There are several ideas presented in his slides. For my personal opinion most of performance win is obtained from userspace processing and memcpy instead of copy_to_user() (but my previous work showed that it is not the case for a lot of situations), so I created first approach, tested second and now move into fully zero-copy design. How skbs or other structures are delivered into the queue/array does not matter in my design - I can replace it in a moment, but I do not want to mess with drivers, since it is huge break, which must be done after high-level stuff proven to work good. advance that having two separate TCP stacks (one of which can contain some bugs (I mean atcp.c)) is not that good idea, so I understand possible negative feedback on that issue, but it is much better than silence. You are absolutely right here. Moreover, I can guess that absense of feedback is a direct consequence of this thing. I would advise to get rid of it and never mention it again. :-) If you took VJ suggestion seriously and moved TCP engine to user space, it could remain unnoticed. But if TCP stays in kernel (and it obviously has to), you want to work with normal stack, you can improve, optimize and rewrite it infinitely, but do not start with a toy. It proves nothing and compromises the