Re: Van Jacobson net channels
On Fri, 2006-02-03 at 18:48, Andi Kleen wrote: > On Friday 03 February 2006 02:07, Greg Banks wrote: > > > > (Don't ask for code - it's not really in an usable state) > > > > Sure. I'm looking forward to it. > > I had actually shelved the idea because of TSO. But if you can get me > some data from your NFS servers that shows TSO is not enough > for them that might change the picture. We should be doing some NFS+TSO testing on SLES10 beta in the next few weeks, time permitting. I'll let you know how it goes. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Friday 03 February 2006 02:07, Greg Banks wrote: > > (Don't ask for code - it's not really in an usable state) > > Sure. I'm looking forward to it. I had actually shelved the idea because of TSO. But if you can get me some data from your NFS servers that shows TSO is not enough for them that might change the picture. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Greg Banks <[EMAIL PROTECTED]> Date: Fri, 03 Feb 2006 12:08:54 +1100 > So, given 2.6.16 on tg3 hardware, would your advice be to > enable TSO by default? Yes. In fact I've been meaning to discuss with Michael Chan enabling it in the driver by default. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels
On Fri, 2006-02-03 at 01:41, Leonid Grossman wrote: > > As I mentioned earlier, it would be cool to get these moderation > tresholds from NAPI, since it can make a better guess about the overall > system utilization than the driver can. Agreed. > But even at the driver level, > this works reasonably well. Yep. > - the moderation scheme is implemented in the ASIC on per channel basis. > So, if you have workloads with very distinct latency needs, you can just > steer it to a separate channel and have an interrupt moderation that is > different from other flows, for example keep an interrupt per packet > always. Wow, that's cool. So I could configure a particular UDP port and a particular TCP port to always have minimum latency, but keep all the rest of the traffic on the same NIC at minimum interrupts? Currently we need to use separate NICs for the two traffic types (for a number of reasons). What's the interface, some kind of ethtool extension or /proc magic? Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 2006-02-02 at 18:51, David S. Miller wrote: > From: Greg Banks <[EMAIL PROTECTED]> > Date: Thu, 02 Feb 2006 18:31:49 +1100 > > > On Thu, 2006-02-02 at 17:45, Andi Kleen wrote: > > > Normally TSO was supposed to fix that. > > > > Sure, except that the last time SGI looked at TSO it was > > extremely flaky. I gather that's much better now, but TSO > > still has a very small size limit imposed by the stack (not > > the hardware). > > Oh you have TSO disabled? That explains a lot. > > Yes, it's been a bumpy road, and there are still some > e1000 lockups, but in general things should be smooth > these days. So, given 2.6.16 on tg3 hardware, would your advice be to enable TSO by default? Greg -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 2006-02-02 at 18:48, Andi Kleen wrote: > On Thursday 02 February 2006 08:31, Greg Banks wrote: > > > [...]SGI's solution is do is ship a script that uses ethtool > > at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq > > up from the defaults. > > All user tuning like this is bad. The stack should all do that automatically. That would be nice ;-) > Would there be a drawback of making these > settings default? Yes, as mentioned elsewhere in this thread, applications which are latency-sensitive will suffer. For example, SGI sells a clustered filesystem where overall performance is sensitive to the RTT of intra-cluster RPCs, to which receive latency due to NIC interrupt mitigation is a significant factor. The NICs which run that traffic need to be using minimum mitigation, but the NICs which run NFS traffic need to be using maximum mitigation. > > This helps a lot, and we're very grateful ;-) But a scheme > > which used the interrupt mitigation hardware dynamically based on > > load could reduce the irq rate and CPU usage even further without > > compromising latency at low load. > > If you know what's needed perhaps you could investigate it? Maybe, in a couple of months when I've the time. > You mean the 64k limit? Exactly. Currently the NFS server is limited to a 32K blocksize so the largest RPC reply size is about 33K. However the NFS client in Linus' tree, and other OS's NFS servers, have much larger limits. A value of about 1.001 MiB would probably be best. The next SGI Linux NFS server release will probably include a patch to increase the maximum blocksize on TCP to 1MiB. > > Cool. Wouldn't it mean rewriting the nontrivial qdiscs? > > It had some compat code that just split up the lists - same > for netfilter. And only an implementation for pfifo_fast. Ok by me, in practice our servers only ever use pfifo. > (Don't ask for code - it's not really in an usable state) Sure. I'm looking forward to it. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels and NIC channels
> -Original Message- > From: Andi Kleen [mailto:[EMAIL PROTECTED] > Why are you saying it can't be used by the host? The stack > should be fully ready for it. Sorry, I should have said "it can't be used by the host to the full potential of the feature" :-). It does work for us now, as a "driver only" implementation, but setting IRQ affinity from the kernel (as well as couple other decisions that we would like host to make, rather than making them in the driver) should help quite a bit. > > The only small piece missing is a way to set the IRQ affinity > from the kernel, but that can be simulated from user space by > tweaking them in /proc. If you have a prototype patch adding > the kernel interfaces wouldn't be that hard neither. Agreed, at this point we should put a patch forward and tweak the kernel interface later on. > > Also how about per CPU TX completion interrupts? Yes, a channel can have separate Tx completion and RX MSI-X interrupts (and an exception MSI-X interrupt, if desired). It's up to 64 MSI-X interrupts total. > > -Andi > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Andi Kleen wrote: On Thursday 02 February 2006 08:31, Greg Banks wrote: The tg3 driver uses small hardcoded values for the RXCOL_TICKS and RXMAX_FRAMES registers, and allows "ethtool -C" to change them. SGI's solution is do is ship a script that uses ethtool at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq up from the defaults. All user tuning like this is bad. The stack should all do that automatically. Would there be a drawback of making these settings default? Larger settings (even the defaults) of the coalescing parms, while giving decent CPU utilization for a bulk transfer and better CPU utilization for a large agregate workload seem to mean bad things for minimizing latency. The "presentation" needs work but the data in: ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt should show some of that. The current executive summary: Executive Summary: By default, the e1000 driver used in conjunction with the A9900A PCI-X Dual-port Gigabit Ethernet adaptor strongly favors maximum packet per second throughput over minimum request/response latency. Anyone desiring lowest possible request/response latency needs to alter the modprobe parameters used when the e1000 driver is loaded. This appears to reduce round-trip latency by as much as 85%. However, configuring the A9900A PCI-X Dual-port Gigabit Ethernet adaptor for minimum request/response latency will reduce maximum packet per second performance (as measured with the netperf TCP_RR test) by ~23% and increase the service demand for bulk data transfer by ~63% for sending and ~145% for receiving. there is also some data in there for tg3 and for xframe I (but with a rather behind the times driver, i'm still trying to get cycles to run with a newer driver) This helps a lot, and we're very grateful ;-) But a scheme which used the interrupt mitigation hardware dynamically based on load could reduce the irq rate and CPU usage even further without compromising latency at low load. If you know what's needed perhaps you could investigate it? I'm guessing that any automagic interrupt mitigation scheme might want to know what it wants to enable for the single-stream TCP_RR transaction/s as the base pps before it starts holding-off interrupts. Even then however, the ability for the user to overrride needs to remain because there may be a workload that wants that PPS rate, but isn't concerned about the latency, only the CPU utilization and so indeed wants the interrupts mitigated. So it would seem that an automagic coalescing might be an N% solution, but I don't think it would be 100%. Question then becomes whether or not N is large enough to warrant it over defaults+manual config. rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels
Leonid Grossman writes: > Right. Interrupt moderation is done on per channel basis. > The only addition to the current NAPI mechanism I'd like to see is to > have NAPI setting desired interrupt rate (once interrupts are ON), > rather than use an interrupt per packet or a driver default. Arguably, > NAPI can figure out desired interrupt rate a bit better than a driver > can. In the current scheme a driver can easily use a dynamic interrupt scheme in fact tulip has used this for years. At low rates there are now delays at all if reach some threshold it increases interrupt latency. It can be done in sevaral levels. The best threshold seems luckily just to be to count the number of packets sitting RX ring when ->poll is called. Jamal heavily experimented with this and gave a talk at OLS 2000. Yes if net channel classifier runs in hardirq we get back to the livelock situation sooner or later. IMO interupts should just be a signal to indicate work Cheers. --ro - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Oh you have TSO disabled? That explains a lot. Yes, it's been a bumpy road, and there are still some e1000 lockups, but in general things should be smooth these days. I suspect that "these days" in kernel.org terms differs somewhat from "these days" RH/SuSE/etc terms, hence TSO being disabled. rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Wed, 01 Feb 2006 16:29:11 -0800 (PST) "David S. Miller" <[EMAIL PROTECTED]> wrote: > From: Stephen Hemminger <[EMAIL PROTECTED]> > Date: Wed, 1 Feb 2006 16:12:14 -0800 > > > The bigger problem I see is scalability. All those mmap rings have to > > be pinned in memory to be useful. It's fine for a single smart application > > per server environment, but in real world with many dumb thread monster > > applications on a single server it will be really hard to get working. > > This is no different from when the thread blocks and the receive queue > fills up, and in order to absorb scheduling latency. We already lock > memory into the kernel for socket buffer memory as it is. At least > the mmap() ring buffer method is optimized and won't have all of the > overhead for struct sk_buff and friends. So we have the potential to > lock down less memory not more. > > This is just like when we started using BK or GIT for source > management, everyone was against it and looking for holes while they > tried to wrap their brains around the new concepts and ideas. I guess > it will take a while for people to understand all this new stuff, but > we'll get there. No, it just means we have to cover our bases and not regress while moving forward. Not that we never have any regressions ;=) -- Stephen Hemminger <[EMAIL PROTECTED]> OSDL http://developer.osdl.org/~shemminger - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels and NIC channels
On Thursday 02 February 2006 17:27, Leonid Grossman wrote: > By now we have submitted UFO, MSI-X and LRO patches. The one item on > the TODO list that we did not submit a full driver patch for is the > "support for distributing receive processing across multiple CPUs (using > NIC hw queues)", mainly because at present the feature can't be fully > used by the host anyways. Why are you saying it can't be used by the host? The stack should be fully ready for it. The only small piece missing is a way to set the IRQ affinity from the kernel, but that can be simulated from user space by tweaking them in /proc. If you have a prototype patch adding the kernel interfaces wouldn't be that hard neither. Also how about per CPU TX completion interrupts? -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels
> -Original Message- > From: Eric W. Biederman [mailto:[EMAIL PROTECTED] > How do you classify channels? Multiple rx steering criterias are available, for example tcp tuple (or subset) hash, direct tcp tuple (or subset) match, MAC address, pkt size, vlan tag, QOS bits, etc. > > If your channels can map directly to the VAN Jacobsen > channels then when the kernel starts using them, it sounds > like the ideal strategy is to use the current NAPI algorithm > of disabling interrupts (on a per channel basis (assuming > MSI-X here) until that channel gets caught up Then enable > interrupts again. Right. Interrupt moderation is done on per channel basis. The only addition to the current NAPI mechanism I'd like to see is to have NAPI setting desired interrupt rate (once interrupts are ON), rather than use an interrupt per packet or a driver default. Arguably, NAPI can figure out desired interrupt rate a bit better than a driver can. > > I wonder if someone could make that the default policy in their NICs? Some NICs can support this today. If there is a consensus on a channel-aware NIC driver interface (including interrupt mgmt per channel), this will become a default NIC implementation. Over time, NIC development is always driven by the OS/stack requirements. > > Eric > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 02 Feb 2006 08:35:28 -0700 [EMAIL PROTECTED] (Eric W. Biederman) wrote: > "Christopher Friesen" <[EMAIL PROTECTED]> writes: > > > Eric W. Biederman wrote: > >> Jeff Garzik <[EMAIL PROTECTED]> writes: > > > >>> This was discussed on the netdev list, and the conclusion was that > >>> you want both NAPI and hw mitigation. This was implemented in a > >>> few drivers, at least. > > > >> How does that deal with the latency that hw mitigation introduces. When you > >> have a workload that bottle-necked waiting for that next > >> packet and hw mitigation is turned on you can see some horrible > >> unjustified slow downs. > > > > Presumably at low traffic you would disable hardware mitigation to get the > > best > > possible latency. As traffic ramps up you tune the hardware mitigation > > appropriately. At high traffic loads, you end up with full hardware > > mitigation, > > but you have enough packets coming in that the latency is minimal. > > The evil but real work load is when you have a high volume of dependent > traffic. > RPC calls or MPI collectives are cases where you are likely to see this. > > Or even in TCP there is an element that once you hit your window limit you > won't > send more traffic until you get your ack. But if you don't get your ack > because the interrupt is mitigated. > > NAPI handles this beautifully. It disables the interrupts until it knows it > needs to process more packets. Then when it is just waiting around for > packets from that card it enables interrupts on that card. Also, NAPI handles the case where receiver is getting DoS or overrrun with packets, and you want the hardware to send flow control. Without NAPI it is easy to get stuck only processing packets and nothing else. I hope the VJ channels code has receive flow control. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
"Leonid Grossman" <[EMAIL PROTECTED]> writes: > There two facilities (at least, in our ASIC, but there is no reason this > can't be part of the generic multi-channel driver interface that I will > get to shortly) to deal with it. > > - hardware supports more than one utilization-based interrupt rate (we > have four). For lowest utilization range, we always set interrupt rate > to one interrupt for every rx packet - exactly for the latency reasons > that you are bringing up. Also, cpu is not busy anyways so extra > interrupts do not hurt much. For highest utilization range, we set the > rate by default to something like an interrupt per 128 packets. There is > also timer-based interrupt, as a last resort option. > As I mentioned earlier, it would be cool to get these moderation > tresholds from NAPI, since it can make a better guess about the overall > system utilization than the driver can. But even at the driver level, > this works reasonably well. > > - the moderation scheme is implemented in the ASIC on per channel basis. > So, if you have workloads with very distinct latency needs, you can just > steer it to a separate channel and have an interrupt moderation that is > different from other flows, for example keep an interrupt per packet > always. How do you classify channels? If your channels can map directly to the VAN Jacobsen channels then when the kernel starts using them, it sounds like the ideal strategy is to use the current NAPI algorithm of disabling interrupts (on a per channel basis (assuming MSI-X here) until that channel gets caught up Then enable interrupts again. I wonder if someone could make that the default policy in their NICs? Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
"Christopher Friesen" <[EMAIL PROTECTED]> writes: > Eric W. Biederman wrote: >> Jeff Garzik <[EMAIL PROTECTED]> writes: > >>> This was discussed on the netdev list, and the conclusion was that >>> you want both NAPI and hw mitigation. This was implemented in a >>> few drivers, at least. > >> How does that deal with the latency that hw mitigation introduces. When you >> have a workload that bottle-necked waiting for that next >> packet and hw mitigation is turned on you can see some horrible >> unjustified slow downs. > > Presumably at low traffic you would disable hardware mitigation to get the > best > possible latency. As traffic ramps up you tune the hardware mitigation > appropriately. At high traffic loads, you end up with full hardware > mitigation, > but you have enough packets coming in that the latency is minimal. The evil but real work load is when you have a high volume of dependent traffic. RPC calls or MPI collectives are cases where you are likely to see this. Or even in TCP there is an element that once you hit your window limit you won't send more traffic until you get your ack. But if you don't get your ack because the interrupt is mitigated. NAPI handles this beautifully. It disables the interrupts until it knows it needs to process more packets. Then when it is just waiting around for packets from that card it enables interrupts on that card. Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels
> -Original Message- > From: Andi Kleen [mailto:[EMAIL PROTECTED] > > You just need to make sure that you don't leak data from > other peoples > > sockets. > > There are three basic ways I can see to do this: > > - You have really advanced hardware which can potentially > manage tens of thousands of hardware queues with full > classification down to the ports. Then everything is great. > But who has such hardware? > Perhaps Leonid will do it, but I expect the majority of Linux > users to not have access to it in the forseeable time. Also > even with the advanced hardware that can handle e.g. 50k > sockets what happens when you need 100k for some extreme situation? > > -Andi You may be surprised here :-) iWAPP (RDMA over Ethernet) received a lot of funding and industry support over last several years, and rNIC development is already pre-announced by multiple vendors not just us. I expect RDMA deployment to be a long and bumpy multi-year road, since protocols and applications will need to change to take full advantage of it. And this is a discussion for a totally separate thread anyways :-) But in the meantime, these new ethernet adapters will have huge number of hw queue pairs (AKA channels), and at least some of the NICs will have these channels at no incremental cost to the hardware. You may be able to use the channels for full socket traffic classification if nothing else, and defer the rest of rNIC functionality until the iWARP infrastructure is mature. This is actually one of many reasons why VJ net channels and related ideas look very promising - we can "extend" it to the driver/hw level with the current NICs that have at least one channel per cpu, with a good chance that the next wave of hardware will support many more channels and will take advantage of the stack/NAPI improvements. Leonid - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Eric W. Biederman wrote: Jeff Garzik <[EMAIL PROTECTED]> writes: This was discussed on the netdev list, and the conclusion was that you want both NAPI and hw mitigation. This was implemented in a few drivers, at least. How does that deal with the latency that hw mitigation introduces. When you have a workload that bottle-necked waiting for that next packet and hw mitigation is turned on you can see some horrible unjustified slow downs. Presumably at low traffic you would disable hardware mitigation to get the best possible latency. As traffic ramps up you tune the hardware mitigation appropriately. At high traffic loads, you end up with full hardware mitigation, but you have enough packets coming in that the latency is minimal. Chris - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels
> -Original Message- > From: Eric W. Biederman [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 02, 2006 4:29 AM > To: Jeff Garzik > Cc: Andi Kleen; Greg Banks; David S. Miller; Leonid Grossman; > [EMAIL PROTECTED]; Linux Network Development list > Subject: Re: Van Jacobson net channels > > Jeff Garzik <[EMAIL PROTECTED]> writes: > > > Andi Kleen wrote: > >> There was already talk some time ago to make NAPI drivers use the > >> hardware mitigation again. The reason is when you have > > > > > > This was discussed on the netdev list, and the conclusion > was that you > > want both NAPI and hw mitigation. This was implemented in > a few drivers, at least. > > How does that deal with the latency that hw mitigation introduces. > When you have a workload that bottle-necked waiting for that > next packet and hw mitigation is turned on you can see some > horrible unjustified slow downs. There two facilities (at least, in our ASIC, but there is no reason this can't be part of the generic multi-channel driver interface that I will get to shortly) to deal with it. - hardware supports more than one utilization-based interrupt rate (we have four). For lowest utilization range, we always set interrupt rate to one interrupt for every rx packet - exactly for the latency reasons that you are bringing up. Also, cpu is not busy anyways so extra interrupts do not hurt much. For highest utilization range, we set the rate by default to something like an interrupt per 128 packets. There is also timer-based interrupt, as a last resort option. As I mentioned earlier, it would be cool to get these moderation tresholds from NAPI, since it can make a better guess about the overall system utilization than the driver can. But even at the driver level, this works reasonably well. - the moderation scheme is implemented in the ASIC on per channel basis. So, if you have workloads with very distinct latency needs, you can just steer it to a separate channel and have an interrupt moderation that is different from other flows, for example keep an interrupt per packet always. Leonid > > Eric > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Jeff Garzik <[EMAIL PROTECTED]> writes: > Andi Kleen wrote: >> There was already talk some time ago to make NAPI drivers use >> the hardware mitigation again. The reason is when you have > > > This was discussed on the netdev list, and the conclusion was that you want > both > NAPI and hw mitigation. This was implemented in a few drivers, at least. How does that deal with the latency that hw mitigation introduces. When you have a workload that bottle-necked waiting for that next packet and hw mitigation is turned on you can see some horrible unjustified slow downs. Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Greg Banks <[EMAIL PROTECTED]> Date: Thu, 02 Feb 2006 18:31:49 +1100 > On Thu, 2006-02-02 at 17:45, Andi Kleen wrote: > > Normally TSO was supposed to fix that. > > Sure, except that the last time SGI looked at TSO it was > extremely flaky. I gather that's much better now, but TSO > still has a very small size limit imposed by the stack (not > the hardware). Oh you have TSO disabled? That explains a lot. Yes, it's been a bumpy road, and there are still some e1000 lockups, but in general things should be smooth these days. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thursday 02 February 2006 08:31, Greg Banks wrote: > The tg3 driver uses small hardcoded values for the RXCOL_TICKS > and RXMAX_FRAMES registers, and allows "ethtool -C" to change > them. SGI's solution is do is ship a script that uses ethtool > at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq > up from the defaults. All user tuning like this is bad. The stack should all do that automatically. Would there be a drawback of making these settings default? > This helps a lot, and we're very grateful ;-) But a scheme > which used the interrupt mitigation hardware dynamically based on > load could reduce the irq rate and CPU usage even further without > compromising latency at low load. If you know what's needed perhaps you could investigate it? > Sure, except that the last time SGI looked at TSO it was > extremely flaky. I believe David has done quite a lot of work on it and it should be much better much. > I gather that's much better now, but TSO > still has a very small size limit imposed by the stack (not > the hardware). You mean the 64k limit? > > > I was playing with a design some time ago to let TCP batch > > the lower level transactions even without that. The idea > > was instead of calling down into IP and dev_queue_xmit et.al. > > for each packet generated by TCP first generate a list of packets > > in sendmsg/sendpage and then just hand down the list > > through all layers into the driver. > > Cool. Wouldn't it mean rewriting the nontrivial qdiscs? It had some compat code that just split up the lists - same for netfilter. And only an implementation for pfifo_fast. (Don't ask for code - it's not really in an usable state) -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thursday 02 February 2006 00:50, David S. Miller wrote: > > Why not concentrate your thinking on how to make it can be made to > _work_ instead of punching holes in the idea? Isn't that more > productive? What I think would be very practical to do would be to try to replace the socket rx queue and the prequeues and perhaps the qdisc queues with an netchannel style array of pointers (just using pointers to skbs instead of indexes) or a list of arrays and see if it gives any cache benefits. What do you think? -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thursday 02 February 2006 00:08, Jeff Garzik wrote: > Definitely not. POSIX AIO is far more complex than the operation > requires, Ah, I sense strong a NIH field. > and is particularly bad for implementations that find it wise > to queue a bunch of to-be-filled buffers. Why? lio_listio seems to be very well suited for that task to me. > Further, the current > implementation of POSIX AIO uses a thread for almost every I/O, which is > yet more overkill. That's just an implementation detail of the current Linux aio. > A simple mmap'd ring buffer is much closer to how the hardware actually > behaves. It's no surprise that the "ring buffer / doorbell" pattern > pops up all over the place in computing these days. If you really want you can just fill in the pointer to the lio list into a mmaped ring buffer. This can be hidden between the POSIX interfaces. [I think Ben's early kernel aio had support for that, but it was eliminated as unneeded complexity] > Getting the TCP receive path out of the kernel isn't a requirement, just > an improvement. It's not clear to me yet how this is an improvement. > But people who care about the performance of their networking apps are > likely to want to switch over to this new userspace networking API, over > the next decade, I think. POSIX aio has the advantage that it already works on some other Unixes and some big applications have support for it that just needs to be enabled with the right ifdef. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 2006-02-02 at 17:45, Andi Kleen wrote: > There was already talk some time ago to make NAPI drivers use > the hardware mitigation again. The reason is when you have > a workload that runs below overload and doesn't quite > fill the queues and is a bit bursty, then NAPI tends to turn > on/off the NIC interrupts quite often. In SGI's experience, all it takes to get into this state is an even workload and a sufficiently fast CPU. On Thu, 2006-02-02 at 17:49, David S. Miller wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Thu, 2 Feb 2006 07:45:26 +0100 > > > Don't think it was ever implemented though. In the end we just > > eat the slowdown in that particular load. > > The tg3 driver uses the chip interrupt mitigation to help > deal with the SGI NUMA issues resulting from NAPI. The tg3 driver uses small hardcoded values for the RXCOL_TICKS and RXMAX_FRAMES registers, and allows "ethtool -C" to change them. SGI's solution is do is ship a script that uses ethtool at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq up from the defaults. This helps a lot, and we're very grateful ;-) But a scheme which used the interrupt mitigation hardware dynamically based on load could reduce the irq rate and CPU usage even further without compromising latency at low load. On Thu, 2006-02-02 at 17:51, Andi Kleen wrote: > On Thursday 02 February 2006 04:19, Greg Banks wrote: > > On Thu, 2006-02-02 at 14:13, David S. Miller wrote: > > > From: Greg Banks <[EMAIL PROTECTED]> > > > Multiple trips down through TCP, qdisc, and the driver for each > > NFS packet sent: > > Normally TSO was supposed to fix that. Sure, except that the last time SGI looked at TSO it was extremely flaky. I gather that's much better now, but TSO still has a very small size limit imposed by the stack (not the hardware). > I was playing with a design some time ago to let TCP batch > the lower level transactions even without that. The idea > was instead of calling down into IP and dev_queue_xmit et.al. > for each packet generated by TCP first generate a list of packets > in sendmsg/sendpage and then just hand down the list > through all layers into the driver. Cool. Wouldn't it mean rewriting the nontrivial qdiscs? Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Andi Kleen wrote: There was already talk some time ago to make NAPI drivers use the hardware mitigation again. The reason is when you have This was discussed on the netdev list, and the conclusion was that you want both NAPI and hw mitigation. This was implemented in a few drivers, at least. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thursday 02 February 2006 07:49, David S. Miller wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Thu, 2 Feb 2006 07:45:26 +0100 > > > Don't think it was ever implemented though. In the end we just > > eat the slowdown in that particular load. > > The tg3 driver uses the chip interrupt mitigation to help > deal with the SGI NUMA issues resulting from NAPI. Ok thanks for the correction. It was indeed fixed then. Great! -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thursday 02 February 2006 00:37, Mitchell Blank Jr wrote: > Jeff Garzik wrote: > > Once packets classified to be delivered to a specific local host socket, > > what further operations are require privs? What received packet data > > cannot be exposed to userspace? > > You just need to make sure that you don't leak data from other peoples > sockets. There are three basic ways I can see to do this: - You have really advanced hardware which can potentially manage tens of thousands of hardware queues with full classification down to the ports. Then everything is great. But who has such hardware? Perhaps Leonid will do it, but I expect the majority of Linux users to not have access to it in the forseeable time. Also even with the advanced hardware that can handle e.g. 50k sockets what happens when you need 100k for some extreme situation? - You use some high level easy classifier to distingush between classical "slower and isolated" streams and "fast and shared by everybody" streams. Let's say you use two IP addresses and program the NIC's hardware RX queues to distingush them. Then you end up with two receive rings - a standard one managed in the classical way and a netchannel one mapped into all applications running the user level TCP stack. This requires moderately advanced hardware (like a current XFrame and perhaps Tigon3?), but should be possible. One problem is that you will have to preallocate a lot of memory for the fast ring because mapping new memory this way is relatively costly (potentially lots of TLB flushes on all CPUs). And of course the data will be all shared between all fast users. Ok assuming the internet is considered a rogue place these days with sniffers everywhere I guess that's not too bad - everybody interested in privacy should use encryption anyways. Still maintaining the separate IP address as the high level classify anchor would be somewhat of a administrator burden. You could avoid it by putting just all data into the fast ring and allowing everybody interested to mmap it, but I'm not sure it's a good idea to completely drop all backwards compatibility in "secure" stream isolation. - You do classification to sockets in software in the interrupt handler and then copy the data once from the memory in the RX ring into a big preallocated buffer per netchannel consumer. That would work, but if the user space TCP stack is to emulate a standard read() interface it would likely need to copy again to get the data into the place the application expects it. This means you would have added an additional copy over the current stack, which is not good. Also question is how this classification would work and would it be really faster than what we do today? All the ways I described have severe drawbacks imho. Did I miss some clever additional way? -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson net channels
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen > Sent: Wednesday, February 01, 2006 10:45 PM > There was already talk some time ago to make NAPI drivers use > the hardware mitigation again. The reason is when you have a > workload that runs below overload and doesn't quite fill the > queues and is a bit bursty, then NAPI tends to turn on/off > the NIC interrupts quite often. At least on some chipsets > (Tigon3 in particular) this seems to cause slowdowns compared > to non NAPI. The idea (from Jamal originally iirc) was to use > the hardware mitigation to cycle less often from polling to > non polling state. > > Don't think it was ever implemented though. In the end we > just eat the slowdown in that particular load. Ideally, we want NAPI to set driver interrupt rate dynamically, to a desired number of packets per interrupt. More and more NICs support this in hardware as a run-time option; switching interrupts ON and OFF is indeed a bit of an "overdrive" but can be still used for legacy NICs. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thursday 02 February 2006 04:19, Greg Banks wrote: > On Thu, 2006-02-02 at 14:13, David S. Miller wrote: > > From: Greg Banks <[EMAIL PROTECTED]> > > Date: Thu, 02 Feb 2006 14:06:06 +1100 > > > > > On Thu, 2006-02-02 at 13:46, David S. Miller wrote: > > > > I know SAMBA is using sendfile() (when the client has the oplock > > > > held, which basically is "always"), is NFS doing so as well? > > > > > > NFS is an in-kernel server, and uses sock->ops->sendpage directly. > > > > Great. > > > > Then where's all the TX overhead for NFS? All the small transactions > > and the sunrpc header munging? > > Multiple trips down through TCP, qdisc, and the driver for each > NFS packet sent: Normally TSO was supposed to fix that. I was playing with a design some time ago to let TCP batch the lower level transactions even without that. The idea was instead of calling down into IP and dev_queue_xmit et.al. for each packet generated by TCP first generate a list of packets in sendmsg/sendpage and then just hand down the list through all layers into the driver. It was inspired by Andrew Morton's 2.5 work in the VM layer who used this trick very successfully with pages and BHs there. But I didn't pursue it further when it turned out all interesting hardware was using TSO already, which does a similar thing. There was also some trickiness when to do the flush exactly. > one for the header and one for each page. Lots > of locks need to be taken and dropped, all this while multiple nfds > on multiple CPUs are all trying to reply to NFS RPCs at the same > time. And in the particular case of the SN2 architecture, time > spent flushing PCI writes in the driver (less of an issue now that > host send rings are the default in tg3). Hmm, maybe it would be still worth for your case with multiple connections going on at the same time. But accumulating the packet list somewhere between different connections would be a natural congestion point and potential scalability issue. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Andi Kleen <[EMAIL PROTECTED]> Date: Thu, 2 Feb 2006 07:45:26 +0100 > Don't think it was ever implemented though. In the end we just > eat the slowdown in that particular load. The tg3 driver uses the chip interrupt mitigation to help deal with the SGI NUMA issues resulting from NAPI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thursday 02 February 2006 02:53, Greg Banks wrote: > On Thu, 2006-02-02 at 08:11, David S. Miller wrote: > > Van is not against NAPI, in fact he's taking NAPI to the next level. > > Softirq handling is overhead, and as this work shows, it is totally > > unnecessary overhead. > > I got the impression that his code was dynamically changing the > e1000 interrupt mitigation registers in response to load, in > other words using the capabilities of the hardware in a way that > NAPI deliberately avoids doing. There was already talk some time ago to make NAPI drivers use the hardware mitigation again. The reason is when you have a workload that runs below overload and doesn't quite fill the queues and is a bit bursty, then NAPI tends to turn on/off the NIC interrupts quite often. At least on some chipsets (Tigon3 in particular) this seems to cause slowdowns compared to non NAPI. The idea (from Jamal originally iirc) was to use the hardware mitigation to cycle less often from polling to non polling state. Don't think it was ever implemented though. In the end we just eat the slowdown in that particular load. > > How in the world can you not understand how incredible this is? > > Maybe "you had to be there". Van's presentation was amazingly > convincing in person, in a way the slides don't convey. I've > not seen a standing ovation at a technical talk before ;-) Wish I had made it then. Perhaps I would see the light then @) > I'm very interested in vj channels for improving CPU usage of > NFS and Samba servers. However, after a few days to reflect, > I'm curious as to how the tx is improved. Yes i was missing that too. He hinted about getting rid of hard_start_xmit somehow, but then never touched it again. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 2006-02-02 at 14:32, David S. Miller wrote: > I see. > > Maybe we can be smarter about how the write(), CORK, sendfile, > UNCORK sequence is done. >From the NFS server's point of view, the ideal interface would be to pass an array of {page,offset,len} tuples, covering up to around 1 MiB+1KiB in total length. Also, nfsd doesn't cork/uncork. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Greg Banks <[EMAIL PROTECTED]> Date: Thu, 02 Feb 2006 14:19:43 +1100 > Multiple trips down through TCP, qdisc, and the driver for each > NFS packet sent: one for the header and one for each page. Lots > of locks need to be taken and dropped, all this while multiple nfds > on multiple CPUs are all trying to reply to NFS RPCs at the same > time. And in the particular case of the SN2 architecture, time > spent flushing PCI writes in the driver (less of an issue now that > host send rings are the default in tg3). I see. Maybe we can be smarter about how the write(), CORK, sendfile, UNCORK sequence is done. Thanks for mentioning this. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 2006-02-02 at 14:13, David S. Miller wrote: > From: Greg Banks <[EMAIL PROTECTED]> > Date: Thu, 02 Feb 2006 14:06:06 +1100 > > > On Thu, 2006-02-02 at 13:46, David S. Miller wrote: > > > I know SAMBA is using sendfile() (when the client has the oplock held, > > > which basically is "always"), is NFS doing so as well? > > > > NFS is an in-kernel server, and uses sock->ops->sendpage directly. > > Great. > > Then where's all the TX overhead for NFS? All the small transactions > and the sunrpc header munging? Multiple trips down through TCP, qdisc, and the driver for each NFS packet sent: one for the header and one for each page. Lots of locks need to be taken and dropped, all this while multiple nfds on multiple CPUs are all trying to reply to NFS RPCs at the same time. And in the particular case of the SN2 architecture, time spent flushing PCI writes in the driver (less of an issue now that host send rings are the default in tg3). Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Greg Banks <[EMAIL PROTECTED]> Date: Thu, 02 Feb 2006 14:06:06 +1100 > On Thu, 2006-02-02 at 13:46, David S. Miller wrote: > > I know SAMBA is using sendfile() (when the client has the oplock held, > > which basically is "always"), is NFS doing so as well? > > NFS is an in-kernel server, and uses sock->ops->sendpage directly. Great. Then where's all the TX overhead for NFS? All the small transactions and the sunrpc header munging? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 2006-02-02 at 13:46, David S. Miller wrote: > I know SAMBA is using sendfile() (when the client has the oplock held, > which basically is "always"), is NFS doing so as well? NFS is an in-kernel server, and uses sock->ops->sendpage directly. > Van does have some ideas in mind for TX net channels that I touched > upon briefly with him, and we'll see some more things in this area, we > just need to be patient. :) Cool. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Greg Banks <[EMAIL PROTECTED]> Date: Thu, 02 Feb 2006 12:53:14 +1100 > I got the impression that his code was dynamically changing the > e1000 interrupt mitigation registers in response to load, in > other words using the capabilities of the hardware in a way that > NAPI deliberately avoids doing. I'm very curious to see the > details. Yes, once you stop doing NAPI and demux in the driver, we start using the HW interrupt mitigation facilities again. > cpu usage on tx is a significant part of the CPU usage > issues for many interesting NFS workloads. I know SAMBA is using sendfile() (when the client has the oplock held, which basically is "always"), is NFS doing so as well? Van does have some ideas in mind for TX net channels that I touched upon briefly with him, and we'll see some more things in this area, we just need to be patient. :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
David S. Miller wrote: From: Rick Jones <[EMAIL PROTECTED]> Date: Wed, 01 Feb 2006 17:32:24 -0800 How large is "the bulk?" The prequeue is always enabled when the app has blocked on read(). Actually I meant in terms of percentage of the cycles to process the packet rather than frequency of occurance but that is an interesting question - any read(), or a read() against the socket associated with that connection? What happens when the application has not blocked on a read - say an application using (e)poll on M connections? Does that ressurect my supposition about the three degrees of parallelism? Ie. ACK goes out as fast as we can context switch to the app receiving the data. This feedback makes all senders to a system send at a rate that system can handle. Once those senders have filled the TCP windows right? All you have to do for this to take effect is to fill the congestion window, which starts at 2 packets :-) I think we've had that converstion before - it starts at 4380 bytes or three segments whichever comes first right? IIRC (I should look this up but...) the two segments is when the MSS is larger than 1460? rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Thu, 2006-02-02 at 08:11, David S. Miller wrote: > Van is not against NAPI, in fact he's taking NAPI to the next level. > Softirq handling is overhead, and as this work shows, it is totally > unnecessary overhead. I got the impression that his code was dynamically changing the e1000 interrupt mitigation registers in response to load, in other words using the capabilities of the hardware in a way that NAPI deliberately avoids doing. I'm very curious to see the details. > > How in the world can you not understand how incredible this is? Maybe "you had to be there". Van's presentation was amazingly convincing in person, in a way the slides don't convey. I've not seen a standing ovation at a technical talk before ;-) I'm very interested in vj channels for improving CPU usage of NFS and Samba servers. However, after a few days to reflect, I'm curious as to how the tx is improved. Van didn't touch upon the tx side at all, and cpu usage on tx is a significant part of the CPU usage issues for many interesting NFS workloads. The other objections raised here are non-issues for an NFS or Samba server. Greg. -- Greg Banks, R&D Software Engineer, SGI Australian Software Group. I don't speak for SGI. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Rick Jones <[EMAIL PROTECTED]> Date: Wed, 01 Feb 2006 17:32:24 -0800 > How large is "the bulk?" The prequeue is always enabled when the app has blocked on read(). > > Ie. ACK goes out as fast as we can context switch > >to the app receiving the data. This feedback makes all senders > >to a system send at a rate that system can handle. > > Once those senders have filled the TCP windows right? All you have to do for this to take effect is to fill the congestion window, which starts at 2 packets :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Maybe I'm not sufficiently clued-in, but in broad handwaving terms, it seems today that all three can be taking place in parallel for a given TCP connection. The application is doing its application-level thing on request N on one CPU, while request N+1 is being processed by TCP on another CPU, while the NIC is DMA'ing request N+2 into the host. That's not what happens in the Linux TCP stack even today. The bulk of the TCP processing is done in user context via the kernel prequeue. How large is "the bulk?" When we get a TCP packet, we simply find the socket and tack on the SKB, then wake up the task. We do none of the TCP packet processing. Once the app wakes up, we do the TCP input path and copy the data directly into user space all in one go. Sounds a little like TOPS but with the user context. OK I think I can grasp that. This has several advantages. 1) TCP stack processing is accounted for in the user process 2) ACK emission is done at a rate that matches the load of the system. ACK emission sounds like something tracked by the EPA :) Ie. ACK goes out as fast as we can context switch to the app receiving the data. This feedback makes all senders to a system send at a rate that system can handle. Once those senders have filled the TCP windows right? 3) checksum + copy in parallel into userspace is possible > And we've been things like this for 6 years :-) This prequeue is another Van Jacobson idea btw, and net channels just extend this concept further. So the parallelism I gained by moving netperf from the interrupt CPU to the non-interrupt CPU was strictly between the driver+ip on the interrupt CPU and tcp+socket on the other right? Cpu0 : 0.1% us, 0.1% sy, 0.0% ni, 60.1% id, 0.2% wa, 0.5% hi, 39.1% si Cpu1 : 0.2% us, 20.9% sy, 0.0% ni, 34.2% id, 0.0% wa, 0.0% hi, 44.8% si (netperf was bound to CPU1, and assuming the top numbers are trustworth) rick jones onlist no need to cc - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Rick Jones <[EMAIL PROTECTED]> Date: Wed, 01 Feb 2006 16:39:00 -0800 > My questions are meant to see if something is even a roadblock in > the first place. Fair enough. > Maybe I'm not sufficiently clued-in, but in broad handwaving terms, > it seems today that all three can be taking place in parallel for a > given TCP connection. The application is doing its > application-level thing on request N on one CPU, while request N+1 > is being processed by TCP on another CPU, while the NIC is DMA'ing > request N+2 into the host. That's not what happens in the Linux TCP stack even today. The bulk of the TCP processing is done in user context via the kernel prequeue. When we get a TCP packet, we simply find the socket and tack on the SKB, then wake up the task. We do none of the TCP packet processing. Once the app wakes up, we do the TCP input path and copy the data directly into user space all in one go. This has several advantages. 1) TCP stack processing is accounted for in the user process 2) ACK emission is done at a rate that matches the load of the system. Ie. ACK goes out as fast as we can context switch to the app receiving the data. This feedback makes all senders to a system send at a rate that system can handle. 3) checksum + copy in parallel into userspace is possible And we've been things like this for 6 years :-) This prequeue is another Van Jacobson idea btw, and net channels just extend this concept further. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
David S. Miller wrote: From: Rick Jones <[EMAIL PROTECTED]> Date: Wed, 01 Feb 2006 15:50:38 -0800 [ What sucks about this whole thread is that only folks like Jeff and myself are attempting to think and use our imagination to consider how some roadblocks might be overcome ] My questions are meant to see if something is even a roadblock in the first place. If the TCP processing is put in the user context, that means there is no more parallelism between the application doing its non-TCP stuff, and the TCP stuff for say the next request, which presently could be processed on another CPU right? There is no such implicit limitation, really. Consider the userspace mmap()'d ring buffer being tagged with, say, connection IDs. Say, file descriptors. In this way the kernel could dump into a single net channel for multiple sockets, and then the app can demux this stuff however it likes. In particular, things like HTTP would want this because web servers get lots of tiny requests and using a net channel per socket could be very wasteful. I'm not meaning to talk about mux/demux of multiple connections, I'm asking about where all the cycles are consumed and how that affects parallelism between user space, "TCP/IP processing" and the NIC for a given flow/connection/whatever. Maybe I'm not sufficiently clued-in, but in broad handwaving terms, it seems today that all three can be taking place in parallel for a given TCP connection. The application is doing its application-level thing on request N on one CPU, while request N+1 is being processed by TCP on another CPU, while the NIC is DMA'ing request N+2 into the host. If the processing is pushed all the way up to user space, will it be the case that the single-threaded application code can be working on request N while the TCP code is processing request N+1? That's what I'm trying to ask about. I think the data I posted about saturating a GbE bidirectionally with a single TCP connection shows an example of advantage being taken of parallelism between the application doing its thing on request N, while TCP is processing N+1 on another CPU and the NIC is bringing N+2 into the RAM. ["Re: [RFC] Poor Network Performance with e1000 on 2.6.14.3" msg id <[EMAIL PROTECTED]> ] What I'm not sure of is if that fully matters. Hence the questions. rick jones So, other background... long ago and far away, in HP-UX 10.20 which was BSDish in its networking, with Inbound Packet Scheduling, the netisr handoff included a hash of the header info and a per-CPU netisr would be used for the "TCP processing" That got HP-UX parallelism for multiple TCP connections coming through a single NIC. It meant that a single threaded application, with multiple connections would have the inbound TCP processing possibly scattered across all the CPUs while it was running on only one CPU. Cache lines for socket structures going back and forth could indeed be a concern although moving a cache line from one CPU to another is not a priori evil (although the threshold is rather high IMO). In HP-UX 11.X IPS was replaced with Thread Optimized Packet Scheduling (TOPS). There was still a netisr-like hand-off (although not as low in the stack as I would have liked it) where a lookup took place that found where the application last accessed that connection (I think Solaris Fire Engine does something very similar today). The idea there was that the place where inbound processing would take place would be determined by where the application last accessed the socket. Still get advantage taken of multiple CPUs for multiple connections to multiple threads, but at the price of losing one part of the app/tcp/nic parallelism. Both TOPS and IPS have been successful in their days. I'm trying to come to grips with which might be "better" - if it is even possible to say that one was better than the other. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Stephen Hemminger <[EMAIL PROTECTED]> Date: Wed, 1 Feb 2006 16:12:14 -0800 > The bigger problem I see is scalability. All those mmap rings have to > be pinned in memory to be useful. It's fine for a single smart application > per server environment, but in real world with many dumb thread monster > applications on a single server it will be really hard to get working. This is no different from when the thread blocks and the receive queue fills up, and in order to absorb scheduling latency. We already lock memory into the kernel for socket buffer memory as it is. At least the mmap() ring buffer method is optimized and won't have all of the overhead for struct sk_buff and friends. So we have the potential to lock down less memory not more. This is just like when we started using BK or GIT for source management, everyone was against it and looking for holes while they tried to wrap their brains around the new concepts and ideas. I guess it will take a while for people to understand all this new stuff, but we'll get there. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Wed, 01 Feb 2006 15:42:39 -0800 (PST) "David S. Miller" <[EMAIL PROTECTED]> wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Wed, 1 Feb 2006 23:55:11 +0100 > > > On Wednesday 01 February 2006 21:26, Jeff Garzik wrote: > > > Andi Kleen wrote: > > > > But I don't think Van's design is supposed to be exposed to user space. > > > > > > It is supposed to be exposed to userspace AFAICS. > > > > Then it's likely insecure and root only, unless he knows some magic > > that we don't. > > > > I hope it's not just PF_PACKET mmap rings with a user space TCP library. > > Yes, that's it. > > If the user screws up the TCP connection and corrupts his data why > should the kernel care? > > > I mean the Linux implementation is in the kernel, but in user context. > > Right, but prequeue doesn't go nearly far enough. We still do up to 5 > demuxes on the input path (protocol, route, IPSEC, netfilter, socket) > plus the queueing at the softint layer. That's rediculious and we've > always understood this, and Van has presented a way to kill this stuff > off. > > But even if you don't like the userspace stuff, we don't necessarily > have to go there, we can just demux directly to sockets in the kernel > TCP stack and then revisit the userspace idea before committing to it. > The bigger problem I see is scalability. All those mmap rings have to be pinned in memory to be useful. It's fine for a single smart application per server environment, but in real world with many dumb thread monster applications on a single server it will be really hard to get working. -- Stephen Hemminger <[EMAIL PROTECTED]> OSDL http://developer.osdl.org/~shemminger - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Rick Jones <[EMAIL PROTECTED]> Date: Wed, 01 Feb 2006 15:50:38 -0800 [ What sucks about this whole thread is that only folks like Jeff and myself are attempting to think and use our imagination to consider how some roadblocks might be overcome ] > If the TCP processing is put in the user context, that means there > is no more parallelism between the application doing its non-TCP > stuff, and the TCP stuff for say the next request, which presently > could be processed on another CPU right? There is no such implicit limitation, really. Consider the userspace mmap()'d ring buffer being tagged with, say, connection IDs. Say, file descriptors. In this way the kernel could dump into a single net channel for multiple sockets, and then the app can demux this stuff however it likes. In particular, things like HTTP would want this because web servers get lots of tiny requests and using a net channel per socket could be very wasteful. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
It almost feels like the channel concept wants a "thread per connection" model? No, it means only that your application must be asynchronous -- which all modern network apps are already. The INN model of a single process calling epoll(2) for 800 sockets should continue to work, as should the Apache N-sockets-per-thread model, as should the thread-per-connection model. All of that continues to be within the realm of application choice. I may not have been as clear as I should - I'm not meaning to ask about stuff like epoll continuing to function etc. If the TCP processing is put in the user context, that means there is no more parallelism between the application doing its non-TCP stuff, and the TCP stuff for say the next request, which presently could be processed on another CPU right? Like when I did the "yes, one can saturate GbE both ways on a single connection - when netperf runs on a CPU other than the one doing (some fraction) of the TCP processing" message earlier today in another thread. rick jones on list, no need to cc - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Mitchell Blank Jr <[EMAIL PROTECTED]> Date: Wed, 1 Feb 2006 15:37:04 -0800 > So I agree that this would have to be CAP_NET_ADMIN only. I'm drowning in all of this pessimism folks. Why not concentrate your thinking on how to make it can be made to _work_ instead of punching holes in the idea? Isn't that more productive? Or I suppose those studly numbers aren't incentive enough to find a solution and try to be optimistic? If so, I think that sucks. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Andi Kleen <[EMAIL PROTECTED]> Date: Wed, 1 Feb 2006 23:55:11 +0100 > On Wednesday 01 February 2006 21:26, Jeff Garzik wrote: > > Andi Kleen wrote: > > > But I don't think Van's design is supposed to be exposed to user space. > > > > It is supposed to be exposed to userspace AFAICS. > > Then it's likely insecure and root only, unless he knows some magic > that we don't. > > I hope it's not just PF_PACKET mmap rings with a user space TCP library. Yes, that's it. If the user screws up the TCP connection and corrupts his data why should the kernel care? > I mean the Linux implementation is in the kernel, but in user context. Right, but prequeue doesn't go nearly far enough. We still do up to 5 demuxes on the input path (protocol, route, IPSEC, netfilter, socket) plus the queueing at the softint layer. That's rediculious and we've always understood this, and Van has presented a way to kill this stuff off. But even if you don't like the userspace stuff, we don't necessarily have to go there, we can just demux directly to sockets in the kernel TCP stack and then revisit the userspace idea before committing to it. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Jeff Garzik wrote: > Once packets classified to be delivered to a specific local host socket, > what further operations are require privs? What received packet data > cannot be exposed to userspace? You just need to make sure that you don't leak data from other peoples sockets. Two issues I see: 1. If the card receives a long frame for application #1 and then receives a short frame for application #2, then you need to make sure that the data gets zeroed out first. So you need to limit this to only maximum-sized packets (or packets whose previous use was on the same flow). Probably not a big deal, since that's the performance-critical case anyway 2. More concerning is how you control what packets the app can see. If you made the memory frames all PAGE_SIZE then you could just give the app the packets to its flows by doing MMU tricks, but wouldn't that murder performance anyway? So I think the only real solution would be to allow the app to map all of the frames all of the time. So I agree that this would have to be CAP_NET_ADMIN only. -Mitch - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
But people who care about the performance of their networking apps are likely to want to switch over to this new userspace networking API, over the next decade, I think. Yet there needs to be some cross-platform commonality for the API yes? That was the main thrust behind my simplistic asking about posix aio being sufficient (which of course could be more than necessary :) to the task - at least as an API, not specifically about any given implementation of it - I agree that doing aio by launching another thread is rather silly... rick jones on list, no need for cc - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Andi Kleen wrote: On Wednesday 01 February 2006 21:26, Jeff Garzik wrote: Andi Kleen wrote: But I don't think Van's design is supposed to be exposed to user space. It is supposed to be exposed to userspace AFAICS. Then it's likely insecure and root only, unless he knows some magic that we don't. Once packets classified to be delivered to a specific local host socket, what further operations are require privs? What received packet data cannot be exposed to userspace? I hope it's not just PF_PACKET mmap rings with a user space TCP library. Why? It's still in the kernel, just in process context. Incorrect. Its in the userspace app (though usually via a library). See slides 26 and 27. I mean the Linux implementation is in the kernel, but in user context. Yes, I know what you meant. My answer still stands... Certainly older applications that use only read(2) and write(2) must continue to work. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Rick Jones wrote: what are the implications for having the application churning away doing application things while TCP is feeding it data? Or for an application that is processing more than one TCP connection in a given thread? It almost feels like the channel concept wants a "thread per connection" model? No, it means only that your application must be asynchronous -- which all modern network apps are already. The INN model of a single process calling epoll(2) for 800 sockets should continue to work, as should the Apache N-sockets-per-thread model, as should the thread-per-connection model. All of that continues to be within the realm of application choice. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Rick Jones wrote: Jeff Garzik wrote: Key point 1: Van's slides align closely with the design that I was already working on, for zero-copy RX. To have a fully async, zero copy network receive, POSIX read(2) is inadequate. Is there an aio_read() in POSIX adequate to the task? Definitely not. POSIX AIO is far more complex than the operation requires, and is particularly bad for implementations that find it wise to queue a bunch of to-be-filled buffers. Further, the current implementation of POSIX AIO uses a thread for almost every I/O, which is yet more overkill. A simple mmap'd ring buffer is much closer to how the hardware actually behaves. It's no surprise that the "ring buffer / doorbell" pattern pops up all over the place in computing these days. Are you speaking strictly in the context of a single TCP connection, or for multiple TCP connections? For the latter getting out of the kernel multiple isn't a priori a requirement. Actually, I'm not even sure it is a priori a requirement for the former? Getting the TCP receive path out of the kernel isn't a requirement, just an improvement. You'll always have to have a basic path for existing applications that do normal read(2) and write(2). You can't break something that fundamental. But people who care about the performance of their networking apps are likely to want to switch over to this new userspace networking API, over the next decade, I think. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Wednesday 01 February 2006 21:26, Jeff Garzik wrote: > Andi Kleen wrote: > > But I don't think Van's design is supposed to be exposed to user space. > > It is supposed to be exposed to userspace AFAICS. Then it's likely insecure and root only, unless he knows some magic that we don't. I hope it's not just PF_PACKET mmap rings with a user space TCP library. > > > It's still in the kernel, just in process context. > > Incorrect. Its in the userspace app (though usually via a library). > See slides 26 and 27. I mean the Linux implementation is in the kernel, but in user context. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Wednesday 01 February 2006 22:11, David S. Miller wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Wed, 1 Feb 2006 19:28:46 +0100 > > > http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf > > I did a writeup in my blog about all of this, another good > reason to actively follow my blog: > > http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html > > Go read. > > > -Andi (who prefers sourceware over slideware) > > People are definitely hung up on the details, and that means > they are analyzing Van's work from the absolute _wrong_ angle. The main reason i look for details is that it's unclear to me if his work is one copy or zero copy and how the actual data in the channels are managed. The netchannels seem to just pass indexes into some other buffer, so unless he found a much better e1000 than I have @) it's probably single copy from RX ring into another big buffer. Right? Some of the other stuff sounded like an attempted zero copy. How is that other buffer managed? Is it sitting in user space? If yes then how does the data end up in the simulated read() in user space? That would require another copy unless I'm missing something. The other way if it's not copy-from-rx-ring to another buffer would be to have a big shared pool of always mapped to everybody pool (assuming no intelligent NIC queue support) - that would be inscure right? I guess independent of any other stuff it would be an interesting experiment to change the socket and TCP prequeue into an linked list of arrays pointing to skb and see if it really helps over the double linked lists (that are the points that should pass skbs between CPUs) Also the TX part is a bit unclear. > So when a TCP socket enters established state, we add an entry into > the classifier. The classifier is even smart enough to look for > a listening socket if the fully established classification fails. I think it's a pretty important detail. The current TCP demultiplex is a considerable part of the TCP processing cost and i haven't see any good proposals yet to make it faster [except the old one of using a smaller hash ..] Is he using some kind of binary tree for this or a hash? > Van is not against NAPI, in fact he's taking NAPI to the next level. > Softirq handling is overhead, and as this work shows, it is totally > unnecessary overhead. > > Yes we do TCP prequeue now, and that's where the second stage net > channel stuff hooks into. But prequeue as we have it now is not > enough, we still run softirq, and IP input processing from softirq not > from user socket context. I don't quite get why this is a problem. softirq is on the same CPU as the interrupt so it should be pretty cheap (no bounced cachelines) Due to the way the stacking works the cache locality should be also ok (except for the big hash tables) > The RX net channel bypasses all of that > crap. > > The way we do softirq now we can feed one cpu with softirq work given > a single card, with Van's stuff we can feed socket users on multiple > cpus with a single card. The net channel data structure SMP > friendliness really helps here. Ok so the point is to not keep the softirq work on the CPU which has the interrupt affinity. MSI-X & receive hashing should solve that one mostly anyways, no? But I agree it would be nice to fix it on old hardware too. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
At the risk of being told to launch myself towards a body of water... So, sort of linking with the data about saturating a GbE both ways on a single TCP connection, and how it required binding netperf to the CPU other than the one taking interrupts... If channels are taken to their limit, and the non-hard-irq processing of the packet is all in the user's context what are the implications for having the application churning away doing application things while TCP is feeding it data? Or for an application that is processing more than one TCP connection in a given thread? It almost feels like the channel concept wants a "thread per connection" model? rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Jeff Garzik <[EMAIL PROTECTED]> Date: Wed, 01 Feb 2006 14:37:46 -0500 > So, I am not concerned with slideware. These are two good ideas that > are worth pursuing, even if Van produces zero additional output. Right. And, to all of you having trouble imagining how else you'd apply these net channel ideas. Consider the case where the RX classifier gives you the net channel which is the output queue of another device. That's routing and packet mirroring :-) The RX classifier could give you a net channel for a netfilter rule. And let's assume that after netfilter NAT's the packet, it's for the local host, and netfilter does another classification and ends up with a local socket net channel. Use your imagination, this stuff can be applied everywhere. It's like eating peanuts. You start eating them one by one and you just can't stop :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Andi Kleen <[EMAIL PROTECTED]> Date: Wed, 1 Feb 2006 20:50:31 +0100 > On Wednesday 01 February 2006 20:37, Jeff Garzik wrote: > > > To have a fully async, zero copy network receive, POSIX read(2) is > > inadequate. > > Agreed, but POSIX aio is adequate. No, it's a joke. To do this stuff right you want networking experts (not UNIX interface standards experts) to come up with how to do things, because folks like POSIX are going to make a rocking implementation next to impossible. Only folks like Van Jacobson can take us out of the myopic view we currently have of how networking receive is done. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
From: Andi Kleen <[EMAIL PROTECTED]> Date: Wed, 1 Feb 2006 19:28:46 +0100 > http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf I did a writeup in my blog about all of this, another good reason to actively follow my blog: http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html Go read. > -Andi (who prefers sourceware over slideware) People are definitely hung up on the details, and that means they are analyzing Van's work from the absolute _wrong_ angle. This surprised me, what I expected was for anyone knowledgable about networking to get this immediately, and as for the details, have an attitude of "I don't care how, let's find a way to make this work!" But since you're so hung up on the details, the basic idea is that there is a tiny classifier in the RX IRQ processing of the driver. We have to touch that first cache line of the packet headers anyway, so the classification comes for free. You'll notice that even though he's running this tiny classifier in the hard IRQ context, in order to put the packet on the right RX net channel, IRQ overhead remains the same. So when a TCP socket enters established state, we add an entry into the classifier. The classifier is even smart enough to look for a listening socket if the fully established classification fails. Van is not against NAPI, in fact he's taking NAPI to the next level. Softirq handling is overhead, and as this work shows, it is totally unnecessary overhead. Yes we do TCP prequeue now, and that's where the second stage net channel stuff hooks into. But prequeue as we have it now is not enough, we still run softirq, and IP input processing from softirq not from user socket context. The RX net channel bypasses all of that crap. The way we do softirq now we can feed one cpu with softirq work given a single card, with Van's stuff we can feed socket users on multiple cpus with a single card. The net channel data structure SMP friendliness really helps here. In one shot it does the input route lookup and the socket lookup. We just attach the packet to the socket's RX net channel, all from hard IRQ context, at zero cost (see above). This is just like the grand unified flow cache idea that we've been tossing around for the past few years. And the beauty of all of this is that it complements ideas like LRO, I/O AT, and cpu architectures like Niagara. How in the world can you not understand how incredible this is? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Andi Kleen wrote: But I don't think Van's design is supposed to be exposed to user space. It is supposed to be exposed to userspace AFAICS. It's still in the kernel, just in process context. Incorrect. Its in the userspace app (though usually via a library). See slides 26 and 27. But irrelevant of the slides, think about the underlying concept: the most efficient pipe of this sort is for the NIC to hand [selected] packets directly to the userspace app, with minimal (hopefully zero) copying. The userspace app is then given the freedom to choose how it handles incoming TCP data, either via custom algorithms or a standard shared library. I agree that ACK latency may be one potential issue... but this overall design is what a lot of low latency systems are moving to. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On 2/1/06, Andi Kleen <[EMAIL PROTECTED]> wrote: > On Wednesday 01 February 2006 20:37, Jeff Garzik wrote: > > > To have a fully async, zero copy network receive, POSIX read(2) is > > inadequate. > > Agreed, but POSIX aio is adequate. > > > One needs a ring buffer, similar in API to the mmap'd > > packet socket, where you can queue a whole bunch of reads. Van's design > > seems similar to this. > > See lio_listio et.al. > > But I don't think Van's design is supposed to be exposed to user space. > It's just a better way to implement BSD sockets. Well, for DCCP it seems interesting, look at: http://www.icir.org/kohler/dccp/nsdiabstract.pdf # A Congestion-Controlled Unreliable Datagram API Junwen Lai and Eddie Kohler Describes a potential DCCP API based on a shared-memory packet ring. The API simultaneously achieves kernel-implemented congestion control, high throughput, and late data choice, where the app can change what's sent very late in the process. Shows that congestion-controlled DCCP API can improve the rate of "important" frames delivered, relative to non-congestion-controlled UDP, in some situations. - Arnaldo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Andi writes: > But I don't think Van's design is supposed to be exposed to user space. > It's just a better way to implement BSD sockets. Actually, it can, indeed, go all the way to user space - connecting channels to the socket layer was one of the intermediate steps. FWIW, I did an article on this, going mostly from the slides. Here's a get-out-of-subscription-jail-free card for it: http://lwn.net/SubscriberLink/169961/776b6c53d1c1673a/ jon - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Jeff Garzik wrote: Key point 1: Van's slides align closely with the design that I was already working on, for zero-copy RX. To have a fully async, zero copy network receive, POSIX read(2) is inadequate. Is there an aio_read() in POSIX adequate to the task? One needs a ring buffer, similar in API to the mmap'd packet socket, where you can queue a whole bunch of reads. Van's design seems similar to this. Key point 2: Once the kernel gets enough info to determine which channel should receive a packet, it's done. Van pushes TCP/IP receive processing into the userland app, which is quite an idea. This pushes work out of the kernel and into the app, which in turn, increases the amount of work that can be performed in parallel on multiple cpus/cores. Increased opportunity for parallelism is indeed goodness. Are you speaking strictly in the context of a single TCP connection, or for multiple TCP connections? For the latter getting out of the kernel isn't a priori a requirement. Actually, I'm not even sure it is a priori a requirement for the former? This also has the side effect of making TOE look even more like an inferior solution... Van's design is far more scalable than TOE. I'm not disagreeing, but it would be good to further define the axis for scalability. rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Wednesday 01 February 2006 20:37, Jeff Garzik wrote: > To have a fully async, zero copy network receive, POSIX read(2) is > inadequate. Agreed, but POSIX aio is adequate. > One needs a ring buffer, similar in API to the mmap'd > packet socket, where you can queue a whole bunch of reads. Van's design > seems similar to this. See lio_listio et.al. But I don't think Van's design is supposed to be exposed to user space. It's just a better way to implement BSD sockets. > Key point 2: > Once the kernel gets enough info to determine which channel should > receive a packet, it's done. Van pushes TCP/IP receive processing into > the userland app, which is quite an idea. We already do this since 2.3 (Alexey's work) The only difference in his scheme seems to be that the demultiplex to different sockets is somehow (he doesn't explain how) pushed into the driver. It's also unclear how this will simplify the drivers as the slides claim. Also I should add that the added ACK latency is a problem for a few workloads. > This pushes work out of the > kernel and into the app, It's still in the kernel, just in process context. > which in turn, increases the amount of work > that can be performed in parallel on multiple cpus/cores. Well the current demultiplex already runs on all CPUs (assuming you have enough devices to send affinied interrupts to each CPU - in the future with MSI-X this can be hopefully done better) > The overall > bottleneck in the kernel is reduced. What bottleneck exactly? -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
Key point 1: Van's slides align closely with the design that I was already working on, for zero-copy RX. To have a fully async, zero copy network receive, POSIX read(2) is inadequate. One needs a ring buffer, similar in API to the mmap'd packet socket, where you can queue a whole bunch of reads. Van's design seems similar to this. Key point 2: Once the kernel gets enough info to determine which channel should receive a packet, it's done. Van pushes TCP/IP receive processing into the userland app, which is quite an idea. This pushes work out of the kernel and into the app, which in turn, increases the amount of work that can be performed in parallel on multiple cpus/cores. The overall bottleneck in the kernel is reduced. PCI MSI-X further reduces the bottleneck, after that. This also has the side effect of making TOE look even more like an inferior solution... Van's design is far more scalable than TOE. So, I am not concerned with slideware. These are two good ideas that are worth pursuing, even if Van produces zero additional output. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Van Jacobson net channels
On Wednesday 01 February 2006 14:48, Leonid Grossman wrote: > David S. Miller wrote: > > > And with Van Jacobson net channels, none of this is going to > > matter and 512 is going to be your limit whether you like it > > or not. So this short term complexity gain is doubly not justified. > > > David, can you elaborate on this upcoming implementation (timeframe, > description if any, etc)? There are some slides on the web, but I failed to fully understand the concept just from them - maybe I'm just not clever enough @) http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf Some details seem to be missing at least: - a description of TX (he hinted at eliminating hard_start_xmit somehow at one slide, but everything else was only on RX) - how the NIC selects the socket to connect to directly (e.g. with what does he replace the ETH/IP/TCP demux tables) - What his problem with softirqs was. They are fully parallelized in Linux and completely cache hot with their feeding interrupts, so I didn't quite follow why he wanted to eliminate them. - how exactly the drivers become more simple and what the generic functions he wants to replace driver code with do. I liked the concept of the cache friendlier"array queues", but as far as I can see there should be usually only one hand off between different CPUs for RX, so I'm not sure they will make that much difference. And longer term this handoff can be eliminated with hardware queue support anyways when we manage to get the NIC to send an MSI to the right CPU. Then everything will be CPU local. Hopefully there is a patch available soon to make this all clear. > I assume these net channels are per cpu? What's the relation to NAPI? > > This sounds like something we can "connect" our driver/ASIC queues to, > in order to extend the channels all the way to 10GbE PHY. I think the basic problem of that is still unsolved. The only really scalable way to do such connections seem to be RX hashing and send MSIs to different CPUs based on that. But it is still not fully clear how to get the Linux scheduler to cooperate with that so that the socket consumers end up at the right CPUs selected by the driver hash. -Andi (who prefers sourceware over slideware) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html