RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: Patrick Geoffray [mailto:[EMAIL PROTECTED] > Sent: Monday, August 20, 2007 1:34 PM > To: Felix Marti > Cc: Evgeniy Polyakov; David Miller; [EMAIL PROTECTED]; > netdev@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > Felix Marti wrote: > > Yes, the app will take the cache hits when accessing the data. > However, > > the fact remains that if there is a copy in the receive path, you > > require and additional 3x memory BW (which is very significant at > these > > high rates and most likely the bottleneck for most current > systems)... > > and somebody always has to take the cache miss be it the copy_to_user > or > > the app. > > The cache miss is going to cost you half the memory bandwidth of a full > copy. If the data is already in cache, then the copy is cheaper. > > However, removing the copy removes the kernel from the picture on the > receive side, so you lose demultiplexing, asynchronism, security, > accounting, flow-control, swapping, etc. If it's ok with you to not use > the kernel stack, then why expect to fit in the existing infrastructure > anyway ? Many of the things you're referring to are moved to the offload adapter but from an ease of use point of view, it would be great if the user could still collect stats the same way, i.e. netstat reports the 4-tuple in use and other network stats. In addition, security features and packet scheduling could be integrated so that the user configures them the same way as the network stack. > > > Yes, RDMA support is there... but we could make it better and easier > to > > What do you need from the kernel for RDMA support beyond HW drivers ? A > fast way to pin and translate user memory (ie registration). That is > pretty much the sandbox that David referred to. > > Eventually, it would be useful to be able to track the VM space to > implement a registration cache instead of using ugly hacks in user- > space > to hijack malloc, but this is completely independent from the net > stack. > > > use. We have a problem today with port sharing and there was a > proposal > > The port spaces are either totally separate and there is no issue, or > completely identical and you should then run your connection manager in > user-space or fix your middlewares. When running on an iWarp device (and hence on top of TCP) I believe that the port space should shared and i.e. netstat reports the 4-tuple in use. > > > and not for technical reasons. I believe this email threads shows in > > detail how RDMA (a network technology) is treated as bastard child by > > the network folks, well at least by one of them. > > I don't think it's fair. This thread actually show how pushy some RDMA > folks are about not acknowledging that the current infrastructure is > here for a reason, and about mistaking zero-copy and RDMA. Zero-copy and RDMA are not the same but in the context of this discussion I referred to RDMA as a superset (zero-copy is implied). > > This is a similar argument than the TOE discussion, and it was > definitively a good decision to not mess up the Linux stack with TOEs. > > Patrick - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
Felix Marti wrote: Yes, the app will take the cache hits when accessing the data. However, the fact remains that if there is a copy in the receive path, you require and additional 3x memory BW (which is very significant at these high rates and most likely the bottleneck for most current systems)... and somebody always has to take the cache miss be it the copy_to_user or the app. The cache miss is going to cost you half the memory bandwidth of a full copy. If the data is already in cache, then the copy is cheaper. However, removing the copy removes the kernel from the picture on the receive side, so you lose demultiplexing, asynchronism, security, accounting, flow-control, swapping, etc. If it's ok with you to not use the kernel stack, then why expect to fit in the existing infrastructure anyway ? Yes, RDMA support is there... but we could make it better and easier to What do you need from the kernel for RDMA support beyond HW drivers ? A fast way to pin and translate user memory (ie registration). That is pretty much the sandbox that David referred to. Eventually, it would be useful to be able to track the VM space to implement a registration cache instead of using ugly hacks in user-space to hijack malloc, but this is completely independent from the net stack. use. We have a problem today with port sharing and there was a proposal The port spaces are either totally separate and there is no issue, or completely identical and you should then run your connection manager in user-space or fix your middlewares. and not for technical reasons. I believe this email threads shows in detail how RDMA (a network technology) is treated as bastard child by the network folks, well at least by one of them. I don't think it's fair. This thread actually show how pushy some RDMA folks are about not acknowledging that the current infrastructure is here for a reason, and about mistaking zero-copy and RDMA. This is a similar argument than the TOE discussion, and it was definitively a good decision to not mess up the Linux stack with TOEs. Patrick - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> GPUs have almost no influence on system security, Unless you use direct rendering from user space. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
* Felix Marti <[EMAIL PROTECTED]> 2007-08-20 12:02 > These graphic adapters provide a wealth of features that you can take > advantage of to bring these amazing graphics to life. General purpose > CPUs cannot keep up. Chelsio offload devices do the same thing in the > realm of networking. - Will there be things you can't do, probably yes, > but as I said, there are lots of knobs to turn (and the latest and > greatest feature that gets hyped up might not always be the best thing > since sliced bread anyway; what happened to BIC love? ;) GPUs have almost no influence on system security, the network stack OTOH is probably the most vulnerable part of an operating system. Even if all vendors would implement all the features collected over the last years properly which seems unlikely. Having such an essential and critical part depend on the vendor of my network card without being able to even verify it properly is truly frightening. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
Andi Kleen wrote: TSO is beneficial for the software again. The linux code currently takes several locks and does quite a few function calls for each packet and using larger packets lowers this overhead. At least with 10GbE saving CPU cycles is still quite important. Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running 2.6.23-rc3. the NICs are dual-core e1000's connected back-to-back with the interrupt throttle disabled. I like using TCP_RR to tickle path-length questions because it rarely runs into bandwidth limitations regardless of the link-type. First, with TSO enabled on both sides, then with it disabled, netperf/netserver bound to the same CPU as takes interrupts, which is the "best" place to be for a TCP_RR test (although not always for a TCP_STREAM test...): :~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 (192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.3% !!! Local CPU util : 39.3% !!! Remote CPU util : 40.6% Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Sus/Tr us/Tr 16384 87380 1 1 10.01 18611.32 20.96 22.35 22.522 24.017 16384 87380 :~# ethtool -K eth2 tso off e1000: eth2: e1000_set_tso: TSO is Disabled :~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 (192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf. : first burst 0 : cpu bind !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 0.4% !!! Local CPU util : 21.0% !!! Remote CPU util : 25.2% Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Sus/Tr us/Tr 16384 87380 1 1 10.01 19812.51 17.81 17.19 17.983 17.358 16384 87380 While the confidence intervals for CPU util weren't hit, I suspect the differences in service demand were still real. On throughput we are talking about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not percentage points) in the first test and 12.5% in the second. So, in broad handwaving terms, TSO increased the per-transaction service demand by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the transaction rate decreased by ~6%. rick jones bitrate blindless is a constant concern - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen > Sent: Monday, August 20, 2007 11:11 AM > To: Felix Marti > Cc: Evgeniy Polyakov; [EMAIL PROTECTED]; netdev@vger.kernel.org; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; David Miller > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > "Felix Marti" <[EMAIL PROTECTED]> writes: > > > What I was referring to is that TSO(/LRO) have their own > > issues, some eluded to by Roland and me. In fact, customers working > on > > the LSR couldn't use TSO due to the burstiness it introduces > > That was in old kernels where TSO didn't honor the initial cwnd > correctly, > right? I assume it's long fixed. > > If not please clarify what the problem was. The problem is that is that Ethernet is about the only technology that discloses 'useable' throughput while everybody else talks about signaling rates ;) - OC-192 can carry about 9.128Gbps (or close to that number) and hence 10Gbps Ethernet was overwhelming the OC-192 network. The customer needed to schedule packets at about 98% of OC-192 throughput in order to avoid packet drop. The scheduling needed to be done on a per packet basis and not per 'burst of packets' basis in order to avoid packet drop. > > > have a look at graphics. > > Graphics used to be done by the host CPU and now we have dedicated > > graphics adapters that do a much better job... > > Is your off load device as programable as a modern GPU? It has a lot of knobs to turn. > > > farfetched that offload devices can do a better job at a data-flow > > problem? > > One big difference is that there is no potentially adverse and > always varying internet between the graphics card and your monitor. These graphic adapters provide a wealth of features that you can take advantage of to bring these amazing graphics to life. General purpose CPUs cannot keep up. Chelsio offload devices do the same thing in the realm of networking. - Will there be things you can't do, probably yes, but as I said, there are lots of knobs to turn (and the latest and greatest feature that gets hyped up might not always be the best thing since sliced bread anyway; what happened to BIC love? ;) > > -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
"Felix Marti" <[EMAIL PROTECTED]> writes: > What I was referring to is that TSO(/LRO) have their own > issues, some eluded to by Roland and me. In fact, customers working on > the LSR couldn't use TSO due to the burstiness it introduces That was in old kernels where TSO didn't honor the initial cwnd correctly, right? I assume it's long fixed. If not please clarify what the problem was. > have a look at graphics. > Graphics used to be done by the host CPU and now we have dedicated > graphics adapters that do a much better job... Is your off load device as programable as a modern GPU? > farfetched that offload devices can do a better job at a data-flow > problem? One big difference is that there is no potentially adverse and always varying internet between the graphics card and your monitor. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: Evgeniy Polyakov [mailto:[EMAIL PROTECTED] > Sent: Monday, August 20, 2007 2:43 AM > To: Felix Marti > Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux- > [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti > ([EMAIL PROTECTED]) wrote: > > [Felix Marti] David and Herbert, so you agree that the user<>kernel > > space memory copy overhead is a significant overhead and we want to > > enable zero-copy in both the receive and transmit path? - Yes, copy > > It depends. If you need to access that data after received, you will > get > cache miss and performance will not be much better (if any) that with > copy. Yes, the app will take the cache hits when accessing the data. However, the fact remains that if there is a copy in the receive path, you require and additional 3x memory BW (which is very significant at these high rates and most likely the bottleneck for most current systems)... and somebody always has to take the cache miss be it the copy_to_user or the app. > > > avoidance is mainly an API issue and unfortunately the so widely used > > (synchronous) sockets API doesn't make copy avoidance easy, which is > one > > area where protocol offload can help. Yes, some apps can resort to > > sendfile() but there are many apps which seem to have trouble > switching > > to that API... and what about the receive path? > > There is number of implementations, and all they are suitable for is > to have recvfile(), since this is likely the only case, which can work > without cache. > > And actually RDMA stack exist and no one said it should be thrown away > _until_ it messes with main stack. It started to speal ports. What will > happen when it gest all port space and no new legal network conection > can be opened, although there is no way to show to user who got it? > What will happen if hardware RDMA connection got terminated and > software > could not free the port? Will RDMA request to export connection reset > functions out of stack to drop network connections which are on the > ports > which are supposed to be used by new RDMA connections? Yes, RDMA support is there... but we could make it better and easier to use. We have a problem today with port sharing and there was a proposal to address the issue by tighter integration (see the beginning of the thread) but the proposal got shot down immediately... because it is RDMA and not for technical reasons. I believe this email threads shows in detail how RDMA (a network technology) is treated as bastard child by the network folks, well at least by one of them. > > RDMA is not a problem, but how it influence to the network stack is. > Let's better think about how to work correctly with network stack > (since > we already have that cr^Wdifferent hardware) instead of saying that > others do bad work and do not allow shiny new feature to exist. By no means did I want to imply that others do bad work; are you referring to me using TSO implementation issues as an example? - If so, let me clarify: I understand that the TSO implementation took some time to get right. What I was referring to is that TSO(/LRO) have their own issues, some eluded to by Roland and me. In fact, customers working on the LSR couldn't use TSO due to the burstiness it introduces and had to fall-back to our fine grained packet scheduling done in the offload device. I am for variety, let us support new technologies that solve real problems (lots of folks are buying this stuff for a reason) instead of the 'ah, its brain-dead and has no future' attitude... there is precedence for offloading the host CPUs: have a look at graphics. Graphics used to be done by the host CPU and now we have dedicated graphics adapters that do a much better job... so, why is it so farfetched that offload devices can do a better job at a data-flow problem? > > -- > Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen > Sent: Monday, August 20, 2007 4:07 AM > To: Felix Marti > Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux- > [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > "Felix Marti" <[EMAIL PROTECTED]> writes: > > > avoidance gains of TSO and LRO are still a very worthwhile savings. > > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 + > > 20), 864B, when moving ~64KB of payload - looks like very much in the > > noise to me. > > TSO is beneficial for the software again. The linux code currently > takes several locks and does quite a few function calls for each > packet and using larger packets lowers this overhead. At least with > 10GbE saving CPU cycles is still quite important. > > > an option to get 'high performance' > > Shouldn't you qualify that? > > It is unlikely you really duplicated all the tuning for corner cases > that went over many years into good software TCP stacks in your > hardware. So e.g. for wide area networks with occasional packet loss > the software might well perform better. Yes, it used to be sufficient to submit performance data to show that a technology make 'sense'. In fact, I believe it was Alan Cox who once said that linux will have a look at offload once an offload device holds the land speed record (probably assuming that the day never comes ;). For the last few years it has been Chelsio offload devices that have been improving their own LSRs (as IO bus speeds have been increasing). It is worthwhile to point out that OC-192 doesn't offer full 10Gbps BW and the fine-grained (per packet and not per TSO-burst) packet scheduler in the offload device played a crucial part in pushing performance to the limits of what OC-192 can do. Most other customers use our offload products in low-latency cluster environments. - The problem with offload devices is that they are not all born equal and there have been a lot of poor implementation giving the technology a bad name. I can only speak for Chelsio and do claim that we have a solid implementation that scales from low-latency clusters environments to LFNs. Andi, I could present performance numbers, i.e. throughput and CPU utilization in function of IO size, number of connections, ... in a back-to-back environment and/or in a cluster environment... but what will it get me? I'd still get hit by the 'not integrated' hammer :( > > -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
"Felix Marti" <[EMAIL PROTECTED]> writes: > > avoidance gains of TSO and LRO are still a very worthwhile savings. > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 + > 20), 864B, when moving ~64KB of payload - looks like very much in the > noise to me. TSO is beneficial for the software again. The linux code currently takes several locks and does quite a few function calls for each packet and using larger packets lowers this overhead. At least with 10GbE saving CPU cycles is still quite important. > an option to get 'high performance' Shouldn't you qualify that? It is unlikely you really duplicated all the tuning for corner cases that went over many years into good software TCP stacks in your hardware. So e.g. for wide area networks with occasional packet loss the software might well perform better. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti ([EMAIL PROTECTED]) wrote: > [Felix Marti] David and Herbert, so you agree that the user<>kernel > space memory copy overhead is a significant overhead and we want to > enable zero-copy in both the receive and transmit path? - Yes, copy It depends. If you need to access that data after received, you will get cache miss and performance will not be much better (if any) that with copy. > avoidance is mainly an API issue and unfortunately the so widely used > (synchronous) sockets API doesn't make copy avoidance easy, which is one > area where protocol offload can help. Yes, some apps can resort to > sendfile() but there are many apps which seem to have trouble switching > to that API... and what about the receive path? There is number of implementations, and all they are suitable for is to have recvfile(), since this is likely the only case, which can work without cache. And actually RDMA stack exist and no one said it should be thrown away _until_ it messes with main stack. It started to speal ports. What will happen when it gest all port space and no new legal network conection can be opened, although there is no way to show to user who got it? What will happen if hardware RDMA connection got terminated and software could not free the port? Will RDMA request to export connection reset functions out of stack to drop network connections which are on the ports which are supposed to be used by new RDMA connections? RDMA is not a problem, but how it influence to the network stack is. Let's better think about how to work correctly with network stack (since we already have that cr^Wdifferent hardware) instead of saying that others do bad work and do not allow shiny new feature to exist. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen > Sent: Sunday, August 19, 2007 4:28 PM > To: Felix Marti > Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > "Felix Marti" <[EMAIL PROTECTED]> writes: > > > what benefits does the TSO infrastructure give the > > non-TSO capable devices? > > It improves performance on software queueing devices between guests > and hypervisors. This is a more and more important application these > days. Even when the system running the Hypervisor has a non TSO > capable device in the end it'll still save CPU cycles this way. Right > now > virtualized IO tends to much more CPU intensive than direct IO so any > help it can get is beneficial. > > It also makes loopback faster, although given that's probably not that > useful. > > And a lot of the "TSO infrastructure" was needed for zero copy TX > anyways, > which benefits most reasonable modern NICs (anything with hardware > checksumming) Hi Andi, yes, you're right. I should have chosen my example more carefully. > > -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: David Miller [mailto:[EMAIL PROTECTED] > Sent: Sunday, August 19, 2007 6:06 PM > To: Felix Marti > Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <[EMAIL PROTECTED]> > Date: Sun, 19 Aug 2007 17:47:59 -0700 > > > [Felix Marti] > > Please stop using this to start your replies, thank you. Better? > > > David and Herbert, so you agree that the user<>kernel > > space memory copy overhead is a significant overhead and we want to > > enable zero-copy in both the receive and transmit path? - Yes, copy > > avoidance is mainly an API issue and unfortunately the so widely used > > (synchronous) sockets API doesn't make copy avoidance easy, which is > one > > area where protocol offload can help. Yes, some apps can resort to > > sendfile() but there are many apps which seem to have trouble > switching > > to that API... and what about the receive path? > > On the send side none of this is an issue. You either are sending > static content, in which using sendfile() is trivial, or you're > generating data dynamically in which case the data copy is in the > noise or too small to do zerocopy on and if not you can use a shared > mmap to generate your data into, and then sendfile out from that file, > to avoid the copy that way. > > splice() helps a lot too. > > Splice has the capability to do away with the receive side too, and > there are a few receivefile() implementations that could get cleaned > up and merged in. I don't believe it is as simple as that. Many apps synthesize their payload in user space buffers (i.e. malloced memory) and expect to receive their data in user space buffers _and_ expect the received data to have a certain alignment and to be contiguous - something not addressed by these 'new' APIs. Look, people writing HPC apps tend to take advantage of whatever they can to squeeze some extra performance out of their apps and they are resorting to protocol offload technology for a reason, wouldn't you agree? > > Also, the I/O bus is still the more limiting factor and main memory > bandwidth in all of this, it is the smallest data pipe for > communications out to and from the network. So the protocol header > avoidance gains of TSO and LRO are still a very worthwhile savings. So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 + 20), 864B, when moving ~64KB of payload - looks like very much in the noise to me. And again, PCI-E provides more bandwidth than the wire... > > But even if RDMA increases performance 100 fold, it still doesn't > avoid the issue that it doesn't fit in with the rest of the networking > stack and feature set. > > Any monkey can change the rules around ("ok I can make it go fast as > long as you don't need firewalling, packet scheduling, classification, > and you only need to talk to specific systems that speak this same > special protocol") to make things go faster. On the other hand well > designed solutions can give performance gains within the constraints > of the full system design and without sactificing functionality. While I believe that you should give people an option to get 'high performance' _instead_ of other features and let them chose whatever they care about, I really do agree with what you're saying and believe that offload devices _should_ be integrated with the facilities that you mention (in fact, offload can do a much better job at lots of things that you mention ;) ... but you're not letting offload devices integrate and you're slowing down innovation in this field. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: "Felix Marti" <[EMAIL PROTECTED]> Date: Sun, 19 Aug 2007 17:47:59 -0700 > [Felix Marti] Please stop using this to start your replies, thank you. > David and Herbert, so you agree that the user<>kernel > space memory copy overhead is a significant overhead and we want to > enable zero-copy in both the receive and transmit path? - Yes, copy > avoidance is mainly an API issue and unfortunately the so widely used > (synchronous) sockets API doesn't make copy avoidance easy, which is one > area where protocol offload can help. Yes, some apps can resort to > sendfile() but there are many apps which seem to have trouble switching > to that API... and what about the receive path? On the send side none of this is an issue. You either are sending static content, in which using sendfile() is trivial, or you're generating data dynamically in which case the data copy is in the noise or too small to do zerocopy on and if not you can use a shared mmap to generate your data into, and then sendfile out from that file, to avoid the copy that way. splice() helps a lot too. Splice has the capability to do away with the receive side too, and there are a few receivefile() implementations that could get cleaned up and merged in. Also, the I/O bus is still the more limiting factor and main memory bandwidth in all of this, it is the smallest data pipe for communications out to and from the network. So the protocol header avoidance gains of TSO and LRO are still a very worthwhile savings. But even if RDMA increases performance 100 fold, it still doesn't avoid the issue that it doesn't fit in with the rest of the networking stack and feature set. Any monkey can change the rules around ("ok I can make it go fast as long as you don't need firewalling, packet scheduling, classification, and you only need to talk to specific systems that speak this same special protocol") to make things go faster. On the other hand well designed solutions can give performance gains within the constraints of the full system design and without sactificing functionality. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: David Miller [mailto:[EMAIL PROTECTED] > Sent: Sunday, August 19, 2007 5:40 PM > To: Felix Marti > Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <[EMAIL PROTECTED]> > Date: Sun, 19 Aug 2007 17:32:39 -0700 > > [ Why do you put that "[Felix Marti]" everywhere you say something? > It's annoying and superfluous. The quoting done by your mail client > makes clear who is saying what. ] > > > Hmmm, interesting... I guess it is impossible to even have > > a discussion on the subject. > > Nice try, Herbert Xu gave a great explanation. [Felix Marti] David and Herbert, so you agree that the user<>kernel space memory copy overhead is a significant overhead and we want to enable zero-copy in both the receive and transmit path? - Yes, copy avoidance is mainly an API issue and unfortunately the so widely used (synchronous) sockets API doesn't make copy avoidance easy, which is one area where protocol offload can help. Yes, some apps can resort to sendfile() but there are many apps which seem to have trouble switching to that API... and what about the receive path? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: "Felix Marti" <[EMAIL PROTECTED]> Date: Sun, 19 Aug 2007 17:32:39 -0700 [ Why do you put that "[Felix Marti]" everywhere you say something? It's annoying and superfluous. The quoting done by your mail client makes clear who is saying what. ] > Hmmm, interesting... I guess it is impossible to even have > a discussion on the subject. Nice try, Herbert Xu gave a great explanation. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: David Miller [mailto:[EMAIL PROTECTED] > Sent: Sunday, August 19, 2007 4:04 PM > To: Felix Marti > Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <[EMAIL PROTECTED]> > Date: Sun, 19 Aug 2007 12:49:05 -0700 > > > You're not at all addressing the fact that RDMA does solve the > > memory BW problem and stateless offload doesn't. > > It does, I just didn't retort to your claims because they were > so blatantly wrong. [Felix Marti] Hmmm, interesting... I guess it is impossible to even have a discussion on the subject. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
Felix Marti <[EMAIL PROTECTED]> wrote: > > [Felix Marti] Aren't you confusing memory and bus BW here? - RDMA > enables DMA from/to application buffers removing the user-to-kernel/ > kernel-to-user memory copy with is a significant overhead at the > rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps > out) requires 60Gbps of BW on most common platforms. So, receiving and > transmitting at 10Gbps with LRO and TSO requires 80Gbps of system > memory BW (which is beyond what most systems can do) whereas RDMA can > do with 20Gbps! Actually this is false. TSO only requires a copy if the user chooses to use the sendmsg interface instead of sendpage. The same is true for RDMA really. Except that instead of having to switch your application to sendfile/splice, you're switching it to RDMA. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: Andi Kleen <[EMAIL PROTECTED]> Date: 20 Aug 2007 01:27:35 +0200 > "Felix Marti" <[EMAIL PROTECTED]> writes: > > > what benefits does the TSO infrastructure give the > > non-TSO capable devices? > > It improves performance on software queueing devices between guests > and hypervisors. This is a more and more important application these > days. Even when the system running the Hypervisor has a non TSO > capable device in the end it'll still save CPU cycles this way. Right now > virtualized IO tends to much more CPU intensive than direct IO so any > help it can get is beneficial. > > It also makes loopback faster, although given that's probably not that > useful. > > And a lot of the "TSO infrastructure" was needed for zero copy TX anyways, > which benefits most reasonable modern NICs (anything with hardware > checksumming) And also, you can enable TSO generation for a non-TSO-hw device and get all of the segmentation overhead reduction gains which works out as a pure win as long as the device can at a minimum do checksumming. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: "Felix Marti" <[EMAIL PROTECTED]> Date: Sun, 19 Aug 2007 12:49:05 -0700 > You're not at all addressing the fact that RDMA does solve the > memory BW problem and stateless offload doesn't. It does, I just didn't retort to your claims because they were so blatantly wrong. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
"Felix Marti" <[EMAIL PROTECTED]> writes: > what benefits does the TSO infrastructure give the > non-TSO capable devices? It improves performance on software queueing devices between guests and hypervisors. This is a more and more important application these days. Even when the system running the Hypervisor has a non TSO capable device in the end it'll still save CPU cycles this way. Right now virtualized IO tends to much more CPU intensive than direct IO so any help it can get is beneficial. It also makes loopback faster, although given that's probably not that useful. And a lot of the "TSO infrastructure" was needed for zero copy TX anyways, which benefits most reasonable modern NICs (anything with hardware checksumming) -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: David Miller [mailto:[EMAIL PROTECTED] > Sent: Sunday, August 19, 2007 12:32 PM > To: Felix Marti > Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Felix Marti" <[EMAIL PROTECTED]> > Date: Sun, 19 Aug 2007 10:33:31 -0700 > > > I know that you don't agree that TSO has drawbacks, as outlined by > > Roland, but its history showing something else: the addition of TSO > > took a fair amount of time and network performance was erratic for > > multiple kernel revisions and the TSO code is sprinkled across the > > network stack. > > This thing you call "sprinkled" is a necessity of any hardware > offload when it is possible for a packet to later get "steered" > to a device which cannot perform the offload. > > Therefore we need a software implementation of TSO so that those > packets can still get output to the non-TSO-capable device. > > We do the same thing for checksum offloading. > > And for free we can use the software offloading mechanism to > get batching to arbitrary network devices, even those which cannot > do TSO. > > What benefits does RDMA infrastructure give to non-RDMA capable > devices? None? I see, that's great. > > And again the TSO bugs and issues are being overstated and, also for > the second time, these issues are more indicative of my bad > programming skills then they are of intrinsic issues of TSO. The > TSO implementation was looking for a good design, and it took me > a while to find it because I personally suck. > > Face it, stateless offloads are always going to be better in the long > term. And this is proven. > > You RDMA folks really do live in some kind of fantasy land. [Felix Marti] You're not at all addressing the fact that RDMA does solve the memory BW problem and stateless offload doesn't. Apart from that, I don't quite understand your argument with respect to the benefits of the RDMA infrastructure; what benefits does the TSO infrastructure give the non-TSO capable devices? Isn't the answer none and yet you added TSO support?! I don't think that the argument is stateless _versus_ stateful offload both have their advantages and disadvantages. Stateless offload does help, i.e. TSO/LRO do improve performance in back-to-back benchmarks. It seems me that _you_ claim that there is no benefit to statefull offload and that is where we're disagreeing; there is benefit and i.e. the much lower memory BW requirements is just one example, yet an important one. We'll probably never agree but it seems to me that we're asking only for small changes to the software stack and then we can give the choice to the end users: they can opt for stateless offload if it fits the performance needs or for statefull offload if their apps require the extra boost in performance. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
From: "Felix Marti" <[EMAIL PROTECTED]> Date: Sun, 19 Aug 2007 10:33:31 -0700 > I know that you don't agree that TSO has drawbacks, as outlined by > Roland, but its history showing something else: the addition of TSO > took a fair amount of time and network performance was erratic for > multiple kernel revisions and the TSO code is sprinkled across the > network stack. This thing you call "sprinkled" is a necessity of any hardware offload when it is possible for a packet to later get "steered" to a device which cannot perform the offload. Therefore we need a software implementation of TSO so that those packets can still get output to the non-TSO-capable device. We do the same thing for checksum offloading. And for free we can use the software offloading mechanism to get batching to arbitrary network devices, even those which cannot do TSO. What benefits does RDMA infrastructure give to non-RDMA capable devices? None? I see, that's great. And again the TSO bugs and issues are being overstated and, also for the second time, these issues are more indicative of my bad programming skills then they are of intrinsic issues of TSO. The TSO implementation was looking for a good design, and it took me a while to find it because I personally suck. Face it, stateless offloads are always going to be better in the long term. And this is proven. You RDMA folks really do live in some kind of fantasy land. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
> -Original Message- > From: [EMAIL PROTECTED] [mailto:general- > [EMAIL PROTECTED] On Behalf Of David Miller > Sent: Sunday, August 19, 2007 12:24 AM > To: [EMAIL PROTECTED] > Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate > PS_TCPportsfrom the host TCP port space. > > From: "Sean Hefty" <[EMAIL PROTECTED]> > Date: Sun, 19 Aug 2007 00:01:07 -0700 > > > Millions of Infiniband ports are in operation today. Over 25% of the > top 500 > > supercomputers use Infiniband. The formation of the OpenFabrics > Alliance was > > pushed and has been continuously funded by an RDMA customer - the US > National > > Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, > Sun, Voltaire, > > Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, > NEC, Fujitsu, > > LSI, SGI, Sandia, and at least two dozen other companies. IDC > expects > > Infiniband adapter revenue to triple between 2006 and 2011, and > switch revenue > > to increase six-fold (combined revenues of 1 billion). > > Scale these numbers with reality and usage. > > These vendors pour in huge amounts of money into a relatively small > number of extremely large cluster installations. Besides the folks > doing nuke and whole-earth simulations at some government lab, nobody > cares. And part of the investment is not being done wholly for smart > economic reasons, but also largely publicity purposes. > > So present your great Infiniband numbers with that being admitted up > front, ok? > > It's relevance to Linux as a general purpose operating system that > should be "good enough" for %99 of the world is close to NIL. > > People have been pouring tons of money and research into doing stupid > things to make clusters go fast, and in such a way that make zero > sense for general purpose operating systems, for ages. RDMA is just > one such example. [Felix Marti] Ouch, and I believed linux to be a leading edge OS, scaling from small embedded systems to hundreds of CPUs and hence I assumed that the same 'scalability' applies to the network subsystem. > > BTW, I find it ironic that you mention memory bandwidth as a retort, > as Roland's favorite stateless offload devil, TSO, deals explicity > with lowering the per-packet BUS bandwidth usage of TCP. LRO > offloading does likewise. [Felix Marti] Aren't you confusing memory and bus BW here? - RDMA enables DMA from/to application buffers removing the user-to-kernel/ kernel-to-user memory copy with is a significant overhead at the rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps out) requires 60Gbps of BW on most common platforms. So, receiving and transmitting at 10Gbps with LRO and TSO requires 80Gbps of system memory BW (which is beyond what most systems can do) whereas RDMA can do with 20Gbps! In addition, BUS improvements are really not significant (nor are buses the bottleneck anymore with wide availability of PCI-E >= x8); TSO avoids the DMA of a bunch of network headers... a typical example of stateless offload - improving performance by a few percent while offload technologies provide system improvements of hundreds of percent. I know that you don't agree that TSO has drawbacks, as outlined by Roland, but its history showing something else: the addition of TSO took a fair amount of time and network performance was erratic for multiple kernel revisions and the TSO code is sprinkled across the network stack. It is an example of an intrusive 'improvement' whereas Steve (who started this thread) is asking for a relatively small change (decoupling the 4-tuple allocation from the socket). As Steve has outlined, your refusal of the change requires RDMA users to work around the issue which pushes the issue to the end-users and thus slowing down the acceptance of the technology leading to a chicken-and-egg problem: you only care if there are lots of users but you make it hard to use the technology in the first place, clever ;) > ___ > general mailing list > [EMAIL PROTECTED] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html