RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


> -Original Message-
> From: Patrick Geoffray [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 20, 2007 1:34 PM
> To: Felix Marti
> Cc: Evgeniy Polyakov; David Miller; [EMAIL PROTECTED];
> netdev@vger.kernel.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> Felix Marti wrote:
> > Yes, the app will take the cache hits when accessing the data.
> However,
> > the fact remains that if there is a copy in the receive path, you
> > require and additional 3x memory BW (which is very significant at
> these
> > high rates and most likely the bottleneck for most current
> systems)...
> > and somebody always has to take the cache miss be it the
copy_to_user
> or
> > the app.
> 
> The cache miss is going to cost you half the memory bandwidth of a
full
> copy. If the data is already in cache, then the copy is cheaper.
> 
> However, removing the copy removes the kernel from the picture on the
> receive side, so you lose demultiplexing, asynchronism, security,
> accounting, flow-control, swapping, etc. If it's ok with you to not
use
> the kernel stack, then why expect to fit in the existing
infrastructure
> anyway ?
Many of the things you're referring to are moved to the offload adapter
but from an ease of use point of view, it would be great if the user
could still collect stats the same way, i.e. netstat reports the 4-tuple
in use and other network stats. In addition, security features and
packet scheduling could be integrated so that the user configures them
the same way as the network stack.

> 
> > Yes, RDMA support is there... but we could make it better and easier
> to
> 
> What do you need from the kernel for RDMA support beyond HW drivers ?
A
> fast way to pin and translate user memory (ie registration). That is
> pretty much the sandbox that David referred to.
> 
> Eventually, it would be useful to be able to track the VM space to
> implement a registration cache instead of using ugly hacks in user-
> space
> to hijack malloc, but this is completely independent from the net
> stack.
> 
> > use. We have a problem today with port sharing and there was a
> proposal
> 
> The port spaces are either totally separate and there is no issue, or
> completely identical and you should then run your connection manager
in
> user-space or fix your middlewares.
When running on an iWarp device (and hence on top of TCP) I believe that
the port space should shared and i.e. netstat reports the 4-tuple in
use. 

> 
> > and not for technical reasons. I believe this email threads shows in
> > detail how RDMA (a network technology) is treated as bastard child
by
> > the network folks, well at least by one of them.
> 
> I don't think it's fair. This thread actually show how pushy some RDMA
> folks are about not acknowledging that the current infrastructure is
> here for a reason, and about mistaking zero-copy and RDMA.
Zero-copy and RDMA are not the same but in the context of this
discussion I referred to RDMA as a superset (zero-copy is implied).

> 
> This is a similar argument than the TOE discussion, and it was
> definitively a good decision to not mess up the Linux stack with TOEs.
> 
> Patrick
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Patrick Geoffray

Felix Marti wrote:

Yes, the app will take the cache hits when accessing the data. However,
the fact remains that if there is a copy in the receive path, you
require and additional 3x memory BW (which is very significant at these
high rates and most likely the bottleneck for most current systems)...
and somebody always has to take the cache miss be it the copy_to_user or
the app.


The cache miss is going to cost you half the memory bandwidth of a full 
copy. If the data is already in cache, then the copy is cheaper.


However, removing the copy removes the kernel from the picture on the 
receive side, so you lose demultiplexing, asynchronism, security, 
accounting, flow-control, swapping, etc. If it's ok with you to not use 
the kernel stack, then why expect to fit in the existing infrastructure 
anyway ?



Yes, RDMA support is there... but we could make it better and easier to


What do you need from the kernel for RDMA support beyond HW drivers ? A 
fast way to pin and translate user memory (ie registration). That is 
pretty much the sandbox that David referred to.


Eventually, it would be useful to be able to track the VM space to 
implement a registration cache instead of using ugly hacks in user-space 
to hijack malloc, but this is completely independent from the net stack.



use. We have a problem today with port sharing and there was a proposal


The port spaces are either totally separate and there is no issue, or 
completely identical and you should then run your connection manager in 
user-space or fix your middlewares.



and not for technical reasons. I believe this email threads shows in
detail how RDMA (a network technology) is treated as bastard child by
the network folks, well at least by one of them.


I don't think it's fair. This thread actually show how pushy some RDMA 
folks are about not acknowledging that the current infrastructure is 
here for a reason, and about mistaking zero-copy and RDMA.


This is a similar argument than the TOE discussion, and it was 
definitively a good decision to not mess up the Linux stack with TOEs.


Patrick
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Andi Kleen
> GPUs have almost no influence on system security, 

Unless you use direct rendering from user space.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Thomas Graf
* Felix Marti <[EMAIL PROTECTED]> 2007-08-20 12:02
> These graphic adapters provide a wealth of features that you can take
> advantage of to bring these amazing graphics to life. General purpose
> CPUs cannot keep up. Chelsio offload devices do the same thing in the
> realm of networking. - Will there be things you can't do, probably yes,
> but as I said, there are lots of knobs to turn (and the latest and
> greatest feature that gets hyped up might not always be the best thing
> since sliced bread anyway; what happened to BIC love? ;)

GPUs have almost no influence on system security, the network stack OTOH
is probably the most vulnerable part of an operating system. Even if all
vendors would implement all the features collected over the last years
properly which seems unlikely. Having such an essential and critical
part depend on the vendor of my network card without being able to even
verify it properly is truly frightening.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Rick Jones

Andi Kleen wrote:

TSO is beneficial for the software again. The linux code currently
takes several locks and does quite a few function calls for each 
packet and using larger packets lowers this overhead. At least with

10GbE saving CPU cycles is still quite important.


Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running 
2.6.23-rc3.  the NICs are dual-core e1000's connected back-to-back with the 
interrupt throttle disabled.  I like using TCP_RR to tickle path-length 
questions because it rarely runs into bandwidth limitations regardless of the 
link-type.


First, with TSO enabled on both sides, then with it disabled, netperf/netserver 
bound to the same CPU as takes interrupts, which is the "best" place to be for a 
TCP_RR test (although not always for a TCP_STREAM test...):


:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind

!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput  :  0.3%
!!!   Local CPU util  : 39.3%
!!!   Remote CPU util : 40.6%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  10.01   18611.32  20.96  22.35  22.522  24.017
16384  87380
:~# ethtool -K eth2 tso off
e1000: eth2: e1000_set_tso: TSO is Disabled
:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind

!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput  :  0.4%
!!!   Local CPU util  : 21.0%
!!!   Remote CPU util : 25.2%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  10.01   19812.51  17.81  17.19  17.983  17.358
16384  87380

While the confidence intervals for CPU util weren't hit, I suspect the 
differences in service demand were still real.  On throughput we are talking 
about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not 
percentage points) in the first test and 12.5% in the second.


So, in broad handwaving terms, TSO increased the per-transaction service demand 
by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the 
transaction rate decreased by ~6%.


rick jones
bitrate blindless is a constant concern
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
> Sent: Monday, August 20, 2007 11:11 AM
> To: Felix Marti
> Cc: Evgeniy Polyakov; [EMAIL PROTECTED]; netdev@vger.kernel.org;
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; David Miller
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> "Felix Marti" <[EMAIL PROTECTED]> writes:
> 
> > What I was referring to is that TSO(/LRO) have their own
> > issues, some eluded to by Roland and me. In fact, customers working
> on
> > the LSR couldn't use TSO due to the burstiness it introduces
> 
> That was in old kernels where TSO didn't honor the initial cwnd
> correctly,
> right? I assume it's long fixed.
> 
> If not please clarify what the problem was.
The problem is that is that Ethernet is about the only technology that
discloses 'useable' throughput while everybody else talks about
signaling rates ;) - OC-192 can carry about 9.128Gbps (or close to that
number) and hence 10Gbps Ethernet was overwhelming the OC-192 network.
The customer needed to schedule packets at about 98% of OC-192
throughput in order to avoid packet drop. The scheduling needed to be
done on a per packet basis and not per 'burst of packets' basis in order
to avoid packet drop.
 
> 
> > have a look at graphics.
> > Graphics used to be done by the host CPU and now we have dedicated
> > graphics adapters that do a much better job...
> 
> Is your off load device as programable as a modern GPU?
It has a lot of knobs to turn.

> 
> > farfetched that offload devices can do a better job at a data-flow
> > problem?
> 
> One big difference is that there is no potentially adverse and
> always varying internet between the graphics card and your monitor.
These graphic adapters provide a wealth of features that you can take
advantage of to bring these amazing graphics to life. General purpose
CPUs cannot keep up. Chelsio offload devices do the same thing in the
realm of networking. - Will there be things you can't do, probably yes,
but as I said, there are lots of knobs to turn (and the latest and
greatest feature that gets hyped up might not always be the best thing
since sliced bread anyway; what happened to BIC love? ;)

> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Andi Kleen
"Felix Marti" <[EMAIL PROTECTED]> writes:

> What I was referring to is that TSO(/LRO) have their own
> issues, some eluded to by Roland and me. In fact, customers working on
> the LSR couldn't use TSO due to the burstiness it introduces

That was in old kernels where TSO didn't honor the initial cwnd correctly, 
right? I assume it's long fixed.

If not please clarify what the problem was.

> have a look at graphics.
> Graphics used to be done by the host CPU and now we have dedicated
> graphics adapters that do a much better job...

Is your off load device as programable as a modern GPU?

> farfetched that offload devices can do a better job at a data-flow
> problem?

One big difference is that there is no potentially adverse and
always varying internet between the graphics card and your monitor.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


> -Original Message-
> From: Evgeniy Polyakov [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 20, 2007 2:43 AM
> To: Felix Marti
> Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org;
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-
> [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti
> ([EMAIL PROTECTED]) wrote:
> > [Felix Marti] David and Herbert, so you agree that the user<>kernel
> > space memory copy overhead is a significant overhead and we want to
> > enable zero-copy in both the receive and transmit path? - Yes, copy
> 
> It depends. If you need to access that data after received, you will
> get
> cache miss and performance will not be much better (if any) that with
> copy.
Yes, the app will take the cache hits when accessing the data. However,
the fact remains that if there is a copy in the receive path, you
require and additional 3x memory BW (which is very significant at these
high rates and most likely the bottleneck for most current systems)...
and somebody always has to take the cache miss be it the copy_to_user or
the app.
> 
> > avoidance is mainly an API issue and unfortunately the so widely
used
> > (synchronous) sockets API doesn't make copy avoidance easy, which is
> one
> > area where protocol offload can help. Yes, some apps can resort to
> > sendfile() but there are many apps which seem to have trouble
> switching
> > to that API... and what about the receive path?
> 
> There is number of implementations, and all they are suitable for is
> to have recvfile(), since this is likely the only case, which can work
> without cache.
> 
> And actually RDMA stack exist and no one said it should be thrown away
> _until_ it messes with main stack. It started to speal ports. What
will
> happen when it gest all port space and no new legal network conection
> can be opened, although there is no way to show to user who got it?
> What will happen if hardware RDMA connection got terminated and
> software
> could not free the port? Will RDMA request to export connection reset
> functions out of stack to drop network connections which are on the
> ports
> which are supposed to be used by new RDMA connections?
Yes, RDMA support is there... but we could make it better and easier to
use. We have a problem today with port sharing and there was a proposal
to address the issue by tighter integration (see the beginning of the
thread) but the proposal got shot down immediately... because it is RDMA
and not for technical reasons. I believe this email threads shows in
detail how RDMA (a network technology) is treated as bastard child by
the network folks, well at least by one of them.
> 
> RDMA is not a problem, but how it influence to the network stack is.
> Let's better think about how to work correctly with network stack
> (since
> we already have that cr^Wdifferent hardware) instead of saying that
> others do bad work and do not allow shiny new feature to exist.
By no means did I want to imply that others do bad work; are you
referring to me using TSO implementation issues as an example? - If so,
let me clarify: I understand that the TSO implementation took some time
to get right. What I was referring to is that TSO(/LRO) have their own
issues, some eluded to by Roland and me. In fact, customers working on
the LSR couldn't use TSO due to the burstiness it introduces and had to
fall-back to our fine grained packet scheduling done in the offload
device. I am for variety, let us support new technologies that solve
real problems (lots of folks are buying this stuff for a reason) instead
of the 'ah, its brain-dead and has no future' attitude... there is
precedence for offloading the host CPUs: have a look at graphics.
Graphics used to be done by the host CPU and now we have dedicated
graphics adapters that do a much better job... so, why is it so
farfetched that offload devices can do a better job at a data-flow
problem?
> 
> --
>   Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
> Sent: Monday, August 20, 2007 4:07 AM
> To: Felix Marti
> Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org;
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-
> [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> "Felix Marti" <[EMAIL PROTECTED]> writes:
> > > avoidance gains of TSO and LRO are still a very worthwhile
savings.
> > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20
+
> > 20), 864B, when moving ~64KB of payload - looks like very much in
the
> > noise to me.
> 
> TSO is beneficial for the software again. The linux code currently
> takes several locks and does quite a few function calls for each
> packet and using larger packets lowers this overhead. At least with
> 10GbE saving CPU cycles is still quite important.
> 
> > an option to get 'high performance'
> 
> Shouldn't you qualify that?
> 
> It is unlikely you really duplicated all the tuning for corner cases
> that went over many years into good software TCP stacks in your
> hardware.  So e.g. for wide area networks with occasional packet loss
> the software might well perform better.
Yes, it used to be sufficient to submit performance data to show that a
technology make 'sense'. In fact, I believe it was Alan Cox who once
said that linux will have a look at offload once an offload device holds
the land speed record (probably assuming that the day never comes ;).
For the last few years it has been Chelsio offload devices that have
been improving their own LSRs (as IO bus speeds have been increasing).
It is worthwhile to point out that OC-192 doesn't offer full 10Gbps BW
and the fine-grained (per packet and not per TSO-burst) packet scheduler
in the offload device played a crucial part in pushing performance to
the limits of what OC-192 can do. Most other customers use our offload
products in low-latency cluster environments. - The problem with offload
devices is that they are not all born equal and there have been a lot of
poor implementation giving the technology a bad name. I can only speak
for Chelsio and do claim that we have a solid implementation that scales
from low-latency clusters environments to LFNs.

Andi, I could present performance numbers, i.e. throughput and CPU
utilization in function of IO size, number of connections, ... in a
back-to-back environment and/or in a cluster environment... but what
will it get me? I'd still get hit by the 'not integrated' hammer :(

> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Andi Kleen
"Felix Marti" <[EMAIL PROTECTED]> writes:
> > avoidance gains of TSO and LRO are still a very worthwhile savings.
> So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 +
> 20), 864B, when moving ~64KB of payload - looks like very much in the
> noise to me.

TSO is beneficial for the software again. The linux code currently
takes several locks and does quite a few function calls for each 
packet and using larger packets lowers this overhead. At least with
10GbE saving CPU cycles is still quite important.

> an option to get 'high performance' 

Shouldn't you qualify that?

It is unlikely you really duplicated all the tuning for corner cases
that went over many years into good software TCP stacks in your
hardware.  So e.g. for wide area networks with occasional packet loss
the software might well perform better.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Evgeniy Polyakov
On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti ([EMAIL PROTECTED]) wrote:
> [Felix Marti] David and Herbert, so you agree that the user<>kernel
> space memory copy overhead is a significant overhead and we want to
> enable zero-copy in both the receive and transmit path? - Yes, copy

It depends. If you need to access that data after received, you will get
cache miss and performance will not be much better (if any) that with
copy.

> avoidance is mainly an API issue and unfortunately the so widely used
> (synchronous) sockets API doesn't make copy avoidance easy, which is one
> area where protocol offload can help. Yes, some apps can resort to
> sendfile() but there are many apps which seem to have trouble switching
> to that API... and what about the receive path?

There is number of implementations, and all they are suitable for is 
to have recvfile(), since this is likely the only case, which can work 
without cache.

And actually RDMA stack exist and no one said it should be thrown away
_until_ it messes with main stack. It started to speal ports. What will
happen when it gest all port space and no new legal network conection
can be opened, although there is no way to show to user who got it?
What will happen if hardware RDMA connection got terminated and software
could not free the port? Will RDMA request to export connection reset
functions out of stack to drop network connections which are on the ports
which are supposed to be used by new RDMA connections?

RDMA is not a problem, but how it influence to the network stack is.
Let's better think about how to work correctly with network stack (since
we already have that cr^Wdifferent hardware) instead of saying that
others do bad work and do not allow shiny new feature to exist.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
> Sent: Sunday, August 19, 2007 4:28 PM
> To: Felix Marti
> Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org;
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> "Felix Marti" <[EMAIL PROTECTED]> writes:
> 
> > what benefits does the TSO infrastructure give the
> > non-TSO capable devices?
> 
> It improves performance on software queueing devices between guests
> and hypervisors. This is a more and more important application these
> days.  Even when the system running the Hypervisor has a non TSO
> capable device in the end it'll still save CPU cycles this way. Right
> now
> virtualized IO tends to much more CPU intensive than direct IO so any
> help it can get is beneficial.
> 
> It also makes loopback faster, although given that's probably not that
> useful.
> 
> And a lot of the "TSO infrastructure" was needed for zero copy TX
> anyways,
> which benefits most reasonable modern NICs (anything with hardware
> checksumming)
Hi Andi, yes, you're right. I should have chosen my example more
carefully.

> 
> -Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


> -Original Message-
> From: David Miller [mailto:[EMAIL PROTECTED]
> Sent: Sunday, August 19, 2007 6:06 PM
> To: Felix Marti
> Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <[EMAIL PROTECTED]>
> Date: Sun, 19 Aug 2007 17:47:59 -0700
> 
> > [Felix Marti]
> 
> Please stop using this to start your replies, thank you.
Better?

> 
> > David and Herbert, so you agree that the user<>kernel
> > space memory copy overhead is a significant overhead and we want to
> > enable zero-copy in both the receive and transmit path? - Yes, copy
> > avoidance is mainly an API issue and unfortunately the so widely
used
> > (synchronous) sockets API doesn't make copy avoidance easy, which is
> one
> > area where protocol offload can help. Yes, some apps can resort to
> > sendfile() but there are many apps which seem to have trouble
> switching
> > to that API... and what about the receive path?
> 
> On the send side none of this is an issue.  You either are sending
> static content, in which using sendfile() is trivial, or you're
> generating data dynamically in which case the data copy is in the
> noise or too small to do zerocopy on and if not you can use a shared
> mmap to generate your data into, and then sendfile out from that file,
> to avoid the copy that way.
> 
> splice() helps a lot too.
> 
> Splice has the capability to do away with the receive side too, and
> there are a few receivefile() implementations that could get cleaned
> up and merged in.
I don't believe it is as simple as that. Many apps synthesize their
payload in user space buffers (i.e. malloced memory) and expect to
receive their data in user space buffers _and_ expect the received data
to have a certain alignment and to be contiguous - something not
addressed by these 'new' APIs. Look, people writing HPC apps tend to
take advantage of whatever they can to squeeze some extra performance
out of their apps and they are resorting to protocol offload technology
for a reason, wouldn't you agree? 

> 
> Also, the I/O bus is still the more limiting factor and main memory
> bandwidth in all of this, it is the smallest data pipe for
> communications out to and from the network.  So the protocol header
> avoidance gains of TSO and LRO are still a very worthwhile savings.
So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 +
20), 864B, when moving ~64KB of payload - looks like very much in the
noise to me. And again, PCI-E provides more bandwidth than the wire...

> 
> But even if RDMA increases performance 100 fold, it still doesn't
> avoid the issue that it doesn't fit in with the rest of the networking
> stack and feature set.
> 
> Any monkey can change the rules around ("ok I can make it go fast as
> long as you don't need firewalling, packet scheduling, classification,
> and you only need to talk to specific systems that speak this same
> special protocol") to make things go faster.  On the other hand well
> designed solutions can give performance gains within the constraints
> of the full system design and without sactificing functionality.
While I believe that you should give people an option to get 'high
performance' _instead_ of other features and let them chose whatever
they care about, I really do agree with what you're saying and believe
that offload devices _should_ be integrated with the facilities that you
mention (in fact, offload can do a much better job at lots of things
that you mention ;) ... but you're not letting offload devices integrate
and you're slowing down innovation in this field.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: "Felix Marti" <[EMAIL PROTECTED]>
Date: Sun, 19 Aug 2007 17:47:59 -0700

> [Felix Marti]

Please stop using this to start your replies, thank you.

> David and Herbert, so you agree that the user<>kernel
> space memory copy overhead is a significant overhead and we want to
> enable zero-copy in both the receive and transmit path? - Yes, copy
> avoidance is mainly an API issue and unfortunately the so widely used
> (synchronous) sockets API doesn't make copy avoidance easy, which is one
> area where protocol offload can help. Yes, some apps can resort to
> sendfile() but there are many apps which seem to have trouble switching
> to that API... and what about the receive path?

On the send side none of this is an issue.  You either are sending
static content, in which using sendfile() is trivial, or you're
generating data dynamically in which case the data copy is in the
noise or too small to do zerocopy on and if not you can use a shared
mmap to generate your data into, and then sendfile out from that file,
to avoid the copy that way.

splice() helps a lot too.

Splice has the capability to do away with the receive side too, and
there are a few receivefile() implementations that could get cleaned
up and merged in.

Also, the I/O bus is still the more limiting factor and main memory
bandwidth in all of this, it is the smallest data pipe for
communications out to and from the network.  So the protocol header
avoidance gains of TSO and LRO are still a very worthwhile savings.

But even if RDMA increases performance 100 fold, it still doesn't
avoid the issue that it doesn't fit in with the rest of the networking
stack and feature set.

Any monkey can change the rules around ("ok I can make it go fast as
long as you don't need firewalling, packet scheduling, classification,
and you only need to talk to specific systems that speak this same
special protocol") to make things go faster.  On the other hand well
designed solutions can give performance gains within the constraints
of the full system design and without sactificing functionality.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


> -Original Message-
> From: David Miller [mailto:[EMAIL PROTECTED]
> Sent: Sunday, August 19, 2007 5:40 PM
> To: Felix Marti
> Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <[EMAIL PROTECTED]>
> Date: Sun, 19 Aug 2007 17:32:39 -0700
> 
> [ Why do you put that "[Felix Marti]" everywhere you say something?
>   It's annoying and superfluous. The quoting done by your mail client
>   makes clear who is saying what. ]
> 
> > Hmmm, interesting... I guess it is impossible to even have
> > a discussion on the subject.
> 
> Nice try, Herbert Xu gave a great explanation.
[Felix Marti] David and Herbert, so you agree that the user<>kernel
space memory copy overhead is a significant overhead and we want to
enable zero-copy in both the receive and transmit path? - Yes, copy
avoidance is mainly an API issue and unfortunately the so widely used
(synchronous) sockets API doesn't make copy avoidance easy, which is one
area where protocol offload can help. Yes, some apps can resort to
sendfile() but there are many apps which seem to have trouble switching
to that API... and what about the receive path?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: "Felix Marti" <[EMAIL PROTECTED]>
Date: Sun, 19 Aug 2007 17:32:39 -0700

[ Why do you put that "[Felix Marti]" everywhere you say something?
  It's annoying and superfluous. The quoting done by your mail client
  makes clear who is saying what. ]

> Hmmm, interesting... I guess it is impossible to even have
> a discussion on the subject.

Nice try, Herbert Xu gave a great explanation.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


> -Original Message-
> From: David Miller [mailto:[EMAIL PROTECTED]
> Sent: Sunday, August 19, 2007 4:04 PM
> To: Felix Marti
> Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <[EMAIL PROTECTED]>
> Date: Sun, 19 Aug 2007 12:49:05 -0700
> 
> > You're not at all addressing the fact that RDMA does solve the
> > memory BW problem and stateless offload doesn't.
> 
> It does, I just didn't retort to your claims because they were
> so blatantly wrong.
[Felix Marti] Hmmm, interesting... I guess it is impossible to even have
a discussion on the subject.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Herbert Xu
Felix Marti <[EMAIL PROTECTED]> wrote:
>
> [Felix Marti] Aren't you confusing memory and bus BW here? - RDMA 
> enables DMA from/to application buffers removing the user-to-kernel/
> kernel-to-user memory copy with is a significant overhead at the 
> rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps 
> out) requires 60Gbps of BW on most common platforms. So, receiving and
> transmitting at 10Gbps with LRO and TSO requires 80Gbps of system 
> memory BW (which is beyond what most systems can do) whereas RDMA can 
> do with 20Gbps!

Actually this is false.  TSO only requires a copy if the user
chooses to use the sendmsg interface instead of sendpage.  The
same is true for RDMA really.  Except that instead of having to
switch your application to sendfile/splice, you're switching it
to RDMA.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: Andi Kleen <[EMAIL PROTECTED]>
Date: 20 Aug 2007 01:27:35 +0200

> "Felix Marti" <[EMAIL PROTECTED]> writes:
> 
> > what benefits does the TSO infrastructure give the
> > non-TSO capable devices?
> 
> It improves performance on software queueing devices between guests
> and hypervisors. This is a more and more important application these
> days.  Even when the system running the Hypervisor has a non TSO
> capable device in the end it'll still save CPU cycles this way. Right now
> virtualized IO tends to much more CPU intensive than direct IO so any
> help it can get is beneficial.
> 
> It also makes loopback faster, although given that's probably not that
> useful.
> 
> And a lot of the "TSO infrastructure" was needed for zero copy TX anyways,
> which benefits most reasonable modern NICs (anything with hardware 
> checksumming)

And also, you can enable TSO generation for a non-TSO-hw device and
get all of the segmentation overhead reduction gains which works out
as a pure win as long as the device can at a minimum do checksumming.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: "Felix Marti" <[EMAIL PROTECTED]>
Date: Sun, 19 Aug 2007 12:49:05 -0700

> You're not at all addressing the fact that RDMA does solve the
> memory BW problem and stateless offload doesn't.

It does, I just didn't retort to your claims because they were
so blatantly wrong.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Andi Kleen
"Felix Marti" <[EMAIL PROTECTED]> writes:

> what benefits does the TSO infrastructure give the
> non-TSO capable devices?

It improves performance on software queueing devices between guests
and hypervisors. This is a more and more important application these
days.  Even when the system running the Hypervisor has a non TSO
capable device in the end it'll still save CPU cycles this way. Right now
virtualized IO tends to much more CPU intensive than direct IO so any
help it can get is beneficial.

It also makes loopback faster, although given that's probably not that
useful.

And a lot of the "TSO infrastructure" was needed for zero copy TX anyways,
which benefits most reasonable modern NICs (anything with hardware 
checksumming)

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


> -Original Message-
> From: David Miller [mailto:[EMAIL PROTECTED]
> Sent: Sunday, August 19, 2007 12:32 PM
> To: Felix Marti
> Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <[EMAIL PROTECTED]>
> Date: Sun, 19 Aug 2007 10:33:31 -0700
> 
> > I know that you don't agree that TSO has drawbacks, as outlined by
> > Roland, but its history showing something else: the addition of TSO
> > took a fair amount of time and network performance was erratic for
> > multiple kernel revisions and the TSO code is sprinkled across the
> > network stack.
> 
> This thing you call "sprinkled" is a necessity of any hardware
> offload when it is possible for a packet to later get "steered"
> to a device which cannot perform the offload.
> 
> Therefore we need a software implementation of TSO so that those
> packets can still get output to the non-TSO-capable device.
> 
> We do the same thing for checksum offloading.
> 
> And for free we can use the software offloading mechanism to
> get batching to arbitrary network devices, even those which cannot
> do TSO.
> 
> What benefits does RDMA infrastructure give to non-RDMA capable
> devices?  None?  I see, that's great.
> 
> And again the TSO bugs and issues are being overstated and, also for
> the second time, these issues are more indicative of my bad
> programming skills then they are of intrinsic issues of TSO.  The
> TSO implementation was looking for a good design, and it took me
> a while to find it because I personally suck.
> 
> Face it, stateless offloads are always going to be better in the long
> term.  And this is proven.
> 
> You RDMA folks really do live in some kind of fantasy land.
[Felix Marti] You're not at all addressing the fact that RDMA does solve
the memory BW problem and stateless offload doesn't. Apart from that, I
don't quite understand your argument with respect to the benefits of the
RDMA infrastructure; what benefits does the TSO infrastructure give the
non-TSO capable devices? Isn't the answer none and yet you added TSO
support?! I don't think that the argument is stateless _versus_ stateful
offload both have their advantages and disadvantages. Stateless offload
does help, i.e. TSO/LRO do improve performance in back-to-back
benchmarks. It seems me that _you_ claim that there is no benefit to
statefull offload and that is where we're disagreeing; there is benefit
and i.e. the much lower memory BW requirements is just one example, yet
an important one. We'll probably never agree but it seems to me that
we're asking only for small changes to the software stack and then we
can give the choice to the end users: they can opt for stateless offload
if it fits the performance needs or for statefull offload if their apps
require the extra boost in performance.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: "Felix Marti" <[EMAIL PROTECTED]>
Date: Sun, 19 Aug 2007 10:33:31 -0700

> I know that you don't agree that TSO has drawbacks, as outlined by
> Roland, but its history showing something else: the addition of TSO
> took a fair amount of time and network performance was erratic for
> multiple kernel revisions and the TSO code is sprinkled across the
> network stack.

This thing you call "sprinkled" is a necessity of any hardware
offload when it is possible for a packet to later get "steered"
to a device which cannot perform the offload.

Therefore we need a software implementation of TSO so that those
packets can still get output to the non-TSO-capable device.

We do the same thing for checksum offloading.

And for free we can use the software offloading mechanism to
get batching to arbitrary network devices, even those which cannot
do TSO.

What benefits does RDMA infrastructure give to non-RDMA capable
devices?  None?  I see, that's great.

And again the TSO bugs and issues are being overstated and, also for
the second time, these issues are more indicative of my bad
programming skills then they are of intrinsic issues of TSO.  The
TSO implementation was looking for a good design, and it took me
a while to find it because I personally suck.

Face it, stateless offloads are always going to be better in the long
term.  And this is proven.

You RDMA folks really do live in some kind of fantasy land.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:general-
> [EMAIL PROTECTED] On Behalf Of David Miller
> Sent: Sunday, August 19, 2007 12:24 AM
> To: [EMAIL PROTECTED]
> Cc: netdev@vger.kernel.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Sean Hefty" <[EMAIL PROTECTED]>
> Date: Sun, 19 Aug 2007 00:01:07 -0700
> 
> > Millions of Infiniband ports are in operation today.  Over 25% of
the
> top 500
> > supercomputers use Infiniband.  The formation of the OpenFabrics
> Alliance was
> > pushed and has been continuously funded by an RDMA customer - the US
> National
> > Labs.  RDMA technologies are backed by Cisco, IBM, Intel, QLogic,
> Sun, Voltaire,
> > Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi,
> NEC, Fujitsu,
> > LSI, SGI, Sandia, and at least two dozen other companies.  IDC
> expects
> > Infiniband adapter revenue to triple between 2006 and 2011, and
> switch revenue
> > to increase six-fold (combined revenues of 1 billion).
> 
> Scale these numbers with reality and usage.
> 
> These vendors pour in huge amounts of money into a relatively small
> number of extremely large cluster installations.  Besides the folks
> doing nuke and whole-earth simulations at some government lab, nobody
> cares.  And part of the investment is not being done wholly for smart
> economic reasons, but also largely publicity purposes.
> 
> So present your great Infiniband numbers with that being admitted up
> front, ok?
> 
> It's relevance to Linux as a general purpose operating system that
> should be "good enough" for %99 of the world is close to NIL.
> 
> People have been pouring tons of money and research into doing stupid
> things to make clusters go fast, and in such a way that make zero
> sense for general purpose operating systems, for ages.  RDMA is just
> one such example.
[Felix Marti] Ouch, and I believed linux to be a leading edge OS, 
scaling from small embedded systems to hundreds of CPUs and hence
I assumed that the same 'scalability' applies to the network subsystem.

> 
> BTW, I find it ironic that you mention memory bandwidth as a retort,
> as Roland's favorite stateless offload devil, TSO, deals explicity
> with lowering the per-packet BUS bandwidth usage of TCP.  LRO
> offloading does likewise.

[Felix Marti] Aren't you confusing memory and bus BW here? - RDMA 
enables DMA from/to application buffers removing the user-to-kernel/
kernel-to-user memory copy with is a significant overhead at the 
rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps 
out) requires 60Gbps of BW on most common platforms. So, receiving and
transmitting at 10Gbps with LRO and TSO requires 80Gbps of system 
memory BW (which is beyond what most systems can do) whereas RDMA can 
do with 20Gbps!

In addition, BUS improvements are really not significant (nor are buses 
the bottleneck anymore with wide availability of PCI-E >= x8); TSO
avoids 
the DMA of a bunch of network headers... a typical example of stateless
offload - improving performance by a few percent while offload
technologies
provide system improvements of hundreds of percent.

I know that you don't agree that TSO has drawbacks, as outlined by
Roland, 
but its history showing something else: the addition of TSO took a fair
amount of time and network performance was erratic for multiple kernel 
revisions and the TSO code is sprinkled across the network stack. It is
an 
example of an intrusive 'improvement' whereas Steve (who started this 
thread) is asking for a relatively small change (decoupling the 4-tuple 
allocation from the socket). As Steve has outlined, your refusal of the
change requires RDMA users to work around the issue which pushes the
issue to the end-users and thus slowing down the acceptance of the 
technology leading to a chicken-and-egg problem: you only care if there 
are lots of users but you make it hard to use the technology in the
first 
place, clever ;)
 
> ___
> general mailing list
> [EMAIL PROTECTED]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html