Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Evgeniy Polyakov
On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti ([EMAIL PROTECTED]) wrote:
 [Felix Marti] David and Herbert, so you agree that the userkernel
 space memory copy overhead is a significant overhead and we want to
 enable zero-copy in both the receive and transmit path? - Yes, copy

It depends. If you need to access that data after received, you will get
cache miss and performance will not be much better (if any) that with
copy.

 avoidance is mainly an API issue and unfortunately the so widely used
 (synchronous) sockets API doesn't make copy avoidance easy, which is one
 area where protocol offload can help. Yes, some apps can resort to
 sendfile() but there are many apps which seem to have trouble switching
 to that API... and what about the receive path?

There is number of implementations, and all they are suitable for is 
to have recvfile(), since this is likely the only case, which can work 
without cache.

And actually RDMA stack exist and no one said it should be thrown away
_until_ it messes with main stack. It started to speal ports. What will
happen when it gest all port space and no new legal network conection
can be opened, although there is no way to show to user who got it?
What will happen if hardware RDMA connection got terminated and software
could not free the port? Will RDMA request to export connection reset
functions out of stack to drop network connections which are on the ports
which are supposed to be used by new RDMA connections?

RDMA is not a problem, but how it influence to the network stack is.
Let's better think about how to work correctly with network stack (since
we already have that cr^Wdifferent hardware) instead of saying that
others do bad work and do not allow shiny new feature to exist.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Andi Kleen
Felix Marti [EMAIL PROTECTED] writes:
  avoidance gains of TSO and LRO are still a very worthwhile savings.
 So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 +
 20), 864B, when moving ~64KB of payload - looks like very much in the
 noise to me.

TSO is beneficial for the software again. The linux code currently
takes several locks and does quite a few function calls for each 
packet and using larger packets lowers this overhead. At least with
10GbE saving CPU cycles is still quite important.

 an option to get 'high performance' 

Shouldn't you qualify that?

It is unlikely you really duplicated all the tuning for corner cases
that went over many years into good software TCP stacks in your
hardware.  So e.g. for wide area networks with occasional packet loss
the software might well perform better.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
 Sent: Monday, August 20, 2007 4:07 AM
 To: Felix Marti
 Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org;
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-
 [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 Felix Marti [EMAIL PROTECTED] writes:
   avoidance gains of TSO and LRO are still a very worthwhile
savings.
  So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20
+
  20), 864B, when moving ~64KB of payload - looks like very much in
the
  noise to me.
 
 TSO is beneficial for the software again. The linux code currently
 takes several locks and does quite a few function calls for each
 packet and using larger packets lowers this overhead. At least with
 10GbE saving CPU cycles is still quite important.
 
  an option to get 'high performance'
 
 Shouldn't you qualify that?
 
 It is unlikely you really duplicated all the tuning for corner cases
 that went over many years into good software TCP stacks in your
 hardware.  So e.g. for wide area networks with occasional packet loss
 the software might well perform better.
Yes, it used to be sufficient to submit performance data to show that a
technology make 'sense'. In fact, I believe it was Alan Cox who once
said that linux will have a look at offload once an offload device holds
the land speed record (probably assuming that the day never comes ;).
For the last few years it has been Chelsio offload devices that have
been improving their own LSRs (as IO bus speeds have been increasing).
It is worthwhile to point out that OC-192 doesn't offer full 10Gbps BW
and the fine-grained (per packet and not per TSO-burst) packet scheduler
in the offload device played a crucial part in pushing performance to
the limits of what OC-192 can do. Most other customers use our offload
products in low-latency cluster environments. - The problem with offload
devices is that they are not all born equal and there have been a lot of
poor implementation giving the technology a bad name. I can only speak
for Chelsio and do claim that we have a solid implementation that scales
from low-latency clusters environments to LFNs.

Andi, I could present performance numbers, i.e. throughput and CPU
utilization in function of IO size, number of connections, ... in a
back-to-back environment and/or in a cluster environment... but what
will it get me? I'd still get hit by the 'not integrated' hammer :(

 
 -Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


 -Original Message-
 From: Evgeniy Polyakov [mailto:[EMAIL PROTECTED]
 Sent: Monday, August 20, 2007 2:43 AM
 To: Felix Marti
 Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org;
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-
 [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti
 ([EMAIL PROTECTED]) wrote:
  [Felix Marti] David and Herbert, so you agree that the userkernel
  space memory copy overhead is a significant overhead and we want to
  enable zero-copy in both the receive and transmit path? - Yes, copy
 
 It depends. If you need to access that data after received, you will
 get
 cache miss and performance will not be much better (if any) that with
 copy.
Yes, the app will take the cache hits when accessing the data. However,
the fact remains that if there is a copy in the receive path, you
require and additional 3x memory BW (which is very significant at these
high rates and most likely the bottleneck for most current systems)...
and somebody always has to take the cache miss be it the copy_to_user or
the app.
 
  avoidance is mainly an API issue and unfortunately the so widely
used
  (synchronous) sockets API doesn't make copy avoidance easy, which is
 one
  area where protocol offload can help. Yes, some apps can resort to
  sendfile() but there are many apps which seem to have trouble
 switching
  to that API... and what about the receive path?
 
 There is number of implementations, and all they are suitable for is
 to have recvfile(), since this is likely the only case, which can work
 without cache.
 
 And actually RDMA stack exist and no one said it should be thrown away
 _until_ it messes with main stack. It started to speal ports. What
will
 happen when it gest all port space and no new legal network conection
 can be opened, although there is no way to show to user who got it?
 What will happen if hardware RDMA connection got terminated and
 software
 could not free the port? Will RDMA request to export connection reset
 functions out of stack to drop network connections which are on the
 ports
 which are supposed to be used by new RDMA connections?
Yes, RDMA support is there... but we could make it better and easier to
use. We have a problem today with port sharing and there was a proposal
to address the issue by tighter integration (see the beginning of the
thread) but the proposal got shot down immediately... because it is RDMA
and not for technical reasons. I believe this email threads shows in
detail how RDMA (a network technology) is treated as bastard child by
the network folks, well at least by one of them.
 
 RDMA is not a problem, but how it influence to the network stack is.
 Let's better think about how to work correctly with network stack
 (since
 we already have that cr^Wdifferent hardware) instead of saying that
 others do bad work and do not allow shiny new feature to exist.
By no means did I want to imply that others do bad work; are you
referring to me using TSO implementation issues as an example? - If so,
let me clarify: I understand that the TSO implementation took some time
to get right. What I was referring to is that TSO(/LRO) have their own
issues, some eluded to by Roland and me. In fact, customers working on
the LSR couldn't use TSO due to the burstiness it introduces and had to
fall-back to our fine grained packet scheduling done in the offload
device. I am for variety, let us support new technologies that solve
real problems (lots of folks are buying this stuff for a reason) instead
of the 'ah, its brain-dead and has no future' attitude... there is
precedence for offloading the host CPUs: have a look at graphics.
Graphics used to be done by the host CPU and now we have dedicated
graphics adapters that do a much better job... so, why is it so
farfetched that offload devices can do a better job at a data-flow
problem?
 
 --
   Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Andi Kleen
Felix Marti [EMAIL PROTECTED] writes:

 What I was referring to is that TSO(/LRO) have their own
 issues, some eluded to by Roland and me. In fact, customers working on
 the LSR couldn't use TSO due to the burstiness it introduces

That was in old kernels where TSO didn't honor the initial cwnd correctly, 
right? I assume it's long fixed.

If not please clarify what the problem was.

 have a look at graphics.
 Graphics used to be done by the host CPU and now we have dedicated
 graphics adapters that do a much better job...

Is your off load device as programable as a modern GPU?

 farfetched that offload devices can do a better job at a data-flow
 problem?

One big difference is that there is no potentially adverse and
always varying internet between the graphics card and your monitor.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
 Sent: Monday, August 20, 2007 11:11 AM
 To: Felix Marti
 Cc: Evgeniy Polyakov; [EMAIL PROTECTED]; netdev@vger.kernel.org;
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; David Miller
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 Felix Marti [EMAIL PROTECTED] writes:
 
  What I was referring to is that TSO(/LRO) have their own
  issues, some eluded to by Roland and me. In fact, customers working
 on
  the LSR couldn't use TSO due to the burstiness it introduces
 
 That was in old kernels where TSO didn't honor the initial cwnd
 correctly,
 right? I assume it's long fixed.
 
 If not please clarify what the problem was.
The problem is that is that Ethernet is about the only technology that
discloses 'useable' throughput while everybody else talks about
signaling rates ;) - OC-192 can carry about 9.128Gbps (or close to that
number) and hence 10Gbps Ethernet was overwhelming the OC-192 network.
The customer needed to schedule packets at about 98% of OC-192
throughput in order to avoid packet drop. The scheduling needed to be
done on a per packet basis and not per 'burst of packets' basis in order
to avoid packet drop.
 
 
  have a look at graphics.
  Graphics used to be done by the host CPU and now we have dedicated
  graphics adapters that do a much better job...
 
 Is your off load device as programable as a modern GPU?
It has a lot of knobs to turn.

 
  farfetched that offload devices can do a better job at a data-flow
  problem?
 
 One big difference is that there is no potentially adverse and
 always varying internet between the graphics card and your monitor.
These graphic adapters provide a wealth of features that you can take
advantage of to bring these amazing graphics to life. General purpose
CPUs cannot keep up. Chelsio offload devices do the same thing in the
realm of networking. - Will there be things you can't do, probably yes,
but as I said, there are lots of knobs to turn (and the latest and
greatest feature that gets hyped up might not always be the best thing
since sliced bread anyway; what happened to BIC love? ;)

 
 -Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Rick Jones

Andi Kleen wrote:

TSO is beneficial for the software again. The linux code currently
takes several locks and does quite a few function calls for each 
packet and using larger packets lowers this overhead. At least with

10GbE saving CPU cycles is still quite important.


Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running 
2.6.23-rc3.  the NICs are dual-core e1000's connected back-to-back with the 
interrupt throttle disabled.  I like using TCP_RR to tickle path-length 
questions because it rarely runs into bandwidth limitations regardless of the 
link-type.


First, with TSO enabled on both sides, then with it disabled, netperf/netserver 
bound to the same CPU as takes interrupts, which is the best place to be for a 
TCP_RR test (although not always for a TCP_STREAM test...):


:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind

!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput  :  0.3%
!!!   Local CPU util  : 39.3%
!!!   Remote CPU util : 40.6%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  10.01   18611.32  20.96  22.35  22.522  24.017
16384  87380
:~# ethtool -K eth2 tso off
e1000: eth2: e1000_set_tso: TSO is Disabled
:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind

!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput  :  0.4%
!!!   Local CPU util  : 21.0%
!!!   Remote CPU util : 25.2%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  10.01   19812.51  17.81  17.19  17.983  17.358
16384  87380

While the confidence intervals for CPU util weren't hit, I suspect the 
differences in service demand were still real.  On throughput we are talking 
about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not 
percentage points) in the first test and 12.5% in the second.


So, in broad handwaving terms, TSO increased the per-transaction service demand 
by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the 
transaction rate decreased by ~6%.


rick jones
bitrate blindless is a constant concern
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Thomas Graf
* Felix Marti [EMAIL PROTECTED] 2007-08-20 12:02
 These graphic adapters provide a wealth of features that you can take
 advantage of to bring these amazing graphics to life. General purpose
 CPUs cannot keep up. Chelsio offload devices do the same thing in the
 realm of networking. - Will there be things you can't do, probably yes,
 but as I said, there are lots of knobs to turn (and the latest and
 greatest feature that gets hyped up might not always be the best thing
 since sliced bread anyway; what happened to BIC love? ;)

GPUs have almost no influence on system security, the network stack OTOH
is probably the most vulnerable part of an operating system. Even if all
vendors would implement all the features collected over the last years
properly which seems unlikely. Having such an essential and critical
part depend on the vendor of my network card without being able to even
verify it properly is truly frightening.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Andi Kleen
 GPUs have almost no influence on system security, 

Unless you use direct rendering from user space.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Patrick Geoffray

Felix Marti wrote:

Yes, the app will take the cache hits when accessing the data. However,
the fact remains that if there is a copy in the receive path, you
require and additional 3x memory BW (which is very significant at these
high rates and most likely the bottleneck for most current systems)...
and somebody always has to take the cache miss be it the copy_to_user or
the app.


The cache miss is going to cost you half the memory bandwidth of a full 
copy. If the data is already in cache, then the copy is cheaper.


However, removing the copy removes the kernel from the picture on the 
receive side, so you lose demultiplexing, asynchronism, security, 
accounting, flow-control, swapping, etc. If it's ok with you to not use 
the kernel stack, then why expect to fit in the existing infrastructure 
anyway ?



Yes, RDMA support is there... but we could make it better and easier to


What do you need from the kernel for RDMA support beyond HW drivers ? A 
fast way to pin and translate user memory (ie registration). That is 
pretty much the sandbox that David referred to.


Eventually, it would be useful to be able to track the VM space to 
implement a registration cache instead of using ugly hacks in user-space 
to hijack malloc, but this is completely independent from the net stack.



use. We have a problem today with port sharing and there was a proposal


The port spaces are either totally separate and there is no issue, or 
completely identical and you should then run your connection manager in 
user-space or fix your middlewares.



and not for technical reasons. I believe this email threads shows in
detail how RDMA (a network technology) is treated as bastard child by
the network folks, well at least by one of them.


I don't think it's fair. This thread actually show how pushy some RDMA 
folks are about not acknowledging that the current infrastructure is 
here for a reason, and about mistaking zero-copy and RDMA.


This is a similar argument than the TOE discussion, and it was 
definitively a good decision to not mess up the Linux stack with TOEs.


Patrick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-20 Thread Felix Marti


 -Original Message-
 From: Patrick Geoffray [mailto:[EMAIL PROTECTED]
 Sent: Monday, August 20, 2007 1:34 PM
 To: Felix Marti
 Cc: Evgeniy Polyakov; David Miller; [EMAIL PROTECTED];
 netdev@vger.kernel.org; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 Felix Marti wrote:
  Yes, the app will take the cache hits when accessing the data.
 However,
  the fact remains that if there is a copy in the receive path, you
  require and additional 3x memory BW (which is very significant at
 these
  high rates and most likely the bottleneck for most current
 systems)...
  and somebody always has to take the cache miss be it the
copy_to_user
 or
  the app.
 
 The cache miss is going to cost you half the memory bandwidth of a
full
 copy. If the data is already in cache, then the copy is cheaper.
 
 However, removing the copy removes the kernel from the picture on the
 receive side, so you lose demultiplexing, asynchronism, security,
 accounting, flow-control, swapping, etc. If it's ok with you to not
use
 the kernel stack, then why expect to fit in the existing
infrastructure
 anyway ?
Many of the things you're referring to are moved to the offload adapter
but from an ease of use point of view, it would be great if the user
could still collect stats the same way, i.e. netstat reports the 4-tuple
in use and other network stats. In addition, security features and
packet scheduling could be integrated so that the user configures them
the same way as the network stack.

 
  Yes, RDMA support is there... but we could make it better and easier
 to
 
 What do you need from the kernel for RDMA support beyond HW drivers ?
A
 fast way to pin and translate user memory (ie registration). That is
 pretty much the sandbox that David referred to.
 
 Eventually, it would be useful to be able to track the VM space to
 implement a registration cache instead of using ugly hacks in user-
 space
 to hijack malloc, but this is completely independent from the net
 stack.
 
  use. We have a problem today with port sharing and there was a
 proposal
 
 The port spaces are either totally separate and there is no issue, or
 completely identical and you should then run your connection manager
in
 user-space or fix your middlewares.
When running on an iWarp device (and hence on top of TCP) I believe that
the port space should shared and i.e. netstat reports the 4-tuple in
use. 

 
  and not for technical reasons. I believe this email threads shows in
  detail how RDMA (a network technology) is treated as bastard child
by
  the network folks, well at least by one of them.
 
 I don't think it's fair. This thread actually show how pushy some RDMA
 folks are about not acknowledging that the current infrastructure is
 here for a reason, and about mistaking zero-copy and RDMA.
Zero-copy and RDMA are not the same but in the context of this
discussion I referred to RDMA as a superset (zero-copy is implied).

 
 This is a similar argument than the TOE discussion, and it was
 definitively a good decision to not mess up the Linux stack with TOEs.
 
 Patrick
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:general-
 [EMAIL PROTECTED] On Behalf Of David Miller
 Sent: Sunday, August 19, 2007 12:24 AM
 To: [EMAIL PROTECTED]
 Cc: netdev@vger.kernel.org; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 From: Sean Hefty [EMAIL PROTECTED]
 Date: Sun, 19 Aug 2007 00:01:07 -0700
 
  Millions of Infiniband ports are in operation today.  Over 25% of
the
 top 500
  supercomputers use Infiniband.  The formation of the OpenFabrics
 Alliance was
  pushed and has been continuously funded by an RDMA customer - the US
 National
  Labs.  RDMA technologies are backed by Cisco, IBM, Intel, QLogic,
 Sun, Voltaire,
  Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi,
 NEC, Fujitsu,
  LSI, SGI, Sandia, and at least two dozen other companies.  IDC
 expects
  Infiniband adapter revenue to triple between 2006 and 2011, and
 switch revenue
  to increase six-fold (combined revenues of 1 billion).
 
 Scale these numbers with reality and usage.
 
 These vendors pour in huge amounts of money into a relatively small
 number of extremely large cluster installations.  Besides the folks
 doing nuke and whole-earth simulations at some government lab, nobody
 cares.  And part of the investment is not being done wholly for smart
 economic reasons, but also largely publicity purposes.
 
 So present your great Infiniband numbers with that being admitted up
 front, ok?
 
 It's relevance to Linux as a general purpose operating system that
 should be good enough for %99 of the world is close to NIL.
 
 People have been pouring tons of money and research into doing stupid
 things to make clusters go fast, and in such a way that make zero
 sense for general purpose operating systems, for ages.  RDMA is just
 one such example.
[Felix Marti] Ouch, and I believed linux to be a leading edge OS, 
scaling from small embedded systems to hundreds of CPUs and hence
I assumed that the same 'scalability' applies to the network subsystem.

 
 BTW, I find it ironic that you mention memory bandwidth as a retort,
 as Roland's favorite stateless offload devil, TSO, deals explicity
 with lowering the per-packet BUS bandwidth usage of TCP.  LRO
 offloading does likewise.

[Felix Marti] Aren't you confusing memory and bus BW here? - RDMA 
enables DMA from/to application buffers removing the user-to-kernel/
kernel-to-user memory copy with is a significant overhead at the 
rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps 
out) requires 60Gbps of BW on most common platforms. So, receiving and
transmitting at 10Gbps with LRO and TSO requires 80Gbps of system 
memory BW (which is beyond what most systems can do) whereas RDMA can 
do with 20Gbps!

In addition, BUS improvements are really not significant (nor are buses 
the bottleneck anymore with wide availability of PCI-E = x8); TSO
avoids 
the DMA of a bunch of network headers... a typical example of stateless
offload - improving performance by a few percent while offload
technologies
provide system improvements of hundreds of percent.

I know that you don't agree that TSO has drawbacks, as outlined by
Roland, 
but its history showing something else: the addition of TSO took a fair
amount of time and network performance was erratic for multiple kernel 
revisions and the TSO code is sprinkled across the network stack. It is
an 
example of an intrusive 'improvement' whereas Steve (who started this 
thread) is asking for a relatively small change (decoupling the 4-tuple 
allocation from the socket). As Steve has outlined, your refusal of the
change requires RDMA users to work around the issue which pushes the
issue to the end-users and thus slowing down the acceptance of the 
technology leading to a chicken-and-egg problem: you only care if there 
are lots of users but you make it hard to use the technology in the
first 
place, clever ;)
 
 ___
 general mailing list
 [EMAIL PROTECTED]
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
 general
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: Felix Marti [EMAIL PROTECTED]
Date: Sun, 19 Aug 2007 10:33:31 -0700

 I know that you don't agree that TSO has drawbacks, as outlined by
 Roland, but its history showing something else: the addition of TSO
 took a fair amount of time and network performance was erratic for
 multiple kernel revisions and the TSO code is sprinkled across the
 network stack.

This thing you call sprinkled is a necessity of any hardware
offload when it is possible for a packet to later get steered
to a device which cannot perform the offload.

Therefore we need a software implementation of TSO so that those
packets can still get output to the non-TSO-capable device.

We do the same thing for checksum offloading.

And for free we can use the software offloading mechanism to
get batching to arbitrary network devices, even those which cannot
do TSO.

What benefits does RDMA infrastructure give to non-RDMA capable
devices?  None?  I see, that's great.

And again the TSO bugs and issues are being overstated and, also for
the second time, these issues are more indicative of my bad
programming skills then they are of intrinsic issues of TSO.  The
TSO implementation was looking for a good design, and it took me
a while to find it because I personally suck.

Face it, stateless offloads are always going to be better in the long
term.  And this is proven.

You RDMA folks really do live in some kind of fantasy land.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


 -Original Message-
 From: David Miller [mailto:[EMAIL PROTECTED]
 Sent: Sunday, August 19, 2007 12:32 PM
 To: Felix Marti
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 From: Felix Marti [EMAIL PROTECTED]
 Date: Sun, 19 Aug 2007 10:33:31 -0700
 
  I know that you don't agree that TSO has drawbacks, as outlined by
  Roland, but its history showing something else: the addition of TSO
  took a fair amount of time and network performance was erratic for
  multiple kernel revisions and the TSO code is sprinkled across the
  network stack.
 
 This thing you call sprinkled is a necessity of any hardware
 offload when it is possible for a packet to later get steered
 to a device which cannot perform the offload.
 
 Therefore we need a software implementation of TSO so that those
 packets can still get output to the non-TSO-capable device.
 
 We do the same thing for checksum offloading.
 
 And for free we can use the software offloading mechanism to
 get batching to arbitrary network devices, even those which cannot
 do TSO.
 
 What benefits does RDMA infrastructure give to non-RDMA capable
 devices?  None?  I see, that's great.
 
 And again the TSO bugs and issues are being overstated and, also for
 the second time, these issues are more indicative of my bad
 programming skills then they are of intrinsic issues of TSO.  The
 TSO implementation was looking for a good design, and it took me
 a while to find it because I personally suck.
 
 Face it, stateless offloads are always going to be better in the long
 term.  And this is proven.
 
 You RDMA folks really do live in some kind of fantasy land.
[Felix Marti] You're not at all addressing the fact that RDMA does solve
the memory BW problem and stateless offload doesn't. Apart from that, I
don't quite understand your argument with respect to the benefits of the
RDMA infrastructure; what benefits does the TSO infrastructure give the
non-TSO capable devices? Isn't the answer none and yet you added TSO
support?! I don't think that the argument is stateless _versus_ stateful
offload both have their advantages and disadvantages. Stateless offload
does help, i.e. TSO/LRO do improve performance in back-to-back
benchmarks. It seems me that _you_ claim that there is no benefit to
statefull offload and that is where we're disagreeing; there is benefit
and i.e. the much lower memory BW requirements is just one example, yet
an important one. We'll probably never agree but it seems to me that
we're asking only for small changes to the software stack and then we
can give the choice to the end users: they can opt for stateless offload
if it fits the performance needs or for statefull offload if their apps
require the extra boost in performance.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Andi Kleen
Felix Marti [EMAIL PROTECTED] writes:

 what benefits does the TSO infrastructure give the
 non-TSO capable devices?

It improves performance on software queueing devices between guests
and hypervisors. This is a more and more important application these
days.  Even when the system running the Hypervisor has a non TSO
capable device in the end it'll still save CPU cycles this way. Right now
virtualized IO tends to much more CPU intensive than direct IO so any
help it can get is beneficial.

It also makes loopback faster, although given that's probably not that
useful.

And a lot of the TSO infrastructure was needed for zero copy TX anyways,
which benefits most reasonable modern NICs (anything with hardware 
checksumming)

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: Felix Marti [EMAIL PROTECTED]
Date: Sun, 19 Aug 2007 12:49:05 -0700

 You're not at all addressing the fact that RDMA does solve the
 memory BW problem and stateless offload doesn't.

It does, I just didn't retort to your claims because they were
so blatantly wrong.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: 20 Aug 2007 01:27:35 +0200

 Felix Marti [EMAIL PROTECTED] writes:
 
  what benefits does the TSO infrastructure give the
  non-TSO capable devices?
 
 It improves performance on software queueing devices between guests
 and hypervisors. This is a more and more important application these
 days.  Even when the system running the Hypervisor has a non TSO
 capable device in the end it'll still save CPU cycles this way. Right now
 virtualized IO tends to much more CPU intensive than direct IO so any
 help it can get is beneficial.
 
 It also makes loopback faster, although given that's probably not that
 useful.
 
 And a lot of the TSO infrastructure was needed for zero copy TX anyways,
 which benefits most reasonable modern NICs (anything with hardware 
 checksumming)

And also, you can enable TSO generation for a non-TSO-hw device and
get all of the segmentation overhead reduction gains which works out
as a pure win as long as the device can at a minimum do checksumming.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Herbert Xu
Felix Marti [EMAIL PROTECTED] wrote:

 [Felix Marti] Aren't you confusing memory and bus BW here? - RDMA 
 enables DMA from/to application buffers removing the user-to-kernel/
 kernel-to-user memory copy with is a significant overhead at the 
 rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps 
 out) requires 60Gbps of BW on most common platforms. So, receiving and
 transmitting at 10Gbps with LRO and TSO requires 80Gbps of system 
 memory BW (which is beyond what most systems can do) whereas RDMA can 
 do with 20Gbps!

Actually this is false.  TSO only requires a copy if the user
chooses to use the sendmsg interface instead of sendpage.  The
same is true for RDMA really.  Except that instead of having to
switch your application to sendfile/splice, you're switching it
to RDMA.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


 -Original Message-
 From: David Miller [mailto:[EMAIL PROTECTED]
 Sent: Sunday, August 19, 2007 4:04 PM
 To: Felix Marti
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 From: Felix Marti [EMAIL PROTECTED]
 Date: Sun, 19 Aug 2007 12:49:05 -0700
 
  You're not at all addressing the fact that RDMA does solve the
  memory BW problem and stateless offload doesn't.
 
 It does, I just didn't retort to your claims because they were
 so blatantly wrong.
[Felix Marti] Hmmm, interesting... I guess it is impossible to even have
a discussion on the subject.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: Felix Marti [EMAIL PROTECTED]
Date: Sun, 19 Aug 2007 17:32:39 -0700

[ Why do you put that [Felix Marti] everywhere you say something?
  It's annoying and superfluous. The quoting done by your mail client
  makes clear who is saying what. ]

 Hmmm, interesting... I guess it is impossible to even have
 a discussion on the subject.

Nice try, Herbert Xu gave a great explanation.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


 -Original Message-
 From: David Miller [mailto:[EMAIL PROTECTED]
 Sent: Sunday, August 19, 2007 5:40 PM
 To: Felix Marti
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 From: Felix Marti [EMAIL PROTECTED]
 Date: Sun, 19 Aug 2007 17:32:39 -0700
 
 [ Why do you put that [Felix Marti] everywhere you say something?
   It's annoying and superfluous. The quoting done by your mail client
   makes clear who is saying what. ]
 
  Hmmm, interesting... I guess it is impossible to even have
  a discussion on the subject.
 
 Nice try, Herbert Xu gave a great explanation.
[Felix Marti] David and Herbert, so you agree that the userkernel
space memory copy overhead is a significant overhead and we want to
enable zero-copy in both the receive and transmit path? - Yes, copy
avoidance is mainly an API issue and unfortunately the so widely used
(synchronous) sockets API doesn't make copy avoidance easy, which is one
area where protocol offload can help. Yes, some apps can resort to
sendfile() but there are many apps which seem to have trouble switching
to that API... and what about the receive path?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread David Miller
From: Felix Marti [EMAIL PROTECTED]
Date: Sun, 19 Aug 2007 17:47:59 -0700

 [Felix Marti]

Please stop using this to start your replies, thank you.

 David and Herbert, so you agree that the userkernel
 space memory copy overhead is a significant overhead and we want to
 enable zero-copy in both the receive and transmit path? - Yes, copy
 avoidance is mainly an API issue and unfortunately the so widely used
 (synchronous) sockets API doesn't make copy avoidance easy, which is one
 area where protocol offload can help. Yes, some apps can resort to
 sendfile() but there are many apps which seem to have trouble switching
 to that API... and what about the receive path?

On the send side none of this is an issue.  You either are sending
static content, in which using sendfile() is trivial, or you're
generating data dynamically in which case the data copy is in the
noise or too small to do zerocopy on and if not you can use a shared
mmap to generate your data into, and then sendfile out from that file,
to avoid the copy that way.

splice() helps a lot too.

Splice has the capability to do away with the receive side too, and
there are a few receivefile() implementations that could get cleaned
up and merged in.

Also, the I/O bus is still the more limiting factor and main memory
bandwidth in all of this, it is the smallest data pipe for
communications out to and from the network.  So the protocol header
avoidance gains of TSO and LRO are still a very worthwhile savings.

But even if RDMA increases performance 100 fold, it still doesn't
avoid the issue that it doesn't fit in with the rest of the networking
stack and feature set.

Any monkey can change the rules around (ok I can make it go fast as
long as you don't need firewalling, packet scheduling, classification,
and you only need to talk to specific systems that speak this same
special protocol) to make things go faster.  On the other hand well
designed solutions can give performance gains within the constraints
of the full system design and without sactificing functionality.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


 -Original Message-
 From: David Miller [mailto:[EMAIL PROTECTED]
 Sent: Sunday, August 19, 2007 6:06 PM
 To: Felix Marti
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 From: Felix Marti [EMAIL PROTECTED]
 Date: Sun, 19 Aug 2007 17:47:59 -0700
 
  [Felix Marti]
 
 Please stop using this to start your replies, thank you.
Better?

 
  David and Herbert, so you agree that the userkernel
  space memory copy overhead is a significant overhead and we want to
  enable zero-copy in both the receive and transmit path? - Yes, copy
  avoidance is mainly an API issue and unfortunately the so widely
used
  (synchronous) sockets API doesn't make copy avoidance easy, which is
 one
  area where protocol offload can help. Yes, some apps can resort to
  sendfile() but there are many apps which seem to have trouble
 switching
  to that API... and what about the receive path?
 
 On the send side none of this is an issue.  You either are sending
 static content, in which using sendfile() is trivial, or you're
 generating data dynamically in which case the data copy is in the
 noise or too small to do zerocopy on and if not you can use a shared
 mmap to generate your data into, and then sendfile out from that file,
 to avoid the copy that way.
 
 splice() helps a lot too.
 
 Splice has the capability to do away with the receive side too, and
 there are a few receivefile() implementations that could get cleaned
 up and merged in.
I don't believe it is as simple as that. Many apps synthesize their
payload in user space buffers (i.e. malloced memory) and expect to
receive their data in user space buffers _and_ expect the received data
to have a certain alignment and to be contiguous - something not
addressed by these 'new' APIs. Look, people writing HPC apps tend to
take advantage of whatever they can to squeeze some extra performance
out of their apps and they are resorting to protocol offload technology
for a reason, wouldn't you agree? 

 
 Also, the I/O bus is still the more limiting factor and main memory
 bandwidth in all of this, it is the smallest data pipe for
 communications out to and from the network.  So the protocol header
 avoidance gains of TSO and LRO are still a very worthwhile savings.
So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 +
20), 864B, when moving ~64KB of payload - looks like very much in the
noise to me. And again, PCI-E provides more bandwidth than the wire...

 
 But even if RDMA increases performance 100 fold, it still doesn't
 avoid the issue that it doesn't fit in with the rest of the networking
 stack and feature set.
 
 Any monkey can change the rules around (ok I can make it go fast as
 long as you don't need firewalling, packet scheduling, classification,
 and you only need to talk to specific systems that speak this same
 special protocol) to make things go faster.  On the other hand well
 designed solutions can give performance gains within the constraints
 of the full system design and without sactificing functionality.
While I believe that you should give people an option to get 'high
performance' _instead_ of other features and let them chose whatever
they care about, I really do agree with what you're saying and believe
that offload devices _should_ be integrated with the facilities that you
mention (in fact, offload can do a much better job at lots of things
that you mention ;) ... but you're not letting offload devices integrate
and you're slowing down innovation in this field.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.

2007-08-19 Thread Felix Marti


 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
 Sent: Sunday, August 19, 2007 4:28 PM
 To: Felix Marti
 Cc: David Miller; [EMAIL PROTECTED]; netdev@vger.kernel.org;
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
 PS_TCPportsfrom the host TCP port space.
 
 Felix Marti [EMAIL PROTECTED] writes:
 
  what benefits does the TSO infrastructure give the
  non-TSO capable devices?
 
 It improves performance on software queueing devices between guests
 and hypervisors. This is a more and more important application these
 days.  Even when the system running the Hypervisor has a non TSO
 capable device in the end it'll still save CPU cycles this way. Right
 now
 virtualized IO tends to much more CPU intensive than direct IO so any
 help it can get is beneficial.
 
 It also makes loopback faster, although given that's probably not that
 useful.
 
 And a lot of the TSO infrastructure was needed for zero copy TX
 anyways,
 which benefits most reasonable modern NICs (anything with hardware
 checksumming)
Hi Andi, yes, you're right. I should have chosen my example more
carefully.

 
 -Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html