Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-05-02 Thread Troy Benjegerdes
On Thu, Apr 27, 2006 at 05:16:29PM -0700, Greg Lindahl wrote:
> On Thu, Apr 27, 2006 at 04:22:40PM -0700, Grant Grundler wrote:
> 
> > Anything preventnig such a gateway from routing SDP to ethernet?
> > Those gateways obviously will grok IB protocols.
> > I'm asking becuase I don't understand/know if there is a real
> > barrier to an IB -> ethernet gateway _without_ IPoIB.
> 
> I don't know if a SDP to ethernet gateway even exists, but I do know
> that it's a lot more work than just an IPoIB to ethernet gateway --
> the gateway is going to have to pass all its data through a TCP stack.
> So I would expect SDP to ethernet to not run very fast, especially on
> a gateway with lots of streams going.

And this is exactly the reason that we should not be playing games with
"infiniband specific" TCP optimzations. If you stay on the IB network,
use SDP or verbs. If you are going to cross networks, you want to be
running the full host TCP stack that has been well tested and is robust
to all the kinds of failures you see crossing networks. This does not
mean that it won't be fast, but you *will* have more overhead than on a
single network fabric.

If someone has a configureation where full TCP processing on
the host is a bottleneck and not the IPoIB to ethernet gateway, then
let's have this discussion again. But I don't believe such a
configuration actually exists anywhere. If you think you have some
problem like this, I would love to be able to run some benchmarks on the
system.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-04-27 Thread Greg Lindahl
On Thu, Apr 27, 2006 at 04:22:40PM -0700, Grant Grundler wrote:

> Anything preventnig such a gateway from routing SDP to ethernet?
> Those gateways obviously will grok IB protocols.
> I'm asking becuase I don't understand/know if there is a real
> barrier to an IB -> ethernet gateway _without_ IPoIB.

I don't know if a SDP to ethernet gateway even exists, but I do know
that it's a lot more work than just an IPoIB to ethernet gateway --
the gateway is going to have to pass all its data through a TCP stack.
So I would expect SDP to ethernet to not run very fast, especially on
a gateway with lots of streams going.

-- greg

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-04-27 Thread Grant Grundler
On Thu, Apr 27, 2006 at 12:23:52AM -0700, Greg Lindahl wrote:
> On Wed, Apr 26, 2006 at 11:13:24PM -0500, Troy Benjegerdes wrote:
> 
> > David is right. If you care about performance, you are already using SDP
> > or verbs layer for the transport anyway. If I am going to be doing IPoIB,
> > it's because eventually I expect the packet might get off the IB network
> > and onto some other network and go halfway across the country.
> 
> This is going to be a surprise to lots of people who want high-speed
> gateways from IB to ethernet -- many clusters connect to fileservers
> and other performance-sensitive gizmos that way.

Anything preventnig such a gateway from routing SDP to ethernet?
Those gateways obviously will grok IB protocols.
I'm asking becuase I don't understand/know if there is a real
barrier to an IB -> ethernet gateway _without_ IPoIB.

thanks,
grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-04-27 Thread Greg Lindahl
On Wed, Apr 26, 2006 at 11:13:24PM -0500, Troy Benjegerdes wrote:

> David is right. If you care about performance, you are already using SDP
> or verbs layer for the transport anyway. If I am going to be doing IPoIB,
> it's because eventually I expect the packet might get off the IB network
> and onto some other network and go halfway across the country.

This is going to be a surprise to lots of people who want high-speed
gateways from IB to ethernet -- many clusters connect to fileservers
and other performance-sensitive gizmos that way.

-- greg
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-04-26 Thread Troy Benjegerdes
On Mon, Mar 20, 2006 at 02:37:04AM -0800, David S. Miller wrote:
> From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
> Date: Mon, 20 Mar 2006 12:22:34 +0200
> 
> > Quoting r. David S. Miller <[EMAIL PROTECTED]>:
> > > The path an SKB can take is opaque and unknown until the very last
> > > moment it is actually given to the device transmit function.
> > 
> > Why, I was proposing looking at dst cache. If that's NULL, well,
> > we won't stretch ACKs. Worst case we apply the wrong optimization.
> > Right?
> 
> Where you receive a packet from isn't very useful for determining
> even the full patch on which that packet itself flowed.
> 
> More importantly, packets also do not necessarily go back out over the
> same path on which packets are received for a connection.  This is
> actually quite common.
> 
> Maybe packets for this connection come in via IPoIB but go out via
> gigabit ethernet and another route altogether.
> 
> > What I'd like to clarify, however: rfc2581 explicitly states that in
> > some cases it might be OK to generate ACKs less frequently than
> > every second full-sized segment. Given Matt's measurements, TCP on
> > top of IP over InfiniBand on Linux seems to hit one of these cases.
> > Do you agree to that?
> 
> I disagree with Linux changing it's behavior.  It would be great to
> turn off congestion control completely over local gigabit networks,
> but that isn't determinable in any way, so we don't do that.
> 
> The IPoIB situation is no different, you can set all the bits you want
> in incoming packets, the barrier to doing this remains the same.
> 
> It hurts performance if any packet drop occurs because it will require
> an extra round trip for recovery to begin to be triggered at the
> sender.
> 
> The network is a black box, routes to and from a destination are
> arbitrary, and so is packet rewriting and reflection, so being able to
> say "this all occurs on IPoIB" is simply infeasible.
> 
> I don't know how else to say this, we simply cannot special case IPoIB
> or any other topology type.

David is right. If you care about performance, you are already using SDP
or verbs layer for the transport anyway. If I am going to be doing IPoIB,
it's because eventually I expect the packet might get off the IB network
and onto some other network and go halfway across the country.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread David S. Miller
From: Benjamin LaHaise <[EMAIL PROTECTED]>
Date: Mon, 20 Mar 2006 10:09:42 -0500

> Wouldn't it make sense to strech the ACK when the previous ACK is still in 
> the TX queue of the device?  I know that sort of behaviour was always an 
> issue on modem links where you don't want to send out redundant ACKs.

I thought about doing some similar trick with TSO, wherein we would
not defer a TSO send if all the previous packets sent are out of the
device transmit queue.  The idea was the prevent the pipe from ever
emptying which is the danger of deferring too much for TSO.

This has several problems.  It's hard to implement.  You have to
decide if you want precise state, thereby checking the TX descriptors.
Or you go for imprecise but easier to implement, which is very
imprecise and therefore not very useful state, by just checking the
SKB refcount or similar which means that you find out it's left the TX
queue after the TX purge interrupt which can be a long time after the
event and by then the pipe has empties which is what you were trying
to prevent.

Lastly, you don't want to touch remote cpu state which is what such
a hack is going to end up doing much of the time.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Rick Jones
Wouldn't it make sense to strech the ACK when the previous ACK is still in 
the TX queue of the device?  I know that sort of behaviour was always an 
issue on modem links where you don't want to send out redundant ACKs.


Perhaps, but it isn't clear that it would be worth the cycles to check. 
   I doubt that a simple reference count on the ACK skb would do it 
since if it were a bare ACK I doubt that TCP keeps a reference to the 
skb in the first place?


Also, what would be the "trigger" to send the next ACK after the 
previous one had left the building (Elvis-like)?  Receipt of N in-order 
segments?  A timeout?


If you are going to go ahead and try to do stretch-ACKs, then I suspect 
the way to go about doing it is to have it behave very much like HP-UX 
or Solaris, both of which have arguably reasonable ACK-avoidance 
heuristics in them.


But don't try to do it quick and dirty.

rick "likes ACK avoidance, just check the archives" jones
on netdev, no need to cc me directly
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Benjamin LaHaise
On Mon, Mar 20, 2006 at 02:04:07PM +0200, Michael S. Tsirkin wrote:
> does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to
> stretch ACKs "after careful consideration", and we are seeing that it helps
> IP over InfiniBand, so recent Linux kernels perform worse in that respect.
> 
> And since there does not seem to be a way to figure it out automagically when
> doing this is a good idea, I proposed adding some kind of knob that will let 
> the
> user apply the consideration for us.

Wouldn't it make sense to strech the ACK when the previous ACK is still in 
the TX queue of the device?  I know that sort of behaviour was always an 
issue on modem links where you don't want to send out redundant ACKs.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Michael S. Tsirkin
Quoting Arjan van de Ven <[EMAIL PROTECTED]>:
> > I read it as if he was proposing to have a sysctl knob to turn off
> > TCP congestion control completely (which has so many issues it's not
> > even funny.)
> 
> owww that's so bad I didn't even consider that

No, I think that comment was taken out of thread context. We were talking about
stretching ACKs - while avoiding stretch ACKs is important for TCP congestion
control, it's not the only mechanism.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Michael S. Tsirkin
Quoting r. Lennert Buytenhek <[EMAIL PROTECTED]>:
> > > > I disagree with Linux changing it's behavior.  It would be great to
> > > > turn off congestion control completely over local gigabit networks,
> > > > but that isn't determinable in any way, so we don't do that.
> > > 
> > > Interesting. Would it make sense to make it another tunable knob in
> > > /proc, sysfs or sysctl then?
> > 
> > that's not the right level; since that is per interface. And you only
> > know the actual interface waay too late (as per earlier posts).
> > Per socket.. maybe
> > But then again it's not impossible to have packets for one socket go out
> > to multiple interfaces
> > (think load balancing bonding over 2 interfaces, one IB another
> > ethernet)
> 
> I read it as if he was proposing to have a sysctl knob to turn off
> TCP congestion control completely (which has so many issues it's not
> even funny.)

Not really, that was David :)

What started this thread was the fact that since 2.6.11 Linux
does not stretch ACKs anymore. RFC 2581 does mention that it might be OK to
stretch ACKs "after careful consideration", and we are seeing that it helps
IP over InfiniBand, so recent Linux kernels perform worse in that respect.

And since there does not seem to be a way to figure it out automagically when
doing this is a good idea, I proposed adding some kind of knob that will let the
user apply the consideration for us.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Arjan van de Ven
On Mon, 2006-03-20 at 12:49 +0100, Lennert Buytenhek wrote:
> On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote:
> 
> > > > I disagree with Linux changing it's behavior.  It would be great to
> > > > turn off congestion control completely over local gigabit networks,
> > > > but that isn't determinable in any way, so we don't do that.
> > > 
> > > Interesting. Would it make sense to make it another tunable knob in
> > > /proc, sysfs or sysctl then?
> > 
> > that's not the right level; since that is per interface. And you only
> > know the actual interface waay too late (as per earlier posts).
> > Per socket.. maybe
> > But then again it's not impossible to have packets for one socket go out
> > to multiple interfaces
> > (think load balancing bonding over 2 interfaces, one IB another
> > ethernet)
> 
> I read it as if he was proposing to have a sysctl knob to turn off
> TCP congestion control completely (which has so many issues it's not
> even funny.)

owww that's so bad I didn't even consider that

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Lennert Buytenhek
On Mon, Mar 20, 2006 at 12:47:03PM +0100, Arjan van de Ven wrote:

> > > I disagree with Linux changing it's behavior.  It would be great to
> > > turn off congestion control completely over local gigabit networks,
> > > but that isn't determinable in any way, so we don't do that.
> > 
> > Interesting. Would it make sense to make it another tunable knob in
> > /proc, sysfs or sysctl then?
> 
> that's not the right level; since that is per interface. And you only
> know the actual interface waay too late (as per earlier posts).
> Per socket.. maybe
> But then again it's not impossible to have packets for one socket go out
> to multiple interfaces
> (think load balancing bonding over 2 interfaces, one IB another
> ethernet)

I read it as if he was proposing to have a sysctl knob to turn off
TCP congestion control completely (which has so many issues it's not
even funny.)
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Arjan van de Ven
On Mon, 2006-03-20 at 13:27 +0200, Michael S. Tsirkin wrote:
> Quoting David S. Miller <[EMAIL PROTECTED]>:
> > I disagree with Linux changing it's behavior.  It would be great to
> > turn off congestion control completely over local gigabit networks,
> > but that isn't determinable in any way, so we don't do that.
> 
> Interesting. Would it make sense to make it another tunable knob in
> /proc, sysfs or sysctl then?

that's not the right level; since that is per interface. And you only
know the actual interface waay too late (as per earlier posts).
Per socket.. maybe
But then again it's not impossible to have packets for one socket go out
to multiple interfaces
(think load balancing bonding over 2 interfaces, one IB another
ethernet)


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Michael S. Tsirkin
Quoting David S. Miller <[EMAIL PROTECTED]>:
> I disagree with Linux changing it's behavior.  It would be great to
> turn off congestion control completely over local gigabit networks,
> but that isn't determinable in any way, so we don't do that.

Interesting. Would it make sense to make it another tunable knob in
/proc, sysfs or sysctl then?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread David S. Miller
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
Date: Mon, 20 Mar 2006 12:22:34 +0200

> Quoting r. David S. Miller <[EMAIL PROTECTED]>:
> > The path an SKB can take is opaque and unknown until the very last
> > moment it is actually given to the device transmit function.
> 
> Why, I was proposing looking at dst cache. If that's NULL, well,
> we won't stretch ACKs. Worst case we apply the wrong optimization.
> Right?

Where you receive a packet from isn't very useful for determining
even the full patch on which that packet itself flowed.

More importantly, packets also do not necessarily go back out over the
same path on which packets are received for a connection.  This is
actually quite common.

Maybe packets for this connection come in via IPoIB but go out via
gigabit ethernet and another route altogether.

> What I'd like to clarify, however: rfc2581 explicitly states that in
> some cases it might be OK to generate ACKs less frequently than
> every second full-sized segment. Given Matt's measurements, TCP on
> top of IP over InfiniBand on Linux seems to hit one of these cases.
> Do you agree to that?

I disagree with Linux changing it's behavior.  It would be great to
turn off congestion control completely over local gigabit networks,
but that isn't determinable in any way, so we don't do that.

The IPoIB situation is no different, you can set all the bits you want
in incoming packets, the barrier to doing this remains the same.

It hurts performance if any packet drop occurs because it will require
an extra round trip for recovery to begin to be triggered at the
sender.

The network is a black box, routes to and from a destination are
arbitrary, and so is packet rewriting and reflection, so being able to
say "this all occurs on IPoIB" is simply infeasible.

I don't know how else to say this, we simply cannot special case IPoIB
or any other topology type.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Michael S. Tsirkin
Quoting r. David S. Miller <[EMAIL PROTECTED]>:
> The path an SKB can take is opaque and unknown until the very last
> moment it is actually given to the device transmit function.

Why, I was proposing looking at dst cache. If that's NULL, well,
we won't stretch ACKs. Worst case we apply the wrong optimization.
Right?

> People need to get the "special case this topology" ideas out of their
> heads. :-)

Okay, I get that.

What I'd like to clarify, however: rfc2581 explicitly states that in some cases
it might be OK to generate ACKs less frequently than every second full-sized
segment. Given Matt's measurements, TCP on top of IP over InfiniBand on Linux
seems to hit one of these cases.  Do you agree to that?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread David S. Miller
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
Date: Mon, 20 Mar 2006 11:06:29 +0200

> Is it the case then that this requirement is less essential on
> networks such as IP over InfiniBand, which are very low latency
> and essencially lossless (with explicit congestion contifications
> in hardware)?

You can never assume any attribute of the network whatsoever.
Even if initially the outgoing device is IPoIB, something in
the middle, like a traffic classification or netfilter rule,
could rewrite the packet and make it go somewhere else.

This even applies to loopback packets, because packets can
get rewritten and redirected even once they are passed in
via netif_receive_skb().

> And as Matt Leininger's research appears to show, stretch ACKs
> are good for performance in case of IP over InfiniBand.
>
> Given all this, would it make sense to add a per-netdevice (or per-neighbour)
> flag to re-enable the trick for these net devices (as was done before
> 314324121f9b94b2ca657a494cf2b9cb0e4a28cc)?
> IP over InfiniBand driver would then simply set this flag.

See above, this is not feasible.

The path an SKB can take is opaque and unknown until the very last
moment it is actually given to the device transmit function.

People need to get the "special case this topology" ideas out of their
heads. :-)

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-20 Thread Michael S. Tsirkin
Quoting r. David S. Miller <[EMAIL PROTECTED]>:
> > well, there are stacks which do "stretch acks" (after a fashion) that 
> > make sure when they see packet loss to "do the right thing" wrt sending 
> > enough acks to allow cwnds to open again in a timely fashion.
> 
> Once a loss happens, it's too late to stop doing the stretch ACKs, the
> damage is done already.  It is going to take you at least one
> extra RTT to recover from the loss compared to if you were not doing
> stretch ACKs.
> 
> You have to keep giving consistent well spaced ACKs back to the
> receiver in order to recover from loss optimally.

Is it the case then that this requirement is less essential on
networks such as IP over InfiniBand, which are very low latency
and essencially lossless (with explicit congestion contifications
in hardware)?

> The ACK every 2 full sized frames behavior of TCP is absolutely
> essential.

Interestingly, I was pointed towards the following RFC draft
http://www.ietf.org/internet-drafts/draft-ietf-tcpm-rfc2581bis-00.txt

The requirement that an ACK "SHOULD" be generated for at least every
second full-sized segment is listed in [RFC1122] in one place as a
SHOULD and another as a MUST.  Here we unambiguously state it is a
SHOULD.  We also emphasize that this is a SHOULD, meaning that an
implementor should indeed only deviate from this requirement after
careful consideration of the implications.

And as Matt Leininger's research appears to show, stretch ACKs
are good for performance in case of IP over InfiniBand.

Given all this, would it make sense to add a per-netdevice (or per-neighbour)
flag to re-enable the trick for these net devices (as was done before
314324121f9b94b2ca657a494cf2b9cb0e4a28cc)?
IP over InfiniBand driver would then simply set this flag.

David, would you accept such a patch? It would be nice to get 2.6.17
back to within at least 10% of 2.6.11.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-10 Thread Rick Jones

David S. Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Thu, 09 Mar 2006 16:21:05 -0800


well, there are stacks which do "stretch acks" (after a fashion) that 
make sure when they see packet loss to "do the right thing" wrt sending 
enough acks to allow cwnds to open again in a timely fashion.



Once a loss happens, it's too late to stop doing the stretch ACKs, the
damage is done already.  It is going to take you at least one
extra RTT to recover from the loss compared to if you were not doing
stretch ACKs.


I must be dense (entirely possible), but how is that absolute?

If there is no more data in flight after the segment that was lost the 
"stretch ACK" stacks with which I'm familiar will generate the 
standalone ACK within the deferred ACK interval (50 milliseconds). I 
guess that can be the "one extra RTT"  However,  if there is data in 
flight after the point of loss, the immediate ACK upon receipt of out-of 
order data kicks in.



You have to keep giving consistent well spaced ACKs back to the
receiver in order to recover from loss optimally.


The key there is defining consistent and well spaced.  Certainly an ACK 
only after a window's-worth of data would not be well spaced, but I 
believe that an ACK after more than two full sized frames could indeed 
be well-spaced.



The ACK every 2 full sized frames behavior of TCP is absolutely
essential.


I don't think it is _quite_ that cut and dried, otherwise, HP-UX and 
Solaris, since < 1997 would have had big time problems.


rick jones
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-09 Thread David S. Miller
From: Rick Jones <[EMAIL PROTECTED]>
Date: Thu, 09 Mar 2006 16:21:05 -0800

> well, there are stacks which do "stretch acks" (after a fashion) that 
> make sure when they see packet loss to "do the right thing" wrt sending 
> enough acks to allow cwnds to open again in a timely fashion.

Once a loss happens, it's too late to stop doing the stretch ACKs, the
damage is done already.  It is going to take you at least one
extra RTT to recover from the loss compared to if you were not doing
stretch ACKs.

You have to keep giving consistent well spaced ACKs back to the
receiver in order to recover from loss optimally.

The ACK every 2 full sized frames behavior of TCP is absolutely
essential.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-09 Thread David S. Miller
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
Date: Fri, 10 Mar 2006 02:10:31 +0200

> But with the change we are discussing, could an ack now be sent even
> sooner than we have at least two full sized segments?  Or does
> __tcp_ack_snd_check delay until we have at least two full sized
> segments? David, could you explain please?

__tcp_ack_snd_check() delays until we have at least two full
sized segments.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-09 Thread Michael S. Tsirkin
Quoting r. Michael S. Tsirkin <[EMAIL PROTECTED]>:
> Or does __tcp_ack_snd_check delay until we have at least two full sized
> segments?

What I'm trying to say, since RFC 2525, 2.13 talks about
"every second full-sized segment", so following the code from
__tcp_ack_snd_check, why does it do

/* More than one full frame received... */
if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss

rather than

/* At least two full frames received... */
if (((tp->rcv_nxt - tp->rcv_wup) >= 2 * inet_csk(sk)->icsk_ack.rcv_mss

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-09 Thread Rick Jones

David S. Miller wrote:

From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
Date: Wed, 8 Mar 2006 14:53:11 +0200



What I was trying to figure out was, how can we re-enable the trick
without hurting TSO? Could a solution be to simply look at the frame
size, and call tcp_send_delayed_ack if the frame size is small?



The change is really not related to TSO.

By reverting it, you are reducing the number of ACKs on the wire, and
the number of context switches at the sender to push out new data.
That's why it can make things go faster, but it also leads to bursty
TCP sender behavior, which is bad for congestion on the internet.


naughty naughty Solaris and HP-UX TCP :)



When the receiver has a strong cpu and can keep up with the incoming
packet rate very well and we are in an environment with no congestion,
the old code helps a lot.  But if the receiver is cpu limited or we
have congestion of any kind, it does exactly the wrong thing.  It will
delay ACKs a very long time to the point where the pipe is depleted
and this kills performance in that case.  For congested environments,
due to the decreased ACK feedback, packet loss recovery will be
extremely poor.  This is the first reason behind my change.


well, there are stacks which do "stretch acks" (after a fashion) that 
make sure when they see packet loss to "do the right thing" wrt sending 
enough acks to allow cwnds to open again in a timely fashion.


that brings-back all that stuff I posted ages ago about the performance 
delta when using an HP-UX receiver and altering the number of segmetns 
per ACK.  should be in the netdev archive somewhere.


might have been around the time of the discussions about MacOS and its 
ack avoidance - which wasn't done very well at the time.





The behavior is also specifically frowned upon in the TCP implementor
community.  It is specifically mentioned in the Known TCP
Implementation Problems RFC2525, in section 2.13 "Stretch ACK
violation".

The entry, quoted below for reference, is very clear on the reasons
why stretch ACKs are bad.  And although it may help performance for
your case, in congested environments and also with cpu limited
receivers it will have a negative impact on performance.  So, this was
the second reason why I made this change.


I would have thought that a receiver "stretching ACK's" would be helpful 
when it was CPU limited since it was spending fewer CPU cycles 
generating ACKs?




So reverting the change isn't really an option.

   Name of Problem
  Stretch ACK violation

   Classification
  Congestion Control/Performance

   Description
  To improve efficiency (both computer and network) a data receiver
  may refrain from sending an ACK for each incoming segment,
  according to [RFC1122].  However, an ACK should not be delayed an
  inordinate amount of time.  Specifically, ACKs SHOULD be sent for
  every second full-sized segment that arrives.  If a second full-
  sized segment does not arrive within a given timeout (of no more
  than 0.5 seconds), an ACK should be transmitted, according to
  [RFC1122].  A TCP receiver which does not generate an ACK for
  every second full-sized segment exhibits a "Stretch ACK
  Violation".


How can it be a "violation" of a SHOULD?-)



   Significance
  TCP receivers exhibiting this behavior will cause TCP senders to
  generate burstier traffic, which can degrade performance in
  congested environments.  In addition, generating fewer ACKs
  increases the amount of time needed by the slow start algorithm to
  open the congestion window to an appropriate point, which
  diminishes performance in environments with large bandwidth-delay
  products.  Finally, generating fewer ACKs may cause needless
  retransmission timeouts in lossy environments, as it increases the
  possibility that an entire window of ACKs is lost, forcing a
  retransmission timeout.


Of those three, I think the most meaningful is the second, which can be 
dealt with by smarts in the ACK-stretching receiver.


For the first, it will only degrade performance if it triggers packet loss.

I'm not sure I've ever seen the third item happen.



   Implications
  When not in loss recovery, every ACK received by a TCP sender
  triggers the transmission of new data segments.  The burst size is
  determined by the number of previously unacknowledged segments
  each ACK covers.  Therefore, a TCP receiver ack'ing more than 2
  segments at a time causes the sending TCP to generate a larger
  burst of traffic upon receipt of the ACK.  This large burst of
  traffic can overwhelm an intervening gateway, leading to higher
  drop rates for both the connection and other connections passing
  through the congested gateway.


Doesn't RED mean that those other connections are rather less likely to 
be affected?




  In addition, the TCP slow start algorithm increases the congestion

[openib-general] Re: TSO and IPoIB performance degradation

2006-03-09 Thread Michael S. Tsirkin
Quoting David S. Miller <[EMAIL PROTECTED]>:
>Description
>   To improve efficiency (both computer and network) a data receiver
>   may refrain from sending an ACK for each incoming segment,
>   according to [RFC1122].  However, an ACK should not be delayed an
>   inordinate amount of time.  Specifically, ACKs SHOULD be sent for
>   every second full-sized segment that arrives.  If a second full-
>   sized segment does not arrive within a given timeout (of no more
>   than 0.5 seconds), an ACK should be transmitted, according to
>   [RFC1122].  A TCP receiver which does not generate an ACK for
>   every second full-sized segment exhibits a "Stretch ACK
>   Violation".

Thanks very much for the info!

So the longest we can delay, according to this spec, is until we have two full
sized segments.

But with the change we are discussing, could an ack now be sent even sooner than
we have at least two full sized segments?  Or does __tcp_ack_snd_check delay
until we have at least two full sized segments? David, could you explain please?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-09 Thread David S. Miller
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
Date: Wed, 8 Mar 2006 14:53:11 +0200

> What I was trying to figure out was, how can we re-enable the trick
> without hurting TSO? Could a solution be to simply look at the frame
> size, and call tcp_send_delayed_ack if the frame size is small?

The change is really not related to TSO.

By reverting it, you are reducing the number of ACKs on the wire, and
the number of context switches at the sender to push out new data.
That's why it can make things go faster, but it also leads to bursty
TCP sender behavior, which is bad for congestion on the internet.

When the receiver has a strong cpu and can keep up with the incoming
packet rate very well and we are in an environment with no congestion,
the old code helps a lot.  But if the receiver is cpu limited or we
have congestion of any kind, it does exactly the wrong thing.  It will
delay ACKs a very long time to the point where the pipe is depleted
and this kills performance in that case.  For congested environments,
due to the decreased ACK feedback, packet loss recovery will be
extremely poor.  This is the first reason behind my change.

The behavior is also specifically frowned upon in the TCP implementor
community.  It is specifically mentioned in the Known TCP
Implementation Problems RFC2525, in section 2.13 "Stretch ACK
violation".

The entry, quoted below for reference, is very clear on the reasons
why stretch ACKs are bad.  And although it may help performance for
your case, in congested environments and also with cpu limited
receivers it will have a negative impact on performance.  So, this was
the second reason why I made this change.

So reverting the change isn't really an option.

   Name of Problem
  Stretch ACK violation

   Classification
  Congestion Control/Performance

   Description
  To improve efficiency (both computer and network) a data receiver
  may refrain from sending an ACK for each incoming segment,
  according to [RFC1122].  However, an ACK should not be delayed an
  inordinate amount of time.  Specifically, ACKs SHOULD be sent for
  every second full-sized segment that arrives.  If a second full-
  sized segment does not arrive within a given timeout (of no more
  than 0.5 seconds), an ACK should be transmitted, according to
  [RFC1122].  A TCP receiver which does not generate an ACK for
  every second full-sized segment exhibits a "Stretch ACK
  Violation".

   Significance
  TCP receivers exhibiting this behavior will cause TCP senders to
  generate burstier traffic, which can degrade performance in
  congested environments.  In addition, generating fewer ACKs
  increases the amount of time needed by the slow start algorithm to
  open the congestion window to an appropriate point, which
  diminishes performance in environments with large bandwidth-delay
  products.  Finally, generating fewer ACKs may cause needless
  retransmission timeouts in lossy environments, as it increases the
  possibility that an entire window of ACKs is lost, forcing a
  retransmission timeout.

   Implications
  When not in loss recovery, every ACK received by a TCP sender
  triggers the transmission of new data segments.  The burst size is
  determined by the number of previously unacknowledged segments
  each ACK covers.  Therefore, a TCP receiver ack'ing more than 2
  segments at a time causes the sending TCP to generate a larger
  burst of traffic upon receipt of the ACK.  This large burst of
  traffic can overwhelm an intervening gateway, leading to higher
  drop rates for both the connection and other connections passing
  through the congested gateway.

  In addition, the TCP slow start algorithm increases the congestion
  window by 1 segment for each ACK received.  Therefore, increasing
  the ACK interval (thus decreasing the rate at which ACKs are
  transmitted) increases the amount of time it takes slow start to
  increase the congestion window to an appropriate operating point,
  and the connection consequently suffers from reduced performance.
  This is especially true for connections using large windows.

   Relevant RFCs
  RFC 1122 outlines delayed ACKs as a recommended mechanism.

   Trace file demonstrating it
  Trace file taken using tcpdump at host B, the data receiver (and
  ACK originator).  The advertised window (which never changed) and
  timestamp options have been omitted for clarity, except for the
  first packet sent by A:

   12:09:24.820187 A.1174 > B.3999: . 2049:3497(1448) ack 1
   win 33580  [tos 0x8]
   12:09:24.824147 A.1174 > B.3999: . 3497:4945(1448) ack 1
   12:09:24.832034 A.1174 > B.3999: . 4945:6393(1448) ack 1
   12:09:24.83 B.3999 > A.1174: . ack 6393
   12:09:24.934837 A.1174 > B.3999: . 6393:7841(1448) ack 1
   12:09:24.942721 A.1174 > B.3999: . 7841:9289(1448) ack 1
   12:09:24.950605

[openib-general] Re: TSO and IPoIB performance degradation

2006-03-08 Thread David S. Miller
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
Date: Wed, 8 Mar 2006 14:53:11 +0200

> What I was trying to figure out was, how can we re-enable the trick without
> hurting TSO? Could a solution be to simply look at the frame size, and call
> tcp_send_delayed_ack if the frame size is small?

The problem is that this patch helps performance when the
receiver is CPU limited.

The old code would delay ACKs forever if the CPU of the
receiver was slow, because we'd wait for all received
packets to be copied into userspace before spitting out
the ACK.  This would allow the pipe to empty, since the
sender is waiting for ACKs in order to send more into
the pipe, and once the ACK did go out it would cause the
sender to emit an enormous burst of data.  Both of these
behaviors are highly frowned upon for a TCP stack.

I'll try to look at this some more later today.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread Roland Dreier
David> I wish you had started the thread by mentioning this
David> specific patch, we wasted an enormous amount of precious
David> developer time speculating and asking for arbitrary tests
David> to be run in order to narrow down the problem, yet you knew
David> the specific change that introduced the performance
David> regression already...

Sorry, you're right.  I was a little confused because I had a memory of
Michael's original email (http://lkml.org/lkml/2006/3/6/150) quoting a
changelog entry, but looking back at the message, it was quoting
something completely different and misleading.

I think the most interesting email in the old thread is
http://openib.org/pipermail/openib-general/2005-October/012482.html
which shows that reverting 314324121 (the "stretch ACK performance
killer" fix) gives ~400 Mbit/sec in extra IPoIB performance.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread David S. Miller
From: Roland Dreier <[EMAIL PROTECTED]>
Date: Tue, 07 Mar 2006 17:17:30 -0800

> The reason TSO comes up is that reverting the patch described below
> helps (or helped at some point at least) IPoIB throughput quite a bit.

I wish you had started the thread by mentioning this specific
patch, we wasted an enormous amount of precious developer time
speculating and asking for arbitrary tests to be run in order
to narrow down the problem, yet you knew the specific change
that introduced the performance regression already...

This is a good example of how not to report a bug.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread Roland Dreier
David> How limited are the IPoIB devices, TX descriptor wise?

David> One side effect of the TSO changes is that one extra
David> descriptor will be used for outgoing packets.  This is
David> because we have to put the headers as well as the user
David> data, into page based buffers now.

We have essentially no limit on TX descriptors.  However I think
there's some confusion about TSO: IPoIB does _not_ do TSO -- generic
InfiniBand hardware does not have any TSO capability.  In the future
we might be able to implement TSO for certain hardware that does have
support, but even that requires some firmware help from the from the
HCA vendors, etc.  So right now the IPoIB driver does not do TSO.

The reason TSO comes up is that reverting the patch described below
helps (or helped at some point at least) IPoIB throughput quite a bit.
Clearly this was a bug fix so we can't revert it in general but I
think what Michael Tsirkin was suggesting at the beginning of this
thread is to do what the last paragraph of the changelog says -- find
some way to re-enable the trick.

diff-tree 3143241... (from e16fa6b...)
Author: David S. Miller <[EMAIL PROTECTED]>
Date:   Mon May 23 12:03:06 2005 -0700

[TCP]: Fix stretch ACK performance killer when doing ucopy.

When we are doing ucopy, we try to defer the ACK generation to
cleanup_rbuf().  This works most of the time very well, but if the
ucopy prequeue is large, this ACKing behavior kills performance.

With TSO, it is possible to fill the prequeue so large that by the
time the ACK is sent and gets back to the sender, most of the window
has emptied of data and performance suffers significantly.

This behavior does help in some cases, so we should think about
re-enabling this trick in the future, using some kind of limit in
order to avoid the bug case.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread David S. Miller
From: Matt Leininger <[EMAIL PROTECTED]>
Date: Tue, 07 Mar 2006 16:11:37 -0800

>   I used the standard setting for tcp_rmem and tcp_wmem.   Here are a
> few other runs that change those variables.  I was able to improve
> performance by ~30MB/s to 403 MB/s, but this is still a ways from the
> 474 MB/s before the TSO patches.

How limited are the IPoIB devices, TX descriptor wise?

One side effect of the TSO changes is that one extra descriptor
will be used for outgoing packets.  This is because we have to
put the headers as well as the user data, into page based
buffers now.

Perhaps you can experiment with increasing the transmit descriptor
table size, if that's possible.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread Matt Leininger
On Tue, 2006-03-07 at 13:49 -0800, Stephen Hemminger wrote:
> On Tue, 07 Mar 2006 13:44:51 -0800
> Matt Leininger <[EMAIL PROTECTED]> wrote:
> 
> > On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> > > 
> > > > More likely you are getting hit by the fact that TSO prevents the
> > > congestion
> > > window from increasing properly. This was fixed in 2.6.15 (around mid
> > > of Nov 2005). 
> > > 
> > > Yep, I noticed the same problem. After updating to the new kernel, the
> > > performance are much better, but it's still lower than before.
> > 
> >  Here is an updated version of OpenIB IPoIB performance for various
> > kernels with and without one of the TSO patches.  The netperf
> > performance for the latest kernels has not improved the TSO performance
> > drop.
> > 
> >   Any comments or suggestions would be appreciated.
> > 
> >   - Matt
> 
> Configuration information? like did you increase the tcp_rmem, tcp_wmem?
> Tcpdump traces of what is being sent and available window?
> Is IB using NAPI or just doing netif_rx()?

  I used the standard setting for tcp_rmem and tcp_wmem.   Here are a
few other runs that change those variables.  I was able to improve
performance by ~30MB/s to 403 MB/s, but this is still a ways from the
474 MB/s before the TSO patches.

 Thanks,

- Matt

All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0
dual EM64T 3.2 GHz PCIe IB HCA (memfull)
patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc
msi_x=1 for all tests

KernelOpenIB netperf (MB/s)  
2.6.16-rc5   in-kernel403  
tcp_wmem 4096 87380 16777216 tcp_rmem 4096 87380 16777216

2.6.16-rc5   in-kernel395  
tcp_wmem 4096 102400 16777216 tcp_rmem 4096 102400 16777216

2.6.16-rc5   in-kernel392  
tcp_wmem 4096 65536 16777216 tcp_rmem 4096 87380 16777216

2.6.16-rc5   in-kernel394  
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 102400 16777216

2.6.16-rc5   in-kernel377  
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 153600 16777216

2.6.16-rc5   in-kernel377  
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 131072 16777216

2.6.16-rc5   in-kernel353  
tcp_wmem 4096 262144 16777216 tcp_rmem 4096 262144 16777216

2.6.16-rc5   in-kernel305  
tcp_wmem 4096 262144 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5   in-kernel303  
tcp_wmem 4096 131072 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5   in-kernel290  
tcp_wmem 4096 524288 16777216 tcp_rmem 4096 524288 16777216

2.6.16-rc5   in-kernel367  default tcp values


All with standard tcp settings
KernelOpenIB netperf (MB/s)  
2.6.16-rc5   in-kernel367  
2.6.15   in-kernel382
2.6.14-rc4 patch 12  in-kernel436 
2.6.14-rc4 patch 1   in-kernel434 
2.6.14-rc4   in-kernel385 
2.6.14-rc3   in-kernel374 
2.6.13.2 svn3627  386 
2.6.13.2 patch 1 svn3627  446 
2.6.13.2 in-kernel394 
2.6.13-rc3 patch 12  in-kernel442 
2.6.13-rc3 patch 1   in-kernel450 
2.6.13-rc3   in-kernel395
2.6.12.5-lustre  in-kernel399  
2.6.12.5 patch 1 in-kernel464
2.6.12.5 in-kernel402 
2.6.12   in-kernel406 
2.6.12-rc6 patch 1   in-kernel470 
2.6.12-rc6   in-kernel407
2.6.12-rc5   in-kernel405 
2.6.12-rc5 patch 1   in-kernel474
2.6.12-rc4   in-kernel470 
2.6.12-rc3   in-kernel466 
2.6.12-rc2   in-kernel469 
2.6.12-rc1   in-kernel466
2.6.11   in-kernel464 
2.6.11   svn3687  464 
2.6.9-11.ELsmp   svn3513  425  (Woody's results, 3.6Ghz EM64T) 


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread Michael S. Tsirkin
Quoting r. Stephen Hemminger <[EMAIL PROTECTED]>:
> Is IB using NAPI or just doing netif_rx()?

No, IPoIB doesn't use NAPI.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread Stephen Hemminger
On Tue, 07 Mar 2006 13:44:51 -0800
Matt Leininger <[EMAIL PROTECTED]> wrote:

> On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> > 
> > > More likely you are getting hit by the fact that TSO prevents the
> > congestion
> > window from increasing properly. This was fixed in 2.6.15 (around mid
> > of Nov 2005). 
> > 
> > Yep, I noticed the same problem. After updating to the new kernel, the
> > performance are much better, but it's still lower than before.
> 
>  Here is an updated version of OpenIB IPoIB performance for various
> kernels with and without one of the TSO patches.  The netperf
> performance for the latest kernels has not improved the TSO performance
> drop.
> 
>   Any comments or suggestions would be appreciated.
> 
>   - Matt

Configuration information? like did you increase the tcp_rmem, tcp_wmem?
Tcpdump traces of what is being sent and available window?
Is IB using NAPI or just doing netif_rx()?
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-07 Thread Matt Leininger
On Mon, 2006-03-06 at 19:13 -0800, Shirley Ma wrote:
> 
> > More likely you are getting hit by the fact that TSO prevents the
> congestion
> window from increasing properly. This was fixed in 2.6.15 (around mid
> of Nov 2005). 
> 
> Yep, I noticed the same problem. After updating to the new kernel, the
> performance are much better, but it's still lower than before.

 Here is an updated version of OpenIB IPoIB performance for various
kernels with and without one of the TSO patches.  The netperf
performance for the latest kernels has not improved the TSO performance
drop.

  Any comments or suggestions would be appreciated.

  - Matt

> 
All benchmarks are with RHEL4 x86_64 with HCA FW v4.7.0
dual EM64T 3.2 GHz PCIe IB HCA (memfull)
patch 1 - remove changeset 314324121f9b94b2ca657a494cf2b9cb0e4a28cc

KernelOpenIBmsi_x  netperf (MB/s)  
2.6.16-rc5   in-kernel1 367
2.6.15   in-kernel1 382
2.6.14-rc4 patch 1   in-kernel1 434 
2.6.14-rc4   in-kernel1 385 
2.6.14-rc3   in-kernel1 374 
2.6.13.2 svn3627  1 386 
2.6.13.2 patch 1 svn3627  1 446 
2.6.13.2 in-kernel1 394 
2.6.13-rc3 patch 12  in-kernel1 442 
2.6.13-rc3 patch 1   in-kernel1 450 
2.6.13-rc3   in-kernel1 395
2.6.12.5-lustre  in-kernel1 399  
2.6.12.5 patch 1 in-kernel1 464
2.6.12.5 in-kernel1 402 
2.6.12   in-kernel1 406 
2.6.12-rc6 patch 1   in-kernel1 470 
2.6.12-rc6   in-kernel1 407
2.6.12-rc5   in-kernel1 405 
2.6.12-rc5 patch 1   in-kernel1 474
2.6.12-rc4   in-kernel1 470 
2.6.12-rc3   in-kernel1 466 
2.6.12-rc2   in-kernel1 469 
2.6.12-rc1   in-kernel1 466
2.6.11   in-kernel1 464 
2.6.11   svn3687  1 464 
2.6.9-11.ELsmp   svn3513  1 425  (Woody's results, 3.6Ghz
EM64T) 


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: TSO and IPoIB performance degradation

2006-03-06 Thread Shirley Ma

> More likely you are getting hit by the fact that
TSO prevents the congestion
window from increasing properly. This was fixed in 2.6.15 (around mid of
Nov 2005).

Yep, I noticed the same problem. After
updating to the new kernel, the performance are much better, but it's still
lower than before.

Thank
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: TSO and IPoIB performance degradation

2006-03-06 Thread Stephen Hemminger
On Tue, 7 Mar 2006 00:34:38 +0200
"Michael S. Tsirkin" <[EMAIL PROTECTED]> wrote:

> Hello, Dave!
> As you might know, the TSO patches merged into mainline kernel
> since 2.6.11 have hurt performance for the simple (non-TSO)
> high-speed netdevice that is IPoIB driver.
> 
> This was discussed at length here
> http://openib.org/pipermail/openib-general/2005-October/012271.html
> 
> I'm trying to figure out what can be done to improve the situation.
> In partucular, I'm looking at the Super TSO patch
> http://oss.sgi.com/archives/netdev/2005-05/msg00889.html
> 
> merged into mainline here
> 
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc
> 
> There, you said:
> 
>   When we do ucopy receive (ie. copying directly to userspace
>   during tcp input processing) we attempt to delay the ACK
>   until cleanup_rbuf() is invoked.  Most of the time this
>   technique works very well, and we emit one ACK advertising
>   the largest window.
> 
>   But this explodes if the ucopy prequeue is large enough.
>   When the receiver is cpu limited and TSO frames are large,
>   the receiver is inundated with ucopy processing, such that
>   the ACK comes out very late.  Often, this is so late that
>   by the time the sender gets the ACK the window has emptied
>   too much to be kept full by the sender.
> 
>   The existing TSO code mostly avoided this by keeping the
>   TSO packets no larger than 1/8 of the available window.
>   But with the new code we can get much larger TSO frames.
> 
> So I'm trying to get a handle on it: could a solution be to simply
> look at the frame size, and call tcp_send_delayed_ack from
> if the frame size is no larger than 1/8?
> 
> Does this make sense?
> 
> Thanks,
> 
> 


More likely you are getting hit by the fact that TSO prevents the congestion
window from increasing properly. This was fixed in 2.6.15 (around mid of Nov 
2005).
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: TSO and IPoIB performance degradation

2006-03-06 Thread David S. Miller
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]>
Date: Tue, 7 Mar 2006 00:34:38 +0200

> So I'm trying to get a handle on it: could a solution be to simply
> look at the frame size, and call tcp_send_delayed_ack from
> if the frame size is no larger than 1/8?
> 
> Does this make sense?

The comment you mention is very old, and no longer applies.

Get full packet traces from the kernel TSO code in the 2.6.x
kernel, analyze is, and post here what you think is occuring
that is causing the performance problems.

One thing to note is that the newer TSO code really needs to
have large socket buffers, so you can experiment with that.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general