Re: Multicast and hardware checksum

2007-06-08 Thread Baruch Even

Herbert Xu wrote:

On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote:
As far as IGMP and multicast handling everything works, the packets are 
even forwarded over the ppp links but they arrive to the client with a 
bad checksum. I don't have the trace in front of me but I believe it was 
the UDP checksum that failed.


What kind of a ppp device is this?

If you run a tcpdump either side of the ppp link do you see the same
UDP checksum value?


This is a pptp link. I've checked the checksum on the receive side, I 
don't know on the sender side and I'll only be able to try it on Sunday.


Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multicast and hardware checksum

2007-06-08 Thread Baruch Even

Herbert Xu wrote:

Baruch Even [EMAIL PROTECTED] wrote:
I have a machine on which I have an applications that sends multicast 
through eth interface with hardware tx checksum enabled. On the same 
machine I have mrouted running that routes the multicast traffic to a 
set of ppp interfaces. The packets that are received by the client have 
their checksum fixed on some number which is incorrect. If I disable tx 
checksum on the eth device the packets arrive with the proper checksum.


Where is the client? On the same machine or behind a PPP link?


The clients are behind the ppp links.

As far as IGMP and multicast handling everything works, the packets are 
even forwarded over the ppp links but they arrive to the client with a 
bad checksum. I don't have the trace in front of me but I believe it was 
the UDP checksum that failed.


Baruch


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multicast and hardware checksum

2007-06-08 Thread Baruch Even

Baruch Even wrote:

Herbert Xu wrote:

On Fri, Jun 08, 2007 at 02:02:27PM +0300, Baruch Even wrote:
As far as IGMP and multicast handling everything works, the packets 
are even forwarded over the ppp links but they arrive to the client 
with a bad checksum. I don't have the trace in front of me but I 
believe it was the UDP checksum that failed.


What kind of a ppp device is this?

If you run a tcpdump either side of the ppp link do you see the same
UDP checksum value?


This is a pptp link. I've checked the checksum on the receive side, I 
don't know on the sender side and I'll only be able to try it on Sunday.


For completeness, the clients are Windows XP clients and the server is a 
 Linux machine. The tunnel is mppe encrypted so I believe that what 
goes out on the client is the same as what got in on the server.


Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Multicast and hardware checksum

2007-06-07 Thread Baruch Even

Hello,

I have a machine on which I have an applications that sends multicast 
through eth interface with hardware tx checksum enabled. On the same 
machine I have mrouted running that routes the multicast traffic to a 
set of ppp interfaces. The packets that are received by the client have 
their checksum fixed on some number which is incorrect. If I disable tx 
checksum on the eth device the packets arrive with the proper checksum.


I still haven't followed the code paths to see how to fix this, maybe 
someone knows the relevant code and can find it faster.


Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9]: tcp-2.6 patchset

2007-05-27 Thread Baruch Even
* Ilpo J?rvinen [EMAIL PROTECTED] [070527 14:16]:
 On Sun, 27 May 2007, David Miller wrote:
 
  From: Ilpo_J?rvinen [EMAIL PROTECTED]
  Date: Sun, 27 May 2007 10:58:27 +0300 (EEST)
  
   While you're in the right context (reviewing patch 8), you could also
   look if tcp_clean_rtx_queue does a right thing when passing a strange 
   pkts_acked to congestion control modules. I wonder if it really should 
   ignore GSO the way it does currently... I read some cc module code and 
   some was adding it to snd_cwnd_cnt, etc. which is a strong indication 
   that GSO should be considered... Also if the head is GSO skb that is not 
   completely acked, the loop breaks with pkts_acked being zero, I doubt
   that can be correct... 
  
  [...snip...]
  will likely take a look at these issues wrt. patch 8 tomorrow.
 
 [...snip...]
 
 Thus, my original question basically culminates in this: should cc
 modules be passed number of packets acked or number of skbs acked?
 ...The latter makes no sense to me unless the value is intented to
 be interpreted as number of timestamps acked or something along those 
 lines. ...I briefly tried looking up for documentation for cc module 
 interface but didn't find anything useful about this, and thus asked in 
 the first place...

At least the htcp module that I wrote assumes that the number is actual
number of tcp packets so GSO should be considered.

The consequences of this bug are not too large but it does make all
congestion control algorithms a lot less aggressive. On my machines GSO
is disabled by default (e1000 at 100mbps  Tigon3 @ 1Gbps).

Cheers,
Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Reduce frequency of cleanup timer in bridge

2007-05-19 Thread Baruch Even
The bridge cleanup timer is fired 10 times a second for timers that are
at least 15 seconds ahead in time and that are not critical to be
cleaned asap.

This patch calculates the next time to run the timer as the minimum of
all timers or a minimum based on the current state.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

--- 2.6.22-rc2/net/bridge/br_fdb.c  2007-05-20 00:51:11.0 +0300
+++ 2.6-rc2/net/bridge/br_fdb.c 2007-05-20 00:50:31.0 +0300
@@ -121,6 +121,7 @@
 {
struct net_bridge *br = (struct net_bridge *)_data;
unsigned long delay = hold_time(br);
+   unsigned long next_timer = jiffies + br-forward_delay;
int i;
 
spin_lock_bh(br-hash_lock);
@@ -129,14 +130,21 @@
struct hlist_node *h, *n;
 
hlist_for_each_entry_safe(f, h, n, br-hash[i], hlist) {
+   unsigned long this_timer;
+   if (f-is_static)
+   continue;
+   this_timer = f-ageing_timer + delay;
+   if (time_before_eq(this_timer, jiffies))
-   if (!f-is_static 
-   time_before_eq(f-ageing_timer + delay, jiffies))
fdb_delete(f);
+   else if (this_timer  next_timer)
+   next_timer = this_timer;
}
}
spin_unlock_bh(br-hash_lock);
 
+   /* Add HZ/4 to ensure we round the jiffies upwards to be after the next
+* timer, otherwise we might round down and will have no-op run. */
+   mod_timer(br-gc_timer, round_jiffies(next_timer + HZ/4));
-   mod_timer(br-gc_timer, jiffies + HZ/10);
 }
 
 /* Completely flush all dynamic entries in forwarding database.*/

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [TCP] Sysctl: document tcp_max_ssthresh (Limited Slow-Start)

2007-05-18 Thread Baruch Even
Ilpo Järvinen wrote:
 Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
 ---
  Documentation/networking/ip-sysctl.txt |   13 +
  1 files changed, 13 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/networking/ip-sysctl.txt 
 b/Documentation/networking/ip-sysctl.txt
 index ce16e6a..44ba8d4 100644
 --- a/Documentation/networking/ip-sysctl.txt
 +++ b/Documentation/networking/ip-sysctl.txt
 @@ -239,6 +239,19 @@ tcp_max_orphans - INTEGER
   more aggressively. Let me to remind again: each orphan eats
   up to ~64K of unswappable memory.
  
 +tcp_max_ssthresh - INTEGER
 + Limited Slow-Start for TCP with Large Congestion Windows defined in
 + RFC3742. Limited slow-start is a mechanism to limit grow of the

s/grow/growth/

 + congestion window on the region where congestion window is larger than
 + tcp_max_ssthresh. A TCP connection with a large congestion window could
 + have its congestion window increased by thousand (or even more)
 + segments per RTT by the traditional slow-start procedure which might be
 + counter-productive to TCP performance when packet losses start to
 + occur. With limited slow-start TCP increments congestion window at
 + most tcp_max_ssthresh/2 segments per RTT when the congestion window is

I'm not a native English speaker but at most sounds a bit awkward to
me, maybe change it to by no more than. But I'm sure someone can find
a better phrasing.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP congestion control for fast, short-distance networks ?

2007-04-04 Thread Baruch Even
* [EMAIL PROTECTED] [EMAIL PROTECTED] [070404 18:29]:
 Hello,
 
 We are currently using both 1 Gb  10 Gb links, that interconnect several 
 servers that are very *local* to 
 each other.
 Typical RTT times range from 0.2 ms - 0.3 ms.
 
 We are currently using TCP reno - is there a more suitable congestion control 
 algorithm for our 
 application,
 especially using the 10 Gb links ?
 (Most of the High-Speed TCP algorithms seem suitable for large RTT, 
 long-distance networks).

I'm not aware of any tests for high speed links with very low RTTs, but
I suspect that the new algorithms will not change much, if the
connections you have are indeed local than the Ethernet pause mechanism
is more effective for the flow control you need.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP congestion control for fast, short-distance networks ?

2007-04-04 Thread Baruch Even
* [EMAIL PROTECTED] [EMAIL PROTECTED] [070404 19:03]:
 Thanks - so you are suggesting we enable 802.3 flow-control / pause-frames?
 (it's currently disabled)

I do, but do test it before you bet on it. I've never tested such a
scenario but from my experience the lower the rtt the lesser are the
problems that the high speed algorithms are trying to solve.

Baruch

 
 -Original Message-
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org
 Sent: Wed, 4 Apr 2007 8:39 AM
 Subject: Re: TCP congestion control for fast, short-distance networks ?
 
  * [EMAIL PROTECTED] [EMAIL PROTECTED] [070404 18:29]:
 Hello,
 
 We are currently using both 1 Gb  10 Gb links, that interconnect 
 several
 servers that are very *local* to
 each other.
 Typical RTT times range from 0.2 ms - 0.3 ms.
 
 We are currently using TCP reno - is there a more suitable congestion 
 control
 algorithm for our
 application,
 especially using the 10 Gb links ?
 (Most of the High-Speed TCP algorithms seem suitable for large RTT,
 long-distance networks).
 
 I'm not aware of any tests for high speed links with very low RTTs, but
 I suspect that the new algorithms will not change much, if the
 connections you have are indeed local than the Ethernet pause mechanism
 is more effective for the flow control you need.
 
 Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: many sockets, slow sendto

2007-03-06 Thread Baruch Even
* Zaccomer Lajos [EMAIL PROTECTED] [070306 17:39]:
 Hi,
 
 
 
 I'm playing around with a simulation, in which many thousands of IP
 
 addresses (on interface aliases) are used to send/receive TCP/UDP
 
 packets. I noticed that the time of send/sendto increased linearly
 
 with the number of file descriptors, and I found it rather strange.

To better understand the reason for this problem you should first use
oprofile to profile the kernel. This will give you the hot spots of the
kernel, where the kernel (or userspace) spends most of its time.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] [TCP]: Reworked recovery's TCPCB_LOST marking functions

2007-03-06 Thread Baruch Even
* Ilpo J?rvinen [EMAIL PROTECTED] [070306 14:52]:
 Complete rewrite for update_scoreboard and mark_head_lost. Couple
 of hints became unnecessary because of this change. Changes
 !TCPCB_TAGBITS check from the original to !(S|L) but it shouldn't
 make a difference, and if there ever is an R only skb TCP will
 mark it as LOST too. The algorithm uses some ideas presented by
 David Miller and Baruch Even.
 
 Seqno lookups require fast lookups that are provided using
 RB-tree patch(+abstraction) from DaveM.
 
 Signed-off-by: Ilpo J?rvinen [EMAIL PROTECTED]
 ---
 
 I'm sorry about poorly chunked diff, is it possible to force git to 
 produce better (large block) diffs when a complete function is rewritten 
 from scratch in the patch (manpage of git-diff-files hints -B bit it did 
 not work, affects whole file rewrites only perhaps)?
 
 This probably conflicts with the other patches in the rbtree patchset of 
 DaveM (two first are required) because I tested this one (at least the 
 non-timedout part worked) and didn't want some random breakage 
 from the other patches (as such was reported).
 
  include/linux/tcp.h  |6 -
  include/net/tcp.h|6 +
  net/ipv4/tcp_input.c |  194 
 +-
  net/ipv4/tcp_minisocks.c |1 
  4 files changed, 130 insertions(+), 77 deletions(-)
 

[snip]

 + newtp-highest_sack = treq-snt_isn + 1;

That's the only initialization that you have for highest_sack, I think
that you should initialize it when a loss is detected to the start_seq
of the first packet that wasn't acked.

Didn't review the rest, still need to arrange a proper tree with
preliminary patches to apply it on. Could you note the kernel you based
it on and include all patches applied before it?

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] [TCP]: Reworked recovery's TCPCB_LOST marking functions

2007-03-06 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070306 23:47]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Tue, 6 Mar 2007 21:42:59 +0200
 
  * Ilpo J?rvinen [EMAIL PROTECTED] [070306 14:52]:
   + newtp-highest_sack = treq-snt_isn + 1;
  
  That's the only initialization that you have for highest_sack, I think
  that you should initialize it when a loss is detected to the start_seq
  of the first packet that wasn't acked.
 
 He also sets it in tcp_sacktag_write_queue() like this:
 
 +
 + if (after(TCP_SKB_CB(skb)-seq,
 + tp-highest_sack))
 + tp-highest_sack = TCP_SKB_CB(skb)-seq;

Yes, but that's still not enough if between the start of the connection
and the first sack block we already wrapped around to before the old
highest_sack. It might not be a common occurrence but it's still
something to take care of.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4]: Kill fastpath_{skb,cnt}_hint.

2007-03-03 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070303 08:22]:
 BTW, I think I figured out a way to get rid of
 lost_{skb,cnt}_hint.  The fact of the matter in this case is that
 the setting of the tag bits always propagates from front of the queue
 onward.  We don't get holes mid-way.
 
 So what we can do is search the RB-tree for high_seq and walk
 backwards.  Once we hit something with TCPCB_TAGBITS set, we
 stop processing as there are no earlier SKBs which we'd need
 to do anything with.
 
 Do you see any problems with that idea?

I think this will be a fairly long walk initially.

You can try an augmented walk. If you are on a node which is tagged
anything on the right side will be tagged as well since it is smaller,
so you need to go left. This way you can find the first non-tagged item
in O(log n).

A bug in this logic is that sequence numbers can and do wrap around.

If you are willing to change the logic of the tree you can remove any
sacked element from it. Many of the operations are really only
interested in the non-sacked skbs. This will be similar to my patches
with the non-sacked list but I still needed the hints since the number
of lost packets could still be large and some operations (retransmit
f.ex.) need to get to the end of the list.


 scoreboard_skb_hint is a little bit trickier, but it is a similar
 case to the tcp_lost_skb_hint case.  Except here the termination
 condition is a relative timeout instead of a sequence number and
 packet count test.
 
 Perhaps for that we can remember some state from the
 tcp_mark_head_lost() we do first.  In fact, we can start
 the queue walk from the latest packet which tcp_mark_head_lost()
 marked with a tag bit.
 
 Basically these two algorithms are saying:
 
 1) Mark up to smallest of 'lost' or tp-high_seq.
 2) Mark packets after those processed in #1 which have
timed out.
 
 Right?

Yes. This makes sense, the two algorithms start from the same place. I'd
even consider merging them into a single walk, unless we know that
usually on happens without the other.

There is another case like that for tcp_xmit_retrans where the forward
transmission should only start at the position that the retransmit
finished. I had that in my old patches and it improved performance at
the time.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4]: Kill fastpath_{skb,cnt}_hint.

2007-03-01 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070228 21:49]:
 
 commit 71b270d966cd42e29eabcd39434c4ad4d33aa2be
 Author: David S. Miller [EMAIL PROTECTED]
 Date:   Tue Feb 27 19:28:07 2007 -0800
 
 [TCP]: Kill fastpath_{skb,cnt}_hint.
 
 Now that we have per-skb fack_counts and an interval
 search mechanism for the retransmit queue, we don't
 need these things any more.
 
 Instead, as we traverse the SACK blocks to tag the
 queue, we use the RB tree to lookup the first SKB
 covering the SACK block by sequence number.
 
 Signed-off-by: David S. Miller [EMAIL PROTECTED]

If you take this approach it makes sense to also remove the sorting of
SACKs, the traversal of the SACK blocks will not start from the
beginning anyway which was the reason for this sorting in the first
place.

One drawback for this approach is that you now walk the entire sack
block when you advance one packet. If you consider a 10,000 packet queue
which had several losses at the beginning and a large sack block that
advances from the middle to the end you'll walk a lot of packets for
that one last stretch of a sack block.

One way to handle that is to use the still existing sack fast path to
detect this case and calculate what is the sequence number to search
for. Since you know what was the end_seq that was handled last, you can
search for it as the start_seq and go on from there. Does it make sense?

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Correct links in net/ipv4/Kconfig

2007-02-21 Thread Baruch Even
Correct dead/indirect links in net/ipv4/Kconfig

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

Index: 2.6-gt/net/ipv4/Kconfig
===
--- 2.6-gt.orig/net/ipv4/Kconfig2007-02-17 15:47:41.0 +0200
+++ 2.6-gt/net/ipv4/Kconfig 2007-02-17 15:55:53.0 +0200
@@ -442,7 +442,7 @@
---help---
  Support for INET (TCP, DCCP, etc) socket monitoring interface used by
  native Linux tools such as ss. ss is included in iproute2, currently
- downloadable at http://developer.osdl.org/dev/iproute2. 
+ downloadable at http://linux-net.osdl.org/index.php/Iproute2.
  
  If unsure, say Y.
 
@@ -550,7 +550,7 @@
Scalable TCP is a sender-side only change to TCP which uses a
MIMD congestion control algorithm which has some nice scaling
properties, though is known to have fairness issues.
-   See http://www-lce.eng.cam.ac.uk/~ctk21/scalable/
+   See http://www.deneholme.net/tom/scalable/
 
 config TCP_CONG_LP
tristate TCP Low Priority
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Correct links in net/ipv4/Kconfig

2007-02-17 Thread Baruch Even
Fix bug #6216, update the link for CONFIG_IP_MCAST help message. The bug with
the proposed fix was submitted by [EMAIL PROTECTED]

Correct other dead/indirect links in the same file.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

Index: 2.6-gt/net/ipv4/Kconfig
===
--- 2.6-gt.orig/net/ipv4/Kconfig2007-02-17 15:47:41.0 +0200
+++ 2.6-gt/net/ipv4/Kconfig 2007-02-17 15:55:53.0 +0200
@@ -9,7 +9,7 @@
  intend to participate in the MBONE, a high bandwidth network on top
  of the Internet which carries audio and video broadcasts. More
  information about the MBONE is on the WWW at
- http://www-itg.lbl.gov/mbone/. Information about the multicast
+ http://www.savetz.com/mbone/. Information about the multicast
  capabilities of the various network cards is contained in
  file:Documentation/networking/multicast.txt. For most people, it's
  safe to say N.
@@ -442,7 +442,7 @@
---help---
  Support for INET (TCP, DCCP, etc) socket monitoring interface used by
  native Linux tools such as ss. ss is included in iproute2, currently
- downloadable at http://developer.osdl.org/dev/iproute2. 
+ downloadable at http://linux-net.osdl.org/index.php/Iproute2.
  
  If unsure, say Y.
 
@@ -550,7 +550,7 @@
Scalable TCP is a sender-side only change to TCP which uses a
MIMD congestion control algorithm which has some nice scaling
properties, though is known to have fairness issues.
-   See http://www-lce.eng.cam.ac.uk/~ctk21/scalable/
+   See http://www.deneholme.net/tom/scalable/
 
 config TCP_CONG_LP
tristate TCP Low Priority
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Hostess SV-11 depends on INET

2007-02-17 Thread Baruch Even
Comtrol Hostess SV-11 driver uses features from INET but doesn't depend on it.
The simple solution is to make it depend on INET as happens for the sealevel
driver.

Fixes bug #7930.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

Index: 2.6-gt/drivers/net/wan/Kconfig
===
--- 2.6-gt.orig/drivers/net/wan/Kconfig 2007-02-17 16:26:22.0 +0200
+++ 2.6-gt/drivers/net/wan/Kconfig  2007-02-17 16:26:27.0 +0200
@@ -26,7 +26,7 @@
 # There is no way to detect a comtrol sv11 - force it modular for now.
 config HOSTESS_SV11
tristate Comtrol Hostess SV-11 support
-   depends on WAN  ISA  m  ISA_DMA_API
+   depends on WAN  ISA  m  ISA_DMA_API  INET
help
  Driver for Comtrol Hostess SV-11 network card which
  operates on low speed synchronous serial links at up to
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] tcp: remove experimental variants from default list

2007-02-13 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070213 00:53]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Tue, 13 Feb 2007 00:12:41 +0200
 
  The problem is that you actually put a mostly untested algorithm as the
  default for everyone to use. The BIC example is important, it was the
  default algorithm for a long while and had implementation bugs that no
  one cared for.
 
 And if our TCP Reno implementation had some bugs, what should
 we change the default to?  This is just idiotic logic.
 
 These kinds of comments are just wanking, and lead to nowhere,
 so please kill the noise.
 
 If we have bugs in a particular algorithm, we should just fix
 them.

I hope you've finished attempting to insult me. But I hope it won't
prevent you from getting back to the topic. The above quote of me was a
prelude to show the repeat behaviour where bic was added without
testing, modified by Stephen and made default with no serious testing of
what was put in the kernel.

It seems this happens again no with cubic. And you failed to respond to
this issue.

  The behaviour of cubic wasn't properly verified as the
  algorithm in the linux kernel is not the one that was actually proposed
  and you intend to make it the default without sufficient testing, that
  seems to me to be quite unreasonable.

According to claims of Doug Leith the cubic algorithm that is in the
kernel is different from what was proposed and tested. That's an
important issue which is deflected by personal attacks.

My main gripe is that there is a run to make an untested algorithm the
default for all Linux installations. And saying that I should test it is
not an escape route, if it's untested it shouldn't be made the default
algorithm.

My skimming of the PFLDNet 2007 proceedings showed only the works by
Injong and Doug on Cubic and Injong tested some version on Linux
2.6.13(!) which might noe be the version in the current tree. Doug shows
some weaknesses of the Cubic algorithm as implemented in Linux.

Do you still think that making Cubic the default is a good idea?

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] tcp: remove experimental variants from default list

2007-02-13 Thread Baruch Even
* SANGTAE HA [EMAIL PROTECTED] [070213 18:50]:
 Hi Baruch,
 
 I would like to add some comments on your argument.
 
 On 2/13/07, Baruch Even [EMAIL PROTECTED] wrote:
 * David Miller [EMAIL PROTECTED] [070213 00:53]:
  From: Baruch Even [EMAIL PROTECTED]
  Date: Tue, 13 Feb 2007 00:12:41 +0200
 
   The problem is that you actually put a mostly untested algorithm as the
   default for everyone to use. The BIC example is important, it was the
   default algorithm for a long while and had implementation bugs that no
   one cared for.
 
  And if our TCP Reno implementation had some bugs, what should
  we change the default to?  This is just idiotic logic.
 
  These kinds of comments are just wanking, and lead to nowhere,
  so please kill the noise.
 
  If we have bugs in a particular algorithm, we should just fix
  them.
 
 I hope you've finished attempting to insult me. But I hope it won't
 prevent you from getting back to the topic. The above quote of me was a
 prelude to show the repeat behaviour where bic was added without
 testing, modified by Stephen and made default with no serious testing of
 what was put in the kernel.
 
 
 What kind of serious testing you want to? I've been testing all
 highspeed protocols including BIC and CUBIC for two and half years
 now. Even Stephen didn't test CUBIC algorithm by himself, he might see
 the results from our experimental studies. I don't care what algorithm
 is default in kernel, however, it is not appropriate to get back to
 Reno. As Windows decided to go with Compound TCP, why we want to
 back to 80's algorithm?

I fail to see how Microsoft should be the reason for anything, if
anything Linux started the arms race.

 It seems this happens again no with cubic. And you failed to respond to
 this issue.
 
   The behaviour of cubic wasn't properly verified as the
   algorithm in the linux kernel is not the one that was actually proposed
   and you intend to make it the default without sufficient testing, that
   seems to me to be quite unreasonable.
 
 According to claims of Doug Leith the cubic algorithm that is in the
 kernel is different from what was proposed and tested. That's an
 important issue which is deflected by personal attacks.
 
 Did you read that paper?
 http://wil.cs.caltech.edu/pfldnet2007/paper/CUBIC_analysis.pdf
 Then, please read the rebuttal for that paper.
 http://www.csc.ncsu.edu/faculty/rhee/Rebuttal-LSM-new.pdf
 
 Also, the implementation can be different. The cubic code inside of
 current kernel introduces faster calculation of cubic root. Even
 though we had some bugs on CUBIC implementation, it is fixed now.

We have seen before with bic that the different implementation meant
that things didnt work as expected. I wouldn't like it to happen again.

 
  My main gripe is that there is a run to make an untested algorithm the
 default for all Linux installations. And saying that I should test it is
 not an escape route, if it's untested it shouldn't be made the default
 algorithm.
 
 What is criteria for untested?  Who judges that this algorithm is
 fully tested and is ready to use?

Did you do your tests on 2.6.20? Did you verify that the algorithm
actually behaves as it should? I don't think anyone did any real tests
on the cubic version in the kernel and I fear a repeat of the bic issue.
Code that is untested is likely not to work and as far as I understand
it you didn't test the current kernel version but rather your own code
on an ancient kernel.

I'd be happy to be proven wrong and shown tests of cubic in the latest
kernel. Saying that I shuold do it myself if concerned is the wrong way
to go. I no longer have access to test equipment to do that and we
should not make an algorithm the default without sufficient testing.

 We still do testing with latest kernel version on production
 networks(4ms, 6ms, 9ms, 45ms, and 200ms). I will post the results when
 those are ready.

That would be an important step indeed.

 My skimming of the PFLDNet 2007 proceedings showed only the works by
 Injong and Doug on Cubic and Injong tested some version on Linux
 2.6.13(!) which might noe be the version in the current tree. Doug shows
 some weaknesses of the Cubic algorithm as implemented in Linux.
 
 As I mentioned, please read the paper and rebuttal carefully. Also, in

I'll do that later on, but quick reading shows web traffic with a
minimum of 137KB, which doesn't seem to be very realistic. But I need to
read deeper to see what goes there.

 Do you still think that making Cubic the default is a good idea?
 
 Then, what do you want to make a default? You want to get back to BIC? or 
 Reno?

I don't claim to have all the right answers, I would prefer to go to
Reno and I don't buy the argument of DaveM that we have to use some
fancy high speed algorithm, even if we do go for one I'd prefer a safer
choice like HSTCP.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info

Re: [patch 3/3] tcp: remove experimental variants from default list

2007-02-13 Thread Baruch Even
* Injong Rhee [EMAIL PROTECTED] [070213 19:43]:
 
 On Feb 13, 2007, at 4:56 AM, Baruch Even wrote:
 
 
 According to claims of Doug Leith the cubic algorithm that is in the
 kernel is different from what was proposed and tested. That's an
 important issue which is deflected by personal attacks.
 
 It is not the algorithm untested -- it is the implementation not
 fully tested. This is exactly the reason we are proposing to build a common, 
 convenient,
 accessible testbed equipped with a full set of automated testing scenarios. 
 This
 would be useful to crack out these bugs. There could be a weakness in an 
 algorithm, but
 there is no bug in the algorithm.

Yes. That was bad terminology on my part.

A testbed would be nice, I've heard several times about ideas to do that
but haven't seen anything materialize yet.

 Do you still think that making Cubic the default is a good idea?
 
 Do you think H-TCP could make a good candidate? I remember there are bugs in

I don't. And if you'd bother looking back at the thread you'd see that I
didn't even consider that an option. You automatically assume that I'm
only trying to further H-TCP, far from it. I've finished my MSc already
and am back to being a free man with his own thoughts. I've seen what
happened before and want to prevent that from happening again. I think
that the bic algorithm wasn't good enough and the implementation was
even buggier and still it was made the default without much thought and
no one thought to pull it back.

 H-TCP implementation (which went on unnoticed for a long time) -- Leith claims
 his team found the bugs -- but it seems a little of coincidence that after we 
 post
 our report on a strange behavior on H-TCP, D. Leith came back saying they
 found the bugs (no attribution..hmm).

I'm the one who found the issue and I can assure you that I didn't see
any notice from you before I did that. I was simply migrating my work
from an older kernel with our patches to the latest kernel at the time
with the patches as committed by Stephen. There was a difference between
what was submitted by myself and what was committed and it took us time
to detect that for the same reason I'm worried about the existing cubic
implementation, we were using our own patches and not testing the Linux
implementation. This is the same thing that is happening now with cubic.

 We also found some problem in the weakness of H-TCP algorithm (not
 implementation) as well (please read our Convex ordering paper in
 PFLDnet07). Based on the same argument of yours, then H-TCP does not
 make the cut. I guess none of TCP protocols would have made the cut
 either.

Bingo.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] tcp: remove experimental variants from default list

2007-02-13 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070213 21:56]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Tue, 13 Feb 2007 11:56:13 +0200
 
  Do you still think that making Cubic the default is a good idea?
 
 Can you propose a better alternative other than Reno?

The only other option would be HS-TCP. It is a very simple extension of
Reno and should be good enough for the current common high BDP
connections. If not that then the choice is to keep on BIC and test the
existing implementation of Cubic before making it the default, if at
all.

I still think that using a default of high-speed algorithm is not the
right thing to do, but I'd rather have a sane default rather than go for
the latest proposal to catch the eye.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] tcp: remove experimental variants from default list

2007-02-12 Thread Baruch Even
* Stephen Hemminger [EMAIL PROTECTED] [070212 18:04]:
 The TCP Vegas implementation is buggy, and BIC is too agressive
 so they should not be in the default list. Westwood is okay, but
 not well tested.

Since no one really agrees on the relative merits and problems of the
different algorithms and since the users themselves dont know, dont care
and have no clue on what should be the correct behaviour to report bugs
(see the old bic bugs, the htcp bugs, the recent sack bugs) I would
suggest to avoid making the whole internet a guinea pig and get back to
reno. If someone really needs to push high BDP flows he should test it
himself and choose what works for his kernel at the time.

For myself and anyone who asks me I recommend to set the default to
reno. For the few who really need high speed flows, they should test
kernel and protocol combination.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] tcp: remove experimental variants from default list

2007-02-12 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070212 22:21]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Mon, 12 Feb 2007 21:11:01 +0200
 
  Since no one really agrees on the relative merits and problems of the
  different algorithms and since the users themselves dont know, dont care
  and have no clue on what should be the correct behaviour to report bugs
  (see the old bic bugs, the htcp bugs, the recent sack bugs) I would
  suggest to avoid making the whole internet a guinea pig and get back to
  reno. If someone really needs to push high BDP flows he should test it
  himself and choose what works for his kernel at the time.
  
  For myself and anyone who asks me I recommend to set the default to
  reno. For the few who really need high speed flows, they should test
  kernel and protocol combination.
 
 We have high BDP flows even going from between the east and the west
 coast of the United States.
 
 This doesn't even begin to touch upon extremely well connected
 coutries like South Korea and what happens when people there try to
 access sites in Europe or the US.
 
 Good high BDP flow handling is necessary now and for everyday usage of
 the internet, it's not some obscure thing only researchers in fancy
 labs need.
 
 This also isn't the internet of 15 years ago where IETF members can
 spend 4 or 5 years masterbating over new ideas before deploying them.
 I know that's what conservative folks want, but it isn't going to
 happen.

The problem is that you actually put a mostly untested algorithm as the
default for everyone to use. The BIC example is important, it was the
default algorithm for a long while and had implementation bugs that no
one cared for. The behaviour of cubic wasn't properly verified as the
algorithm in the linux kernel is not the one that was actually proposed
and you intend to make it the default without sufficient testing, that
seems to me to be quite unreasonable.

As to the reasoning that the new algorithms are supposed to act like
Reno, that needs to be verified as well, it's not evident from the code
itself.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected Acknowledgement / Stalled Connections

2007-02-04 Thread Baruch Even
* Parag Warudkar [EMAIL PROTECTED] [070205 00:57]:
 On 2/4/07, Parag Warudkar [EMAIL PROTECTED] wrote:
 I am running 2.6.20 and have trouble with stalled connections. For
 instance, if I try to download a debian ISO image using wget, the
 connection runs fine for few seconds and then stalls for ever.
 
 In my router logs I see a ton of messages like the below -
 
 [INFO] Sun Feb 04 17:22:03 2007 Blocked incoming TCP packet from
 192.168.0.174:34090 to 130.239.18.138:80 with unexpected
 acknowledgement 3269301836 (expected 3269343453 to 3269408989)
 
 Where 192.168.0.174 is my laptop running FC6 and kernel 2.6.20 and
 130.239.18.138 is whatever cdimage.debian.org resolves to atm.
 
 What's going on here? Any TCP/IP tunable that I can set/turn on/off to
 prevent this from happening?
 
 Turning tcp_sack off seems to cure it. Turning it on again makes the
 connections stall. Seems like the D-Link router doesn't like the SACKs
 linux sends?

You can also try to disable tcp_window_scaling.

Can you provide a tcpdump trace of the connection? Traces with and
without SACK would be appreciated, also a trace with SACK but without
window scaling would be useful. I only need the headers of the packet so
-s 60 (the default) will be fine.

There was a change in 2.6.20 that I put but it should only have affected
the sender side and even then only the way it interpreted the data and
not things it sent on the wire.

Trying the dl with previous kernels can also help, if only to try and
pinpoint where things went bad, assuming it is in the kernel. This
sounds like a bad implementation of ACK window checks that ICSA requires
for firewall product certification.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Check num sacks in SACK fast path

2007-01-31 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070131 22:52]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Mon, 29 Jan 2007 09:13:49 +0200
 
  When we check for SACK fast path make sure that we also have the same 
  number of
  SACK blocks in the cache and in the new SACK data. This prevents us from
  mistakenly taking the cache data if the old data in the SACK cache is the 
  same
  as the data in the SACK block.
  
  Signed-Off-By: Baruch Even [EMAIL PROTECTED]
 
 We could implement this without extra state, for example by
 clearing out the rest of the recv_sack_cache entries.
 
 We should never see a SACK block from sequence zero to zero,
 which would be an empty SACK block.

That would work as well at the cost of extra writing to memory for each
ack packet. Though I won't guess what is worse, the extra memory used or
the extra writing.

 Something like the following?
 
 diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
 index c26076f..84cd722 100644
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -999,6 +1001,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
 *ack_skb, u32 prior_snd_
   return 0;
   }
   }
 + for (; i = 4; i++) {

That won't work though, the = should be , I've actually used
ARRAY_SIZE just to be on the safe side.

 + tp-recv_sack_cache[i].start_seq = 0;
 + tp-recv_sack_cache[i].end_seq = 0;
 + }
  
   if (flag)
   num_sacks = 1;

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Advance fast path pointer for first block only

2007-01-31 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070131 22:48]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Mon, 29 Jan 2007 09:13:39 +0200
 
  Only advance the SACK fast-path pointer for the first block, the fast-path
  assumes that only the first block advances next time so we should not move 
  the
  skb for the next sack blocks.
  
  Signed-Off-By: Baruch Even [EMAIL PROTECTED]
  
  ---
  
  I'm not sure about the fack_count part, this patch changes the value that is
  maintained in this case.
 
 I'm not sure about this patch :-)

That's why we have a gatekeeper for. To keep us on our toes.

 The fastpath is being used for two things here, I believe.
 
 First, it's being used to monitor the expanding of the initial
 SACK block, as you note.  This is clear by the fact that we
 only use the fastpath cache on future calls when this code path
 triggers:
 
   if (flag)
   num_sacks = 1;

 because otherwise we clear the fastpath cache to NULL.

True.

 But it's also being used in this loop you are editing to remember
 where we stopped in the previous iteration of the loop.  With your
 change it will always walk the whole retransmit queue from the end of
 the first SACK block, whereas before it would iterate starting at the
 last SKB visited at the end of the previous sack block processed by
 the loop.

OK. When I did the patch I forgot that by this stage we have sorted the
SACK blocks and we can't rely on the expanding block to be the first.

 I'll use this opportunity to say I'm rather unhappy with all of
 these fastpath hinting schemes.  They chew up a ton of space in
 the tcp_sock because we have several pointers for each of the
 different queue states we can cache and those eat 8 bytes a
 piece on 64-bit.

Understood. I do intend to look at different ways to organise the data
and do the work, but I wouldn't hold my breath for this.

One option to limit the damage of slow processing of SACK is to limit
the amount of work we are willing to do for SACK, if we limit the SACK
walk to 1000 packets we will prevent the worst and I believe that we
will still not cause much harm to the TCP recovery. This belief will
need to be checked, the value 1000 will also need to be checked and
maybe made configurable.

This is obviously a hack, but it will guarantee that we never cause the
ACK clock to die like it can now.

 We can probably fix the bug and preserve the inter-iteration
 end-SKB caching by having local variable SKB/fack_count caches
 in this function, and only updating the tcp_sock cache values
 for the first SACK block.
 
 Maybe something like this?

Ah, now I see what you meant and it is indeed another issue with my
patch. I'll change my patch with this fix and fix the issue of i == 0
being the wrong way to cache only the originally first SACK block.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net-2.6.21 GIT tree

2007-01-28 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070129 02:54]:
 
 I just cut it at:
 
   kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.21.git
 
 Feel free to send me feature patches for consideration.
 
 I'll probably toss things like Baruch's latest SACK fixes in
 there so they can cook for a while and then perhaps after some
 time we'll backport them into whatever -stable branch is active
 at the time.

The new fixes are much less important than the first one so I have no
problem with them waiting a bit. I do think that only actively testing
the SACK processing correctness and performance will show anything. It's
been shown time and again that most peoples just take the existing
behaviour as ok even if it is inefficient at best.

I'll try to get access to a network testbed or setup a virtual one on my
laptop and actually test this code.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Fix issues with SACK processing

2007-01-27 Thread Baruch Even
These patches are intended to fix the issues I've raised in a former
email in addition to the sorting code.

I still was not able to runtime test these patches, they were only
compile tested.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Advance fast path pointer for first block only

2007-01-27 Thread Baruch Even
Only advance the SACK fast-path pointer for the first block, the fast-path
assumes that only the first block advances next time so we should not move the
skb for the next sack blocks.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

---

I'm not sure about the fack_count part, this patch changes the value that is
maintained in this case.

Index: 2.6-rc6/net/ipv4/tcp_input.c
===
--- 2.6-rc6.orig/net/ipv4/tcp_input.c   2007-01-27 14:53:27.0 +0200
+++ 2.6-rc6/net/ipv4/tcp_input.c2007-01-27 15:59:22.0 +0200
@@ -1048,8 +1048,13 @@
int in_sack, pcount;
u8 sacked;
 
-   tp-fastpath_skb_hint = skb;
-   tp-fastpath_cnt_hint = fack_count;
+   if (i == 0) {
+   /* Only advance the hint for the first SACK
+* block, the hint is for quickly handling the
+* advancing of the first SACK blocks only. */
+   tp-fastpath_skb_hint = skb;
+   tp-fastpath_cnt_hint = fack_count;
+   }
 
/* The retransmission queue is always in order, so
 * we can short-circuit the walk early.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Seperate DSACK from SACK fast path

2007-01-27 Thread Baruch Even
Move DSACK code outside the SACK fast-path checking code. If the DSACK
determined that the information was too old we stayed with a partial cache
copied. Most likely this matters very little since the next packet will not be
DSACK and we will find it in the cache. but it's still not good form and there
is little reason to couple the two checks.

Since the SACK receive cache doesn't need the data to be in host order we also
remove the ntohl in the checking loop.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

Index: 2.6-rc6/net/ipv4/tcp_input.c
===
--- 2.6-rc6.orig/net/ipv4/tcp_input.c   2007-01-27 15:06:30.0 +0200
+++ 2.6-rc6/net/ipv4/tcp_input.c2007-01-27 15:59:15.0 +0200
@@ -948,16 +948,43 @@
tp-fackets_out = 0;
prior_fackets = tp-fackets_out;
 
+   /* Check for D-SACK. */
+   if (before(ntohl(sp[0].start_seq), TCP_SKB_CB(ack_skb)-ack_seq)) {
+   dup_sack = 1;
+   tp-rx_opt.sack_ok |= 4;
+   NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV);
+   } else if (num_sacks  1 
+   !after(ntohl(sp[0].end_seq), ntohl(sp[1].end_seq)) 
+   !before(ntohl(sp[0].start_seq), 
ntohl(sp[1].start_seq))) {
+   dup_sack = 1;
+   tp-rx_opt.sack_ok |= 4;
+   NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFORECV);
+   }
+
+   /* D-SACK for already forgotten data...
+* Do dumb counting. */
+   if (dup_sack 
+   !after(ntohl(sp[0].end_seq), prior_snd_una) 
+   after(ntohl(sp[0].end_seq), tp-undo_marker))
+   tp-undo_retrans--;
+
+   /* Eliminate too old ACKs, but take into
+* account more or less fresh ones, they can
+* contain valid SACK info.
+*/
+   if (before(TCP_SKB_CB(ack_skb)-ack_seq, prior_snd_una - 
tp-max_window))
+   return 0;
+
/* SACK fastpath:
 * if the only SACK change is the increase of the end_seq of
 * the first block then only apply that SACK block
 * and use retrans queue hinting otherwise slowpath */
flag = 1;
-   for (i = 0; i num_sacks; i++) {
-   __u32 start_seq = ntohl(sp[i].start_seq);
-   __u32 end_seq =  ntohl(sp[i].end_seq);
+   for (i = 0; i  num_sacks; i++) {
+   __u32 start_seq = sp[i].start_seq;
+   __u32 end_seq = sp[i].end_seq;
 
-   if (i == 0){
+   if (i == 0) {
if (tp-recv_sack_cache[i].start_seq != start_seq)
flag = 0;
} else {
@@ -967,37 +994,6 @@
}
tp-recv_sack_cache[i].start_seq = start_seq;
tp-recv_sack_cache[i].end_seq = end_seq;
-
-   /* Check for D-SACK. */
-   if (i == 0) {
-   u32 ack = TCP_SKB_CB(ack_skb)-ack_seq;
-
-   if (before(start_seq, ack)) {
-   dup_sack = 1;
-   tp-rx_opt.sack_ok |= 4;
-   NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV);
-   } else if (num_sacks  1 
-  !after(end_seq, ntohl(sp[1].end_seq)) 
-  !before(start_seq, ntohl(sp[1].start_seq))) {
-   dup_sack = 1;
-   tp-rx_opt.sack_ok |= 4;
-   NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFORECV);
-   }
-
-   /* D-SACK for already forgotten data...
-* Do dumb counting. */
-   if (dup_sack 
-   !after(end_seq, prior_snd_una) 
-   after(end_seq, tp-undo_marker))
-   tp-undo_retrans--;
-
-   /* Eliminate too old ACKs, but take into
-* account more or less fresh ones, they can
-* contain valid SACK info.
-*/
-   if (before(ack, prior_snd_una - tp-max_window))
-   return 0;
-   }
}
 
if (flag)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Check num sacks in SACK fast path

2007-01-27 Thread Baruch Even
When we check for SACK fast path make sure that we also have the same number of
SACK blocks in the cache and in the new SACK data. This prevents us from
mistakenly taking the cache data if the old data in the SACK cache is the same
as the data in the SACK block.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

Index: 2.6-rc6/include/linux/tcp.h
===
--- 2.6-rc6.orig/include/linux/tcp.h2007-01-27 15:06:02.0 +0200
+++ 2.6-rc6/include/linux/tcp.h 2007-01-27 15:19:04.0 +0200
@@ -317,6 +317,7 @@
struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/
 
struct tcp_sack_block recv_sack_cache[4];
+   u32 recv_sack_cache_size;
 
/* from STCP, retrans queue hinting */
struct sk_buff* lost_skb_hint;
Index: 2.6-rc6/net/ipv4/tcp_input.c
===
--- 2.6-rc6.orig/net/ipv4/tcp_input.c   2007-01-27 15:18:30.0 +0200
+++ 2.6-rc6/net/ipv4/tcp_input.c2007-01-27 15:30:09.0 +0200
@@ -979,7 +979,8 @@
 * if the only SACK change is the increase of the end_seq of
 * the first block then only apply that SACK block
 * and use retrans queue hinting otherwise slowpath */
-   flag = 1;
+   flag = num_sacks == tp-recv_sack_cache_size;
+   tp-recv_sack_cache_size = num_sacks;
for (i = 0; i  num_sacks; i++) {
__u32 start_seq = sp[i].start_seq;
__u32 end_seq = sp[i].end_seq;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Seperate DSACK from SACK fast path

2007-01-27 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070128 06:06]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Sat, 27 Jan 2007 18:49:49 +0200
 
  Since the SACK receive cache doesn't need the data to be in host
  order we also remove the ntohl in the checking loop.
  ...
  -   for (i = 0; i num_sacks; i++) {
  -   __u32 start_seq = ntohl(sp[i].start_seq);
  -   __u32 end_seq =  ntohl(sp[i].end_seq);
  +   for (i = 0; i  num_sacks; i++) {
  +   __u32 start_seq = sp[i].start_seq;
  +   __u32 end_seq = sp[i].end_seq;
  ...
  }
  tp-recv_sack_cache[i].start_seq = start_seq;
  tp-recv_sack_cache[i].end_seq = end_seq;
 
 Ok, and now the sack cache and the real sack blocks are
 stored in net-endian and this works out because we only
 make direct equality comparisons with the recv_sack_cache[]
 entry values?

Yes. The only comparison we do with recv_sack_cache entries is != and
that works for net-endian just fine.

The only reason recv_sack_cache was in host-order before that was that
start_seq and end_seq were used to do more before/after comparisons for
DSACK.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix sorting of SACK blocks

2007-01-25 Thread Baruch Even
The sorting of SACK blocks actually munges them rather than sort, causing the
TCP stack to ignore some SACK information and breaking the assumption of
ordered SACK blocks after sorting.

The sort takes the data from a second buffer which isn't moved causing
subsequent data moves to occur from the wrong location. The fix is to
use a temporary buffer as a normal sort does.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 
2.6-mod/net/ipv4/tcp_input.c
--- 2.6-rc6/net/ipv4/tcp_input.c2007-01-25 19:04:20.0 +0200
+++ 2.6-mod/net/ipv4/tcp_input.c2007-01-25 19:52:04.0 +0200
@@ -1011,10 +1011,11 @@
for (j = 0; j  i; j++){
if (after(ntohl(sp[j].start_seq),
  ntohl(sp[j+1].start_seq))){
-   sp[j].start_seq = 
htonl(tp-recv_sack_cache[j+1].start_seq);
-   sp[j].end_seq = 
htonl(tp-recv_sack_cache[j+1].end_seq);
-   sp[j+1].start_seq = 
htonl(tp-recv_sack_cache[j].start_seq);
-   sp[j+1].end_seq = 
htonl(tp-recv_sack_cache[j].end_seq);
+   struct tcp_sack_block_wire tmp;
+
+   tmp = sp[j];
+   sp[j] = sp[j+1];
+   sp[j+1] = tmp;
}
 
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix sorting of SACK blocks

2007-01-25 Thread Baruch Even
* Stephen Hemminger [EMAIL PROTECTED] [070125 20:47]:
 On Thu, 25 Jan 2007 20:29:03 +0200
 Baruch Even [EMAIL PROTECTED] wrote:
 
  The sorting of SACK blocks actually munges them rather than sort, causing 
  the
  TCP stack to ignore some SACK information and breaking the assumption of
  ordered SACK blocks after sorting.
  
  The sort takes the data from a second buffer which isn't moved causing
  subsequent data moves to occur from the wrong location. The fix is to
  use a temporary buffer as a normal sort does.
  
  Signed-Off-By: Baruch Even [EMAIL PROTECTED]
  
  diff -X 2.6-rc6/Documentation/dontdiff -ur 2.6-rc6/net/ipv4/tcp_input.c 
  2.6-mod/net/ipv4/tcp_input.c
  --- 2.6-rc6/net/ipv4/tcp_input.c2007-01-25 19:04:20.0 +0200
  +++ 2.6-mod/net/ipv4/tcp_input.c2007-01-25 19:52:04.0 +0200
  @@ -1011,10 +1011,11 @@
  for (j = 0; j  i; j++){
  if (after(ntohl(sp[j].start_seq),
ntohl(sp[j+1].start_seq))){
  -   sp[j].start_seq = 
  htonl(tp-recv_sack_cache[j+1].start_seq);
  -   sp[j].end_seq = 
  htonl(tp-recv_sack_cache[j+1].end_seq);
  -   sp[j+1].start_seq = 
  htonl(tp-recv_sack_cache[j].start_seq);
  -   sp[j+1].end_seq = 
  htonl(tp-recv_sack_cache[j].end_seq);
  +   struct tcp_sack_block_wire tmp;
  +
  +   tmp = sp[j];
  +   sp[j] = sp[j+1];
  +   sp[j+1] = tmp;
  }
   
  }
 
 This looks okay, but is there a test case that can be run?

There is nothing visible that shows the problem, the only option is to
add some code to print the SACK blocks after sorting and run it over a
large BDP connection that can be saturated. You'll obviously need to
have several holes, I believe that the bug will be visible when you have
ACK packets with three SACK blocks where the first block is the highest
which should be the normal case.

Cheers,
Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Possible bugs in SACK processing

2007-01-25 Thread Baruch Even
In addition to the patch I've provided there are two more issues that I
believe are bugs in the SACK processing code. Since I'm not certain but
I don't have the time to look into them I'd like to raise them for other
folks to look at.

First issue is the checking of the applicability of the fast path. The
sack blocks are compared directly, but there is no comparison of the
number of sack blocks. If in the former sack we had two blocks and now
we have three we will compare the third sack block from now against old
or uninitialised data. The chance of anything really bad happening might
not be high but it seems to be a bad behaviour.

The second issue is that there is no check that the fast path is
actually behind the hint. Consider a scenario where we have three sack
blocks and the first sack update is about an old location. And then
comes another sack packet with only an update to the old location. The
result will be that after the former sack block the hint is in the
latest location it can be and when the next sack packet arrives we
detect its an increase only but the fast path hint is too far and we do
no updating at all.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix sorting of SACK blocks

2007-01-25 Thread Baruch Even
* David Miller [EMAIL PROTECTED] [070126 01:55]:
 From: Baruch Even [EMAIL PROTECTED]
 Date: Thu, 25 Jan 2007 20:29:03 +0200
 
  The sorting of SACK blocks actually munges them rather than sort, causing 
  the
  TCP stack to ignore some SACK information and breaking the assumption of
  ordered SACK blocks after sorting.
  
  The sort takes the data from a second buffer which isn't moved causing
  subsequent data moves to occur from the wrong location. The fix is to
  use a temporary buffer as a normal sort does.
  
  Signed-Off-By: Baruch Even [EMAIL PROTECTED]
 
 BTW, in reviewing this I note that there is now only one remaining
 use of tp-recv_sack_cache[] and that is the code earlier in this
 function which is trying to detect if all we are doing is extending
 the leading edge of a SACK block.
 
 It would be nice to be able to clear out that usage as well, and
 remove recv_sack_cache[] and thus make tcp_sock smaller.

You actually need recv_sack_cache to detect if you can use the fast
path. Another alternative is to somehow hash the values of the sack
blocks but then you rely on probabilty that you will properly detect the
ability to use the fast path. Hashing will save some space but you can't
get rid of it completely unless you go back to the old and slow method
of SACK processing.

There were thoughts thrown a while back about using a different data
structure, I think you said you started working on something like that.
If that comes to fruition the cache might go.

FWIW, my other mail about possible bugs actually says that you might
need to add another value to check, the number of sack blocks in the
cache.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPv6 source address selection

2006-07-30 Thread Baruch Even
Hello,

My network has several IPv6 addresses and they don't route between
themselves, due to the current source address selection it means that
many times the network is simply not operational since Linux will choose
an address for a different network than targeted for the connection.

I have seen in thread [1] that there is a patch for this and that
another patch was supposed to be applied for 2.6.15(!), yet I'm using
2.6.17 and nothing works.

Another report about this is at [2]

Could whatever difference exists be resolved and this issue fixed? IPv6
is completely unusable on my network due to this issue.

Thanks,
Baruch

[1] http://marc.theaimsgroup.com/?t=11304534653r=1w=2
[2] http://marc.theaimsgroup.com/?l=linux-netm=111989050303975w=2
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP Veno module for kernel 2.6.16.13

2006-05-24 Thread Baruch Even
#ZHOU BIN# wrote:
 From: Bin Zhou [EMAIL PROTECTED]
 + else if (sysctl_tcp_abc) {
 + /* RFC3465: Apppriate Byte Count
 + * increase once for each full cwnd acked.
 + * Veno has no idear about it so far, so we keep
 + * it as Reno.
 + */
 + if (tp-bytes_acked = tp-snd_cwnd*tp-mss_cache) {
 + tp-bytes_acked -= tp-snd_cwnd*tp-mss_cache;
 + if (tp-snd_cwnd  tp-snd_cwnd_clamp)
 + tp-snd_cwnd++;
 + }

You should prefer to ignore abc instead. At least that's what everyone
else is doing, the only place where abc is active is in NewReno.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bugme-new] [Bug 6197] New: unregister_netdevice: waiting for ppp9 to become free. Usage count = 658

2006-03-10 Thread Baruch Even
Herbert Xu wrote:
 Baruch Even [EMAIL PROTECTED] wrote:
 
+   case NETDEV_UNREGISTER:
   case NETDEV_GOING_DOWN:
   case NETDEV_DOWN:
   /* Find every socket on this device and kill it. */
 
 
 This brings up the question as to why we need to flush it on
 NETDEV_GOING_DOWN and NETDEV_DOWN as well.  If it's possible
 for things to get added after the flush then isn't it pointless
 to flush there?

It's the first time I've looked at this code and I'm not completely sure
I understand the whole state machine for net devices, which is why I
opted to do the simplest thing that might work. Someone more versed in
this code can do better.

I've taken the time to survey the other protocols/devices that handle
these events and it does seem like there is a complete seperation
between the actions of these events (which makes sense otherwise why
have different events).

It does look like it should be sufficient to remove all sockets only in
the unregister case but I'm not sure why we should also keep those
sockets when a device was taken down?

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bugme-new] [Bug 6197] New: unregister_netdevice: waiting for ppp9 to become free. Usage count = 658

2006-03-09 Thread Baruch Even
* Andrew Morton [EMAIL PROTECTED] [060309 12:19]:
 
 
 Begin forwarded message:
 
 Date: Thu, 9 Mar 2006 01:24:06 -0800
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: [Bugme-new] [Bug 6197] New: unregister_netdevice: waiting for ppp9 
 to become free. Usage count = 658
 
 
 http://bugzilla.kernel.org/show_bug.cgi?id=6197
 
Summary: unregister_netdevice: waiting for ppp9 to become free.
 Usage count = 658
 Kernel Version: 2.6.15 and all 2.6 series
 Status: NEW
   Severity: blocking
  Owner: [EMAIL PROTECTED]
  Submitter: [EMAIL PROTECTED]
 
 
 Hi there! I've been experienced a big problem with the lastest kernel version
 and all 2.6 versions prior to this version.
 I'm using Fedora Core 3 and the machine in cause is a router and a dialin
 server. I use pppoe-server in kernel mode (rp-pppoe-3.7 and pppd 2.4.3).
 When a user connects to server with pppoe, then to the http daemon and then
 disconnects the kernel start saying messages like this kind:
 
 Message from [EMAIL PROTECTED] at Thu Mar  9 10:51:14 2006 ...
 
 nextc kernel: unregister_netdevice: waiting for ppp9 to become free. Usage 
 count
 = 233

We seem to not handle the NETDEV_UNREGISTER notification in pppoe, can
you please try the following patch? It is against latest git snapshot
but it should apply to 2.6.15 as well.

Baruch
--

We need to remove all references to the device when we receive the
NETDEV_UNREGISTER notification.

Signed-off-by: Baruch Even [EMAIL PROTECTED]

--
 drivers/net/pppoe.c |1 +
 1 file changed, 1 insertion(+)

Index: pppcd/drivers/net/pppoe.c
===
--- pppcd.orig/drivers/net/pppoe.c
+++ pppcd/drivers/net/pppoe.c
@@ -305,6 +305,7 @@ static int pppoe_device_event(struct not
 * LCP re-negotiation.
 */
 
+   case NETDEV_UNREGISTER:
case NETDEV_GOING_DOWN:
case NETDEV_DOWN:
/* Find every socket on this device and kill it. */

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC,NETLINK]: (v2) Add netlink_has_listeners() for checking for multicast listeners

2006-01-30 Thread Baruch Even
Patrick McHardy wrote:
 New version of the netlink_has_listeners() patch.
 
 Changes:
 
 - Fix missing listeners bitmap update when there was no delta in the
   number of subscribed groups
 - Use RCU to protect nltable listeners bitmap
 
 
 
 
 
 [NETLINK]: Add netlink_has_listeners() for checking for multicast listeners
 
 netlink_has_listeners() should be used to avoid unneccessary event message
 generation if there are no listeners.
 
...
   if (nlk-flags  NETLINK_KERNEL_SOCKET) {
 - netlink_table_grab();
 + unsigned long *listeners;
 +
 + listeners = nl_table[sk-sk_protocol].listeners;
 + nl_table[sk-sk_protocol].listeners = NULL;
 + synchronize_rcu();
 + kfree(nl_table[sk-sk_protocol].listeners);

Doesn't the NULL assignment needs to use rcu_assign_pointer()?

And isn't the kfree should be on the local listeners variable as opposed
to the just NULLed variable?


Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: Fix H-TCP accounting

2006-01-26 Thread Baruch Even
This fixes the accounting in H-TCP, the ccount variable is also adjusted
a few lines above this one.

This line was not supposed to be there and wasn't there in the patches
originally submitted, the four patches submitted were merged to one and
in that merge the bug was introduced.

Signed-Off-By: Baruch Even [EMAIL PROTECTED]

--

 net/ipv4/tcp_htcp.c |1 -
 1 file changed, 1 deletion(-)

Index: 2.6-git/net/ipv4/tcp_htcp.c
===
--- 2.6-git.orig/net/ipv4/tcp_htcp.c
+++ 2.6-git/net/ipv4/tcp_htcp.c
@@ -230,7 +230,6 @@ static void htcp_cong_avoid(struct sock 
if (tp-snd_cwnd  tp-snd_cwnd_clamp)
tp-snd_cwnd++;
tp-snd_cwnd_cnt = 0;
-   ca-ccount++;
}
}
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Strange cwnd history

2006-01-25 Thread Baruch Even
Hi,

I'm testing Linux 2.6.16-rc1-git4 on a 500Mbps line with 220ms rtt. I'm
getting a very strange cwnd history and was wondering if anyone noticed
it before and knows why it happens. A graph is attached and you can find
a resizable version at http://hamilton.ie/person/baruch/linet/

The changes I have in my tree are only related to tracing the cwnd
history (and other details), the only functional change is commenting
out the tp-snd_cwnd = min(tp-snd_cwnd, tcp_packets_in_flight(tp)+1);
line in tcp_cwnd_down(), with it the history was flawed but differently.

Settings of possible interest:
congestion control: htcp
ecn: 1
abc: 1
tso: off (performance is very low with it on, on the order of tens of Kbps)
sack: 1 (no losses due to the use of ECN)

The ECN marking strategy is to mark all packets that fill more than half
the buffer, the buffer is set to 40% of the BDP so ECN markings happens
at 20% of the BDP.

The strange part is where the graph goes up in stairs until some point
that it returns to normality and does the normal H-TCP fast increase.

Thanks in advance,
Baruch



SACK performance improvements - technical report and updated 2.6.6 patches

2005-12-19 Thread Baruch Even
Hello,

I wanted to post an update about my work for SACK performance
improvements, I've updated the patches on our website and added a
technical report on the work so far.

It can be found at:
http://hamilton.ie/net/research.htm#patches

In summary: The Linux stack so far is unable to effectively handle
single transfers on 1Gbps with high rtt links (220 ms rtt is what we
tested). The sender is unable to process the ACK packets fast enough
causing lost  ACKs and increased transfer times. Our work resulted in a
set of patches that enable the Linux TCP stack to handle this load
without breaking sweat.

Your comments on this work would be appreciated.

Regards,
Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cubic: pre-compute based on parameters

2005-12-14 Thread Baruch Even
David S. Miller wrote:
 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Mon, 12 Dec 2005 12:03:22 -0800
 
 
-d32 = d32 / HZ;
-
 /* (wmax-cwnd) * (srtt3 / HZ) / c * 2^(3*bictcp_HZ)  */
-d64 = (d64 * dist * d32)  (count+3-BICTCP_HZ);
-
-/* cubic root */
-d64 = cubic_root(d64);
-
-result = (u32)d64;
-return result;
+ return cubic_root((cube_factor * dist)  (cube_scale + 3 - BICTCP_HZ));
 
  ...
 
+ while ( !(d32  0x8000)  (cube_scale  BICTCP_HZ)){
+ d32 = d32  1;
+ ++cube_scale;
+ }
+ cube_factor = d64 * d32 / HZ;
+
 
 
 I don't think this transformation is equivalent.
 
 In the old code only the d32 is scaled by HZ.
 
 So in the old code we're saying something like:
 
   d64 = (d64 * dist * (d32 / HZ))  (count + 3 - BICTCP_HZ);
 
 whereas the new code looks like:
 
   d64 = (((d64 * d32) / HZ) * dist)  (count + 3 - BICTCP_HZ);
 
 Is that really equivalent?

Almost. It depends on how large the numbers are in d64 and d32, if their
multiplication may overflow than the first option is better since it has
less of a chance to overflow.

On the other hand, the second line can be more accurate.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hardware assisted SACK processing (was: [PATCH] TCP Offload (TOE) - Chelsio)

2005-08-21 Thread Baruch Even
David S. Miller wrote:
 From: Wael Noureddine [EMAIL PROTECTED]
 Date: Sun, 21 Aug 2005 00:54:51 -0700
 
 
You could also tweak the LRO timeout in a similar fashion based upon
traffic patterns as well.  In fact, extremely sophisticated things can
be done here to deal with the LRO timing as seen on WAN vs. LAN
streams.

The accurate statement is extremely complicated things need to be done here 
to deal with the LRO timing as seen on WAN vs. LAN streams. Not to mention 
dealing with retransmissions and the dynamics of congestion control.
 
 
 LRO will just stop accumulating when out-of-sequence data arrives.
 Nothing complicated at all.
 
 And that's _EXACTLY_ what we want to happen.  We want Linux's TCP loss
 response algorithms to take care of things, which have been
 extensively tuned over many many years and gets several orders of
 magnitude more testing and exposure than any customized stack you guys
 put onto a network card.

Actually, at high speeds SACK processing becomes a huge bottleneck by
itself. If we could have some help from the hardware with pruning some
of the trivial cases it would help, I guess.

One thing I can think of and which I implemented in software is a SACK
cache feature where at high speeds kicks in and starts processing SACKs
every 16 packets (exact parameters to be researched). This was shown to
increase performance and eliminate stalls that happen otherwise.

In this niche a nic that will understand the SACKs, and batch them for
processing if they are just the common case of additional x packets
added to latest SACK block would help in this regard. Reducing quite a
bit of work the CPU needs to do otherwise.

We still have the full Linux stack processing and reacting to the losses
 and doing the retransmits, we get the hardware to batch some of the
work for us. This needs to be host assisted since we don't want this at
the early stages of the connection, or for slow connections.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Stretch ACKs (was: [PATCH] TCP Offload (TOE) - Chelsio)

2005-08-21 Thread Baruch Even
David S. Miller wrote:
 From: Wael Noureddine [EMAIL PROTECTED]
 Date: Sun, 21 Aug 2005 00:17:17 -0700
 
 
How do you intend on avoiding huge stretch ACKs?
 
 
 The implication is that stretch ACKs are bad, which is wrong.
 Oh yes, that's right, you're the same person who earlier in this
 thread tried to teach us that bursty TCPs are non-standard :-)
 
 Stretch ACKs are actually a positive thing on a healthy connection and
 do indeed help the sender.  And when loss events occur, LRO stops
 immediately and delivers the packets as-is so that loss information
 via ACKs with SACK blocks can immediately make their way to the
 sender.
 
 Linux does actually currently generate stretch ACKs, when beneficial.

I do notice that on my own tests, I'm seeing stretch acks of 7 and 8
packets quite often. Is there any intention to add ABC (Accurate Byte
Counting) to Linux to offset the effects this has on the cwnd growth?

I haven't seen anything critical happening because of this, but it
definitely changes the way TCP behaves.

Baruch
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html