Re: No pmtu probing on retransmits?

2008-02-03 Thread John Heffner
Andi Kleen wrote:
> Hallo,
>
> While looking for something else in tcp_output.c I noticed that
> MTU probing seems to be only done in tcp_write_xmit (when
> packets come directly from process context), but not via the timer
> driven timer retransmit path (tcp_retransmit_skb). Is that intentional?
> It looks quite weird. I would normally assume PMTU blackholes get usually
> detected on retransmit timeouts. Or do I miss something?

MTU probing occurs only when everything is going fine.  We are probing
a larger size than currently in use.  In the case of a timeout, we
want to retransmit with the safe smaller size.


> You seem to have assumed interrupt context at least
> because tcp_mtu_probe() uses GFP_ATOMIC which is only needed for
> interrupts. Currently it is only called in process context I think.

I'm pretty sure it'll get called on ACK processing in softirq, f.e.:
tcp_mtu_probe()
tcp_write_xmit()
__tcp_push_pending_frames()
tcp_data_snd_check()
tcp_rcv_established()

Am I missing something?

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SO_RCVBUF doesn't change receiver advertised window

2008-01-16 Thread John Heffner

Ritesh Kumar wrote:

On 1/16/08, Bill Fink <[EMAIL PROTECTED]> wrote:

On Tue, 15 Jan 2008, Ritesh Kumar wrote:


Hi,
I am using linux 2.6.20 and am trying to limit the receiver window
size for a TCP connection. However, it seems that auto tuning is not
turning itself off even after I use the syscall

rwin=65536
setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, sizeof(rwin));

and verify using

getsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, &rwin_size);

that RCVBUF indeed is getting set (the value returned from getsockopt
is double that, 131072).

Linux doubles what you requested, and then uses (by default) 1/4
of the socket space for overhead, so you effectively get 1.5 times
what you requested as an actual advertised receiver window, which
means since you specified 64 KB, you actually get 96 KB.


The above calls are made before connect() on the client side and
before bind(), accept() on the server side. Bulk data is being sent
from the client to the server. The client and the server machines also
have tcp_moderate_rcvbuf set to 0 (though I don't think that's really
needed; setting a value to SO_RCVBUF should automatically turnoff auto
tuning.).

However the tcp trace shows the SYN, SYN/ACK and the first few packets as:
14:34:18.831703 IP 192.168.1.153.45038 > 192.168.2.204.: S
3947298186:3947298186(0) win 5840 
14:34:18.836000 IP 192.168.2.204. > 192.168.1.153.45038: S
3955381015:3955381015(0) ack 3947298187 win 5792 
14:34:18.837654 IP 192.168.1.153.45038 > 192.168.2.204.: . ack 1
win 183 
14:34:18.837849 IP 192.168.1.153.45038 > 192.168.2.204.: .
1:1449(1448) ack 1 win 183 
14:34:18.837851 IP 192.168.1.153.45038 > 192.168.2.204.: P
1449:1461(12) ack 1 win 183 
14:34:18.839001 IP 192.168.2.204. > 192.168.1.153.45038: . ack
1449 win 2172 
14:34:18.839011 IP 192.168.2.204. > 192.168.1.153.45038: . ack
1461 win 2172 
14:34:18.840875 IP 192.168.1.153.45038 > 192.168.2.204.: .
1461:2909(1448) ack 1 win 183 
14:34:18.840997 IP 192.168.1.153.45038 > 192.168.2.204.: .
2909:4357(1448) ack 1 win 183 
14:34:18.841120 IP 192.168.1.153.45038 > 192.168.2.204.: .
4357:5805(1448) ack 1 win 183 
14:34:18.841244 IP 192.168.1.153.45038 > 192.168.2.204.: .
5805:7253(1448) ack 1 win 183 
14:34:18.841388 IP 192.168.2.204. > 192.168.1.153.45038: . ack
2909 win 2896 
14:34:18.841399 IP 192.168.2.204. > 192.168.1.153.45038: . ack
4357 win 3620 
14:34:18.841413 IP 192.168.2.204. > 192.168.1.153.45038: . ack
5805 win 4344 

As you can see, the syn and syn ack show rcv windows to be 5840 and
5792 and it automatically increases for the receiver to values 2172
till 4344 and more in the later part of the trace till 24214.

Since the window scale was 2, the final advertised receiver window
you indicate of 24214 gives 2^2*24214 or right around 96 KB, which
is what is expected given the way Linux works.

-Bill


Thanks for the explanation Bill. That surely clears part of my doubt.
However, why doesn't linux advertise 24214 in the SYN packets? I was
hoping that the moment I setup a RCVBUF, linux would pre-allocate
buffers and drop any autotuning. Doesn't the above behavior count as
autotuning?



Linux also starts all connections with a small advertised window.  It 
only grows the window after observing the ratio of data to overhead in 
received packets.  If it receives only small packets from the sender 
with a high overhead ratio, it will only open the window just far enough 
that it doesn't overflow the receive buffer.  This algorithm (look for 
rcv_ssthresh in the code) controls the advertised window given a receive 
buffer size.  This is separate from autotuning, which adjusts the buffer 
size.  You're correct that autotuning is disabled when SO_RCVBUF is set, 
but the "receive slow-start" is always used.


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-09 Thread John Heffner

SANGTAE HA wrote:

On Jan 9, 2008 9:56 AM, John Heffner <[EMAIL PROTECTED]> wrote:

I also wonder how much of a problem this is (for now, with window sizes
of order 1 packets.  My understanding is that the biggest problems
arise from O(N^2) time for recovery because every ack was expensive.
Have current tests shown the final ack to be a major source of problems?

Yes, several people have reported this.

I may have missed some of this.  Does anyone have a link to some recent
data?


I had some testing on this a month ago.
A small set of recent results with linux 2.6.23.9 are at
http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency
One of serious cases with a large number of packet losses (initial
loss is around 8000 packets) is at
http://netsrv.csc.ncsu.edu/net-2.6.23.9/sack_efficiency/600--TCP-TCP-NONE--400-3-1.0--1000-120-0-0-1-1-5-500--1.0-0.5-133000-73-300-0.93-150--3/

Also, there is a comparison among three Linux kernels (2.6.13,
2.6.18-rc4, 2.6.20.3) at
http://netsrv.csc.ncsu.edu/wiki/index.php/Efficiency_of_SACK_processing



If I'm reading this right, all these tests occur with large amounts of 
loss and tons of sack processing.  What would be most pertinent to this 
discussion would be a test with a large window, with delayed ack and 
sack disabled, and a single loss repaired by fast retransmit.  This 
would isolate the "single big ack" processing from other factors such as 
doubling the ack rate and sack processing.


I could probably set up such a test, but I don't want to duplicate 
effort if someone else already has done something similar.


Thanks,
  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-09 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Tue, 08 Jan 2008 23:27:08 -0500

I also wonder how much of a problem this is (for now, with window sizes 
of order 1 packets.  My understanding is that the biggest problems 
arise from O(N^2) time for recovery because every ack was expensive. 
Have current tests shown the final ack to be a major source of problems?


Yes, several people have reported this.


I may have missed some of this.  Does anyone have a link to some recent 
data?


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread John Heffner

Andi Kleen wrote:

David Miller <[EMAIL PROTECTED]> writes:

The big problem is that recovery from even a single packet loss in a
window makes us run kfree_skb() for a all the packets in a full
window's worth of data when recovery completes.


Why exactly is it a problem to free them all at once? Are you worried
about kernel preemption latencies?

-Andi



I also wonder how much of a problem this is (for now, with window sizes 
of order 1 packets.  My understanding is that the biggest problems 
arise from O(N^2) time for recovery because every ack was expensive. 
Have current tests shown the final ack to be a major source of problems?


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SACK scoreboard

2008-01-08 Thread John Heffner

David Miller wrote:

Ilpo, just trying to keep an old conversation from dying off.

Did you happen to read a recent blog posting of mine?

http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead

I've been thinking more and more and I think we might be able
to get away with enforcing that SACKs are always increasing in
coverage.

I doubt there are any real systems out there that drop out of order
packets that are properly formed and are in window, even though the
SACK specification (foolishly, in my opinion) allows this.

If we could free packets as SACK blocks cover them, all the problems
go away.

For one thing, this will allow the retransmit queue liberation during
loss recovery to be spread out over the event, instead of batched up
like crazy to the point where the cumulative ACK finally moves and
releases an entire window's worth of data.

Next, it would simplify all of this scanning code trying to figure out
which holes to fill during recovery.

And for SACK scoreboard marking, the RB trie would become very nearly
unecessary as far as I can tell.

I would not even entertain this kind of crazy idea unless I thought
the fundamental complexity simplification payback was enormous.  And
in this case I think it is.

What we could do is put some experimental hack in there for developers
to start playing with, which would enforce that SACKs always increase
in coverage.  If violated the connection reset and a verbose log
message is logged so we can analyze any cases that occur.

Sounds crazy, but maybe has potential.  What do you think?



Linux has a code path where this can happen under memory over-commit, in 
tcp_prune_queue().  Also, I think one of the motivations for making SACK 
strictly advisory is there was some concern about buggy SACK 
implementations.  Keeping data in your retransmit queue allows you to 
fall back to timeout and go-back-n if things completely fall apart.  For 
better or worse, we have to deal with the spec the way it is.


Even if you made this assumption of "hard" SACKs, you still have to 
worry about large ACKs if SACK is disabled, though I guess you could say 
people running with large windows without SACK deserve what they get. :)



I haven't thought about this too hard, but can we approximate this by 
moving scaked data into a sacked queue, then if something bad happens 
merge this back into the retransmit queue?  The code will have to deal 
with non-contiguous data in the retransmit queue; I'm not sure offhand 
if that violates any assumptions.  You still have a single expensive ACK 
at the end of recovery, though I wonder how much this really hurts.  If 
you want to ameliorate this, you could save this sacked queue to be 
batch processed later, in application context for instance.


  -John


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSO trimming question

2007-12-20 Thread John Heffner

David Miller wrote:

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)


[PATCH] [TCP]: Fix TSO deferring

I'd say that most of what tcp_tso_should_defer had in between
there was dead code because of this.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>


Yikes!

John, we've been living a lie for more than a year. :-/

On the bright side this explains a lot of small TSO frames I've been
seeing in traces over the past year but never got a chance to
investigate.


Ouch.  This fix may improve some benchmarks.

Re-checking this function was on my list of things to do because I had 
also noticed some TSO frames that seemed a bit small.  This clearly 
explains it.


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread John Heffner

David Miller wrote:

Ilpo, I was pondering the kind of debugging one does to find
congestion control issues and even SACK bugs and it's currently too
painful because there is no standard way to track state changes.

I assume you're using something like carefully crafted printk's,
kprobes, or even ad-hoc statistic counters.  That's what I used to do
:-)

With that in mind it occurred to me that we might want to do something
like a state change event generator.

Basically some application or even a daemon listens on this generic
netlink socket family we create.  The header of each event packet
indicates what socket the event is for and then there is some state
information.

Then you can look at a tcpdump and this state dump side by side and
see what the kernel decided to do.

Now there is the question of granularity.

A very important consideration in this is that we want this thing to
be enabled in the distributions, therefore it must be cheap.  Perhaps
one test at the end of the packet input processing.

So I say we pick some state to track (perhaps start with tcp_info)
and just push that at the end of every packet input run.  Also,
we add some minimal filtering capability (match on specific IP
address and/or port, for example).

Maybe if we want to get really fancy we can have some more-expensive
debug mode where detailed specific events get generated via some
macros we can scatter all over the place.  This won't be useful
for general user problem analysis, but it will be excellent for
developers.

Let me know if you think this is useful enough and I'll work on
an implementation we can start playing with.



FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
http://caia.swin.edu.au/urp/newtcp/tools.html
http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6 0/3]: Three TCP fixes

2007-12-04 Thread John Heffner

Ilpo Järvinen wrote:

On Tue, 4 Dec 2007, John Heffner wrote:


Ilpo Järvinen wrote:

...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
as lower bound even though the ssthresh was already halved, so snd_ssthresh
should suffice.

I remember this coming up at least once before, so it's probably worth a
comment in the code.  Rate-halving attempts to actually reduce cwnd to half
the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound on how far
rate-halving can reduce cwnd.  See the "Bounding Parameters" section of
<http://www.psc.edu/networking/papers/FACKnotes/current/>.


Thanks for the info! Sadly enough it makes NewReno recovery quite 
inefficient when there are enough losses and high BDP link (in my case 
384k/200ms, BDP sized buffer). There might be yet another bug in it as 
well (it is still a bit unclear how tcp variables behaved during my 
scenario and I'll investigate further) but reduction in the transfer 
rate is going to last longer than a short moment (which is used as 
motivation in those FACK notes). In fact, if I just use RFC2581 like 
setting w/o rate-halving (and experience the initial "pause" in sending), 
the ACK clock to send out new data works very nicely beating rate halving 
fair and square. For SACK/FACK it works much nicer because recovery is 
finished much earlier and slow start recovers cwnd quickly.


I believe this is exactly the reason why Matt (CC'd) and Jamshid 
abandoned this line of work in the late 90's.  In my opinion, it's 
probably not such a bad idea to use cwnd/2 as the bound.  In some 
situations, the current rate-halving code will work better, but as you 
point out, in others the cwnd is lowered too much.



...Mind if I ask another similar one, any idea why prior_ssthresh is 
smaller (3/4 of it) than cwnd used to be (see tcp_current_ssthresh)?


Not sure on that one.  I'm not aware of any publications this is based 
on.  Maybe Alexey knows?


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6 0/3]: Three TCP fixes

2007-12-04 Thread John Heffner

Ilpo Järvinen wrote:

...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
as lower bound even though the ssthresh was already halved, 
so snd_ssthresh should suffice.


I remember this coming up at least once before, so it's probably worth a 
comment in the code.  Rate-halving attempts to actually reduce cwnd to 
half the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound 
on how far rate-halving can reduce cwnd.  See the "Bounding Parameters" 
section of .


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] [TCP]: MTUprobe: receiver window & data available checks fixed

2007-11-21 Thread John Heffner

Ilpo Järvinen wrote:

It seems that the checked range for receiver window check should
begin from the first rather than from the last skb that is going
to be included to the probe. And that can be achieved without
reference to skbs at all, snd_nxt and write_seq provides the
correct seqno already. Plus, it SHOULD account packets that are
necessary to trigger fast retransmit [RFC4821].

Location of snd_wnd < probe_size/size_needed check is bogus
because it will cause the other if() match as well (due to
snd_nxt >= snd_una invariant).

Removed dead obvious comment.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>


Acked-by: John Heffner <[EMAIL PROTECTED]>



---
 net/ipv4/tcp_output.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 30d6737..ff22ce8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1289,6 +1289,7 @@ static int tcp_mtu_probe(struct sock *sk)
struct sk_buff *skb, *nskb, *next;
int len;
int probe_size;
+   int size_needed;
unsigned int pif;
int copy;
int mss_now;
@@ -1307,6 +1308,7 @@ static int tcp_mtu_probe(struct sock *sk)
/* Very simple search strategy: just double the MSS. */
mss_now = tcp_current_mss(sk, 0);
probe_size = 2*tp->mss_cache;
+   size_needed = probe_size + (tp->reordering + 1) * mss_now;
if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
/* TODO: set timer for probe_converge_event */
return -1;
@@ -1316,18 +1318,15 @@ static int tcp_mtu_probe(struct sock *sk)
len = 0;
if ((skb = tcp_send_head(sk)) == NULL)
return -1;
-   while ((len += skb->len) < probe_size && !tcp_skb_is_last(sk, skb))
+   while ((len += skb->len) < size_needed && !tcp_skb_is_last(sk, skb))
skb = tcp_write_queue_next(sk, skb);
-   if (len < probe_size)
+   if (len < size_needed)
return -1;
 
-	/* Receive window check. */

-   if (after(TCP_SKB_CB(skb)->seq + probe_size, tp->snd_una + 
tp->snd_wnd)) {
-   if (tp->snd_wnd < probe_size)
-   return -1;
-   else
-   return 0;
-   }
+   if (tp->snd_wnd < size_needed)
+   return -1;
+   if (after(tp->snd_nxt + size_needed, tp->snd_una + tp->snd_wnd))
+   return 0;
 
 	/* Do we need to wait to drain cwnd? */

pif = tcp_packets_in_flight(tp);


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] [TCP] MTUprobe: Cleanup send queue check (no need to loop)

2007-11-21 Thread John Heffner

Ilpo Järvinen wrote:

The original code has striking complexity to perform a query
which can be reduced to a very simple compare.

FIN seqno may be included to write_seq but it should not make
any significant difference here compared to skb->len which was
used previously. One won't end up there with SYN still queued.

Use of write_seq check guarantees that there's a valid skb in
send_head so I removed the extra check.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>


Acked-by: John Heffner <[EMAIL PROTECTED]>



---
 net/ipv4/tcp_output.c |7 +--
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ff22ce8..1822ce6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1315,12 +1315,7 @@ static int tcp_mtu_probe(struct sock *sk)
}
 
 	/* Have enough data in the send queue to probe? */

-   len = 0;
-   if ((skb = tcp_send_head(sk)) == NULL)
-   return -1;
-   while ((len += skb->len) < size_needed && !tcp_skb_is_last(sk, skb))
-   skb = tcp_write_queue_next(sk, skb);
-   if (len < size_needed)
+   if (tp->write_seq - tp->snd_nxt < size_needed)
return -1;
 
 	if (tp->snd_wnd < size_needed)


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fw: [Bug 9189] New: Oops in kernel 2.6.21-rc4 through 2.6.23, page allocation failure

2007-10-19 Thread John Heffner

Stephen Hemminger wrote:

Looks like a memory over commit with small machines??

Begin forwarded message:

Date: Fri, 19 Oct 2007 01:35:33 -0700 (PDT)
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: [Bug 9189] New: Oops in kernel 2.6.21-rc4 through 2.6.23, page 
allocation failure

[snip]

Problem Description:After recent upgrade to kernel 2.6.23 (from 2.6.20) I have
started seeing kernel oops-es in networking code. The problem is 100%
reproducible in my environment. I've seen two slightly different backtraces but
both seem to be caused by the same commit.

I've performed the git bisect and tracked down the problem to the commit:
53cdcc04c1e85d4e423b2822b66149b6f2e52c2c [TCP]: Fix tcp_mem[] initialization

Once I reverse this commit in 2.6.23 the problem goes away (this is true also
for the kernel version generated by git bisect, 2.6.21-rc4).

Backtrace #1:
page allocation failure. order:1, mode:0x20
 [] __alloc_pages+0x2e1/0x300   
 [] cache_alloc_refill+0x29e/0x4b0

 [] __kmalloc+0x6e/0x80
 [] __alloc_skb+0x53/0x110
 [] tcp_collapse+0x1ac/0x370
 [] tcp_prune_queue+0xfd/0x2c0
 [] tcp_data_queue+0x7cd/0xbb0
 [] skb_checksum+0x4d/0x2a0
 [] tcp_rcv_established+0x36e/0x6a0
 [] tcp_v4_do_rcv+0xb4/0x2a0
 [] __alloc_pages+0xd9/0x300
 [] tcp_v4_rcv+0x6a9/0x6c0
 [] ip_local_deliver+0x91/0x110
 [] ip_rcv+0x230/0x3c0
 [] __alloc_skb+0x53/0x110
 [] netif_receive_skb+0x152/0x1e0
 [] process_backlog+0x6f/0xe0
 [] net_rx_action+0x5c/0xf0
 [] __do_softirq+0x42/0x90
 [] do_softirq+0x27/0x30
 [] do_IRQ+0x3d/0x70
 [] sys_gettimeofday+0x28/0x80
 [] common_interrupt+0x23/0x28
 ===



I'm not surprised that this commit would make a difference in this 
situation, since it does change the fraction of memory TCP is allowed to 
use.  (If it really is too much in this situation, we should tweak the 
function.)  However, I don't think this is the root cause.  Why does it 
oops here when the allocation fails?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question on TSO maximum segment sizes.

2007-10-11 Thread John Heffner

Ben Greear wrote:
I just tried turning off my explicit SO_SNDBUF/SO_RCVBUG settings in my 
app,

and the connection ran very poorly through a link with even a small
bit of latency (~2-4ms I believe).


I often run at full gigabit or faster with latencies of 100+ ms.  Can 
you give a bit more detail?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread John Heffner

Larry McVoy wrote:

More data, we've conclusively eliminated the card / cpu from the mix.
We've got 2 ia64 boxes with e1000 interfaces.  One box is running
linux 2.6.12 and the other is running hpux 11.

I made sure the linux one was running at gigabit and reran the tests
from the linux/ia64 <=> hp/ia64.  Same results, when linux sends
it is slow, when it receives it is fast.

And note carefully: we've removed hpux from the equation, we can do
the same tests from linux to multiple linux clients and see the same
thing, sending from the server is slow, receiving on the server is
fast.



I think I'm still missing some basic data here (probably because this 
thread did not originate on netdev).  Let me try to nail down some of 
the basics.  You have a linux ia64 box (running 2.6.12 or 2.6.18?) that 
sends slowly, and receives faster, but not quite a 1 Gbps?  And this is 
true regardless of which peer it sends or receives from?  And the 
behavior is different depending on which kernel?  How, and which kernel 
versions?  Do you have other hardware running the same kernel that 
behaves the same or differently?


Have you done ethernet cable tests?  Have you tried measuring the udp 
sending rate?  (Iperf can do this.)  Are there any error counters on the 
interface?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread John Heffner

Larry McVoy wrote:

On Tue, Oct 02, 2007 at 06:52:54PM +0800, Herbert Xu wrote:

One of my clients also has gigabit so I played around with just that
one and it (itanium running hpux w/ broadcom gigabit) can push the load
as well.  One weird thing is that it is dependent on the direction the
data is flowing.  If the hp is sending then I get 46MB/sec, if linux is
sending then I get 18MB/sec.  Weird.  Linux is debian, running 

First of all check the CPU load on both sides to see if either
of them is saturating.  If the CPU's fine then look at the tcpdump
output to see if both receivers are using the same window settings.


tcpdump is a good idea, take a look at this.  The window starts out
at 46 and never opens up in my test case, but in the rsh case it 
starts out the same but does open up.  Ideas?


(Binary tcpdumps are always better than ascii.)

The window on the sender (linux box) starts at 46.  It doesn't open up, 
but it's not receiving data so it doesn't matter, and you don't expect 
it to.  The HP box always announces a window of 32768.


Looks like you have TSO enabled.  Does it behave differently if it's 
disabled?  I think Rick Jones is on to something with the HP ack 
avoidance.  Looks like a pretty low ack ratio, and it might not be 
interacting well with TSO, especially at such a small window size.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread John Heffner

Larry McVoy wrote:

A short summary is "can someone please post a test program that sources
and sinks data at the wire speed?"  because apparently I'm too old and
clueless to write such a thing.


Here's a simple reference tcp source/sink that's I've used for years. 
For example, on a couple gigabit machines:


$ ./tcpsend -t10 dew
Sent 1240415312 bytes in 10.033101 seconds
Throughput: 123632294 B/s

  -John

/*
 * discard.c
 * A simple discard server.
 *
 * Copyright 2003 John Heffner.
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#if 0
#define RATELIMIT
#define RATE10  /* bytes/sec */
#define WAIT_TIME   (100/HZ-1)
#define READ_SIZE   (RATE/HZ)
#else
#define READ_SIZE   (1024*1024)
#endif

void child_handler(int sig)
{
int status;

wait(&status);
}

int main(int argc, char *argv[])
{
int port = 9000;
int lfd;
struct sockaddr_in laddr;
int newfd;
struct sockaddr_in newaddr;
int pid;
socklen_t len;

if (argc > 2) {
fprintf(stderr, "usage: discard [port]\n");
exit(1);
}
if (argc == 2) {
if (sscanf(argv[1], "%d", &port) != 1 || port < 0 || port > 
65535) {
fprintf(stderr, "discard: error: not a port number\n");
exit(1);
}
}

if (signal(SIGCHLD, child_handler) == SIG_ERR) {
perror("signal");
exit(1);
}

memset(&laddr, 0, sizeof (laddr));
laddr.sin_family = AF_INET;
laddr.sin_port = htons(port);
laddr.sin_addr.s_addr = INADDR_ANY;

if ((lfd = socket(PF_INET, SOCK_STREAM, 0)) < 0) {
perror("socket");
exit(1);
}
if (bind(lfd, (struct sockaddr *)&laddr, sizeof (laddr)) != 0) {
perror("bind");
exit(1);
}
if (listen(lfd, 5) != 0) {
perror("listen");
exit(1);
}

for (;;) {
if ((newfd = accept(lfd, (struct sockaddr *)&newaddr, &len)) < 
0) {
if (errno == EINTR)
continue;
perror("accept");
exit(1);
}

if ((pid = fork()) < 0) {
perror("fork");
exit(1);
} else if (pid == 0) {
int n;
char buf[READ_SIZE];
int64_t data_rcvd = 0;
struct timeval stime, etime;
float time;

gettimeofday(&stime, NULL);
while ((n = read(newfd, buf, READ_SIZE)) > 0) {
data_rcvd += n;
#ifdef RATELIMIT
usleep(WAIT_TIME);
#endif
}
gettimeofday(&etime, NULL);
close(newfd);

time = (float)(100*(etime.tv_sec - stime.tv_sec) + 
etime.tv_usec - stime.tv_usec) / 100.0;
printf("Received %lld bytes in %f seconds\n", (long 
long)data_rcvd, time);
printf("Throughput: %d B/s\n", (int)((float)data_rcvd / 
time));

exit(0);
        }

close(newfd);
}

return 1;
}
/*
 * tcpsend.c
 * Send pseudo-random data through a TCP connection.
 *
 * Copyright 2003 John Heffner.
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#ifdef __linux__
#include 
#endif

#define SNDSIZE (1024 * 10)
#define BUFSIZE (1024 * 1024)

#define max(a,b)(a > b ? a : b)
#define min(a,b)(a < b ? a : b)

int time_done = 0;
int interrupt_done = 0;

struct timeval starttime;

void int_handler(int sig)
{
interrupt_done = 1;
}

void alarm_handler(int sig)
{
time_done = 1;
}

static void usage_error(int err) {
fprintf(stderr, "usage: tcpsend [-z] [-b max_bytes] [-t max_time] 
hostname [port]\n");
exit(err);
}

static void cleanup_exit(int fd, char *filename, int status)
{
if (fd > 0)
close(fd);
if (filename)
unlink(filename);
exit(status);
}

int main(int argc, char *argv[])
{
char *hostname = "localhost";

Re: sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread John Heffner

Yes it has this problem.  I've observed it in practice on a busy firewall.

  -John


Chris Friesen wrote:


Hi all,

We're considering some hardware that uses the sk98lin network hardware, 
and we'll be using jumbo frames.  Looking at the driver, when using a 
9KB MTU it seems like it would end up trying to atomically allocate a 
16KB buffer.


Has anyone heard of this been a problem?  It would seem like trying to 
atomically allocate four physically contiguous pages could become tricky 
after the system has been running for a while.


The reason I ask is that we ran into this with the e1000.  Before they 
added the new jumbo frame code it was trying to atomically allocate 32KB 
buffers and we would start getting allocation failures after a month or 
so of uptime.


Any information anyone can provide would be appreciated.


Thanks,

Chris
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Make TCP prequeue configurable

2007-09-27 Thread John Heffner

Stephen Hemminger wrote:

On Fri, 28 Sep 2007 00:08:33 +0200
Eric Dumazet <[EMAIL PROTECTED]> wrote:


Hi all

I am sure some of you are going to tell me that prequeue is not
all black :)

Thank you

[RFC] Make TCP prequeue configurable

The TCP prequeue thing is based on old facts, and has drawbacks.

1) It adds 48 bytes per 'struct tcp_sock'
2) It adds some ugly code in hot paths
3) It has a small hit ratio on typical servers using many sockets
4) It may have a high hit ratio on UP machines running one process,
where the prequeue adds litle gain. (In fact, letting the user
doing the copy after being woke up is better for cache reuse)
5) Doing a copy to user in softirq handler is not good, because of
potential page faults :(
6) Maybe the NET_DMA thing is the only thing that might need prequeue.

This patch introduces a CONFIG_TCP_PREQUEUE, automatically selected if 
CONFIG_NET_DMA is on.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>



Rather than having a two more compile cases and test cases to deal
with.  If you can prove it is useless, make a case for killing
it completely.



I think it really does help in case (4) with old NICs that don't do rx 
checksumming.  I'm not sure how many people really care about this 
anymore, but probably some...?


OTOH, it would be nice to get rid of sysctl_tcp_low_latency.

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder

2007-09-17 Thread John Heffner

Rick Jones wrote:

John Heffner wrote:
Any reason you're overloading tcpi_unacked and tcpi_sacked?  It seems 
that setting idiag_rqueue and idiag_wqueue are sufficient.


Different fields for different structures.   The tcp_info struct doesn't 
have the idiag_mumble, so to get the two values shown in /proc/net/tcp I 
use tcpi_unacked and tcpi_sacked.


For the INET_DIAG_INFO stuff the idiag_mumble fields are used and that 
then covers ss.


Maybe I'm missing something.  get_tcp[46]_sock() does not use struct 
tcp_info.  The only way I see using this is by doing 
getsockopt(TCP_INFO) on your listen socket.  Is this the intention?


  -John

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder

2007-09-17 Thread John Heffner
Any reason you're overloading tcpi_unacked and tcpi_sacked?  It seems 
that setting idiag_rqueue and idiag_wqueue are sufficient.


  -John


Rick Jones wrote:

Return some useful information such as the maximum listen backlog and the
current listen backlog in the tcp_info structure and have that match what
one can see in /proc/net/tcp, /proc/net/tcp6, and INET_DIAG_INFO.

Signed-off-by: Rick Jones <[EMAIL PROTECTED]>
Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
---

diff -r bdcdd0e1ee9d Documentation/networking/proc_net_tcp.txt
--- a/Documentation/networking/proc_net_tcp.txt Sat Sep 01 07:00:31 2007 +
+++ b/Documentation/networking/proc_net_tcp.txt Tue Sep 11 10:38:23 2007 -0700
@@ -20,8 +20,8 @@ up into 3 parts because of the length of
   || | |   |--> number of unrecovered RTO timeouts
   || | |--> number of jiffies until timer expires
   || |> timer_active (see below)
-  ||--> receive-queue
-  |---> transmit-queue
+  ||--> receive-queue or connection backlog
+  |---> transmit-queue or connection limit
 
10000 54165785 4 cd1e6040 25 4 27 3 -1
 |  || || |  | |  | |--> slow start size threshold, 
diff -r bdcdd0e1ee9d net/ipv4/tcp.c

--- a/net/ipv4/tcp.cSat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp.cTue Sep 11 10:38:23 2007 -0700
@@ -2030,8 +2030,14 @@ void tcp_get_info(struct sock *sk, struc
info->tcpi_snd_mss = tp->mss_cache;
info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss;
 
-	info->tcpi_unacked = tp->packets_out;

-   info->tcpi_sacked = tp->sacked_out;
+   if (sk->sk_state == TCP_LISTEN) {
+   info->tcpi_unacked = sk->sk_ack_backlog;
+   info->tcpi_sacked = sk->sk_max_ack_backlog;
+   }
+   else {
+   info->tcpi_unacked = tp->packets_out;
+   info->tcpi_sacked = tp->sacked_out;
+   }
info->tcpi_lost = tp->lost_out;
info->tcpi_retrans = tp->retrans_out;
info->tcpi_fackets = tp->fackets_out;
diff -r bdcdd0e1ee9d net/ipv4/tcp_diag.c
--- a/net/ipv4/tcp_diag.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp_diag.c   Tue Sep 11 10:38:23 2007 -0700
@@ -25,11 +25,14 @@ static void tcp_diag_get_info(struct soc
const struct tcp_sock *tp = tcp_sk(sk);
struct tcp_info *info = _info;
 
-	if (sk->sk_state == TCP_LISTEN)

+   if (sk->sk_state == TCP_LISTEN) {
r->idiag_rqueue = sk->sk_ack_backlog;
-   else
+   r->idiag_wqueue = sk->sk_max_ack_backlog;
+   }
+   else {
r->idiag_rqueue = tp->rcv_nxt - tp->copied_seq;
-   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   }
if (info != NULL)
tcp_get_info(sk, info);
 }
diff -r bdcdd0e1ee9d net/ipv4/tcp_ipv4.c
--- a/net/ipv4/tcp_ipv4.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp_ipv4.c   Tue Sep 11 10:38:23 2007 -0700
@@ -2320,7 +2320,8 @@ static void get_tcp4_sock(struct sock *s
sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
"%08X %5d %8d %lu %d %p %u %u %u %u %d",
i, src, srcp, dest, destp, sk->sk_state,
-   tp->write_seq - tp->snd_una,
+   sk->sk_state == TCP_LISTEN ? sk->sk_max_ack_backlog :
+(tp->write_seq - tp->snd_una),
sk->sk_state == TCP_LISTEN ? sk->sk_ack_backlog :
 (tp->rcv_nxt - tp->copied_seq),
timer_active,
diff -r bdcdd0e1ee9d net/ipv6/tcp_ipv6.c
--- a/net/ipv6/tcp_ipv6.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv6/tcp_ipv6.c   Tue Sep 11 10:38:23 2007 -0700
@@ -2005,8 +2005,10 @@ static void get_tcp6_sock(struct seq_fil
   dest->s6_addr32[0], dest->s6_addr32[1],
   dest->s6_addr32[2], dest->s6_addr32[3], destp,
   sp->sk_state,
-  tp->write_seq-tp->snd_una,
-  (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
(tp->rcv_nxt - tp->copied_seq),
+  (sp->sk_state == TCP_LISTEN) ? sp->sk_max_ack_backlog:
+ tp->write_seq-tp->snd_una,
+		   (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
+	(tp->rcv_nxt - tp->copied_seq),

   timer_active,
   jiffies_to_clock_t(timer_expires - jiffies),
   icsk->icsk_retransmits,
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a mes

[PATCH 1/2] [IPROUTE2] Add missing LIBUTIL for dependencies.

2007-09-11 Thread John Heffner

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 Makefile |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index af0d5e4..7e4605c 100644
--- a/Makefile
+++ b/Makefile
@@ -29,7 +29,8 @@ LDLIBS += -L../lib -lnetlink -lutil
 
 SUBDIRS=lib ip tc misc netem genl
 
-LIBNETLINK=../lib/libnetlink.a ../lib/libutil.a
+LIBUTIL=../lib/libutil.a
+LIBNETLINK=../lib/libnetlink.a $(LIBUTIL)
 
 all: Config
@set -e; \
-- 
1.5.3.rc4.29.g74276-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [IPROUTE2] ss: parse bare integers are port numbers rather than IP addresses

2007-09-11 Thread John Heffner

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 misc/ss.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 5d14f13..d617f6d 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -953,6 +953,10 @@ void *parse_hostcond(char *addr)
memset(&a, 0, sizeof(a));
a.port = -1;
 
+   /* Special case: integer by itself is considered a port number */
+   if (!get_integer(&a.port, addr, 0))
+   goto out;
+
if (fam == AF_UNIX || strncmp(addr, "unix:", 5) == 0) {
char *p;
a.addr.family = AF_UNIX;
-- 
1.5.3.rc4.29.g74276-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [NET] Change type of owner in sock_lock_t to int, rename

2007-09-11 Thread John Heffner
The type of owner in sock_lock_t is currently (struct sock_iocb *),
presumably for historical reasons.  It is never used as this type, only
tested as NULL or set to (void *)1.  For clarity, this changes it to type
int, and renames to owned, to avoid any possible type casting errors.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/net/sock.h |7 +++
 net/core/sock.c|6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 802c670..5ed9fa4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -76,10 +76,9 @@
  * between user contexts and software interrupt processing, whereas the
  * mini-semaphore synchronizes multiple users amongst themselves.
  */
-struct sock_iocb;
 typedef struct {
spinlock_t  slock;
-   struct sock_iocb*owner;
+   int owned;
wait_queue_head_t   wq;
/*
 * We express the mutex-alike socket_lock semantics
@@ -737,7 +736,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, 
int size)
  * Since ~2.3.5 it is also exclusive sleep lock serializing
  * accesses from user process context.
  */
-#define sock_owned_by_user(sk) ((sk)->sk_lock.owner)
+#define sock_owned_by_user(sk) ((sk)->sk_lock.owned)
 
 /*
  * Macro so as to not evaluate some arguments when
@@ -748,7 +747,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, 
int size)
  */
 #define sock_lock_init_class_and_name(sk, sname, skey, name, key)  \
 do {   \
-   sk->sk_lock.owner = NULL;   \
+   sk->sk_lock.owned = 0;  \
init_waitqueue_head(&sk->sk_lock.wq);   \
spin_lock_init(&(sk)->sk_lock.slock);   \
debug_check_no_locks_freed((void *)&(sk)->sk_lock,  \
diff --git a/net/core/sock.c b/net/core/sock.c
index cfed7d4..edbc562 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1575,9 +1575,9 @@ void fastcall lock_sock_nested(struct sock *sk, int 
subclass)
 {
might_sleep();
spin_lock_bh(&sk->sk_lock.slock);
-   if (sk->sk_lock.owner)
+   if (sk->sk_lock.owned)
__lock_sock(sk);
-   sk->sk_lock.owner = (void *)1;
+   sk->sk_lock.owned = 1;
spin_unlock(&sk->sk_lock.slock);
/*
 * The sk_lock has mutex_lock() semantics here:
@@ -1598,7 +1598,7 @@ void fastcall release_sock(struct sock *sk)
spin_lock_bh(&sk->sk_lock.slock);
if (sk->sk_backlog.tail)
__release_sock(sk);
-   sk->sk_lock.owner = NULL;
+   sk->sk_lock.owned = 0;
if (waitqueue_active(&sk->sk_lock.wq))
wake_up(&sk->sk_lock.wq);
spin_unlock_bh(&sk->sk_lock.slock);
-- 
1.5.3.rc7.30.g947ad2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Clean up owner field in sock_lock_t

2007-09-11 Thread John Heffner
I don't know why the owner field is a (struct sock_iocb *).  I'm assuming
it's historical.  Can someone check this out?  Did I miss some alternate
usage?

These patches are against net-2.6.24.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [NET] Cleanup: Use sock_owned_by_user() macro

2007-09-11 Thread John Heffner
Changes asserts in sunrpc to use sock_owned_by_user() macro instead of
referencing sock_lock.owner directly.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/sunrpc/svcsock.c  |2 +-
 net/sunrpc/xprtsock.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index ed17a50..3a95612 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -104,7 +104,7 @@ static struct lock_class_key svc_slock_key[2];
 static inline void svc_reclassify_socket(struct socket *sock)
 {
struct sock *sk = sock->sk;
-   BUG_ON(sk->sk_lock.owner != NULL);
+   BUG_ON(sock_owned_by_user(sk));
switch (sk->sk_family) {
case AF_INET:
sock_lock_init_class_and_name(sk, "slock-AF_INET-NFSD",
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4ae7eed..282efd4 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1186,7 +1186,7 @@ static struct lock_class_key xs_slock_key[2];
 static inline void xs_reclassify_socket(struct socket *sock)
 {
struct sock *sk = sock->sk;
-   BUG_ON(sk->sk_lock.owner != NULL);
+   BUG_ON(sock_owned_by_user(sk));
switch (sk->sk_family) {
case AF_INET:
sock_lock_init_class_and_name(sk, "slock-AF_INET-NFS",
-- 
1.5.3.rc7.30.g947ad2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable take 2

2007-08-30 Thread John Heffner

Rick Jones wrote:
Like I said the consumers of this are a triffle well, 
"anxious" :)


Just curious, did you or this customer try with F-RTO enabled?  Or is 
this case you're dealing with truly hopeless?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 29 Aug 2007 16:06:27 -0700

I belive the biggest component comes from link-layer retransmissions. 
There can also be some short outtages thanks to signal blocking, 
tunnels, people with big hats and whatnot that the link-layer 
retransmissions are trying to address.  The three seconds seems to be a 
value that gives the certainty that 99 times out of 10 the segment was 
indeed lost.


The trace I've been sent shows clean RTTs ranging from ~200 milliseconds 
to ~7000 milliseconds.


Thanks for the info.

It's pretty easy to generate examples where we might have some sockets
talking over interfaces on such a network and others which are not.
Therefore, if we do this, a per-route metric is probably the best bet.


This is exactly what I was thinking.  It might even help discourage 
users from playing with this setting who should not. ;)


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NCR, was [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

Stephen Hemminger wrote:

On Wed, 29 Aug 2007 15:28:12 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

And reading NCR some more, we already have something similar in the
form of Alexey's reordering detection, in fact it handles exactly the
case NCR supposedly deals with.  We do not trigger loss recovery
strictly on the 3rd duplicate ACK, and we've known about and dealt
with the reordering issue explicitly for years.



Yeah, it looked like another case of BSD RFC writers reinventing
Linux algorithms, but it is worth getting the behaviour standardized
and more widely reviewed.


I don't believe this was the case.  NCR is substantially different, and 
came out of work at Texas A&M.  The original (only) implementation was 
in Linux IIRC.  Its goal was to do better.  Their papers say it does. 
It might be worth looking at.


In my own experience with reordering, Alexey's code had some 
hard-to-track-down bugs (look at all the work Ilpo's been doing), and 
the relative simplicity of NCR may be one of the reasons it does well in 
tests.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

John Heffner wrote:

What exactly causes such a huge delay?  What is the TCP measured RTO
in these circumstances where spurious RTOs happen and a 3 second
minimum RTO makes things better?


I haven't done a lot of work on wireless myself, but my understanding is 
that one of the biggest problems is the behavior link-layer 
retransmission schemes.  They can suddenly increase the delay of packets 
by a significant amount when you get a burst of radio interference. It's 
hard for TCP to gracefully handle this kind of jump without some minimum 
RTO, especially since wlan RTTs can often be quite small.


(Replying to myself) Though F-RTO does often help in this case.

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 29 Aug 2007 15:29:03 -0700


David Miller wrote:

None of the research folks want to commit to saying a lower value is
OK, even though it's quite clear that on a local 10 gigabit link a
minimum value of even 200 is absolutely and positively absurd.

So what do these cellphone network people want to do, increate the
minimum RTO or increase it?  Exactly how does it help them?
They want to increase it.  The folks who triggered this want to make it 
3 seconds to avoid spurrious RTOs.  Their experience the "other 
platform" they widh to replace suggests that 3 seconds is a good value 
for their network.



If the issue is wireless loss, algorithms like FRTO might help them,
because FRTO tries to make a distinction between capacity losses
(which should adjust cwnd) and radio losses (which are not capacity
based and therefore should not affect cwnd).
I was looking at that.  FRTO seems only to affect the cwnd calculations, 
and not the RTO calculation, so it seems to "deal with" spurrious RTOs 
rather than preclude them.  There is a strong desire here to not have 
spurrious RTO's in the first place.  Each spurrious retransmission will 
increase a user's charges.


All of this seems to suggest that the RTO calculation is wrong.


I think there's definitely room for improving the RTO calculation. 
However, this may not be the end-all fix...




It seems that packets in this network can be delayed several orders of
magnitude longer than the usual round trip as measured by TCP.

What exactly causes such a huge delay?  What is the TCP measured RTO
in these circumstances where spurious RTOs happen and a 3 second
minimum RTO makes things better?


I haven't done a lot of work on wireless myself, but my understanding is 
that one of the biggest problems is the behavior link-layer 
retransmission schemes.  They can suddenly increase the delay of packets 
by a significant amount when you get a burst of radio interference. 
It's hard for TCP to gracefully handle this kind of jump without some 
minimum RTO, especially since wlan RTTs can often be quite small.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)

2007-08-28 Thread John Heffner

OBATA Noboru wrote:

Is it correct that you think my problem can be addressed either
by the followings?

(1) Make the application timeouts longer.  (Steve has shown that
making an application timeouts twice the failover detection
timeout would be a solution.)


Right.  Is there something wrong with this approach?



(2) Let TCP have a notification of some kind.


There was some work on this in the IETF a while back (google trigtran 
linkup), but it never went anywhere to my knowledge.  In principle it's 
possible, but it's not clear that it's worth doing.  It's really just an 
optimization anyway.  Imaging the link that's failing over is one hop or 
more away from the endpoint.  You're back to the same problem again.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-26 Thread John Heffner

Bill Fink wrote:

Here's the beforeafter delta of the receiver's "netstat -s"
statistics for the TSO enabled case:

Ip:
3659898 total packets received
3659898 incoming packets delivered
80050 requests sent out
Tcp:
2 passive connection openings
3659897 segments received
80050 segments send out
TcpExt:
33 packets directly queued to recvmsg prequeue.
104956 packets directly received from backlog
705528 packets directly received from prequeue
3654842 packets header predicted
193 packets header predicted and directly queued to user
4 acknowledgments not containing data received
6 predicted acknowledgments

And here it is for the TSO disabled case (GSO also disabled):

Ip:
4107083 total packets received
4107083 incoming packets delivered
1401376 requests sent out
Tcp:
2 passive connection openings
4107083 segments received
1401376 segments send out
TcpExt:
2 TCP sockets finished time wait in fast timer
48486 packets directly queued to recvmsg prequeue.
1056111048 packets directly received from backlog
2273357712 packets directly received from prequeue
1819317 packets header predicted
2287497 packets header predicted and directly queued to user
4 acknowledgments not containing data received
10 predicted acknowledgments

For the TSO disabled case, there are a huge amount more TCP segments
sent out (1401376 versus 80050), which I assume are ACKs, and which
could possibly contribute to the higher throughput for the TSO disabled
case due to faster feedback, but not explain the lower CPU utilization.
There are many more packets directly queued to recvmsg prequeue
(48486 versus 33).  The numbers for packets directly received from
backlog and prequeue in the TCP disabled case seem bogus to me so
I don't know how to interpret that.  There are only about half as
many packets header predicted (1819317 versus 3654842), but there
are many more packets header predicted and directly queued to user
(2287497 versus 193).  I'll leave the analysis of all this to those
who might actually know what it all means.


There are a few interesting things here.  For one, the bursts caused by 
TSO seem to be causing the receiver to do stretch acks.  This may have a 
negative impact on flow performance, but it's hard to say for sure how 
much.  Interestingly, it will even further reduce the CPU load on the 
sender, since it has to process fewer acks.


As I suspected, in the non-TSO case the receiver gets lots of packets 
directly queued to user.  This should result in somewhat lower CPU 
utilization on the receiver.  I don't know if it can account for all the 
difference you see.


The backlog and prequeue values are probably correct, but netstat's 
description is wrong.  A quick look at the code reveals these values are 
in units of bytes, not packets.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread John Heffner

Bill Fink wrote:

Here you can see there is a major difference in the TX CPU utilization
(99 % with TSO disabled versus only 39 % with TSO enabled), although
the TSO disabled case was able to squeeze out a little extra performance
from its extra CPU utilization.  Interestingly, with TSO enabled, the
receiver actually consumed more CPU than with TSO disabled, so I guess
the receiver CPU saturation in that case (99 %) was what restricted
its performance somewhat (this was consistent across a few test runs).



One possibility is that I think the receive-side processing tends to do 
better when receiving into an empty queue.  When the (non-TSO) sender is 
the flow's bottleneck, this is going to be the case.  But when you 
switch to TSO, the receiver becomes the bottleneck and you're always 
going to have to put the packets at the back of the receive queue.  This 
might help account for the reason why you have both lower throughput and 
higher CPU utilization -- there's a point of instability right where the 
receiver becomes the bottleneck and you end up pushing it over to the 
bad side. :)


Just a theory.  I'm honestly surprised this effect would be so 
significant.  What do the numbers from netstat -s look like in the two 
cases?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread John Heffner

TJ wrote:

Right now Juniper are claiming the issue that brought this to the
surface (the bug linked to in my original post) is a problem with the
implementation of TCP_DEFER_ACCEPT.

My position so far is that the Juniper DX OS is not following the HTTP
standard because it doesn't send a request with the connection, and as I
read the end of section 1.4 of RFC2616, an HTTP connection should be
accompanied by a request.

Can anyone confirm my interpretation or provide references to firm it
up, or refute it?


You can think of TCP_DEFER_ACCEPT as an implicit application close() 
after a certain timeout, when not receiving a request.  All HTTP servers 
do this anyway (though I think technically they're supposed to send a 
408 Request Timeout error it seems many do not).  It's a very valid 
question for Juniper as to why their box is failing to fill requests 
when its back-end connection has gone away, instead of re-establishing 
the connection and filling the request.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-23 Thread John Heffner

TJ wrote:

client SYN > server LISTENING
client < SYN ACK server SYN_RECEIVED (time-out 3s)
 server: inet_rsk(req)->acked = 1

client ACK > server (discarded)

client < SYN ACK (DUP) server (time-out 6s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 12s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 24s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 48s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 96s)
client ACK (DUP) > server (discarded)

server: half-open socket closed.

With each client ACK being dropped by the kernel's TCP_DEFER_ACCEPT
mechanism eventually the handshake fails after the 'SYN ACK' retries and
time-outs expire.

There is a case for arguing the kernel should be operating in an
enhanced handshaking mode when TCP_DEFER_ACCEPT is enabled, not an
alternative mode, and therefore should accept *both* RFC 793 and
TCP_DEFER_ACCEPT. I've been unable to find a specification or RFC for
implementing TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER to give me firm
guidance.

It seems incorrect to penalise a client that is trying to complete the
handshake according to the RFC 793 specification, especially as the
client has no way of knowing ahead of time whether or not the server is
operating deferred accept.


Interesting problem.  TCP_DEFER_ACCEPT does not conform to any standard 
I'm aware of.  (In fact, I'd say it's in violation of RFC 793.)  The 
implementation does exactly what it claims, though -- it "allows a 
listener to be awakened only  when  data  arrives  on  the  socket."


I think a more useful spec might have been "allows a listener to be 
awakened only when data arrives on the socket, unless the specified 
timeout has expired."  Once the timeout expires, it should process the 
embryonic connection as if TCP_DEFER_ACCEPT is not set.  Unfortunately, 
I don't think we can retroactively change this definition, as an 
application might depend on data being available and do a non-blocking 
read() after the accept(), expecting data to be there.  Is this worth 
trying to fix?


Also, a listen socket with a backlog and TCP_DEFER_ACCEPT will have reqs 
sit in the backlog for the full defer timeout, even if they've received 
data, which is not really the right thing to do.


I've attached a patch implementing this suggestion (compile tested only 
-- I think I got the logic right but it's late ;).  Kind of ugly, and 
uses up a bit in struct inet_request_sock.  Maybe can be done better...


  -John
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 62daf21..f9f64a5 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -72,7 +72,8 @@ struct inet_request_sock {
sack_ok: 1,
wscale_ok  : 1,
ecn_ok : 1,
-   acked  : 1;
+   acked  : 1,
+   deferred   : 1;
struct ip_options   *opt;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 185c7ec..cad2490 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -978,6 +978,7 @@ static inline void tcp_openreq_init(struct request_sock 
*req,
ireq->snd_wscale = rx_opt->snd_wscale;
ireq->wscale_ok = rx_opt->wscale_ok;
ireq->acked = 0;
+   ireq->deferred = 0;
ireq->ecn_ok = 0;
ireq->rmt_port = tcp_hdr(skb)->source;
 }
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fbe7714..1207fb8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -444,9 +444,6 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
}
 
-   if (queue->rskq_defer_accept)
-   max_retries = queue->rskq_defer_accept;
-
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
i = lopt->clock_hand;
 
@@ -455,7 +452,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
if ((req->retrans < thresh ||
-(inet_rsk(req)->acked && req->retrans < 
max_retries))
+(inet_rsk(req)->acked && req->retrans < 
max_retries) ||
+(inet_rsk(req)->deferred && req->retrans <
+ queue->rskq_defer_accept + max_retries))
&& !req->rsk_ops->rtx_syn_ack(parent, req, 
NULL)) {
unsigned long timeo;
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index a12b08f..c4867f3 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -637,8 +637,10 @@ struct sock *tcp_check_req(struct

Re: TCP's initial cwnd setting correct?...

2007-08-08 Thread John Heffner
I believe the current calculation is correct.  The RFC specifies a 
window of no more than 4380 bytes unless 2*MSS > 4380.  If you change 
the code in this way, then MSS=1461 will give you an initial window of 
3*MSS == 4383, violating the spec.  Reading the pseudocode in the RFC 
3390 is a bit misleading because they use a clamp at 4380 bytes rather 
than use a multiplier in the relevant range.


  -John


David Miller wrote:

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Mon, 6 Aug 2007 15:37:15 +0300 (EEST)


@@ -805,13 +805,13 @@ void tcp_update_metrics(struct sock *sk)
}
 }
 
-/* Numbers are taken from RFC2414.  */

+/* Numbers are taken from RFC3390.  */
 __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst)
 {
__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
 
 	if (!cwnd) {

-   if (tp->mss_cache > 1460)
+   if (tp->mss_cache >= 2190)
cwnd = 2;
else
cwnd = (tp->mss_cache > 1095) ? 3 : 4;


I remember suggesting something similar about 5 or 6 years
ago and Alexey Kuznetsov at the time explained the numbers
which are there and why they should not be changed.

I forget the reasons though, and I'll try to do the research.

These numbers have been like this forever, FWIW.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP's initial cwnd setting correct?...

2007-08-08 Thread John Heffner

That sounds right to me.

  -John


Ilpo Järvinen wrote:

On Mon, 6 Aug 2007, Ilpo Järvinen wrote:

...Goto logic could be cleaner (somebody has any suggestion for better 
way to structure it?)


...I could probably move the setting of snd_cwnd earlier to avoid 
this problem if this seems a valid fix at all.




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [TCP] Sysctl: document tcp_max_ssthresh (Limited Slow-Start)

2007-05-18 Thread John Heffner

Rick Jones wrote:
as an asside, "tcp_max_ssthresh" sounds like the maximum value ssthresh 
can take-on.  is that correct, or is this more of a "once ssthresh is 
above this, behave in this new way?"  If that is the case, while the 


I don't like it either, but you'll have to talk to Sally Floyd about 
that one.. ;)


In general, I would like the documentation to emphasize more how to set 
the parameter than describe the algorithm.  The max_ssthresh parameter 
should ideally be set to the bottleneck queue size, or more 
realistically a conservative value that's likely to be smaller than the 
bottleneck queue size.  When max_ssthresh is smaller than the bottleneck 
queue, (limited) slow start will not overflow it until cwnd has fully 
ramped up to the appropriate size.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread John Heffner

Benjamin LaHaise wrote:

On Tue, May 01, 2007 at 09:41:28PM +0400, Evgeniy Polyakov wrote:

Hmm, 2.2 machine in your test seems to behave incorrectly:


I am aware of that.  However, I think that the loss of certain packets and 
reordering can result in the same behaviour.  What's more, is that this 
behaviour can occur in real deployed systems.  "Be strict in what you send 
and liberal in what you accept."  Both systems should be fixed, which is 
what I'm trying to do.


Actually, you cannot get in this situation by loss or reordering of 
packets, only be corruption of state on one side.  It sends the FIN, 
which effectively increases the sequence number by one.  However, all 
later segments it sends have an old lower sequence number, which are now 
out of window.


Being liberal in what you accept is good to a point, but sometimes you 
have to draw the line.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread John Heffner

Benjamin LaHaise wrote:

According to your patch, several packets with fin bit might be sent,
including one with data. If another host does not receive fin
retransmit, then that logic is broken, and it can not be fixed by
duplicating fins, I would even say, that remote box should drop second
packet with fin, while it can carry data, which will break higher
connection logic.


The FIN hasn't been ack'd by the other side, though and yet Linux is no 
longer transmitting packets with it sent.  Read the beginning of the trace.


I agree completely with Evgeniy.  The patch you sent would cause bad 
breakage by sending the FIN bit on segments with different sequence numbers.


Looking at your trace, it seems like the behavior of the test system 
192.168.2.2 is broken in two ways.  First, like you said it has broken 
state in that it has forgotten that it sent the FIN.  Once you do that, 
the connection state is corrupt and all bets are off.  It's sending an 
out-of-window segment that's getting tossed by Linux, and Linux 
generates an ack in response.  This is in direct RFC compliance.  The 
second problem is that the other system is generating these broken acks 
in response to the legitimate acks Linux is sending, causing the ack 
war.  I can't really guess why it's doing that...


You might be able to change Linux to prevent this ack war, but doing so 
would break RFC compliance, and given the buggy nature of the other end, 
it sounds to me like a bad idea.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [NET] Add IP(V6)_PMTUDISC_RPOBE

2007-04-18 Thread John Heffner
Sorry, forgot the -n flag on git-format-patch.  Patches resent with 
correct sequence numbers.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"

2007-04-18 Thread John Heffner
This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb.

Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37
does not work.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 -
 include/linux/in6.h  |1 -
 include/linux/skbuff.h   |3 +--
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 --
 net/ipv4/ip_output.c |   14 --
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 ---
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 ---
 11 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 2dc1f8a..1912e7c 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,7 +83,6 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
-#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index d559fac..4e8350a 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,7 +179,6 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
-#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bf9b9f..7f17cfc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -277,8 +277,7 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1,
-   ign_dst_mtu:1;
+   ipvs_property:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index 6a08b65..75f226d 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
-   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_NET_SCHED
 #ifdef CONFIG_NET_CLS_ACT
new->tc_verd = old->tc_verd;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 704bc44..79e71ee 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) &&
-   !skb->ign_dst_mtu && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
-   rt->u.dst.dev->mtu :
-   dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc < IP_PMTUDISC_DO)
+   if (inet->pmtudisc != IP_PMTUDISC_DO)
skb->local_df = 1;
 
-   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
-   skb->ign_dst_mtu = 1;
-
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 

[PATCH 4/4] [NET] Add IP(V6)_PMTUDISC_RPOBE

2007-04-18 Thread John Heffner
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 net/ipv4/ip_output.c |   20 +++-
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv6/ip6_output.c|   15 ---
 net/ipv6/ipv6_sockglue.c |2 +-
 6 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..3975cbf 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 34606ef..66e2c3a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb)
return -EINVAL;
 }
 
+static inline int ip_skb_dst_mtu(struct sk_buff *skb)
+{
+   struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;
+
+   return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
+  skb->dst->dev->mtu : dst_mtu(skb->dst);
+}
+
 static inline int ip_finish_output(struct sk_buff *skb)
 {
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
@@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
- htonl(dst_mtu(&rt->u.dst)));
+ htonl(ip_skb_dst_mtu(skb)));
kfree_skb(skb);
return -EMSGSIZE;
}
@@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu : 
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 * locally. */
-   if (inet->pmtudisc == IP_PMTUDISC_DO ||
+   if (inet->pmtudisc >= IP_PMTUDISC_DO ||
(skb->len <= dst_mtu(&rt->u.dst) &&
 ip_dont_fragment(sk, &rt->u.dst)))
df = htons(IP_DF);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c199d23..4d54457 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
inet->hdrincl = val ? 1 : 0;
break;
case IP_MTU_DISCOVER:
-   if (val<0 || val>2)
+  

[PATCH 3/4] [NET] MTU discovery check in ip6_fragment()

2007-04-18 Thread John Heffner
Adds a check in ip6_fragment() mirroring ip_fragment() for packets
that we can't fragment, and sends an ICMP Packet Too Big message
in response.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv6/ip6_output.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 4cfdad4..5a5b7d4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -567,6 +567,19 @@ static int ip6_fragment(struct sk_buff *skb, int 
(*output)(struct sk_buff *))
nexthdr = *prevhdr;
 
mtu = dst_mtu(&rt->u.dst);
+
+   /* We must not fragment if the socket is set to force MTU discovery
+* or if the skb it not generated by a local socket.  (This last
+* check should be redundant, but it's free.)
+*/
+   if (!np || np->pmtudisc >= IPV6_PMTUDISC_DO) {
+   skb->dev = skb->dst->dev;
+   icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb->dev);
+   IP6_INC_STATS(ip6_dst_idev(skb->dst), IPSTATS_MIB_FRAGFAILS);
+   kfree_skb(skb);
+   return -EMSGSIZE;
+   }
+
if (np && np->frag_size < mtu) {
if (np->frag_size)
mtu = np->frag_size;
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Revert "[NET] Do pmtu check in transport layer"

2007-04-18 Thread John Heffner
This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37.

This idea does not work, as pointed at by Patrick McHardy.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +---
 net/ipv4/raw.c|8 +++-
 net/ipv6/ip6_output.c |   11 +--
 net/ipv6/raw.c|7 ++-
 4 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 79e71ee..34606ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen ||
-   (inet->pmtudisc >= IP_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
+   if (inet->cork.length + length > 0x - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index c60aadf..24d7c9f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
-   int mtu;
 
-   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
+  rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b8e307a..4cfdad4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
-inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
-   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
+   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
+   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
+   }
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f4cd90b..f65fcd7 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
-   int mtu;
 
-   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"

2007-04-18 Thread John Heffner
This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb.

Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37
does not work.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 -
 include/linux/in6.h  |1 -
 include/linux/skbuff.h   |3 +--
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 --
 net/ipv4/ip_output.c |   14 --
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 ---
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 ---
 11 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 2dc1f8a..1912e7c 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,7 +83,6 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
-#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index d559fac..4e8350a 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,7 +179,6 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
-#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bf9b9f..7f17cfc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -277,8 +277,7 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1,
-   ign_dst_mtu:1;
+   ipvs_property:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index 6a08b65..75f226d 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
-   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_NET_SCHED
 #ifdef CONFIG_NET_CLS_ACT
new->tc_verd = old->tc_verd;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 704bc44..79e71ee 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) &&
-   !skb->ign_dst_mtu && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
-   rt->u.dst.dev->mtu :
-   dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc < IP_PMTUDISC_DO)
+   if (inet->pmtudisc != IP_PMTUDISC_DO)
skb->local_df = 1;
 
-   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
-   skb->ign_dst_mtu = 1;
-
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 

[PATCH] [NET] Add IP(V6)_PMTUDISC_RPOBE

2007-04-18 Thread John Heffner
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 net/ipv4/ip_output.c |   20 +++-
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv6/ip6_output.c|   15 ---
 net/ipv6/ipv6_sockglue.c |2 +-
 6 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..3975cbf 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 34606ef..66e2c3a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb)
return -EINVAL;
 }
 
+static inline int ip_skb_dst_mtu(struct sk_buff *skb)
+{
+   struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;
+
+   return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
+  skb->dst->dev->mtu : dst_mtu(skb->dst);
+}
+
 static inline int ip_finish_output(struct sk_buff *skb)
 {
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
@@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
- htonl(dst_mtu(&rt->u.dst)));
+ htonl(ip_skb_dst_mtu(skb)));
kfree_skb(skb);
return -EMSGSIZE;
}
@@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu : 
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 * locally. */
-   if (inet->pmtudisc == IP_PMTUDISC_DO ||
+   if (inet->pmtudisc >= IP_PMTUDISC_DO ||
(skb->len <= dst_mtu(&rt->u.dst) &&
 ip_dont_fragment(sk, &rt->u.dst)))
df = htons(IP_DF);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c199d23..4d54457 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
inet->hdrincl = val ? 1 : 0;
break;
case IP_MTU_DISCOVER:
-   if (val<0 || val>2)
+  

[PATCH] Revert "[NET] Do pmtu check in transport layer"

2007-04-18 Thread John Heffner
This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37.

This idea does not work, as pointed at by Patrick McHardy.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +---
 net/ipv4/raw.c|8 +++-
 net/ipv6/ip6_output.c |   11 +--
 net/ipv6/raw.c|7 ++-
 4 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 79e71ee..34606ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen ||
-   (inet->pmtudisc >= IP_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
+   if (inet->cork.length + length > 0x - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index c60aadf..24d7c9f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
-   int mtu;
 
-   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
+  rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b8e307a..4cfdad4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
-inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
-   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
+   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
+   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
+   }
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f4cd90b..f65fcd7 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
-   int mtu;
 
-   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [NET] MTU discovery check in ip6_fragment()

2007-04-18 Thread John Heffner
Adds a check in ip6_fragment() mirroring ip_fragment() for packets
that we can't fragment, and sends an ICMP Packet Too Big message
in response.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv6/ip6_output.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 4cfdad4..5a5b7d4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -567,6 +567,19 @@ static int ip6_fragment(struct sk_buff *skb, int 
(*output)(struct sk_buff *))
nexthdr = *prevhdr;
 
mtu = dst_mtu(&rt->u.dst);
+
+   /* We must not fragment if the socket is set to force MTU discovery
+* or if the skb it not generated by a local socket.  (This last
+* check should be redundant, but it's free.)
+*/
+   if (!np || np->pmtudisc >= IPV6_PMTUDISC_DO) {
+   skb->dev = skb->dst->dev;
+   icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb->dev);
+   IP6_INC_STATS(ip6_dst_idev(skb->dst), IPSTATS_MIB_FRAGFAILS);
+   kfree_skb(skb);
+   return -EMSGSIZE;
+   }
+
if (np && np->frag_size < mtu) {
if (np->frag_size)
mtu = np->frag_size;
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/0] Re-try changes for PMTUDISC_PROBE

2007-04-18 Thread John Heffner
This backs out the the transport layer MTU checks that don't work.  As a 
consequence, I had to back out the PMTUDISC_PROBE patch as well.  These 
patches should fix the problem with ipv6 that the transport layer change 
tried to address, and re-implement PMTUDISC_PROBE.  I think this 
approach is nicer than the last one, since it doesn't require a bit in 
struct sk_buff.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-17 Thread John Heffner

David Miller wrote:

From: "Robert Iakobashvili" <[EMAIL PROTECTED]>
Date: Tue, 17 Apr 2007 10:58:04 +0300


David,

On 4/16/07, David Miller <[EMAIL PROTECTED]> wrote:

Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c
Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700

 [TCP]: Fix tcp_mem[] initialization.
 Change tcp_mem initialization function.  The fraction of total memory
 is now a continuous function of memory size, and independent of page
 size.


Kernels 2.6.19 and 2.6.20 series are effectively broken right now.
Don't you wish to patch them?

Can you verify that this patch actually fixes your problem?

Yes, it fixes.


Thanks, I will submit it to -stable branch.


My only reservation in submitting this to -stable is that it will in 
many cases increase the default tcp_mem values, which in turn can 
increase the default tcp_rmem values, and therefore the window scale. 
There will be some set of people with broken firewalls who trigger that 
problem for the first time by upgrading along the stable branch.  While 
it's not our fault, it could cause some complaints...


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug in tcp?

2007-04-16 Thread John Heffner

Stephen Hemminger wrote:

A guess: maybe something related to a PAWS wraparound problem.
Does turning off sysctl net.ipv4.tcp_timestamps fix it?


That was my first thought too (aside from netfilter), but a failed PAWS 
check should not result in a reset..


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-16 Thread John Heffner

Robert Iakobashvili wrote:

Kernels 2.6.19 and 2.6.20 series are effectively broken right now.
Don't you wish to patch them?



I don't know if this qualifies as an unconditional bug.  The commit 
above was actually a bugfix so that the limits were not higher than 
total memory on some systems, but had the side effect that it made them 
even smaller on your particular configuration.  Also, having initial 
sysctl values that are conservatively small probably doesn't qualify as 
a bug (for patching stable trees).  You might ask the -stable 
maintainers if they have a different opinion.


For most people, 2.6.19 and 2.6.20 work fine.  For those who really care 
about the tcp_mem values (are using a substantial fraction of physical 
memory for TCP connections), the best bet is to set the tcp_mem sysctl 
values in the startup scripts, or use the new initialization function in 
2.6.21.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-16 Thread John Heffner

Robert Iakobashvili wrote:

Hi John,

On 4/15/07, John Heffner <[EMAIL PROTECTED]> wrote:

Robert Iakobashvili wrote:
> Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and
> 2.6.20.6 do not.
>
> Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5
> tcp_rmem and tcp_wmem are the same, whereas tcp_mem are
> much different:
>
> kernel  tcp_mem
> ---
> 2.6.18.312288 16384 24576
> 2.6.19.5  30724096   6144



Another patch that went in right around that time:

commit 52bf376c63eebe72e862a1a6e713976b038c3f50
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Tue Nov 14 20:25:17 2006 -0800

 [TCP]: Fix up sysctl_tcp_mem initialization.
(This has been changed again for 2.6.21.)

In the dmesg, there should be some messages like this:
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)

What do yours say?


For the 2.6.19.5, where we have this problem:

From dmsg:

IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
TCP established hash table entries: 16384 (order: 5, 131072 bytes)
TCP bind hash table entries: 8192 (order: 4, 65536 bytes)

#cat /proc/sys/net/ipv4/tcp_mem
307240966144

MemTotal:   484368 kB
CONFIG_HIGHMEM4G=y



Yes, this difference is caused by the commit above.  The old way didn't 
really make a lot of sense, since it was different based on smp/non-smp 
and page size, and had large discontinuities at 512MB and every power of 
two.  It was hard to make the limit never larger than the memory pool 
but never too small either, when based on the hash table size.


The current net-2.6 (2.6.21) has a redesigned tcp_mem initialization 
that should give you more appropriate values, something like 45408 60546 
90816.  For reference:


Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c
Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700

[TCP]: Fix tcp_mem[] initialization.

Change tcp_mem initialization function.  The fraction of total memory
is now a continuous function of memory size, and independent of page
size.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-15 Thread John Heffner

Robert Iakobashvili wrote:

Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and
2.6.20.6 do not.

Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5
tcp_rmem and tcp_wmem are the same, whereas tcp_mem are
much different:

kernel  tcp_mem
---
2.6.18.312288 16384 24576
2.6.19.5  30724096   6144


Is not it done deliberately by the below patch:

commit 9e950efa20dc8037c27509666cba6999da9368e8
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Mon Nov 6 23:10:51 2006 -0800

   [TCP]: Don't use highmem in tcp hash size calculation.

   This patch removes consideration of high memory when determining TCP
   hash table sizes.  Taking into account high memory results in tcp_mem
   values that are too large.

Is it a feature?

My machine has:
MemTotal:   484368 kB
and
for all kernel configurations are actually the same with
CONFIG_HIGHMEM4G=y

Thanks,



Another patch that went in right around that time:

commit 52bf376c63eebe72e862a1a6e713976b038c3f50
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Tue Nov 14 20:25:17 2006 -0800

[TCP]: Fix up sysctl_tcp_mem initialization.

Fix up tcp_mem initial settings to take into account the size of the
hash entries (different on SMP and non-SMP systems).

    Signed-off-by: John Heffner <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

(This has been changed again for 2.6.21.)

In the dmesg, there should be some messages like this:

IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)

What do yours say?

Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [NET] Do pmtu check in transport layer

2007-04-09 Thread John Heffner

Patrick McHardy wrote:

John Heffner wrote:

Check the pmtu check at the transport layer (for UDP, ICMP and raw), and
send a local error if socket is PMTUDISC_DO and packet is too big.  This is
actually a pure bugfix for ipv6.  For ipv4, it allows us to do pmtu checks
in the same way as for ipv6.

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d096332..593acf7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-	if (inet->cork.length + length > 0x - fragheaderlen) {

+   if (inet->cork.length + length > 0x - fragheaderlen ||
+   (inet->pmtudisc >= IP_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}



This makes ping report an incorrect MTU when IPsec is used since we're
only accounting for the additional header_len, not the trailer_len
(which is not easily changeable). Additionally it will report different
MTUs for the first and following fragments when the socket is corked
because only the first fragment includes the header_len. It also can't
deal with things like NAT and routing by fwmark that change the route.
The old behaviour was that we get an ICMP frag. required with the MTU
of the final route, while this will always report the MTU of the
initially chosen route.

For all these reasons I think it should be reverted to the old
behaviour.


You're right, this is no good.  I think the other problems are fixable, 
but NAT really screws this.


Unfortunately, there is still a real problem with ipv6, in that the 
output side does not generate a packet too big ICMP like ipv4.  Also, it 
feels kind of undesirable be rely on local ICMP instead of direct error 
message delivery.  I'll try to generate a new patch.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [iputils] Re-probe at same TTL after MTU reduction.

2007-04-03 Thread John Heffner
This fixes a bug that would miss a hop after an ICMP packet too big message,
since it would continue increase the TTL without probing again.
---
 tracepath.c  |6 ++
 tracepath6.c |6 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index d035a1e..19b2c6b 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -352,8 +352,14 @@ main(int argc, char **argv)
exit(1);
}
 
+restart:
for (i=0; i<3; i++) {
+   int old_mtu;
+   
+   old_mtu = mtu;
res = probe_ttl(fd, ttl);
+   if (mtu != old_mtu)
+   goto restart;
if (res == 0)
goto done;
if (res > 0)
diff --git a/tracepath6.c b/tracepath6.c
index a010218..65c4a4a 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -422,8 +422,14 @@ int main(int argc, char **argv)
exit(1);
}
 
+restart:
for (i=0; i<3; i++) {
+   int old_mtu;
+   
+   old_mtu = mtu;
res = probe_ttl(fd, ttl);
+   if (mtu != old_mtu)
+   goto restart;
if (res == 0)
goto done;
if (res > 0)
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [iputils] Fix asymm messages.

2007-04-03 Thread John Heffner
We should only print the asymm messages in tracepath/6 when you receive a
TTL expired message, because this is the only time when we'd expect the
same number of hops back as our TTL was set to for a symmetric path.
---
 tracepath.c  |   25 -
 tracepath6.c |   25 -
 2 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index a562d88..d035a1e 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -163,19 +163,6 @@ restart:
}
}
 
-   if (rethops>=0) {
-   if (rethops<=64)
-   rethops = 65-rethops;
-   else if (rethops<=128)
-   rethops = 129-rethops;
-   else
-   rethops = 256-rethops;
-   if (sndhops>=0 && rethops != sndhops)
-   printf("asymm %2d ", rethops);
-   else if (sndhops<0 && rethops != ttl)
-   printf("asymm %2d ", rethops);
-   }
-
if (rettv) {
int diff = 
(tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec);
printf("%3d.%03dms ", diff/1000, diff%1000);
@@ -204,6 +191,18 @@ restart:
if (e->ee_origin == SO_EE_ORIGIN_ICMP &&
e->ee_type == 11 &&
e->ee_code == 0) {
+   if (rethops>=0) {
+   if (rethops<=64)
+   rethops = 65-rethops;
+   else if (rethops<=128)
+   rethops = 129-rethops;
+   else
+   rethops = 256-rethops;
+   if (sndhops>=0 && rethops != sndhops)
+   printf("asymm %2d ", rethops);
+   else if (sndhops<0 && rethops != ttl)
+   printf("asymm %2d ", rethops);
+   }
printf("\n");
break;
}
diff --git a/tracepath6.c b/tracepath6.c
index 6f13a51..a010218 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -176,19 +176,6 @@ restart:
}
}
 
-   if (rethops>=0) {
-   if (rethops<=64)
-   rethops = 65-rethops;
-   else if (rethops<=128)
-   rethops = 129-rethops;
-   else
-   rethops = 256-rethops;
-   if (sndhops>=0 && rethops != sndhops)
-   printf("asymm %2d ", rethops);
-   else if (sndhops<0 && rethops != ttl)
-   printf("asymm %2d ", rethops);
-   }
-
if (rettv) {
int diff = 
(tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec);
printf("%3d.%03dms ", diff/1000, diff%1000);
@@ -220,6 +207,18 @@ restart:
(e->ee_origin == SO_EE_ORIGIN_ICMP6 &&
 e->ee_type == 3 &&
 e->ee_code == 0)) {
+   if (rethops>=0) {
+   if (rethops<=64)
+   rethops = 65-rethops;
+   else if (rethops<=128)
+   rethops = 129-rethops;
+   else
+   rethops = 256-rethops;
+   if (sndhops>=0 && rethops != sndhops)
+   printf("asymm %2d ", rethops);
+   else if (sndhops<0 && rethops != ttl)
+   printf("asymm %2d ", rethops);
+   }
printf("\n");
break;
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [iputils] Add documentation for the -l flag.

2007-04-03 Thread John Heffner
---
 doc/tracepath.sgml |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml
index 71eaa8d..c0f308b 100644
--- a/doc/tracepath.sgml
+++ b/doc/tracepath.sgml
@@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this 
path
 
 
 tracepath
+-l 
 
 
 
@@ -39,6 +40,18 @@ of UDP ports to maintain trace history.
 
 
 
+OPTIONS
+
+ 
+  
+  
+Sets the initial packet length to 
+ 
+
+
+
 OUTPUT
 
 
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [iputils] Document -n flag.

2007-04-03 Thread John Heffner
---
 doc/tracepath.sgml |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml
index c0f308b..1bc83b9 100644
--- a/doc/tracepath.sgml
+++ b/doc/tracepath.sgml
@@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this 
path
 
 
 tracepath
+-n
 -l 
 
 
@@ -42,6 +43,14 @@ of UDP ports to maintain trace history.
 
 OPTIONS
 
+
+ 
+  
+  
+Do not look up host names.  Only print IP addresses numerically.
+  
+ 
+
  
   
   
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-27 Thread John Heffner

John Heffner wrote:
I also believe this is a useful thing to have.  I'm not 100% sure this 
ioctl is the way to go, but it seems reasonable.  This directly 
corresponds to writing deleteTcb to the tcpConnectionState variable in 
the TCP MIB (RFC 4022).  I don't think it constitutes a protocol violation.


Responding to myself in good form :P  I'll add that there are other ways 
to do this currently but all I know of are hackish, f.e. using a raw 
socket to send RST packets to yourself.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-27 Thread John Heffner

Mark Huth wrote:



David Miller wrote:

From: [EMAIL PROTECTED] (David Griego)
Date: Tue, 27 Mar 2007 14:47:54 -0700

 

Adds an IOCTL for aborting established TCP connections, and is
designed to be an HA performance improvement for cleaning up, failure 
notification, and application termination.


Signed-off-by:  David Griego <[EMAIL PROTECTED]>



SO_LINGER with a zero linger time plus close() isn't working
properly?

There is no reason for this ioctl at all.  Either existing
facilities provide what you need or what you want is a
protocol violation we can't do.
  
Actually, there are legitimate uses for this sort of API.  The patch 
allows an administrator to kill specific connections that are in use by 
other applications, where the close is not available, since the socket 
is owned by another process.  Say one of your large applications has 
hundreds or even thousands of open connections and you have determined 
that a particular connection is causing trouble.  This API allows the 
admin to kill that particular connection, and doesn't appear to violate 
any RFC offhand, since an abort is sent  to the peer.


One may argue that the applications should be modified, but that is not 
always possible in the case of various ISVs.  As Linux gains market 
share in the large server market, more and more applications are being 
ported from other platforms that have this sort of 
management/administrative interfaces.


Mark Huth


I also believe this is a useful thing to have.  I'm not 100% sure this 
ioctl is the way to go, but it seems reasonable.  This directly 
corresponds to writing deleteTcb to the tcpConnectionState variable in 
the TCP MIB (RFC 4022).  I don't think it constitutes a protocol violation.


As a concrete example of a way I've used this type of feature is to 
defend against a netkill [1] style attack, where the defense involves 
making decisions about which connections to kill when memory gets 
scarce.  It makes sense to do this with a system daemon, since an admin 
might have an arbitrarily complicated policy as to which applications 
and peers have priority for the memory.  This is too complicated to 
distribute and enforce across all applications.  You could do this in 
the kernel, but why if you don't have to?


  -John

[1] http://shlang.com/netkill/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ip(7) IP_PMTUDISC_PROBE

2007-03-27 Thread John Heffner
Document new IP_PMTUDISC_PROBE value for IP_MTU_DISCOVERY.  (Going into 
2.6.22).


Thanks,
  -John
diff -rU3 man-pages-2.43-a/man7/ip.7 man-pages-2.43-b/man7/ip.7
--- man-pages-2.43-a/man7/ip.7  2006-09-26 09:54:29.0 -0400
+++ man-pages-2.43-b/man7/ip.7  2007-03-27 15:46:18.0 -0400
@@ -515,6 +515,7 @@
 IP_PMTUDISC_WANT:Use per-route settings.
 IP_PMTUDISC_DONT:Never do Path MTU Discovery.
 IP_PMTUDISC_DO:Always do Path MTU Discovery. 
+IP_PMTUDISC_PROBE:Set DF but ignore Path MTU.
 .TE   
 
 When PMTU discovery is enabled the kernel automatically keeps track of
@@ -550,6 +551,15 @@
 with the
 .B IP_MTU
 option. 
+
+It is possible to implement RFC 4821 MTU probing with
+.B SOCK_DGRAM
+of
+.B SOCK_RAW
+sockets by setting a value of IP_PMTUDISC_PROBE.  This is also particularly
+useful for diagnostic tools such as
+.BR tracepath (8)
+that wish to deliberately send probe packets larger than the observed Path MTU.
 .TP
 .B IP_MTU
 Retrieve the current known path MTU of the current socket. 


[PATCH 2/2] [iputils] Use PMTUDISC_PROBE mode if it exists.

2007-03-23 Thread John Heffner
Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 tracepath.c  |   10 --
 tracepath6.c |   10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index 1f901ba..a562d88 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -24,6 +24,10 @@
 #include 
 #include 
 
+#ifndef IP_PMTUDISC_PROBE
+#define IP_PMTUDISC_PROBE  3
+#endif
+
 struct hhistory
 {
int hops;
@@ -322,8 +326,10 @@ main(int argc, char **argv)
}
memcpy(&target.sin_addr, he->h_addr, 4);
 
-   on = IP_PMTUDISC_DO;
-   if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on))) {
+   on = IP_PMTUDISC_PROBE;
+   if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on)) &&
+   (on = IP_PMTUDISC_DO,
+setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on {
perror("IP_MTU_DISCOVER");
exit(1);
}
diff --git a/tracepath6.c b/tracepath6.c
index d65230d..6f13a51 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -30,6 +30,10 @@
 #define SOL_IPV6 IPPROTO_IPV6
 #endif
 
+#ifndef IPV6_PMTUDISC_PROBE
+#define IPV6_PMTUDISC_PROBE3
+#endif
+
 int overhead = 48;
 int mtu = 128000;
 int hops_to = -1;
@@ -369,8 +373,10 @@ int main(int argc, char **argv)
mapped = 1;
}
 
-   on = IPV6_PMTUDISC_DO;
-   if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on))) {
+   on = IPV6_PMTUDISC_PROBE;
+   if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on)) &&
+   (on = IPV6_PMTUDISC_DO,
+setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on {
perror("IPV6_MTU_DISCOVER");
exit(1);
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [iputils] Add length flag to set initial MTU.

2007-03-23 Thread John Heffner
Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 tracepath.c  |   10 --
 tracepath6.c |   10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index c3f6f74..1f901ba 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -265,7 +265,7 @@ static void usage(void) __attribute((noreturn));
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: tracepath [-n] [/]\n");
+   fprintf(stderr, "Usage: tracepath [-n] [-l ] 
[/]\n");
exit(-1);
 }
 
@@ -279,11 +279,17 @@ main(int argc, char **argv)
char *p;
int ch;
 
-   while ((ch = getopt(argc, argv, "nh?")) != EOF) {
+   while ((ch = getopt(argc, argv, "nh?l:")) != EOF) {
switch(ch) {
case 'n':   
no_resolve = 1;
break;
+   case 'l':
+   if ((mtu = atoi(optarg)) <= overhead) {
+   fprintf(stderr, "Error: length must be >= 
%d\n", overhead);
+   exit(1);
+   }
+   break;
default:
usage();
}
diff --git a/tracepath6.c b/tracepath6.c
index 23d6a8c..d65230d 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -280,7 +280,7 @@ static void usage(void) __attribute((noreturn));
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: tracepath6 [-n] [-b] [/]\n");
+   fprintf(stderr, "Usage: tracepath6 [-n] [-b] [-l ] 
[/]\n");
exit(-1);
 }
 
@@ -297,7 +297,7 @@ int main(int argc, char **argv)
int gai;
char pbuf[NI_MAXSERV];
 
-   while ((ch = getopt(argc, argv, "nbh?")) != EOF) {
+   while ((ch = getopt(argc, argv, "nbh?l:")) != EOF) {
switch(ch) {
case 'n':   
no_resolve = 1;
@@ -305,6 +305,12 @@ int main(int argc, char **argv)
case 'b':   
show_both = 1;
break;
+   case 'l':
+   if ((mtu = atoi(optarg)) <= overhead) {
+   fprintf(stderr, "Error: length must be >= 
%d\n", overhead);
+   exit(1);
+   }
+   break;
default:
usage();
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] [iputils] MTU discovery changes

2007-03-23 Thread John Heffner
These add some changes that make tracepath a little more useful for 
diagnosing MTU issues.  The length flag helps distinguish between MTU 
black holes and other types of black holes by allowing you to vary the 
probe packet lengths.  Using PMTUDISC_PROBE gives you the same results 
on each run without having to flush the route cache, so you can see 
where MTU changes in the path actually occur.


The PMTUDISC_PROBE patch goes in should be conditional on whether the 
corresponding kernel patch (just sent) goes in.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] [NET] Move DF check to ip_forward

2007-03-23 Thread John Heffner
Do fragmentation check in ip_forward, similar to ipv6 forwarding.  Also add
a debug printk in the DF check in ip_fragment since we should now never
reach it.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_forward.c |8 
 net/ipv4/ip_output.c  |2 ++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 369e721..0efb1f5 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -85,6 +85,14 @@ int ip_forward(struct sk_buff *skb)
if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
goto sr_failed;
 
+   if (unlikely(skb->len > dst_mtu(&rt->u.dst) &&
+(skb->nh.iph->frag_off & htons(IP_DF))) && !skb->local_df) 
{
+   IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
+   icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+ htonl(dst_mtu(&rt->u.dst)));
+   goto drop;
+   }
+
/* We are about to mangle packet. Copy it! */
if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len))
goto drop;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 593acf7..90bdd53 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -433,6 +433,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
iph = skb->nh.iph;
 
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
+   if (net_ratelimit())
+   printk(KERN_DEBUG "ip_fragment: requested fragment of 
packet with DF set\n");
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(dst_mtu(&rt->u.dst)));
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] [NET] Add IP(V6)_PMTUDISC_RPOBE

2007-03-23 Thread John Heffner
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery. 
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 include/linux/skbuff.h   |3 ++-
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 ++
 net/ipv4/ip_output.c |   14 ++
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 +++
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 +++
 11 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..2dc1f8a 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4ff3940..64038b4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -284,7 +284,8 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1;
+   ipvs_property:1,
+   ign_dst_mtu;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index e79c3e3..f5874a3 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -201,7 +201,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -549,6 +550,7 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
+   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_BRIDGE_NETFILTER
new->nf_bridge  = old->nf_bridge;
nf_bridge_get(old->nf_bridge);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 90bdd53..a7e8944 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -201,7 +201,8 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) &&
+   !skb->ign_dst_mtu && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -801,7 +802,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu :
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1220,13 +1223,16 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
+   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
+   s

[PATCH 1/3] [NET] Do pmtu check in transport layer

2007-03-23 Thread John Heffner
Check the pmtu check at the transport layer (for UDP, ICMP and raw), and
send a local error if socket is PMTUDISC_DO and packet is too big.  This is
actually a pure bugfix for ipv6.  For ipv4, it allows us to do pmtu checks
in the same way as for ipv6.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +++-
 net/ipv4/raw.c|8 +---
 net/ipv6/ip6_output.c |   11 ++-
 net/ipv6/raw.c|7 +--
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d096332..593acf7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen) {
+   if (inet->cork.length + length > 0x - fragheaderlen ||
+   (inet->pmtudisc >= IP_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 87e9c16..f252f4e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,10 +271,12 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
+   int mtu;
 
-   if (length > rt->u.dst.dev->mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
-  rt->u.dst.dev->mtu);
+   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
+rt->u.dst.dev->mtu;
+   if (length > mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3055169..711dfc3 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1044,11 +1044,12 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
-   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
-   }
+   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
+inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
+   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 306d5d8..75db277 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -556,9 +556,12 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
+   int mtu;
 
-   if (length > rt->u.dst.dev->mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
+   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
+rt->u.dst.dev->mtu;
+   if (length > mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] [NET] MTU discovery changes

2007-03-23 Thread John Heffner
These are a few changes to fix/clean up some of the MTU discovery 
processing with non-stream sockets, and add a probing mode.  See also 
matching patches to tracepath to take advantage of this.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcp_mem initialization

2007-03-15 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Wed, 14 Mar 2007 17:25:22 -0400

The current tcp_mem initialization gives values that are really too 
small for systems with ~256-768 MB of memory, and also for systems with 
larger page sizes (ia64).  This patch gives an alternate method of 
initialization that doesn't depend on the cache allocation functions, 
but I think should still provide a nice curve that gives a smaller 
fraction of total memory with small-memory systems, while maintaining 
the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems.


Indeed, it's really dumb for any of these calculations to be
dependant upon the page size.

Your patch looks good, and I'll review it further tomorrow and
push upstream unless I find some issues with it.

Thanks John.



The way it's coded is somewhat opaque since it has to be done with 
32-bit integer arithmetic.  These plots might help make the motivation 
behind the code a little clearer.


Thanks,
  -John




[PATCH] tcp_mem initialization

2007-03-14 Thread John Heffner
The current tcp_mem initialization gives values that are really too 
small for systems with ~256-768 MB of memory, and also for systems with 
larger page sizes (ia64).  This patch gives an alternate method of 
initialization that doesn't depend on the cache allocation functions, 
but I think should still provide a nice curve that gives a smaller 
fraction of total memory with small-memory systems, while maintaining 
the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems.


  -John

Change tcp_mem initialization function.  The fraction of total memory is now
a continuous function of memory size, and independent of page size.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit a4461a36efb376bf01399cfd6f1ad15dc89a8794
tree 23b2fb9da52b45de8008fc7ea6bb8c10e3a3724b
parent 8b9909ded6922c33c221b105b26917780cfa497d
author John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400
committer John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400

 net/ipv4/tcp.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 74c4d10..3834b10 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2458,11 +2458,18 @@ void __init tcp_init(void)
sysctl_max_syn_backlog = 128;
}
 
-   /* Allow no more than 3/4 kernel memory (usually less) allocated to TCP 
*/
-   sysctl_tcp_mem[0] = (1536 / sizeof (struct inet_bind_hashbucket)) << 
order;
-   sysctl_tcp_mem[1] = sysctl_tcp_mem[0] * 4 / 3;
+   /* Set the pressure threshold to be a fraction of global memory that
+* is up to 1/2 at 256 MB, decreasing toward zero with the amount of
+* memory, with a floor of 128 pages.
+*/
+   limit = min(nr_all_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
+   limit = (limit * (nr_all_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
+   limit = max(limit, 128UL);
+   sysctl_tcp_mem[0] = limit / 4 * 3;
+   sysctl_tcp_mem[1] = limit;
sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
+   /* Set per-socket limits to no more than 1/128 the pressure threshold */
limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
max_share = min(4UL*1024*1024, limit);
 


Re: SWS for rcvbuf < MTU

2007-03-13 Thread John Heffner

Alex Sidorenko wrote:
Here are the values from live kernel (obtained with 'crash') when the host was 
in SWS state:


full_space=708  full_space/2=354
free_space=393
window=76

In this case the test from my original fix, (window < full_space/2),  
succeeds. But John's test


free_space > window + full_space/2
393  430

does not. So I suspect that the new fix will not always work. From tcpdump 
traces we can see that both hosts exchange with 76-byte packets for a long 
time. From customer's application log we see that it continues to read 
76-byte chunks per each read() call - even though more than that is available 
in the receive buffer. Technically it's OK for read() to return even after 
reading one byte, so if sk->receive_queue contains multiple 76-byte skbuffs 
we may return after processing just one skbuff (but we we don't understand 
the details of why this happens on customer's system).


Are there any particular reasons why you want to postpone window update until 
free_space becomes > window + full_space/2 and not as soon as 
free_space > full_space/2? As the only real-life occurance of SWS shows 
free_space oscillating slightly above full_space/2, I created the fix 
specifically to match this phenomena as seen on customer's host. We reach the 
modified section only when (free_space > full_space/2) so it should be OK to 
update the window at this point if mss==full_space. 

So yes, we can test John's fix on customer's host but I doubt it will work for 
the reasons mentioned above, in brief:


'window = free_space' instead of 'window=full_space/2' is OK,
but the test 'free_space > window + full_space/2' is not for the specific 
pattern customer sees on his hosts.



Sorry for the long delay in response, I've been on vacation.  I'm okay 
with your patch, and I can't think of any real problem with it, except 
that the behavior is non-standard.  Then again, Linux acking in general 
is non-standard, which has created the bug in the first place. :)  The 
only thing I can think where it might still ack too often is if 
free_space frequently drops just below full_space/2 for a bit then rises 
above full_space/2.


I've also attached a corrected version of my earlier patch that I think 
solves the problem you noted.


Thanks,
  -John
Do full receiver-side SWS avoidance when rcvbuf < mss.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit f4333661026621e15549fb75b37be785e4a1c443
tree 30d46b64ea19634875fdd4656d33f76db526a313
parent 562aa1d4c6a874373f9a48ac184f662fbbb06a04
author John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400
committer John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400

 net/ipv4/tcp_output.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dc15113..e621a63 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1605,8 +1605,15 @@ u32 __tcp_select_window(struct sock *sk)
 * We also don't do any window rounding when the free space
 * is too small.
 */
-   if (window <= free_space - mss || window > free_space)
+   if (window <= free_space - mss || window > free_space) {
window = (free_space/mss)*mss;
+   } else if (mss == full_space) {
+   /* Do full receive-side SWS avoidance
+* when rcvbuf <= mss */
+   window = tcp_receive_window(tp);
+   if (free_space > window + full_space/2)
+   window = free_space;
+   }
}
 
return window;


Re: SWS for rcvbuf < MTU

2007-03-03 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Fri, 02 Mar 2007 16:16:39 -0500

Please don't apply the patch I sent.  I've been thinking about this a 
bit harder, and it may not fix this particular problem.  (Hard to say 
without knowing exactly what it is.)  As the comment above 
__tcp_select_window() states, we do not do full receive-side SWS 
avoidance because of header prediction.


Alex, you're right I missed that special zero-window case.  I'm still 
not quite sure I'm completely happy with this patch.  I'd like to think 
about this a little bit harder...


Ok


Alright, I've thought about it a bit more, and I think the patch I sent 
should work.  Alex, any opinion?  Any way you can test this out?


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SWS for rcvbuf < MTU

2007-03-02 Thread John Heffner

David Miller wrote:

From: Alex Sidorenko <[EMAIL PROTECTED]>
Date: Fri, 2 Mar 2007 15:21:58 -0500

they told us that they use small rcvbuf to throttle bandwidth for this 
application. I explained it would be better to use TC for this purpose. They 
agreed and will probably redesign their application in the future, but they 
cannot do it right now. For the same reason they have to use the old 2.4.20 
for a while - in big companies the important production software cannot be 
changed quickly. 

The fix I suggested is trivial and should have no impact the case of 
rcvfbuf>mtu, so I think it makes sense to include it in upstream kernel.


I have no objection to the fix, especially John's version.

I was just curious about the app, thanks for the info :)


Please don't apply the patch I sent.  I've been thinking about this a 
bit harder, and it may not fix this particular problem.  (Hard to say 
without knowing exactly what it is.)  As the comment above 
__tcp_select_window() states, we do not do full receive-side SWS 
avoidance because of header prediction.


Alex, you're right I missed that special zero-window case.  I'm still 
not quite sure I'm completely happy with this patch.  I'd like to think 
about this a little bit harder...


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SWS for rcvbuf < MTU

2007-03-02 Thread John Heffner

Alex Sidorenko wrote:
[snip]

--- net/ipv4/tcp_output.c.orig  Wed May  3 20:40:43 2006
+++ net/ipv4/tcp_output.c   Tue Jan 30 14:24:56 2007
@@ -641,6 +641,7 @@
  * Note, we don't "adjust" for TIMESTAMP or SACK option bytes.
  * Regular options like TIMESTAMP are taken into account.
  */
+static const char *SWS_id_string="@#SWS-fix-2";
 u32 __tcp_select_window(struct sock *sk)
 {
struct tcp_opt *tp = &sk->tp_pinfo.af_tcp;
@@ -682,6 +683,9 @@
window = tp->rcv_wnd;
if (window <= free_space - mss || window > free_space)
window = (free_space/mss)*mss;
+/* A fix for small rcvbuf [EMAIL PROTECTED] */
+   else if (mss == full_space && window < full_space/2)
+   window = full_space/2;

return window;
 }


Good analysis of the problem, but the patch does not look quite right. 
In particular, you can't ever announce a zero window. :)


I think this attached patch does the correct SWS avoidance.

Thanks,
  -John

Do receiver-side SWS avoidance for rcvbuf < MSS.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 38d33181c93a28cf7fb2f9f3377305a04636c054
tree 503f8a9de6e78694bae9fc2eb1c9dd5d26a0b5ed
parent 562aa1d4c6a874373f9a48ac184f662fbbb06a04
author John Heffner <[EMAIL PROTECTED]> Fri, 02 Mar 2007 13:47:44 -0500
committer John Heffner <[EMAIL PROTECTED]> Fri, 02 Mar 2007 13:47:44 -0500

 net/ipv4/tcp_output.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dc15113..688b955 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1607,6 +1607,9 @@ u32 __tcp_select_window(struct sock *sk)
 */
if (window <= free_space - mss || window > free_space)
window = (free_space/mss)*mss;
+   else if (mss == full_space &&
+free_space > window + full_space/2)
+   window = free_space;
}
 
return window;


[PATCH 2/3] TCP sysctl documentation: tcp_no_metrics_save

2007-02-26 Thread John Heffner
 Document sysctl tcp_no_metrics_save.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 17cb799000caef3b2fed28cc5d0601bb2311efa8
tree c27ccf561065b145bc48d0b8dbbaa3c608015e03
parent 4c5fd9d3a9ea8b939aed1afda2ac0fc54e3df592
author John Heffner <[EMAIL PROTECTED]> Mon, 26 Feb 2007 19:51:50 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 26 Feb 2007 19:51:50 -0500

 Documentation/networking/ip-sysctl.txt |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index a9ad96b..891f389 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -249,6 +249,14 @@ tcp_moderate_rcvbuf - BOOLEAN
match the size required by the path for full throughput.  Enabled by
default.
 
+tcp_no_metrics_save - BOOLEAN
+   By default, TCP saves various connection metrics in the route cache
+   when the connection closes, so that connections established in the
+   near future can use these to set initial conditions.  Usually, this
+   increases overall performance, but may sometimes cause performance
+   degredation.  If set, TCP will not cache metrics on closing
+   connections.
+
 tcp_orphan_retries - INTEGER
How may times to retry before killing TCP connection, closed
by our side. Default value 7 corresponds to ~50sec-16min


[PATCH 2/3] TCP sysctl documentation: MTU probing

2007-02-26 Thread John Heffner
 Documentation for sysctls tcp_mtu_probing and tcp_base_mss.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 6da0563572e0a6d0abda9d950f30902844c37862
tree 6f21ae02c11a1340412a926e8e2f568f5ed3b5a8
parent 17cb799000caef3b2fed28cc5d0601bb2311efa8
author John Heffner <[EMAIL PROTECTED]> Mon, 26 Feb 2007 20:02:35 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 26 Feb 2007 20:02:35 -0500

 Documentation/networking/ip-sysctl.txt |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 891f389..d3aae1f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -147,6 +147,11 @@ tcp_available_congestion_control - STRING
More congestion control algorithms may be available as modules,
but not loaded.
 
+tcp_base_mss - INTEGER
+   The initial value of search_low to be used by Packetization Layer
+   Path MTU Discovery (MTU probing).  If MTU probing is enabled,
+   this is the inital MSS used by the connection.
+
 tcp_congestion_control - STRING
Set the congestion control algorithm to be used for new
connections. The algorithm "reno" is always available, but
@@ -249,6 +254,13 @@ tcp_moderate_rcvbuf - BOOLEAN
match the size required by the path for full throughput.  Enabled by
default.
 
+tcp_mtu_probing - INTEGER
+   Controls TCP Packetization-Layer Path MTU Discovery.  Takes three
+   values:
+ 0 - Disabled
+ 1 - Disabled by default, enabled when an ICMP black hole detected
+ 2 - Always enabled, use initial MSS of tcp_base_mss.
+
 tcp_no_metrics_save - BOOLEAN
By default, TCP saves various connection metrics in the route cache
when the connection closes, so that connections established in the


[PATCH 1/3] TCP sysctl documentation: tcp_moderate_rcvbuf

2007-02-26 Thread John Heffner
 Document sysctl tcp_moderate_rcvbuf.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 4c5fd9d3a9ea8b939aed1afda2ac0fc54e3df592
tree c25c2fd01e076fbb7356a8c37d06d2e22c60f263
parent aef8811abbc9249a2bd59bd2331bbe523df05d17
author John Heffner <[EMAIL PROTECTED]> Mon, 26 Feb 2007 19:44:58 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 26 Feb 2007 19:44:58 -0500

 Documentation/networking/ip-sysctl.txt |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index a0f6842..a9ad96b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -243,6 +243,12 @@ tcp_mem - vector of 3 INTEGERs: min, pressure, max
Defaults are calculated at boot time from amount of available
memory.
 
+tcp_moderate_rcvbuf - BOOLEAN
+   If set, TCP performs receive buffer autotuning, attempting to
+   automatically size the buffer (no greater than tcp_rmem[2]) to
+   match the size required by the path for full throughput.  Enabled by
+   default.
+
 tcp_orphan_retries - INTEGER
How may times to retry before killing TCP connection, closed
by our side. Default value 7 corresponds to ~50sec-16min


Re: [PATCH] fix limited slow start bug

2007-02-22 Thread John Heffner

Ilpo Järvinen wrote:
BTW, while looking this patch, I noticed that snd_cwnd_clamp is only u16 
while snd_cwnd is u32, which seems rather strange since snd_cwnd is being 
limited by the clamp value here and there?!?! And tcp_highspeed.c is 
clearly assuming even more than this (but the problem is hidden as 
snd_cwnd_clamp is feed back to the min_t and the used 32-bit constant 
could be safely cut to 16-bits anyway):


  tp->snd_cwnd_clamp = min_t(u32, tp->snd_cwnd_clamp, 0x/128);

Has the type being changed somewhere in the past or why is this so?


It's been that way as long as I can remember.  It's always been a 
mystery to me as well.  I suspect the tcp_highspeed code is that way 
because this patch originally came out of the Web100-patched kernel, 
which at one point was using a 32 bit snd_cwnd_clamp IIRC.


I think it's not unreasonable to change clamp to 32 bits now, since with 
1500 byte packets, this corresponds to a max cwnd of ~94MB.  This is 
pretty big, but we are currently right at this limit with 10 GigE.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix limited slow start bug

2007-02-22 Thread John Heffner
 Fix arithmetic order bug in limited slow start.  The subtraction needs to be
done before snd_cwnd is incremented.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 244e7411d99443df7b7ae849ba6ebbec4c2342bc
tree e6d5985a22448f59f8bef393542e1d5497ee5684
parent 97033fa201705e6cfc68ce66f34ede3277c3d645
author John Heffner <[EMAIL PROTECTED]> Thu, 22 Feb 2007 13:54:01 -0500
committer John Heffner <[EMAIL PROTECTED]> Thu, 22 Feb 2007 13:54:01 -0500

 net/ipv4/tcp_cong.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 7fd2910..a0c894f 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -303,9 +303,9 @@ void tcp_slow_start(struct tcp_sock *tp)

tp->snd_cwnd_cnt += cnt;
while (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
+   tp->snd_cwnd_cnt -= tp->snd_cwnd;
if (tp->snd_cwnd < tp->snd_cwnd_clamp)
tp->snd_cwnd++;
-   tp->snd_cwnd_cnt -= tp->snd_cwnd;
}
 }
 EXPORT_SYMBOL_GPL(tcp_slow_start);


Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-22 Thread John Heffner
My patch is meant as a replacement for YeAH patch 2/2, not meant to back 
it out.  You do still need the second hunk below.  Sorry 'bout that.


If you're going to apply YeAH patch 2/2 first, you will also need to 
remove the declaration of tcp_limited_slow_start() in include/net/tcp.h.


Thanks,
  -John


David Miller wrote:

From: David Miller <[EMAIL PROTECTED]>
Date: Thu, 22 Feb 2007 00:27:04 -0800 (PST)


I'll apply this, but could you please also when making suggestions
like this provide the patch necessary to kill the function added for
YeaH and the call site in the YeaH algorithm?


Here is how I'm resolving this:

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 2b4142b..5ee79f3 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -310,29 +310,6 @@ void tcp_slow_start(struct tcp_sock *tp)
 }
 EXPORT_SYMBOL_GPL(tcp_slow_start);
 
-void tcp_limited_slow_start(struct tcp_sock *tp)

-{
-   /* RFC3742: limited slow start
-* the window is increased by 1/K MSS for each arriving ACK,
-* for K = int(cwnd/(0.5 max_ssthresh))
-*/
-
-   const int max_ssthresh = 100;
-
-   if (max_ssthresh > 0 && tp->snd_cwnd > max_ssthresh) {
-   u32 k = max(tp->snd_cwnd / (max_ssthresh >> 1), 1U);
-   if (++tp->snd_cwnd_cnt >= k) {
-   if (tp->snd_cwnd < tp->snd_cwnd_clamp)
-   tp->snd_cwnd++;
-   tp->snd_cwnd_cnt = 0;
-   }
-   } else {
-   if (tp->snd_cwnd < tp->snd_cwnd_clamp)
-   tp->snd_cwnd++;
-   }
-}
-EXPORT_SYMBOL_GPL(tcp_limited_slow_start);
-
 /*
  * TCP Reno congestion control
  * This is special case used for fallback as well.
diff --git a/net/ipv4/tcp_yeah.c b/net/ipv4/tcp_yeah.c
index 2d971d1..815e020 100644
--- a/net/ipv4/tcp_yeah.c
+++ b/net/ipv4/tcp_yeah.c
@@ -104,7 +104,7 @@ static void tcp_yeah_cong_avoid(struct sock *sk, u32 ack,
return;
 
 	if (tp->snd_cwnd <= tp->snd_ssthresh) {

-   tcp_limited_slow_start(tp);
+   tcp_slow_start(tp);
} else if (!yeah->doing_reno_now) {
/* Scalable */
 
-

To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-19 Thread John Heffner

Angelo P. Castellani wrote:

John Heffner ha scritto:
Note the patch is compile-tested only!  I can do some real testing if 
you'd like to apply this Dave.
The date you read on the patch is due to the fact I've splitted this 
patchset into 2 diff files. This isn't compile-tested only, I've used 
this piece of code for about 3 months.


Sorry for the confusion.  The patch I attached to my message was 
compile-tested only.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function

2007-02-19 Thread John Heffner
I'd prefer to make it apply automatically across all congestion controls 
that do slow-start, and also make the max_ssthresh parameter 
controllable via sysctl.  This patch (attached) should implement this. 
Note the default value for sysctl_tcp_max_ssthresh = 0, which disables 
limited slow-start.  This should make ABC apply during LSS as well.


Note the patch is compile-tested only!  I can do some real testing if 
you'd like to apply this Dave.


Thanks,
  -John


Angelo P. Castellani wrote:

Forgot the patch..

Angelo P. Castellani ha scritto:

From: Angelo P. Castellani <[EMAIL PROTECTED]>

RFC3742: limited slow start

See http://www.ietf.org/rfc/rfc3742.txt

Signed-off-by: Angelo P. Castellani <[EMAIL PROTECTED]>
---

To allow code reutilization I've added the limited slow start 
procedure as an exported symbol of linux tcp congestion control.


On large BDP networks canonical slow start should be avoided because 
it requires large packet losses to converge, whereas at lower BDPs 
slow start and limited slow start are identical. Large BDP is defined 
through the max_ssthresh variable.


I think limited slow start could safely replace the canonical slow 
start procedure in Linux.


Regards,
Angelo P. Castellani

p.s.: in the attached patch is added an exported function currently 
used only by YeAH TCP


include/net/tcp.h   |1 +
net/ipv4/tcp_cong.c |   23 +++
2 files changed, 24 insertions(+)







diff -uprN linux-2.6.20-a/include/net/tcp.h linux-2.6.20-c/include/net/tcp.h
--- linux-2.6.20-a/include/net/tcp.h2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-c/include/net/tcp.h2007-02-19 10:54:10.0 +0100
@@ -669,6 +669,7 @@ extern void tcp_get_allowed_congestion_c
 extern int tcp_set_allowed_congestion_control(char *allowed);
 extern int tcp_set_congestion_control(struct sock *sk, const char *name);
 extern void tcp_slow_start(struct tcp_sock *tp);
+extern void tcp_limited_slow_start(struct tcp_sock *tp);
 
 extern struct tcp_congestion_ops tcp_init_congestion_ops;

 extern u32 tcp_reno_ssthresh(struct sock *sk);
diff -uprN linux-2.6.20-a/net/ipv4/tcp_cong.c linux-2.6.20-c/net/ipv4/tcp_cong.c
--- linux-2.6.20-a/net/ipv4/tcp_cong.c  2007-02-04 19:44:54.0 +0100
+++ linux-2.6.20-c/net/ipv4/tcp_cong.c  2007-02-19 10:54:10.0 +0100
@@ -297,6 +297,29 @@ void tcp_slow_start(struct tcp_sock *tp)
 }
 EXPORT_SYMBOL_GPL(tcp_slow_start);
 
+void tcp_limited_slow_start(struct tcp_sock *tp)

+{
+   /* RFC3742: limited slow start
+* the window is increased by 1/K MSS for each arriving ACK,
+* for K = int(cwnd/(0.5 max_ssthresh))
+*/
+
+   const int max_ssthresh = 100;
+
+   if (max_ssthresh > 0 && tp->snd_cwnd > max_ssthresh) {
+   u32 k = max(tp->snd_cwnd / (max_ssthresh >> 1), 1U);
+   if (++tp->snd_cwnd_cnt >= k) {
+   if (tp->snd_cwnd < tp->snd_cwnd_clamp)
+   tp->snd_cwnd++;
+   tp->snd_cwnd_cnt = 0;
+   }
+   } else {
+   if (tp->snd_cwnd < tp->snd_cwnd_clamp)
+   tp->snd_cwnd++;
+   }
+}
+EXPORT_SYMBOL_GPL(tcp_limited_slow_start);
+
 /*
  * TCP Reno congestion control
  * This is special case used for fallback as well.


Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 97033fa201705e6cfc68ce66f34ede3277c3d645
tree 5df4607728abce93aa05b31015a90f2ce369abff
parent 8a03d9a498eaf02c8a118752050a5154852c13bf
author John Heffner <[EMAIL PROTECTED]> Mon, 19 Feb 2007 15:52:16 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 19 Feb 2007 15:52:16 -0500

 include/linux/sysctl.h |1 +
 include/net/tcp.h  |1 +
 net/ipv4/sysctl_net_ipv4.c |8 
 net/ipv4/tcp_cong.c|   33 +++--
 4 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 2c5fb38..a2dce72 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -438,6 +438,7 @@ enum
NET_CIPSOV4_RBM_STRICTVALID=121,
NET_TCP_AVAIL_CONG_CONTROL=122,
NET_TCP_ALLOWED_CONG_CONTROL=123,
+   NET_TCP_MAX_SSTHRESH=124,
 };
 
 enum {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5c472f2..521da28 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -230,6 +230,7 @@ extern int sysctl_tcp_mtu_probing;
 extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
+extern int sysctl_tcp_max_ssthresh;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
i

Re: [patch 3/3] tcp: remove experimental variants from default list

2007-02-13 Thread John Heffner
This isn't really a reply to anyone in particular, but I wanted to touch 
on a few points.




Reno. As Windows decided to go with "Compound TCP", why we want to
back to 80's algorithm?


It's worth noting that Microsoft is not using Compound TCP by default, 
except in Beta versions so they can get more experience with it.  It is 
available to turn on in production versions, but Reno is still default. 
 Take this how you will, but that's the current state of affairs.




I fail to see how Microsoft should be the reason for anything, if
anything Linux started the arms race.


I'd like to put to bed this notion of an arms race.  A number of people 
have accused Linux and Windows of competing with each other to be more 
aggressive, which is just not the case.  The use of non-standard 
congestion control algorithms is due to a real need to fill underused 
large pipes.  In fact, if Linux or Windows stomped on top of other TCPs 
in production, it would lead to a bad reputation for the one doing the 
stomping, and is something everyone is eager to avoid.  It would be 
easier to design an extremely aggressive control algorithm.  The hard 
work is in achieving the desired properties of fairness, stability, 
etc., in addition to high utilization.


Some care has been taken (okay, with varying success) in designing each 
of the default candidate algorithms to avoid harming standard Reno-style 
flows under "normal" conditions.  If an algorithms meets this 
requirement, then there's almost no reason at this point not to use it. 
 The main issue for the future is dealing with the interaction between 
various (possibly unknown) congestion control algorithms.  From an 
academic point of view, it's very difficult to say anything about how 
they might interact.  At least it's more difficult than modeling how 
flows using a single algorithm interact with each other.  This is 
something of a concern, but we must weigh this against the pressing 
demand for something better than reno.  Further, there's all sorts of 
traffic out there on the Internet with varying responsiveness, as there 
is no enforcement of any particular model of congestion control.  This 
must be taken into account, regardless of what Linux chooses as its 
default at any point in time.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] apply cwnd rules to FIN packets with data

2007-02-05 Thread John Heffner

Rick Jones wrote:

John Heffner wrote:

David Miller wrote:

However, I can't think of any reason why the cwnd test should not 
apply.



Care to elaborate here?  You can view the FIN special case as an off
by one error in the CWND test, it's not going to melt the internet.
:-)



True, it's not going to melt the internet, but why stop at one when 
two would finish the connection even faster?  Not sure I buy this 
argument.  Was there some benchmarking data that was a justification 
for this in the first place?


Is the cwnd in the stack byte based, or packet based?

While "all" the RFCs tend to discuss things in terms of byte-based cwnds 
and assumptions based on MSSes and such, the underlying principle was/is 
a conservation of packets.  As David said, a packet is a packet, and if 
one were going to be sending a FIN segment, it might as well carry data. 
 And if one isn't comfortable sending that one last data segment with 
the FIN because cwnd wasn't large enough at the time, should the FIN be 
sent at that point, even if it is waffer thin?


The most conservative thing is to apply congestion control exactly as 
you would to any other segment, that is, just take the special case out 
entirely.  An empty FIN is not too likely to cause problems, a full-MSS 
FIN somewhat more so, 2-MSS, yet more, a 64k TSO segment even more. :) 
I don't have hard data to argue for or against any particular 
optimization, but it seems there should be some if we're ignoring the 
standard cwnd rules.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] apply cwnd rules to FIN packets with data

2007-02-05 Thread John Heffner

David Miller wrote:
However, I can't think of any reason why the cwnd test should not 
apply.


Care to elaborate here?  You can view the FIN special case as an off
by one error in the CWND test, it's not going to melt the internet.
:-)


True, it's not going to melt the internet, but why stop at one when two 
would finish the connection even faster?  Not sure I buy this argument. 
 Was there some benchmarking data that was a justification for this in 
the first place?


My first patch was broken anyway (should not have pulled the test from 
tso_should_defer), and the change is not needed to the nagle test since 
it's implicit.  This patch just restores the old behavior from before 
TSO, sending the FIN when it's the last true segment.  We can debate the 
merits of applying congestion control to the FIN separately. :)


  -John
Don't apply FIN exception to full TSO segments.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit 89de0d8cb75958b0315c076b31a597143e30f7a4
tree 7e9c321e62729c6ef76e3886fe9edf2ac78a680c
parent c0d4d573feed199b16094c072e7cb07afb01c598
author John Heffner <[EMAIL PROTECTED]> Mon, 05 Feb 2007 18:42:31 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 05 Feb 2007 18:42:31 -0500

 net/ipv4/tcp_output.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 975f447..58b7111 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -965,7 +965,8 @@ static inline unsigned int tcp_cwnd_test
u32 in_flight, cwnd;
 
/* Don't be strict about the congestion window for the final FIN.  */
-   if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
+   if ((TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
+   tcp_skb_pcount(skb) == 1)
return 1;
 
in_flight = tcp_packets_in_flight(tp);


Re: [PATCH] apply cwnd rules to FIN packets with data

2007-02-05 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Mon, 05 Feb 2007 16:58:18 -0500

This is especially important with TSO enabled.  Currently, it will send 
a burst of up to 64k at the end of a connection, even when cwnd is much 
smaller than 64k.  This patch still lets out empty FIN packets, but does 
not apply the special case to FINs carrying data.


Good catch John.

But I think the correct test on skb->len would be to just make
sure that it is <= REAL_MSS.

What do you think about that?  This would match the original intention
of the logic in the pre-TSO days.


What was the intention of that logic?

Actually, I think it would be better to leave the Nagle test as it was 
(which is implicitly < real_mss), because there is obviously no point in 
doing the nagle test when you know there is no more data that will be 
sent.  However, I can't think of any reason why the cwnd test should not 
apply.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] apply cwnd rules to FIN packets with data

2007-02-05 Thread John Heffner
This is especially important with TSO enabled.  Currently, it will send 
a burst of up to 64k at the end of a connection, even when cwnd is much 
smaller than 64k.  This patch still lets out empty FIN packets, but does 
not apply the special case to FINs carrying data.


  -John

Apply cwnd rules to FIN packets that contain data.

---
commit af319609eee705e0791a1a58c33b216e8d0254bf
tree 5a1afcc506e09f5adfd74efb7e0cbbc82ec4d5b0
parent c0d4d573feed199b16094c072e7cb07afb01c598
author John Heffner <[EMAIL PROTECTED]> Mon, 05 Feb 2007 16:25:46 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 05 Feb 2007 16:25:46 -0500

 net/ipv4/tcp_output.c |7 ++-
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 975f447..215c99d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -965,7 +965,7 @@ static inline unsigned int tcp_cwnd_test
u32 in_flight, cwnd;
 
/* Don't be strict about the congestion window for the final FIN.  */
-   if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
+   if ((TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) && skb->len == 0)
return 1;
 
in_flight = tcp_packets_in_flight(tp);
@@ -1034,7 +1034,7 @@ static inline int tcp_nagle_test(struct 
 
/* Don't use the nagle rule for urgent data (or for the final FIN).  */
if (tp->urg_mode ||
-   (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN))
+   ((TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) && skb->len == 0))
return 1;
 
if (!tcp_nagle_check(tp, skb, cur_mss, nonagle))
@@ -1156,9 +1156,6 @@ static int tcp_tso_should_defer(struct s
const struct inet_connection_sock *icsk = inet_csk(sk);
u32 send_win, cong_win, limit, in_flight;
 
-   if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
-   goto send_now;
-
if (icsk->icsk_ca_state != TCP_CA_Open)
goto send_now;
 


Re: [PATCH] fix up sysctl_tcp_mem initialization

2006-11-15 Thread John Heffner

David Miller wrote:
However, I wonder if we want to set this differently than the way this 
patch does it.  Depending on how far off the memory size is from a power 
of two (exactly equal to a power of two is the worst case), and if total 
memory <128M, it can be substantially less than 3/4.


Longer term, yes, probably a better way exists.

So you concern is that when we round to a power of 2 like we do
now, we often mis-shoot?


I'm not that concerned about it, but basically yes, there are big (x2) 
jumps on power-of-two memory size boundaries.  There's also a bigger 
(x8) discontinuity at 128k pages.  It could be smoother.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix up sysctl_tcp_mem initialization

2006-11-14 Thread John Heffner
The initial values of sysctl_tcp_mem are sometimes greater than the 
total memory in the system (particularly on SMP systems).  This patch 
ensures that tcp_mem[2] is always <= 3/4 nr_kernel_pages.


However, I wonder if we want to set this differently than the way this 
patch does it.  Depending on how far off the memory size is from a power 
of two (exactly equal to a power of two is the worst case), and if total 
memory <128M, it can be substantially less than 3/4.


  -John
Fix up tcp_mem initiail settings to take into account the size of the
hash entries (different on SMP and non-SMP systems).

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit d4ef8c8245c0a033622ce9ba9e25d379475254f6
tree 5377b8af0bac3b92161188e7369a84e472b5acb2
parent ea55b7c31b47edf90132baea9a088da3bbe2bb5c
author John Heffner <[EMAIL PROTECTED]> Tue, 14 Nov 2006 14:53:27 -0500
committer John Heffner <[EMAIL PROTECTED]> Tue, 14 Nov 2006 14:53:27 -0500

 net/ipv4/tcp.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4322318..c05e8ed 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2316,9 +2316,10 @@ void __init tcp_init(void)
sysctl_max_syn_backlog = 128;
}
 
-   sysctl_tcp_mem[0] =  768 << order;
-   sysctl_tcp_mem[1] = 1024 << order;
-   sysctl_tcp_mem[2] = 1536 << order;
+   /* Allow no more than 3/4 kernel memory (usually less) allocated to TCP 
*/
+   sysctl_tcp_mem[0] = (1536 / sizeof (struct inet_bind_hashbucket)) << 
order;
+   sysctl_tcp_mem[1] = sysctl_tcp_mem[0] * 4 / 3;
+   sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
max_share = min(4UL*1024*1024, limit);


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-07 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Tue, 07 Nov 2006 16:50:33 -0500

The only stack I know of that does this currently is linux, and in doing 
so does not conform to the spec. ;)  Sending to a BSD receiver will 
result in the same behavior, so the "right place" to fix this is on the 
sending side.  (I know the issue of packet vs. byte counting has come up 
many times over the last 10 years or so, and many arguments have been 
made on either side... I don't mean this to be flame bait but it's clear 
what will happen in this scenario.)


John, you cannot change the N-million existing Linux systems
out there doing congestion control via byte counting.  You
cannot do this no matter how much you wish it so :-)


That would make our lives easier, wouldn't it? ;)  Clearly there are 
some combinations of TCP stacks out there that won't interoperate well 
under certain workloads.  Making new versions of the stack work well is 
the best we can hope for...


Fixing the sending side does not mean we have to back out the 
work-around on the receiving side.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-07 Thread John Heffner

David Miller wrote:

If we don't ACK every two segments, stacks which grow the congestion
window based upon packet counting will not grow the congestion window
properly when they are sending smaller than MSS sized segments.


The only stack I know of that does this currently is linux, and in doing 
so does not conform to the spec. ;)  Sending to a BSD receiver will 
result in the same behavior, so the "right place" to fix this is on the 
sending side.  (I know the issue of packet vs. byte counting has come up 
many times over the last 10 years or so, and many arguments have been 
made on either side... I don't mean this to be flame bait but it's clear 
what will happen in this scenario.)


One way of viewing the current situation is that linux's packet counting 
plus ABC is more conservative than byte counting -- sometimes much more 
so.  Packet counting without ABC may be more or less conservative than 
byte counting, depending on segment sizes and receiver ACK strategy. 
Without ABC, linux is vulnerable to aggressive ACKing to inflate the 
cwnd.  This is a kind of ugly state of affairs.


Unfortunately I see no clear way to reconcile these issues short of 
switching to byte counting.  Obviously this would be a big change as 
packet counting is deeply ingrained in not only the congestion control 
but also the recovery code.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] don't use highmem in tcp hash size calculation

2006-11-06 Thread John Heffner
 This patch removes consideration of high memory when determining TCP hash
table sizes.  Taking into account high memory results in tcp_mem values that
are too large.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit ea55b7c31b47edf90132baea9a088da3bbe2bb5c
tree 82311e12d4e4e006fba1688cb537de06cf7a4e4b
parent 4f6f9ba021f8a2149238f7c081cd7cf55c70c775
author John Heffner <[EMAIL PROTECTED]> Mon, 06 Nov 2006 20:03:01 -0500
committer John Heffner <[EMAIL PROTECTED]> Mon, 06 Nov 2006 20:03:01 -0500

 net/ipv4/tcp.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 66e9a72..4322318 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2270,7 +2270,7 @@ void __init tcp_init(void)
thash_entries,
(num_physpages >= 128 * 1024) ?
13 : 15,
-   HASH_HIGHMEM,
+   0,
&tcp_hashinfo.ehash_size,
NULL,
0);
@@ -2286,7 +2286,7 @@ void __init tcp_init(void)
tcp_hashinfo.ehash_size,
(num_physpages >= 128 * 1024) ?
13 : 15,
-   HASH_HIGHMEM,
+   0,
&tcp_hashinfo.bhash_size,
NULL,
64 * 1024);


Re: [PATCH] tcp: don't allow unfair congestion control to be built without warning

2006-10-27 Thread John Heffner
I think "unfair" is a difficult word.  Unfair to what?  It's true that 
Scalable TCP is unfair to itself in that flows with unequal shares do 
not converge, but it's not clear what its interactions are with other 
congestion control algorithms.  It's not clear to me that it's 
significantly more unfair wrt. reno than BIC, etc.  "Known to be broken" 
might be more correct language. :)


One thought would be to use a module parameter that sets one bit of 
state: allow unprivileged use.  Each module could have a sensible 
default value.


  -John


Stephen Hemminger wrote:

My proposed method restricting TCP choices to fair algorithms.
This a net wide, not system wide issue, it should not be done
by kernel policy choice (capability), but by a build choice.

--- sky2.orig/net/ipv4/Kconfig  2006-10-27 10:10:47.0 -0700
+++ sky2/net/ipv4/Kconfig   2006-10-27 10:15:56.0 -0700
@@ -470,6 +470,16 @@
 
 if TCP_CONG_ADVANCED
 
+config TCP_CONG_UNFAIR

+bool "Allow unfair congestion control algorithms"
+   depends on EXPERIMENTAL
+---help---
+ Some of the congestion control algorithms are for testing
+ and research purposes and should not deployed on public
+ networks because of the possiblity of unfair behavior.
+ These algorithms may be useful for future development
+ or comparison purposes.
+
 config TCP_CONG_BIC
tristate "Binary Increase Congestion (BIC) control"
default m
@@ -551,7 +561,7 @@
 
 config TCP_CONG_SCALABLE

tristate "Scalable TCP"
-   depends on EXPERIMENTAL
+   depends on TCP_CONG_UNFAIR
default n
---help---
Scalable TCP is a sender-side only change to TCP which uses a


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] tcp: setsockopt congestion control autoload

2006-10-26 Thread John Heffner

Hagen Paul Pfeifer wrote:

* John Heffner | 2006-10-26 13:29:26 [-0400]:

My reservation in doing this would be that as an administrator, I may 
want to choose exactly what congestion control is available any any 
given time.  The different congestion control algorithms are not 
necessarily fair to each other.


ACK, completely right. A user without CAP_NET_ADMIN MUST NOT changed the
algorithm.  We know that there are some unfairness out there. And maybe some
time ago someone introduce a satellite-algorithm which is per definition
completely unfair to vanilla tcp.
We should guard this with a CAP_NET_ADMIN capability so that built-in modules
also shouldn't be enabled.


I don't know if I'd want to go that far.  For example, there's a nice 
protocol TCP-LP which is by design unfair in the other direction -- it 
yields to other traffic so that you can basically run a scavenger service.


If you really care about this, you could try to rank protocols based on 
aggressiveness (note this is not trivial) and do something like 'nice' 
where mortals can only nice up not down.  Practically speaking, I'm not 
sure this is necessary (worth the effort).


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] tcp: setsockopt congestion control autoload

2006-10-26 Thread John Heffner
My reservation in doing this would be that as an administrator, I may 
want to choose exactly what congestion control is available any any 
given time.  The different congestion control algorithms are not 
necessarily fair to each other.


If the modules are autoloaded, I could still enforce this by moving the 
modules out of /lib/modules, but I think it's cleaner to do it by 
loading/unloading modules as appropriate.


  -John


Stephen Hemminger wrote:

If user asks for a congestion control type with setsockopt() then it
may be available as a module not included in the kernel already. 
It should be autoloaded if needed.  This is done already when

the default selection is change with sysctl, but not when application
requests via sysctl.

Only reservation is are there any bad security implications from this?

Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>

--- orig/net/ipv4/tcp_cong.c2006-10-25 13:55:34.0 -0700
+++ new/net/ipv4/tcp_cong.c 2006-10-25 13:58:39.0 -0700
@@ -153,9 +153,19 @@
 
 	rcu_read_lock();

ca = tcp_ca_find(name);
+   /* no change asking for existing value */
if (ca == icsk->icsk_ca_ops)
goto out;
 
+#ifdef CONFIG_KMOD

+   /* not found attempt to autoload module */
+   if (!ca) {
+   rcu_read_unlock();
+   request_module("tcp_%s", name);
+   rcu_read_lock();
+   ca = tcp_ca_find(name);
+   }
+#endif
if (!ca)
err = -ENOENT;
 
-

To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Bound TSO defer time (resend)

2006-10-17 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Tue, 17 Oct 2006 00:18:33 -0400


Stephen Hemminger wrote:

On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
John Heffner <[EMAIL PROTECTED]> wrote:

This patch limits the amount of time you will defer sending a TSO segment
to less than two clock ticks, or the time between two acks, whichever is
longer.

Okay, but doing any timing on clock ticks makes the behavior dependent
on the value of HZ which doesn't seem desirable. Should this be based
on RTT or a real-time values?
It would be nice to use a high res clock so you don't depend on HZ, but 
this is still expensive on most SMP arch's as I understand it.


Right so we do need to use a jiffies based solution.

Since HZ is variable, I have a feeling that the thing to do here
is pick some timeout in msec.  Then replace the "2 clock ticks"
with some msec_to_jiffies() calls, bottoming out at 1 jiffie.

How does that sound?


That's actually how I originally coded it. :)  But then it occurred to 
me that if you've already been waiting for a full clock tick, the 
marginal CPU savings of waiting longer will not be great.  Which is why 
I chose the value of 2 ticks so you're guaranteed to have waited at 
least one full tick.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Bound TSO defer time (resend)

2006-10-16 Thread John Heffner

Stephen Hemminger wrote:

On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
John Heffner <[EMAIL PROTECTED]> wrote:



This patch limits the amount of time you will defer sending a TSO segment
to less than two clock ticks, or the time between two acks, whichever is
longer.




Okay, but doing any timing on clock ticks makes the behavior dependent
on the value of HZ which doesn't seem desirable. Should this be based
on RTT or a real-time values?


It would be nice to use a high res clock so you don't depend on HZ, but 
this is still expensive on most SMP arch's as I understand it.


  -John

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >