from:"\"john\""

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 29 Aug 2007 15:29:03 -0700

David Miller wrote:

None of the research folks want to commit to saying a lower value is
OK, even though it's quite clear that on a local 10 gigabit link a
minimum value of even 200 is absolutely and positively absurd.

So what do these cellphone network people want to do, increate the
minimum RTO or increase it?  Exactly how does it help them?
They want to increase it.  The folks who triggered this want to make it 
3 seconds to avoid spurrious RTOs.  Their experience the "other 
platform" they widh to replace suggests that 3 seconds is a good value 
for their network.

If the issue is wireless loss, algorithms like FRTO might help them,
because FRTO tries to make a distinction between capacity losses
(which should adjust cwnd) and radio losses (which are not capacity
based and therefore should not affect cwnd).
I was looking at that.  FRTO seems only to affect the cwnd calculations, 
and not the RTO calculation, so it seems to "deal with" spurrious RTOs 
rather than preclude them.  There is a strong desire here to not have 
spurrious RTO's in the first place.  Each spurrious retransmission will 
increase a user's charges.

All of this seems to suggest that the RTO calculation is wrong.

I think there's definitely room for improving the RTO calculation. 
However, this may not be the end-all fix...

It seems that packets in this network can be delayed several orders of
magnitude longer than the usual round trip as measured by TCP.

What exactly causes such a huge delay?  What is the TCP measured RTO
in these circumstances where spurious RTOs happen and a 3 second
minimum RTO makes things better?

I haven't done a lot of work on wireless myself, but my understanding is 
that one of the biggest problems is the behavior link-layer 
retransmission schemes.  They can suddenly increase the delay of packets 
by a significant amount when you get a burst of radio interference. 
It's hard for TCP to gracefully handle this kind of jump without some 
minimum RTO, especially since wlan RTTs can often be quite small.

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] make _minimum_ TCP retransmission timeout configurable


John Heffner wrote:

What exactly causes such a huge delay?  What is the TCP measured RTO
in these circumstances where spurious RTOs happen and a 3 second
minimum RTO makes things better?


I haven't done a lot of work on wireless myself, but my understanding is 
that one of the biggest problems is the behavior link-layer 
retransmission schemes.  They can suddenly increase the delay of packets 
by a significant amount when you get a burst of radio interference. It's 
hard for TCP to gracefully handle this kind of jump without some minimum 
RTO, especially since wlan RTTs can often be quite small.


(Replying to myself) Though F-RTO does often help in this case.

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NCR, was [PATCH] make _minimum_ TCP retransmission timeout configurable


Stephen Hemminger wrote:

On Wed, 29 Aug 2007 15:28:12 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

And reading NCR some more, we already have something similar in the
form of Alexey's reordering detection, in fact it handles exactly the
case NCR supposedly deals with.  We do not trigger loss recovery
strictly on the 3rd duplicate ACK, and we've known about and dealt
with the reordering issue explicitly for years.



Yeah, it looked like another case of BSD RFC writers reinventing
Linux algorithms, but it is worth getting the behaviour standardized
and more widely reviewed.


I don't believe this was the case.  NCR is substantially different, and 
came out of work at Texas A&M.  The original (only) implementation was 
in Linux IIRC.  Its goal was to do better.  Their papers say it does. 
It might be worth looking at.


In my own experience with reordering, Alexey's code had some 
hard-to-track-down bugs (look at all the work Ilpo's been doing), and 
the relative simplicity of NCR may be one of the reasons it does well in 
tests.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] make _minimum_ TCP retransmission timeout configurable

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 29 Aug 2007 16:06:27 -0700

I belive the biggest component comes from link-layer retransmissions. 
There can also be some short outtages thanks to signal blocking, 
tunnels, people with big hats and whatnot that the link-layer 
retransmissions are trying to address.  The three seconds seems to be a 
value that gives the certainty that 99 times out of 10 the segment was 
indeed lost.

The trace I've been sent shows clean RTTs ranging from ~200 milliseconds 
to ~7000 milliseconds.

Thanks for the info.

It's pretty easy to generate examples where we might have some sockets
talking over interfaces on such a network and others which are not.
Therefore, if we do this, a per-route metric is probably the best bet.

This is exactly what I was thinking.  It might even help discourage 
users from playing with this setting who should not. ;)

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] make _minimum_ TCP retransmission timeout configurable take 2

2007-08-30 Thread John Heffner


Rick Jones wrote:
Like I said the consumers of this are a triffle well, 
"anxious" :)


Just curious, did you or this customer try with F-RTO enabled?  Or is 
this case you're dealing with truly hopeless?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

82557/8/9 Ethernet Pro 100 interrupt mitigation support

2007-09-03 Thread John Sigler


(Please ignore previous message, it was sent from the wrong account.)

Hello everyone,

I have several systems with three integrated Intel 82559 (I *think*).

Does someone know if these boards support hardware interrupt mitigation?
I.e. is it possible to configure them to raise an IRQ only if their
hardware buffer is full OR if some given time (say 1 ms) has passed and
packets are available in their hardware buffer.

I've been using the eepro100 driver up to now, but I'm about to try the
e100 driver. Would I have to use NAPI? Or is this an orthogonal feature?

Regards.

00:08.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 
08)
Subsystem: Intel Corporation EtherExpress PRO/100B (TX)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR-

Re: 82557/8/9 Ethernet Pro 100 interrupt mitigation support

2007-09-03 Thread John Sigler

John Sigler wrote:

I have several systems with three integrated Intel 82559 (I *think*).

Does someone know if these boards support hardware interrupt mitigation?
I.e. is it possible to configure them to raise an IRQ only if their
hardware buffer is full OR if some given time (say 1 ms) has passed and
packets are available in their hardware buffer.

I've been using the eepro100 driver up to now, but I'm about to try the
e100 driver. Would I have to use NAPI? Or is this an orthogonal feature?

00:08.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev
08)
Subsystem: Intel Corporation EtherExpress PRO/100B (TX)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- TAbort- SERR- TAbort- SERR-

Here is Intel's page for the 82559:
http://www.intel.com/design/network/products/lan/controllers/82559.htm

The "82559ER Fast Ethernet PCI Controller" data sheet mentions a 3 KB
receive FIFO. I suppose that's too small to aggregate several frames?

The "8255x Controller Family Open Source Software Developer Manual"
mentions the features supported by the 82559. I don't see anything
related to interrupt mitigation support.

Does NAPI work well when there is no hardware interrupt mitigation support?

Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: 82557/8/9 Ethernet Pro 100 interrupt mitigation support

2007-09-04 Thread John Sigler


Jesse Brandeburg wrote:


Auke Kok wrote:


Marc Sigler wrote:


I have several systems with three integrated Intel 82559 (I *think*).

Does someone know if these boards support hardware interrupt
mitigation? I.e. is it possible to configure them to raise an IRQ
only if their hardware buffer is full OR if some given time (say 1
ms) has passed and packets are available in their hardware buffer.

I've been using the eepro100 driver up to now, but I'm about to try
the e100 driver. Would I have to use NAPI? Or is this an orthogonal
feature? 


e100 hardware (as far as I can see from the specs) doesn't support
any irq mitigation, so you'll need to run in NAPI mode if you want to
throttle irq's. the in-kernel e100 already runs in NAPI mode, so
that's already covered. 


beware that the eepro100 driver is scheduled for removal (2.6.25 or so).


We support mitigation of interrupts in a downloadable microcode on only
a few pieces of hardware (revision id specific) in e100.c (see
e100_setup_ucode)


http://lxr.linux.no/source/drivers/net/e100.c#L1176

OK.

How do I tell which revision id I have?

00:08.0 0200: 8086:1229 (rev 08)
00:09.0 0200: 8086:1229 (rev 08)
00:0a.0 0200: 8086:1229 (rev 08)

How much memory is available on the board to bundle packets? 3000 bytes?


If you really really wanted mitigation you could probably backport the
microcode from the e100 driver in the 2.4.35 kernel for your specific
hardware.  This driver is versioned 2.X.


I forgot to mention I'm running 2.6.22.1-rt9.
I'm not sure why you mention 2.4.35?
The problem with e100 is that it fails to properly set up all three 
interfaces, which is why I'm stuck with eepro100.


Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] Clean up owner field in sock_lock_t

I don't know why the owner field is a (struct sock_iocb *).  I'm assuming
it's historical.  Can someone check this out?  Did I miss some alternate
usage?

These patches are against net-2.6.24.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [NET] Cleanup: Use sock_owned_by_user() macro

Changes asserts in sunrpc to use sock_owned_by_user() macro instead of
referencing sock_lock.owner directly.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/sunrpc/svcsock.c  |2 +-
 net/sunrpc/xprtsock.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index ed17a50..3a95612 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -104,7 +104,7 @@ static struct lock_class_key svc_slock_key[2];
 static inline void svc_reclassify_socket(struct socket *sock)
 {
struct sock *sk = sock->sk;
-   BUG_ON(sk->sk_lock.owner != NULL);
+   BUG_ON(sock_owned_by_user(sk));
switch (sk->sk_family) {
case AF_INET:
sock_lock_init_class_and_name(sk, "slock-AF_INET-NFSD",
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4ae7eed..282efd4 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1186,7 +1186,7 @@ static struct lock_class_key xs_slock_key[2];
 static inline void xs_reclassify_socket(struct socket *sock)
 {
struct sock *sk = sock->sk;
-   BUG_ON(sk->sk_lock.owner != NULL);
+   BUG_ON(sock_owned_by_user(sk));
switch (sk->sk_family) {
case AF_INET:
sock_lock_init_class_and_name(sk, "slock-AF_INET-NFS",
-- 
1.5.3.rc7.30.g947ad2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] [NET] Change type of owner in sock_lock_t to int, rename

The type of owner in sock_lock_t is currently (struct sock_iocb *),
presumably for historical reasons.  It is never used as this type, only
tested as NULL or set to (void *)1.  For clarity, this changes it to type
int, and renames to owned, to avoid any possible type casting errors.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/net/sock.h |7 +++
 net/core/sock.c|6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 802c670..5ed9fa4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -76,10 +76,9 @@
  * between user contexts and software interrupt processing, whereas the
  * mini-semaphore synchronizes multiple users amongst themselves.
  */
-struct sock_iocb;
 typedef struct {
spinlock_t  slock;
-   struct sock_iocb*owner;
+   int owned;
wait_queue_head_t   wq;
/*
 * We express the mutex-alike socket_lock semantics
@@ -737,7 +736,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, 
int size)
  * Since ~2.3.5 it is also exclusive sleep lock serializing
  * accesses from user process context.
  */
-#define sock_owned_by_user(sk) ((sk)->sk_lock.owner)
+#define sock_owned_by_user(sk) ((sk)->sk_lock.owned)
 
 /*
  * Macro so as to not evaluate some arguments when
@@ -748,7 +747,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, 
int size)
  */
 #define sock_lock_init_class_and_name(sk, sname, skey, name, key)  \
 do {   \
-   sk->sk_lock.owner = NULL;   \
+   sk->sk_lock.owned = 0;  \
init_waitqueue_head(&sk->sk_lock.wq);   \
spin_lock_init(&(sk)->sk_lock.slock);   \
debug_check_no_locks_freed((void *)&(sk)->sk_lock,  \
diff --git a/net/core/sock.c b/net/core/sock.c
index cfed7d4..edbc562 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1575,9 +1575,9 @@ void fastcall lock_sock_nested(struct sock *sk, int 
subclass)
 {
might_sleep();
spin_lock_bh(&sk->sk_lock.slock);
-   if (sk->sk_lock.owner)
+   if (sk->sk_lock.owned)
__lock_sock(sk);
-   sk->sk_lock.owner = (void *)1;
+   sk->sk_lock.owned = 1;
spin_unlock(&sk->sk_lock.slock);
/*
 * The sk_lock has mutex_lock() semantics here:
@@ -1598,7 +1598,7 @@ void fastcall release_sock(struct sock *sk)
spin_lock_bh(&sk->sk_lock.slock);
if (sk->sk_backlog.tail)
__release_sock(sk);
-   sk->sk_lock.owner = NULL;
+   sk->sk_lock.owned = 0;
if (waitqueue_active(&sk->sk_lock.wq))
wake_up(&sk->sk_lock.wq);
spin_unlock_bh(&sk->sk_lock.slock);
-- 
1.5.3.rc7.30.g947ad2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] [IPROUTE2] ss: parse bare integers are port numbers rather than IP addresses


Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 misc/ss.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 5d14f13..d617f6d 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -953,6 +953,10 @@ void *parse_hostcond(char *addr)
memset(&a, 0, sizeof(a));
a.port = -1;
 
+   /* Special case: integer by itself is considered a port number */
+   if (!get_integer(&a.port, addr, 0))
+   goto out;
+
if (fam == AF_UNIX || strncmp(addr, "unix:", 5) == 0) {
char *p;
a.addr.family = AF_UNIX;
-- 
1.5.3.rc4.29.g74276-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [IPROUTE2] Add missing LIBUTIL for dependencies.


Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 Makefile |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index af0d5e4..7e4605c 100644
--- a/Makefile
+++ b/Makefile
@@ -29,7 +29,8 @@ LDLIBS += -L../lib -lnetlink -lutil
 
 SUBDIRS=lib ip tc misc netem genl
 
-LIBNETLINK=../lib/libnetlink.a ../lib/libutil.a
+LIBUTIL=../lib/libutil.a
+LIBNETLINK=../lib/libnetlink.a $(LIBUTIL)
 
 all: Config
@set -e; \
-- 
1.5.3.rc4.29.g74276-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder

2007-09-17 Thread John Heffner

Any reason you're overloading tcpi_unacked and tcpi_sacked?  It seems 
that setting idiag_rqueue and idiag_wqueue are sufficient.


  -John


Rick Jones wrote:

Return some useful information such as the maximum listen backlog and the
current listen backlog in the tcp_info structure and have that match what
one can see in /proc/net/tcp, /proc/net/tcp6, and INET_DIAG_INFO.

Signed-off-by: Rick Jones <[EMAIL PROTECTED]>
Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
---

diff -r bdcdd0e1ee9d Documentation/networking/proc_net_tcp.txt
--- a/Documentation/networking/proc_net_tcp.txt Sat Sep 01 07:00:31 2007 +
+++ b/Documentation/networking/proc_net_tcp.txt Tue Sep 11 10:38:23 2007 -0700
@@ -20,8 +20,8 @@ up into 3 parts because of the length of
   || | |   |--> number of unrecovered RTO timeouts
   || | |--> number of jiffies until timer expires
   || |> timer_active (see below)
-  ||--> receive-queue
-  |---> transmit-queue
+  ||--> receive-queue or connection backlog
+  |---> transmit-queue or connection limit
 
10000 54165785 4 cd1e6040 25 4 27 3 -1
 |  || || |  | |  | |--> slow start size threshold, 
diff -r bdcdd0e1ee9d net/ipv4/tcp.c

--- a/net/ipv4/tcp.cSat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp.cTue Sep 11 10:38:23 2007 -0700
@@ -2030,8 +2030,14 @@ void tcp_get_info(struct sock *sk, struc
info->tcpi_snd_mss = tp->mss_cache;
info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss;
 
-	info->tcpi_unacked = tp->packets_out;

-   info->tcpi_sacked = tp->sacked_out;
+   if (sk->sk_state == TCP_LISTEN) {
+   info->tcpi_unacked = sk->sk_ack_backlog;
+   info->tcpi_sacked = sk->sk_max_ack_backlog;
+   }
+   else {
+   info->tcpi_unacked = tp->packets_out;
+   info->tcpi_sacked = tp->sacked_out;
+   }
info->tcpi_lost = tp->lost_out;
info->tcpi_retrans = tp->retrans_out;
info->tcpi_fackets = tp->fackets_out;
diff -r bdcdd0e1ee9d net/ipv4/tcp_diag.c
--- a/net/ipv4/tcp_diag.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp_diag.c   Tue Sep 11 10:38:23 2007 -0700
@@ -25,11 +25,14 @@ static void tcp_diag_get_info(struct soc
const struct tcp_sock *tp = tcp_sk(sk);
struct tcp_info *info = _info;
 
-	if (sk->sk_state == TCP_LISTEN)

+   if (sk->sk_state == TCP_LISTEN) {
r->idiag_rqueue = sk->sk_ack_backlog;
-   else
+   r->idiag_wqueue = sk->sk_max_ack_backlog;
+   }
+   else {
r->idiag_rqueue = tp->rcv_nxt - tp->copied_seq;
-   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   }
if (info != NULL)
tcp_get_info(sk, info);
 }
diff -r bdcdd0e1ee9d net/ipv4/tcp_ipv4.c
--- a/net/ipv4/tcp_ipv4.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp_ipv4.c   Tue Sep 11 10:38:23 2007 -0700
@@ -2320,7 +2320,8 @@ static void get_tcp4_sock(struct sock *s
sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
"%08X %5d %8d %lu %d %p %u %u %u %u %d",
i, src, srcp, dest, destp, sk->sk_state,
-   tp->write_seq - tp->snd_una,
+   sk->sk_state == TCP_LISTEN ? sk->sk_max_ack_backlog :
+(tp->write_seq - tp->snd_una),
sk->sk_state == TCP_LISTEN ? sk->sk_ack_backlog :
 (tp->rcv_nxt - tp->copied_seq),
timer_active,
diff -r bdcdd0e1ee9d net/ipv6/tcp_ipv6.c
--- a/net/ipv6/tcp_ipv6.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv6/tcp_ipv6.c   Tue Sep 11 10:38:23 2007 -0700
@@ -2005,8 +2005,10 @@ static void get_tcp6_sock(struct seq_fil
   dest->s6_addr32[0], dest->s6_addr32[1],
   dest->s6_addr32[2], dest->s6_addr32[3], destp,
   sp->sk_state,
-  tp->write_seq-tp->snd_una,
-  (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
(tp->rcv_nxt - tp->copied_seq),
+  (sp->sk_state == TCP_LISTEN) ? sp->sk_max_ack_backlog:
+ tp->write_seq-tp->snd_una,
+		   (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
+	(tp->rcv_nxt - tp->copied_seq),

   timer_active,
   jiffies_to_clock_t(timer_expires - jiffies),
   icsk->icsk_retransmits,

Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder

2007-09-17 Thread John Heffner


Rick Jones wrote:

John Heffner wrote:
Any reason you're overloading tcpi_unacked and tcpi_sacked?  It seems 
that setting idiag_rqueue and idiag_wqueue are sufficient.


Different fields for different structures.   The tcp_info struct doesn't 
have the idiag_mumble, so to get the two values shown in /proc/net/tcp I 
use tcpi_unacked and tcpi_sacked.


For the INET_DIAG_INFO stuff the idiag_mumble fields are used and that 
then covers ss.


Maybe I'm missing something.  get_tcp[46]_sock() does not use struct 
tcp_info.  The only way I see using this is by doing 
getsockopt(TCP_INFO) on your listen socket.  Is this the intention?


  -John

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP

2007-09-20 Thread john ye

Bottom Softirq Implementation. John Ye, 2007.08.27

Why this patch:
Make kernel be able to concurrently execute softirq's net code on SMP 
system.
Takes full advantages of SMP to handle more packets and greatly raises NIC 
throughput.
The current kernel's net packet processing logic is:
1) The CPU which handles a hardirq must be executing its related softirq.
2) One softirq instance(irqs handled by 1 CPU) can't be executed on more 
than 2 CPUs
at the same time.
The limitation make kernel network be hard to take the advantages of SMP.

How this patch:
It splits the current softirq code into 2 parts: the cpu-sensitive top half,
and the cpu-insensitive bottom half, then make bottom half(calld BS) be
executed on SMP concurrently.
The two parts are not equal in terms of size and load. Top part has constant 
code
size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules 
to match
will make the bottom part's load be very high. So, if the bottom part 
softirq
can be randomly distributed to processors and run concurrently on them, the 
network will
gain much more packet handling capacity, network throughput will be be 
increased
remarkably.

Where useful:
It's useful on SMP machines that meet the following 2 conditions:
1) have high kernel network load, for example, running iptables with 
thousands of rules, etc).
2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
On these system, with the increase of softirq load, some CPUs will be idle
while others(number is equal to # of NIC) keeps busy.
IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq 
concurrency.
Balancing the load of each cpus will not remarkably increase network speed.

Where NOT useful:
If the bottom half of softirq is too small(without running iptables), or the 
network
is too idle, BS patch will not be seen to have visible effect. But It has no
negative affect either.
User can turn on/off BS functionality by /proc/sys/net/bs_enable switch.

How to test:
On a linux box, run iptables, add 2000 rules to table filter & table nat to 
simulate huge
softirq load. Then, open 20 ftp sessions to download big file. On another 
machine(who
use this test machine as gateway), open 20 more ftp download sessions. 
Compare the speed,
without BS enabled, and with BS enabled.
cat /proc/sys/net/bs_enable. this is a switch to turn on/off BS
cat /proc/sys/net/bs_status. this shows the usage of each CPUs
Test shown that when bottom softirq load is high, the network throughput can 
be nearly
doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux 
box.

Bugs:
It will NOT allow hotpug CPU.
It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
for example, 0,1,2,3 is OK. 0,1,8,9 is KO.

Some considerations in the future:
1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems 
no need any more,
at least not for network irq.
2) Softirq load will become very small. It only run the top half of old 
softirq, which
is much less expensive than bottom half---the netfilter program.
To let top softirq process more packets, can these 3 network parameters be 
given a larger value?
   extern int netdev_max_backlog = 1000;
   extern int netdev_budget = 300;
   extern int weight_p = 64;
3) Now, BS are running on built-in keventd thread, we can create new 
workqueues to let it run on?

Signed-off-by: John Ye (Seeker) <[EMAIL PROTECTED]>


--- old/net/ipv4/ip_input.c 2007-09-20 20:50:31.0 +0800
+++ new/net/ipv4/ip_input.c 2007-09-21 05:52:40.0 +0800
@@ -362,6 +362,198 @@
 return NET_RX_DROP;
 }

+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+
+/*
+ *
+Bottom Softirq Implementation. John Ye, 2007.08.27
+
+Why this patch:
+Make kernel be able to concurrently execute softirq's net code on SMP 
system.
+Takes full advantages of SMP to handle more packets and greatly raises NIC 
throughput.
+The current kernel's net packet processing logic is:
+1) The CPU which handles a hardirq must be executing its related softirq.
+2) One softirq instance(irqs handled by 1 CPU) can't be executed on more 
than 2 CPUs
+at the same time.
+The limitation make kernel network be hard to take the advantages of SMP.
+
+How this patch:
+It splits the current softirq code into 2 parts: the cpu-sensitive top 
half,
+and the cpu-insensitive bottom half, then make bottom half(calld BS) be
+executed on SMP concurrently.
+The two parts are not equal in terms of size and load. Top part has 
constant code
+size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
+netfilter(iptables) whose load varies very much. An iptalbes with 1000 
rules to match
+will make the bottom part's load be very high. So, if the bottom part 
softirq
+can be randomly distributed to processor

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP

2007-09-21 Thread John Ye

David,

Thanks for your reply. I understand it's not worth to do.

I have made it a loadable module to fulfill the function. it mainly for busy
NAT gateway server with SMP to speed up.

John Ye



- Original Message -
From: "David Miller" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: ; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, September 21, 2007 1:46 AM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq
network code on SMP


>
> The whole reason the queues are per-cpu is so that we do not
> have to touch remote processor state nor use locks of any
> kind whatsoever.
>
> With multi-queue networking cards becoming more and more
> available, which will split up the packet workload in
> hardware across all available cpus, there is less and less
> reason to make a patch like this one.
>
> We've known about this issue for ages, and if we felt it
> was appropriate to make this change, we would have done
> so years ago.
>


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

want same order in /sys/class/net/eth as /sys/bus/pci/devices

2007-09-21 Thread John Reiser

I'd like to see the same order of devices in /sys/class/net/eth*
as in /sys/bus/pci/devices.  This would make administration easier.
On Fedora 8 tests, the order I see is reversed:
  http://bugzilla.redhat.com/show_bug.cgi?id=291431

Perhaps the reversal is a result of the alias order listed in
/etc/modprobe.conf.  But the alias order was obtained from some
source.  Was the first reversal due to a user-space program
(such as the anaconda installer), or due to something within
the kernel?

-- 
John Reiser, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirqnetwork code on SMP

2007-09-22 Thread john ye

Dear Jamal,

Sorry, I sent to you all a not-good-formatted mail.
Thanks for instructions and corrections from you all.

I have thought that packet re-ordering for upper TCP protocol will become 
more intensive and this will make the network even slower.

I do randomly select a CPU to dispatch the skb to. Previously, I dispatch 
skb evenly to all CPUs( round robin, one by one). but I didn't find a quick 
coding. for_each_online_cpu is not quick enough.

According to my test result, it did make packet INPUT speed doubled because 
another CPU is used concurrently.
It seems the packets still keep "roughly ordering" after turning on BS 
patch.

The test is simple: use an 2400 lines of iptables -t filter -A INPUT -p 
tcp -s x.x.x.x --dport yy -j .
these rules make the current softirq be very busy on one CPU and make the 
incoming net very slow. after turning on BS, the speed doubled.

For NAT test, I didn't get a good result like INPUT because real environment 
limitation.
The test is very basic and is far from "full".

It seems to me that the cross-cpu spinlock_ for the queue doesn't have 
big cost and is allowable in terms of CPU time consumption, compared with 
the gains by making other CPUs joint in the work.

I have made BS patch into a loadable module. 
http://linux.chinaunix.net/bbs/thread-909725-2-1.html and let others help 
with testing.

John Ye

- Original Message - 
From: "jamal" <[EMAIL PROTECTED]>
To: "John Ye" <[EMAIL PROTECTED]>
Cc: "David Miller" <[EMAIL PROTECTED]>; ; 
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; 
<[EMAIL PROTECTED]>
Sent: Friday, September 21, 2007 7:43 PM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run 
softirqnetwork code on SMP

> On Fri, 2007-21-09 at 17:25 +0800, John Ye wrote:
>> David,
>>
>> Thanks for your reply. I understand it's not worth to do.
>>
>> I have made it a loadable module to fulfill the function. it mainly for 
>> busy
>> NAT gateway server with SMP to speed up.
>>
>
> John,
>
> It was a little hard to read your code; however, it does seems to me
> like will cause a massive amount of packet reordering to the end hosts
> using you as the gateway especially when it is receiving a lot of
> packets/second.
> You have a queue per CPU that connects your bottom and top half and
> several CPUs that may service a single NIC in your bottom half.
> one cpu in either bottom/top half has to be slightly loaded and you
> loose the ordering where incoming doesnt match outgoing packet order.
>
> cheers,
> jamal
>
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently runsoftirqnetwork code on SMP

2007-09-23 Thread john ye

Dear Jamal,

Yes. you are right. I do "need some real fast traffic generator; possibly 
one that can do
thousands of tcp sessions." to get some kind of convincing result.

Also, the packet reordering is also my big concern. round-robin doesn't have 
much help.

"The INPUT speed is doubled by using 2 CPUs" is shown by these steps:
1) without intables, ftp get a 50M file from another machine, ftp can show 
speed 10M/s.
2) run iptables and add many intpalbes rules, ftp get the same file, the 
speed is down to 3M/s, top shows CPU0 busy in softirq. CPU1 idle.
3) insmod my module BS, then ftp get the same file, the speed can reach 
6M/s, top shows both CPU0 and CPU1 are busy in keventd/0/1

I will try my best to do further test. the best test should be done on a 4 
CPU GATEWAY machine. In China, there are many companies who use linux box 
running iptables as a gateway to serve 1000 around clients, for example. On 
those machines, a lot conntracking, and they have the "idle CPUs while net 
is too busy" problem.

In my BS module (If you got it), only 2 functions are needed to see: 
REP_ip_rcv(), and bs_func(). Others have nothing to do with the BS patch ---  
they are there only for accessing non-EXPORT_SYMBOLed kernel variables.

Thanks a lot for your thought.

John Ye

- Original Message - 
From: "jamal" <[EMAIL PROTECTED]>
To: "john ye" <[EMAIL PROTECTED]>
Cc: "David Miller" <[EMAIL PROTECTED]>; ; 
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; 
<[EMAIL PROTECTED]>
Sent: Sunday, September 23, 2007 8:43 PM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently 
runsoftirqnetwork code on SMP

> On Sun, 2007-23-09 at 12:45 +0800, john ye wrote:
>
>>  I do randomly select a CPU to dispatch the skb to. Previously, I
>> dispatch
>>  skb evenly to all CPUs( round robin, one by one). but I didn't find a
>> quick
>>  coding. for_each_online_cpu is not quick enough.
>
> for_each_online_cpu doenst look that expensive - but even round robin
> wont fix the reordering problem. What you need to do is make sure that a
> flow always goes to the same cpu over some period of time.
>
>>  According to my test result, it did make packet INPUT speed doubled
>> because
>>  another CPU is used concurrently.
>
> How did you measure "speed" - was it throughput? Did you measure how
> much cpu was being utilized?
>
>>  It seems the packets still keep "roughly ordering" after turning on
>> BS patch.
>
> Linux TCP is very resilient to reordering compared to other OSes, but
> even then if you hit it with enough packets it is going to start
> sweating it.
>
>>  The test is simple: use an 2400 lines of iptables -t filter -A INPUT
>> -p
>>  tcp -s x.x.x.x --dport yy -j .
>>  these rules make the current softirq be very busy on one CPU and make
>> the
>>  incoming net very slow. after turning on BS, the speed doubled.
>>
> Ok, but how do you observe "doubled"?
> Do you have conntrack on? It maybe that what you have just found is
> netfilter needs to have its work defered from packet rcv.
> You need some real fast traffic generator; possibly one that can do
> thousands of tcp sessions.
>
>>  For NAT test, I didn't get a good result like INPUT because real
>> environment limitation.
>>  The test is very basic and is far from "full".
>
> What happens when you totally compile out netfilter and you just use
> this machine as a server?
>
>>  It seems to me that the cross-cpu spinlock_ for the queue doesn't
>> have
>>  big cost and is allowable in terms of CPU time consumption, compared
>> with
>>  the gains by making other CPUs joint in the work.
>>
>>  I have made BS patch into a loadable module.
>>  http://linux.chinaunix.net/bbs/thread-909725-2-1.html and let others
>> help with testing.
>
> It is still very hard to read; and i am not sure how you are going to
> get the performance you claim eventually - you are registering as a tap
> for ip packets, which means you will process two of each incoming
> packets.
>
> cheers,
> jamal
>
> 

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP

2007-09-23 Thread John Ye

Dear Jamal,

Thanks, bothered you all.

I will look into the 2 issues. re-ordering and spinlock, and do extensive
test.
Once having result, no matter positive or negative, I will contact you.
The format will not be a mess any more.

John Ye

- Original Message -
From: "jamal" <[EMAIL PROTECTED]>
To: "john ye" <[EMAIL PROTECTED]>
Cc: "David Miller" <[EMAIL PROTECTED]>; ;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, September 24, 2007 2:07 AM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network:
concurrentlyrunsoftirqnetwork code on SMP


> John,
> It will NEVER be an acceptable solution as long as you have re-ordering.
> I will look at it - but i have to run out for now. In the meantime,
> I have indented it for you to be in proper kernel format so others can
> also look it. Attached.
>
> cheers,
> jamal
>
>


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently runsoftirqnetwork code on SMP

2007-09-25 Thread john ye

Jamal,

You pointed out a key point: it's NOT acceptable if massive packet re-ordering 
couldn¡¯t be avoided.
I debugged function tcp_ofo_queue in net/ipv4/tcp_input.c & monitored 
out_of_order_queue, found that re-ordering
becomes unacceptable with the softirq load grows.

It's simple to avoid out-of-order packets by changing random dispatch into 
dispatch based on source ip address.
e.g. cpu = iph->saddr % nr_cpus. while cpu is like a hash entry.
Considering that BS patch is mainly used on server with many incoming 
connections,
dispatch by IP should balance CPU load well.

The test is under way, it's not bad so far.
The queue spin_lock seems not cost much.

Below is the bcpp beautified module code. Last time code mess is caused by 
outlook express which killed tabs.

Thanks.

John Ye



/*
 *  BOTTOM_SOFTIRQ_NET
 *  An implementation of bottom softirq concurrent execution on SMP
 *  This is implemented by splitting current net softirq into top 
half
 *  and bottom half, dispatch the bottom half to each cpu's 
workqueue.
 *  Hopefully, it can raise the throughput of NIC when running 
iptalbes
 *  on SMP machine.
 *
 *  Version:$Id: bs_smp.c, v 2.6.13-15 for kernel 2.6.13-15-smp
 *
 *  Authors:John Ye & QianYu Ye, 2007.08.27
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static spinlock_t *p_ptype_lock;
static struct list_head *p_ptype_base;/* 16 way hashed list */

int (*Pip_options_rcv_srr)(struct sk_buff *skb);
int (*Pnf_rcv_postxfrm_nonlocal)(struct sk_buff *skb);
struct ip_rt_acct *ip_rt_acct;
struct ipv4_devconf *Pipv4_devconf;

#define ipv4_devconf (*Pipv4_devconf)
//#define ip_rt_acct Pip_rt_acct
#define ip_options_rcv_srr Pip_options_rcv_srr
#define nf_rcv_postxfrm_nonlocal Pnf_rcv_postxfrm_nonlocal
//extern int nf_rcv_postxfrm_local(struct sk_buff *skb);
//extern int ip_options_rcv_srr(struct sk_buff *skb);
static struct workqueue_struct **Pkeventd_wq;
#define keventd_wq (*Pkeventd_wq)

#define INSERT_CODE_HERE

static inline int ip_rcv_finish(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
struct iphdr *iph = skb->nh.iph;
int err;

/*
 * Initialise the virtual path cache for the packet. It describes
 * how the packet travels inside Linux networking.
 */
if (skb->dst == NULL)
{
if ((err = ip_route_input(skb, iph->daddr, iph->saddr, 
iph->tos, dev)))
{
if (err == -EHOSTUNREACH)
IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS);
goto drop;
}
}

if (nf_xfrm_nonlocal_done(skb))
return nf_rcv_postxfrm_nonlocal(skb);

#ifdef CONFIG_NET_CLS_ROUTE
if (skb->dst->tclassid)
{
struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id();
u32 idx = skb->dst->tclassid;
st[idx&0xFF].o_packets++;
st[idx&0xFF].o_bytes+=skb->len;
st[(idx>>16)&0xFF].i_packets++;
st[(idx>>16)&0xFF].i_bytes+=skb->len;
}
#endif

if (iph->ihl > 5)
{
struct ip_options *opt;

/* It looks as overkill, because not all
   IP options require packet mangling.
   But it is the easiest for now, especially taking
   into account that combination of IP options
   and running sniffer is extremely rare condition.
  --ANK (980813)
*/

if (skb_cow(skb, skb_headroom(skb)))
{
IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
goto drop;
}
iph = skb->nh.iph;

if (ip_options_compile(NULL, skb))
goto inhdr_error;

opt = &(IPCB(skb)->opt);
if (opt->srr)
{
struct in_device *in_dev = in_dev_get(dev);

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP

2007-09-25 Thread John Ye

Jamal & Stephen,

I found BSS-hash paper you mentioned and have browsed it briefly.
The issue "may end sending all your packets to one cpu" might be dealt with
by
cpu hash (srcip + dstip) % nr_cpus, plus checking cpu balance periodically,
shift cpu by an extra seed value?

Any way, the cpu hash code must not be too expensive because every incoming
packet hits the path.

We are going to do further study on this BSS thing.

__do_IRQ has a tendency to collect same IRQ on different CPUs into one CPU
when NIC is busy(by IRQ_PENDING & IRQ_INPROGRESS control skill). so,
dispatch the load to SMP here may be good thing(?).

Thanks.

John Ye

- Original Message -
From: "jamal" <[EMAIL PROTECTED]>
To: "Stephen Hemminger" <[EMAIL PROTECTED]>
Cc: "john ye" <[EMAIL PROTECTED]>; "David Miller" <[EMAIL PROTECTED]>;
; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Wednesday, September 26, 2007 6:22 AM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network:
concurrentlyrunsoftirqnetwork code on SMP

> On Tue, 2007-25-09 at 09:03 -0700, Stephen Hemminger wrote:
>
> > There is a standard hash called RSS, that many drivers support because
it is
> > used by other operating systems.
>
> I think any stateless/simple thing will do (something along the lines
> what 802.1ad does for trunk, a 5 classical five tuple etc).
>
> Having solved the reordering problem in such a stateless way introduces
> a loadbalancing setback; you may end sending all your packets to one cpu
> (a problem Mr Ye didnt have when he was re-orderding ;->).
>
> cheers,
> jamal
>
>

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Make TCP prequeue configurable

2007-09-27 Thread John Heffner


Stephen Hemminger wrote:

On Fri, 28 Sep 2007 00:08:33 +0200
Eric Dumazet <[EMAIL PROTECTED]> wrote:


Hi all

I am sure some of you are going to tell me that prequeue is not
all black :)

Thank you

[RFC] Make TCP prequeue configurable

The TCP prequeue thing is based on old facts, and has drawbacks.

1) It adds 48 bytes per 'struct tcp_sock'
2) It adds some ugly code in hot paths
3) It has a small hit ratio on typical servers using many sockets
4) It may have a high hit ratio on UP machines running one process,
where the prequeue adds litle gain. (In fact, letting the user
doing the copy after being woke up is better for cache reuse)
5) Doing a copy to user in softirq handler is not good, because of
potential page faults :(
6) Maybe the NET_DMA thing is the only thing that might need prequeue.

This patch introduces a CONFIG_TCP_PREQUEUE, automatically selected if 
CONFIG_NET_DMA is on.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>



Rather than having a two more compile cases and test cases to deal
with.  If you can prove it is useless, make a case for killing
it completely.



I think it really does help in case (4) with old NICs that don't do rx 
checksumming.  I'm not sure how many people really care about this 
anymore, but probably some...?


OTOH, it would be nice to get rid of sysctl_tcp_low_latency.

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SWS for rcvbuf < MTU

2007-03-13 Thread John Heffner


Alex Sidorenko wrote:
Here are the values from live kernel (obtained with 'crash') when the host was 
in SWS state:


full_space=708  full_space/2=354
free_space=393
window=76

In this case the test from my original fix, (window < full_space/2),  
succeeds. But John's test


free_space > window + full_space/2
393  430

does not. So I suspect that the new fix will not always work. From tcpdump 
traces we can see that both hosts exchange with 76-byte packets for a long 
time. From customer's application log we see that it continues to read 
76-byte chunks per each read() call - even though more than that is available 
in the receive buffer. Technically it's OK for read() to return even after 
reading one byte, so if sk->receive_queue contains multiple 76-byte skbuffs 
we may return after processing just one skbuff (but we we don't understand 
the details of why this happens on customer's system).


Are there any particular reasons why you want to postpone window update until 
free_space becomes > window + full_space/2 and not as soon as 
free_space > full_space/2? As the only real-life occurance of SWS shows 
free_space oscillating slightly above full_space/2, I created the fix 
specifically to match this phenomena as seen on customer's host. We reach the 
modified section only when (free_space > full_space/2) so it should be OK to 
update the window at this point if mss==full_space. 

So yes, we can test John's fix on customer's host but I doubt it will work for 
the reasons mentioned above, in brief:


'window = free_space' instead of 'window=full_space/2' is OK,
but the test 'free_space > window + full_space/2' is not for the specific 
pattern customer sees on his hosts.



Sorry for the long delay in response, I've been on vacation.  I'm okay 
with your patch, and I can't think of any real problem with it, except 
that the behavior is non-standard.  Then again, Linux acking in general 
is non-standard, which has created the bug in the first place. :)  The 
only thing I can think where it might still ack too often is if 
free_space frequently drops just below full_space/2 for a bit then rises 
above full_space/2.


I've also attached a corrected version of my earlier patch that I think 
solves the problem you noted.


Thanks,
  -John
Do full receiver-side SWS avoidance when rcvbuf < mss.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit f4333661026621e15549fb75b37be785e4a1c443
tree 30d46b64ea19634875fdd4656d33f76db526a313
parent 562aa1d4c6a874373f9a48ac184f662fbbb06a04
author John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400
committer John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400

 net/ipv4/tcp_output.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dc15113..e621a63 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1605,8 +1605,15 @@ u32 __tcp_select_window(struct sock *sk)
 * We also don't do any window rounding when the free space
 * is too small.
 */
-   if (window <= free_space - mss || window > free_space)
+   if (window <= free_space - mss || window > free_space) {
window = (free_space/mss)*mss;
+   } else if (mss == full_space) {
+   /* Do full receive-side SWS avoidance
+* when rcvbuf <= mss */
+   window = tcp_receive_window(tp);
+   if (free_space > window + full_space/2)
+   window = free_space;
+   }
}
 
return window;

[PATCH] tcp_mem initialization

2007-03-14 Thread John Heffner

The current tcp_mem initialization gives values that are really too 
small for systems with ~256-768 MB of memory, and also for systems with 
larger page sizes (ia64).  This patch gives an alternate method of 
initialization that doesn't depend on the cache allocation functions, 
but I think should still provide a nice curve that gives a smaller 
fraction of total memory with small-memory systems, while maintaining 
the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems.


  -John

Change tcp_mem initialization function.  The fraction of total memory is now
a continuous function of memory size, and independent of page size.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit a4461a36efb376bf01399cfd6f1ad15dc89a8794
tree 23b2fb9da52b45de8008fc7ea6bb8c10e3a3724b
parent 8b9909ded6922c33c221b105b26917780cfa497d
author John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400
committer John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400

 net/ipv4/tcp.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 74c4d10..3834b10 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2458,11 +2458,18 @@ void __init tcp_init(void)
sysctl_max_syn_backlog = 128;
}
 
-   /* Allow no more than 3/4 kernel memory (usually less) allocated to TCP 
*/
-   sysctl_tcp_mem[0] = (1536 / sizeof (struct inet_bind_hashbucket)) << 
order;
-   sysctl_tcp_mem[1] = sysctl_tcp_mem[0] * 4 / 3;
+   /* Set the pressure threshold to be a fraction of global memory that
+* is up to 1/2 at 256 MB, decreasing toward zero with the amount of
+* memory, with a floor of 128 pages.
+*/
+   limit = min(nr_all_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
+   limit = (limit * (nr_all_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
+   limit = max(limit, 128UL);
+   sysctl_tcp_mem[0] = limit / 4 * 3;
+   sysctl_tcp_mem[1] = limit;
sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
+   /* Set per-socket limits to no more than 1/128 the pressure threshold */
limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
max_share = min(4UL*1024*1024, limit);

Re: [PATCH] tcp_mem initialization

2007-03-15 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Wed, 14 Mar 2007 17:25:22 -0400

The current tcp_mem initialization gives values that are really too 
small for systems with ~256-768 MB of memory, and also for systems with 
larger page sizes (ia64).  This patch gives an alternate method of 
initialization that doesn't depend on the cache allocation functions, 
but I think should still provide a nice curve that gives a smaller 
fraction of total memory with small-memory systems, while maintaining 
the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems.

Indeed, it's really dumb for any of these calculations to be
dependant upon the page size.

Your patch looks good, and I'll review it further tomorrow and
push upstream unless I find some issues with it.

Thanks John.

The way it's coded is somewhat opaque since it has to be done with 
32-bit integer arithmetic.  These plots might help make the motivation 
behind the code a little clearer.

Thanks,
  -John

[PATCH 0/3] [NET] MTU discovery changes

These are a few changes to fix/clean up some of the MTU discovery 
processing with non-stream sockets, and add a probing mode.  See also 
matching patches to tracepath to take advantage of this.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] [NET] Do pmtu check in transport layer

Check the pmtu check at the transport layer (for UDP, ICMP and raw), and
send a local error if socket is PMTUDISC_DO and packet is too big.  This is
actually a pure bugfix for ipv6.  For ipv4, it allows us to do pmtu checks
in the same way as for ipv6.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +++-
 net/ipv4/raw.c|8 +---
 net/ipv6/ip6_output.c |   11 ++-
 net/ipv6/raw.c|7 +--
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d096332..593acf7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen) {
+   if (inet->cork.length + length > 0x - fragheaderlen ||
+   (inet->pmtudisc >= IP_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 87e9c16..f252f4e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,10 +271,12 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
+   int mtu;
 
-   if (length > rt->u.dst.dev->mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
-  rt->u.dst.dev->mtu);
+   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
+rt->u.dst.dev->mtu;
+   if (length > mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3055169..711dfc3 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1044,11 +1044,12 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
-   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
-   }
+   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
+inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
+   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 306d5d8..75db277 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -556,9 +556,12 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
+   int mtu;
 
-   if (length > rt->u.dst.dev->mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
+   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
+rt->u.dst.dev->mtu;
+   if (length > mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] [NET] Move DF check to ip_forward

Do fragmentation check in ip_forward, similar to ipv6 forwarding.  Also add
a debug printk in the DF check in ip_fragment since we should now never
reach it.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_forward.c |8 
 net/ipv4/ip_output.c  |2 ++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 369e721..0efb1f5 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -85,6 +85,14 @@ int ip_forward(struct sk_buff *skb)
if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
goto sr_failed;
 
+   if (unlikely(skb->len > dst_mtu(&rt->u.dst) &&
+(skb->nh.iph->frag_off & htons(IP_DF))) && !skb->local_df) 
{
+   IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
+   icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+ htonl(dst_mtu(&rt->u.dst)));
+   goto drop;
+   }
+
/* We are about to mangle packet. Copy it! */
if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len))
goto drop;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 593acf7..90bdd53 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -433,6 +433,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
iph = skb->nh.iph;
 
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
+   if (net_ratelimit())
+   printk(KERN_DEBUG "ip_fragment: requested fragment of 
packet with DF set\n");
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(dst_mtu(&rt->u.dst)));
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] [NET] Add IP(V6)_PMTUDISC_RPOBE

Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery. 
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 include/linux/skbuff.h   |3 ++-
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 ++
 net/ipv4/ip_output.c |   14 ++
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 +++
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 +++
 11 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..2dc1f8a 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4ff3940..64038b4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -284,7 +284,8 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1;
+   ipvs_property:1,
+   ign_dst_mtu;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index e79c3e3..f5874a3 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -201,7 +201,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -549,6 +550,7 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
+   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_BRIDGE_NETFILTER
new->nf_bridge  = old->nf_bridge;
nf_bridge_get(old->nf_bridge);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 90bdd53..a7e8944 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -201,7 +201,8 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) &&
+   !skb->ign_dst_mtu && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -801,7 +802,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu :
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1220,13 +1223,16 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
+   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
+   s

[PATCH 0/2] [iputils] MTU discovery changes

These add some changes that make tracepath a little more useful for 
diagnosing MTU issues.  The length flag helps distinguish between MTU 
black holes and other types of black holes by allowing you to vary the 
probe packet lengths.  Using PMTUDISC_PROBE gives you the same results 
on each run without having to flush the route cache, so you can see 
where MTU changes in the path actually occur.


The PMTUDISC_PROBE patch goes in should be conditional on whether the 
corresponding kernel patch (just sent) goes in.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] [iputils] Use PMTUDISC_PROBE mode if it exists.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 tracepath.c  |   10 --
 tracepath6.c |   10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index 1f901ba..a562d88 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -24,6 +24,10 @@
 #include 
 #include 
 
+#ifndef IP_PMTUDISC_PROBE
+#define IP_PMTUDISC_PROBE  3
+#endif
+
 struct hhistory
 {
int hops;
@@ -322,8 +326,10 @@ main(int argc, char **argv)
}
memcpy(&target.sin_addr, he->h_addr, 4);
 
-   on = IP_PMTUDISC_DO;
-   if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on))) {
+   on = IP_PMTUDISC_PROBE;
+   if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on)) &&
+   (on = IP_PMTUDISC_DO,
+setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on {
perror("IP_MTU_DISCOVER");
exit(1);
}
diff --git a/tracepath6.c b/tracepath6.c
index d65230d..6f13a51 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -30,6 +30,10 @@
 #define SOL_IPV6 IPPROTO_IPV6
 #endif
 
+#ifndef IPV6_PMTUDISC_PROBE
+#define IPV6_PMTUDISC_PROBE3
+#endif
+
 int overhead = 48;
 int mtu = 128000;
 int hops_to = -1;
@@ -369,8 +373,10 @@ int main(int argc, char **argv)
mapped = 1;
}
 
-   on = IPV6_PMTUDISC_DO;
-   if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on))) {
+   on = IPV6_PMTUDISC_PROBE;
+   if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on)) &&
+   (on = IPV6_PMTUDISC_DO,
+setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on {
perror("IPV6_MTU_DISCOVER");
exit(1);
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [iputils] Add length flag to set initial MTU.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 tracepath.c  |   10 --
 tracepath6.c |   10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index c3f6f74..1f901ba 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -265,7 +265,7 @@ static void usage(void) __attribute((noreturn));
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: tracepath [-n] [/]\n");
+   fprintf(stderr, "Usage: tracepath [-n] [-l ] 
[/]\n");
exit(-1);
 }
 
@@ -279,11 +279,17 @@ main(int argc, char **argv)
char *p;
int ch;
 
-   while ((ch = getopt(argc, argv, "nh?")) != EOF) {
+   while ((ch = getopt(argc, argv, "nh?l:")) != EOF) {
switch(ch) {
case 'n':   
no_resolve = 1;
break;
+   case 'l':
+   if ((mtu = atoi(optarg)) <= overhead) {
+   fprintf(stderr, "Error: length must be >= 
%d\n", overhead);
+   exit(1);
+   }
+   break;
default:
usage();
}
diff --git a/tracepath6.c b/tracepath6.c
index 23d6a8c..d65230d 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -280,7 +280,7 @@ static void usage(void) __attribute((noreturn));
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: tracepath6 [-n] [-b] [/]\n");
+   fprintf(stderr, "Usage: tracepath6 [-n] [-b] [-l ] 
[/]\n");
exit(-1);
 }
 
@@ -297,7 +297,7 @@ int main(int argc, char **argv)
int gai;
char pbuf[NI_MAXSERV];
 
-   while ((ch = getopt(argc, argv, "nbh?")) != EOF) {
+   while ((ch = getopt(argc, argv, "nbh?l:")) != EOF) {
switch(ch) {
case 'n':   
no_resolve = 1;
@@ -305,6 +305,12 @@ int main(int argc, char **argv)
case 'b':   
show_both = 1;
break;
+   case 'l':
+   if ((mtu = atoi(optarg)) <= overhead) {
+   fprintf(stderr, "Error: length must be >= 
%d\n", overhead);
+   exit(1);
+   }
+   break;
default:
usage();
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ip(7) IP_PMTUDISC_PROBE

2007-03-27 Thread John Heffner

Document new IP_PMTUDISC_PROBE value for IP_MTU_DISCOVERY.  (Going into 
2.6.22).


Thanks,
  -John
diff -rU3 man-pages-2.43-a/man7/ip.7 man-pages-2.43-b/man7/ip.7
--- man-pages-2.43-a/man7/ip.7  2006-09-26 09:54:29.0 -0400
+++ man-pages-2.43-b/man7/ip.7  2007-03-27 15:46:18.0 -0400
@@ -515,6 +515,7 @@
 IP_PMTUDISC_WANT:Use per-route settings.
 IP_PMTUDISC_DONT:Never do Path MTU Discovery.
 IP_PMTUDISC_DO:Always do Path MTU Discovery. 
+IP_PMTUDISC_PROBE:Set DF but ignore Path MTU.
 .TE   
 
 When PMTU discovery is enabled the kernel automatically keeps track of
@@ -550,6 +551,15 @@
 with the
 .B IP_MTU
 option. 
+
+It is possible to implement RFC 4821 MTU probing with
+.B SOCK_DGRAM
+of
+.B SOCK_RAW
+sockets by setting a value of IP_PMTUDISC_PROBE.  This is also particularly
+useful for diagnostic tools such as
+.BR tracepath (8)
+that wish to deliberately send probe packets larger than the observed Path MTU.
 .TP
 .B IP_MTU
 Retrieve the current known path MTU of the current socket.

Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-27 Thread John Heffner

Mark Huth wrote:

David Miller wrote:

From: [EMAIL PROTECTED] (David Griego)
Date: Tue, 27 Mar 2007 14:47:54 -0700

Adds an IOCTL for aborting established TCP connections, and is
designed to be an HA performance improvement for cleaning up, failure 
notification, and application termination.

Signed-off-by:  David Griego <[EMAIL PROTECTED]>

SO_LINGER with a zero linger time plus close() isn't working
properly?

There is no reason for this ioctl at all.  Either existing
facilities provide what you need or what you want is a
protocol violation we can't do.

Actually, there are legitimate uses for this sort of API.  The patch 
allows an administrator to kill specific connections that are in use by 
other applications, where the close is not available, since the socket 
is owned by another process.  Say one of your large applications has 
hundreds or even thousands of open connections and you have determined 
that a particular connection is causing trouble.  This API allows the 
admin to kill that particular connection, and doesn't appear to violate 
any RFC offhand, since an abort is sent  to the peer.

One may argue that the applications should be modified, but that is not 
always possible in the case of various ISVs.  As Linux gains market 
share in the large server market, more and more applications are being 
ported from other platforms that have this sort of 
management/administrative interfaces.

Mark Huth

I also believe this is a useful thing to have.  I'm not 100% sure this 
ioctl is the way to go, but it seems reasonable.  This directly 
corresponds to writing deleteTcb to the tcpConnectionState variable in 
the TCP MIB (RFC 4022).  I don't think it constitutes a protocol violation.

As a concrete example of a way I've used this type of feature is to 
defend against a netkill [1] style attack, where the defense involves 
making decisions about which connections to kill when memory gets 
scarce.  It makes sense to do this with a system daemon, since an admin 
might have an arbitrarily complicated policy as to which applications 
and peers have priority for the memory.  This is too complicated to 
distribute and enforce across all applications.  You could do this in 
the kernel, but why if you don't have to?

  -John

[1] http://shlang.com/netkill/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-27 Thread John Heffner


John Heffner wrote:
I also believe this is a useful thing to have.  I'm not 100% sure this 
ioctl is the way to go, but it seems reasonable.  This directly 
corresponds to writing deleteTcb to the tcpConnectionState variable in 
the TCP MIB (RFC 4022).  I don't think it constitutes a protocol violation.


Responding to myself in good form :P  I'll add that there are other ways 
to do this currently but all I know of are hackish, f.e. using a raw 
socket to send RST packets to yourself.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [iputils] Add documentation for the -l flag.

---
 doc/tracepath.sgml |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml
index 71eaa8d..c0f308b 100644
--- a/doc/tracepath.sgml
+++ b/doc/tracepath.sgml
@@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this 
path
 
 
 tracepath
+-l 
 
 
 
@@ -39,6 +40,18 @@ of UDP ports to maintain trace history.
 
 
 
+OPTIONS
+
+ 
+  
+  
+Sets the initial packet length to 
+ 
+
+
+
 OUTPUT
 
 
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [iputils] Document -n flag.

---
 doc/tracepath.sgml |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml
index c0f308b..1bc83b9 100644
--- a/doc/tracepath.sgml
+++ b/doc/tracepath.sgml
@@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this 
path
 
 
 tracepath
+-n
 -l 
 
 
@@ -42,6 +43,14 @@ of UDP ports to maintain trace history.
 
 OPTIONS
 
+
+ 
+  
+  
+Do not look up host names.  Only print IP addresses numerically.
+  
+ 
+
  
   
   
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] [iputils] Re-probe at same TTL after MTU reduction.

This fixes a bug that would miss a hop after an ICMP packet too big message,
since it would continue increase the TTL without probing again.
---
 tracepath.c  |6 ++
 tracepath6.c |6 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index d035a1e..19b2c6b 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -352,8 +352,14 @@ main(int argc, char **argv)
exit(1);
}
 
+restart:
for (i=0; i<3; i++) {
+   int old_mtu;
+   
+   old_mtu = mtu;
res = probe_ttl(fd, ttl);
+   if (mtu != old_mtu)
+   goto restart;
if (res == 0)
goto done;
if (res > 0)
diff --git a/tracepath6.c b/tracepath6.c
index a010218..65c4a4a 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -422,8 +422,14 @@ int main(int argc, char **argv)
exit(1);
}
 
+restart:
for (i=0; i<3; i++) {
+   int old_mtu;
+   
+   old_mtu = mtu;
res = probe_ttl(fd, ttl);
+   if (mtu != old_mtu)
+   goto restart;
if (res == 0)
goto done;
if (res > 0)
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [iputils] Fix asymm messages.

We should only print the asymm messages in tracepath/6 when you receive a
TTL expired message, because this is the only time when we'd expect the
same number of hops back as our TTL was set to for a symmetric path.
---
 tracepath.c  |   25 -
 tracepath6.c |   25 -
 2 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index a562d88..d035a1e 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -163,19 +163,6 @@ restart:
}
}
 
-   if (rethops>=0) {
-   if (rethops<=64)
-   rethops = 65-rethops;
-   else if (rethops<=128)
-   rethops = 129-rethops;
-   else
-   rethops = 256-rethops;
-   if (sndhops>=0 && rethops != sndhops)
-   printf("asymm %2d ", rethops);
-   else if (sndhops<0 && rethops != ttl)
-   printf("asymm %2d ", rethops);
-   }
-
if (rettv) {
int diff = 
(tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec);
printf("%3d.%03dms ", diff/1000, diff%1000);
@@ -204,6 +191,18 @@ restart:
if (e->ee_origin == SO_EE_ORIGIN_ICMP &&
e->ee_type == 11 &&
e->ee_code == 0) {
+   if (rethops>=0) {
+   if (rethops<=64)
+   rethops = 65-rethops;
+   else if (rethops<=128)
+   rethops = 129-rethops;
+   else
+   rethops = 256-rethops;
+   if (sndhops>=0 && rethops != sndhops)
+   printf("asymm %2d ", rethops);
+   else if (sndhops<0 && rethops != ttl)
+   printf("asymm %2d ", rethops);
+   }
printf("\n");
break;
}
diff --git a/tracepath6.c b/tracepath6.c
index 6f13a51..a010218 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -176,19 +176,6 @@ restart:
}
}
 
-   if (rethops>=0) {
-   if (rethops<=64)
-   rethops = 65-rethops;
-   else if (rethops<=128)
-   rethops = 129-rethops;
-   else
-   rethops = 256-rethops;
-   if (sndhops>=0 && rethops != sndhops)
-   printf("asymm %2d ", rethops);
-   else if (sndhops<0 && rethops != ttl)
-   printf("asymm %2d ", rethops);
-   }
-
if (rettv) {
int diff = 
(tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec);
printf("%3d.%03dms ", diff/1000, diff%1000);
@@ -220,6 +207,18 @@ restart:
(e->ee_origin == SO_EE_ORIGIN_ICMP6 &&
 e->ee_type == 3 &&
 e->ee_code == 0)) {
+   if (rethops>=0) {
+   if (rethops<=64)
+   rethops = 65-rethops;
+   else if (rethops<=128)
+   rethops = 129-rethops;
+   else
+   rethops = 256-rethops;
+   if (sndhops>=0 && rethops != sndhops)
+   printf("asymm %2d ", rethops);
+   else if (sndhops<0 && rethops != ttl)
+   printf("asymm %2d ", rethops);
+   }
printf("\n");
break;
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] [NET] Do pmtu check in transport layer

2007-04-09 Thread John Heffner


Patrick McHardy wrote:

John Heffner wrote:

Check the pmtu check at the transport layer (for UDP, ICMP and raw), and
send a local error if socket is PMTUDISC_DO and packet is too big.  This is
actually a pure bugfix for ipv6.  For ipv4, it allows us to do pmtu checks
in the same way as for ipv6.

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d096332..593acf7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-	if (inet->cork.length + length > 0x - fragheaderlen) {

+   if (inet->cork.length + length > 0x - fragheaderlen ||
+   (inet->pmtudisc >= IP_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}



This makes ping report an incorrect MTU when IPsec is used since we're
only accounting for the additional header_len, not the trailer_len
(which is not easily changeable). Additionally it will report different
MTUs for the first and following fragments when the socket is corked
because only the first fragment includes the header_len. It also can't
deal with things like NAT and routing by fwmark that change the route.
The old behaviour was that we get an ICMP frag. required with the MTU
of the final route, while this will always report the MTU of the
initially chosen route.

For all these reasons I think it should be reverted to the old
behaviour.


You're right, this is no good.  I think the other problems are fixable, 
but NAT really screws this.


Unfortunately, there is still a real problem with ipv6, in that the 
output side does not generate a packet too big ICMP like ipv4.  Also, it 
feels kind of undesirable be rely on local ICMP instead of direct error 
message delivery.  I'll try to generate a new patch.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP connection stops after high load.

2007-04-15 Thread John Heffner


Robert Iakobashvili wrote:

Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and
2.6.20.6 do not.

Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5
tcp_rmem and tcp_wmem are the same, whereas tcp_mem are
much different:

kernel  tcp_mem
---
2.6.18.312288 16384 24576
2.6.19.5  30724096   6144


Is not it done deliberately by the below patch:

commit 9e950efa20dc8037c27509666cba6999da9368e8
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Mon Nov 6 23:10:51 2006 -0800

   [TCP]: Don't use highmem in tcp hash size calculation.

   This patch removes consideration of high memory when determining TCP
   hash table sizes.  Taking into account high memory results in tcp_mem
   values that are too large.

Is it a feature?

My machine has:
MemTotal:   484368 kB
and
for all kernel configurations are actually the same with
CONFIG_HIGHMEM4G=y

Thanks,



Another patch that went in right around that time:

commit 52bf376c63eebe72e862a1a6e713976b038c3f50
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Tue Nov 14 20:25:17 2006 -0800

[TCP]: Fix up sysctl_tcp_mem initialization.

Fix up tcp_mem initial settings to take into account the size of the
hash entries (different on SMP and non-SMP systems).

    Signed-off-by: John Heffner <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

(This has been changed again for 2.6.21.)

In the dmesg, there should be some messages like this:

IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)

What do yours say?

Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP connection stops after high load.

2007-04-16 Thread John Heffner

Robert Iakobashvili wrote:

Hi John,

On 4/15/07, John Heffner <[EMAIL PROTECTED]> wrote:

Robert Iakobashvili wrote:
> Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and
> 2.6.20.6 do not.
>
> Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5
> tcp_rmem and tcp_wmem are the same, whereas tcp_mem are
> much different:
>
> kernel  tcp_mem
> ---
> 2.6.18.312288 16384 24576
> 2.6.19.5  30724096   6144

Another patch that went in right around that time:

commit 52bf376c63eebe72e862a1a6e713976b038c3f50
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Tue Nov 14 20:25:17 2006 -0800

 [TCP]: Fix up sysctl_tcp_mem initialization.
(This has been changed again for 2.6.21.)

In the dmesg, there should be some messages like this:
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)

What do yours say?

For the 2.6.19.5, where we have this problem:

From dmsg:

IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
TCP established hash table entries: 16384 (order: 5, 131072 bytes)
TCP bind hash table entries: 8192 (order: 4, 65536 bytes)

#cat /proc/sys/net/ipv4/tcp_mem
307240966144

MemTotal:   484368 kB
CONFIG_HIGHMEM4G=y

Yes, this difference is caused by the commit above.  The old way didn't 
really make a lot of sense, since it was different based on smp/non-smp 
and page size, and had large discontinuities at 512MB and every power of 
two.  It was hard to make the limit never larger than the memory pool 
but never too small either, when based on the hash table size.

The current net-2.6 (2.6.21) has a redesigned tcp_mem initialization 
that should give you more appropriate values, something like 45408 60546 
90816.  For reference:

Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c
Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700

[TCP]: Fix tcp_mem[] initialization.

Change tcp_mem initialization function.  The fraction of total memory
is now a continuous function of memory size, and independent of page
size.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP connection stops after high load.

2007-04-16 Thread John Heffner


Robert Iakobashvili wrote:

Kernels 2.6.19 and 2.6.20 series are effectively broken right now.
Don't you wish to patch them?



I don't know if this qualifies as an unconditional bug.  The commit 
above was actually a bugfix so that the limits were not higher than 
total memory on some systems, but had the side effect that it made them 
even smaller on your particular configuration.  Also, having initial 
sysctl values that are conservatively small probably doesn't qualify as 
a bug (for patching stable trees).  You might ask the -stable 
maintainers if they have a different opinion.


For most people, 2.6.19 and 2.6.20 work fine.  For those who really care 
about the tcp_mem values (are using a substantial fraction of physical 
memory for TCP connections), the best bet is to set the tcp_mem sysctl 
values in the startup scripts, or use the new initialization function in 
2.6.21.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bug in tcp?

2007-04-16 Thread John Heffner


Stephen Hemminger wrote:

A guess: maybe something related to a PAWS wraparound problem.
Does turning off sysctl net.ipv4.tcp_timestamps fix it?


That was my first thought too (aside from netfilter), but a failed PAWS 
check should not result in a reset..


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP connection stops after high load.

2007-04-17 Thread John Heffner

David Miller wrote:

From: "Robert Iakobashvili" <[EMAIL PROTECTED]>
Date: Tue, 17 Apr 2007 10:58:04 +0300

David,

On 4/16/07, David Miller <[EMAIL PROTECTED]> wrote:

Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c
Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700

 [TCP]: Fix tcp_mem[] initialization.
 Change tcp_mem initialization function.  The fraction of total memory
 is now a continuous function of memory size, and independent of page
 size.

Kernels 2.6.19 and 2.6.20 series are effectively broken right now.
Don't you wish to patch them?

Can you verify that this patch actually fixes your problem?

Yes, it fixes.

Thanks, I will submit it to -stable branch.

My only reservation in submitting this to -stable is that it will in 
many cases increase the default tcp_mem values, which in turn can 
increase the default tcp_rmem values, and therefore the window scale. 
There will be some set of people with broken firewalls who trigger that 
problem for the first time by upgrading along the stable branch.  While 
it's not our fault, it could cause some complaints...

Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/0] Re-try changes for PMTUDISC_PROBE

This backs out the the transport layer MTU checks that don't work.  As a 
consequence, I had to back out the PMTUDISC_PROBE patch as well.  These 
patches should fix the problem with ipv6 that the transport layer change 
tried to address, and re-implement PMTUDISC_PROBE.  I think this 
approach is nicer than the last one, since it doesn't require a bit in 
struct sk_buff.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Revert "[NET] Do pmtu check in transport layer"

This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37.

This idea does not work, as pointed at by Patrick McHardy.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +---
 net/ipv4/raw.c|8 +++-
 net/ipv6/ip6_output.c |   11 +--
 net/ipv6/raw.c|7 ++-
 4 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 79e71ee..34606ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen ||
-   (inet->pmtudisc >= IP_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
+   if (inet->cork.length + length > 0x - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index c60aadf..24d7c9f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
-   int mtu;
 
-   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
+  rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b8e307a..4cfdad4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
-inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
-   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
+   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
+   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
+   }
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f4cd90b..f65fcd7 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
-   int mtu;
 
-   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [NET] MTU discovery check in ip6_fragment()

Adds a check in ip6_fragment() mirroring ip_fragment() for packets
that we can't fragment, and sends an ICMP Packet Too Big message
in response.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv6/ip6_output.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 4cfdad4..5a5b7d4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -567,6 +567,19 @@ static int ip6_fragment(struct sk_buff *skb, int 
(*output)(struct sk_buff *))
nexthdr = *prevhdr;
 
mtu = dst_mtu(&rt->u.dst);
+
+   /* We must not fragment if the socket is set to force MTU discovery
+* or if the skb it not generated by a local socket.  (This last
+* check should be redundant, but it's free.)
+*/
+   if (!np || np->pmtudisc >= IPV6_PMTUDISC_DO) {
+   skb->dev = skb->dst->dev;
+   icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb->dev);
+   IP6_INC_STATS(ip6_dst_idev(skb->dst), IPSTATS_MIB_FRAGFAILS);
+   kfree_skb(skb);
+   return -EMSGSIZE;
+   }
+
if (np && np->frag_size < mtu) {
if (np->frag_size)
mtu = np->frag_size;
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"

This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb.

Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37
does not work.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 -
 include/linux/in6.h  |1 -
 include/linux/skbuff.h   |3 +--
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 --
 net/ipv4/ip_output.c |   14 --
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 ---
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 ---
 11 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 2dc1f8a..1912e7c 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,7 +83,6 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
-#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index d559fac..4e8350a 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,7 +179,6 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
-#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bf9b9f..7f17cfc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -277,8 +277,7 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1,
-   ign_dst_mtu:1;
+   ipvs_property:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index 6a08b65..75f226d 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
-   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_NET_SCHED
 #ifdef CONFIG_NET_CLS_ACT
new->tc_verd = old->tc_verd;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 704bc44..79e71ee 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) &&
-   !skb->ign_dst_mtu && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
-   rt->u.dst.dev->mtu :
-   dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc < IP_PMTUDISC_DO)
+   if (inet->pmtudisc != IP_PMTUDISC_DO)
skb->local_df = 1;
 
-   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
-   skb->ign_dst_mtu = 1;
-
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame

[PATCH] [NET] Add IP(V6)_PMTUDISC_RPOBE

Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 net/ipv4/ip_output.c |   20 +++-
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv6/ip6_output.c|   15 ---
 net/ipv6/ipv6_sockglue.c |2 +-
 6 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..3975cbf 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 34606ef..66e2c3a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb)
return -EINVAL;
 }
 
+static inline int ip_skb_dst_mtu(struct sk_buff *skb)
+{
+   struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;
+
+   return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
+  skb->dst->dev->mtu : dst_mtu(skb->dst);
+}
+
 static inline int ip_finish_output(struct sk_buff *skb)
 {
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
@@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
- htonl(dst_mtu(&rt->u.dst)));
+ htonl(ip_skb_dst_mtu(skb)));
kfree_skb(skb);
return -EMSGSIZE;
}
@@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu : 
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 * locally. */
-   if (inet->pmtudisc == IP_PMTUDISC_DO ||
+   if (inet->pmtudisc >= IP_PMTUDISC_DO ||
(skb->len <= dst_mtu(&rt->u.dst) &&
 ip_dont_fragment(sk, &rt->u.dst)))
df = htons(IP_DF);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c199d23..4d54457 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
inet->hdrincl = val ? 1 : 0;
break;
case IP_MTU_DISCOVER:
-   if (val<0 || val>2)
+

[PATCH 2/4] Revert "[NET] Do pmtu check in transport layer"

This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37.

This idea does not work, as pointed at by Patrick McHardy.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +---
 net/ipv4/raw.c|8 +++-
 net/ipv6/ip6_output.c |   11 +--
 net/ipv6/raw.c|7 ++-
 4 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 79e71ee..34606ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen ||
-   (inet->pmtudisc >= IP_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
+   if (inet->cork.length + length > 0x - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index c60aadf..24d7c9f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
-   int mtu;
 
-   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
+  rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b8e307a..4cfdad4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
-inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
-   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
+   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
+   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
+   }
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f4cd90b..f65fcd7 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
-   int mtu;
 
-   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"

This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb.

Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37
does not work.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 -
 include/linux/in6.h  |1 -
 include/linux/skbuff.h   |3 +--
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 --
 net/ipv4/ip_output.c |   14 --
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 ---
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 ---
 11 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 2dc1f8a..1912e7c 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,7 +83,6 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
-#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index d559fac..4e8350a 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,7 +179,6 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
-#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bf9b9f..7f17cfc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -277,8 +277,7 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1,
-   ign_dst_mtu:1;
+   ipvs_property:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index 6a08b65..75f226d 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
-   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_NET_SCHED
 #ifdef CONFIG_NET_CLS_ACT
new->tc_verd = old->tc_verd;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 704bc44..79e71ee 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) &&
-   !skb->ign_dst_mtu && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
-   rt->u.dst.dev->mtu :
-   dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc < IP_PMTUDISC_DO)
+   if (inet->pmtudisc != IP_PMTUDISC_DO)
skb->local_df = 1;
 
-   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
-   skb->ign_dst_mtu = 1;
-
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame

[PATCH 4/4] [NET] Add IP(V6)_PMTUDISC_RPOBE

Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 net/ipv4/ip_output.c |   20 +++-
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv6/ip6_output.c|   15 ---
 net/ipv6/ipv6_sockglue.c |2 +-
 6 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..3975cbf 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 34606ef..66e2c3a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb)
return -EINVAL;
 }
 
+static inline int ip_skb_dst_mtu(struct sk_buff *skb)
+{
+   struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;
+
+   return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
+  skb->dst->dev->mtu : dst_mtu(skb->dst);
+}
+
 static inline int ip_finish_output(struct sk_buff *skb)
 {
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
@@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
- htonl(dst_mtu(&rt->u.dst)));
+ htonl(ip_skb_dst_mtu(skb)));
kfree_skb(skb);
return -EMSGSIZE;
}
@@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu : 
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 * locally. */
-   if (inet->pmtudisc == IP_PMTUDISC_DO ||
+   if (inet->pmtudisc >= IP_PMTUDISC_DO ||
(skb->len <= dst_mtu(&rt->u.dst) &&
 ip_dont_fragment(sk, &rt->u.dst)))
df = htons(IP_DF);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c199d23..4d54457 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
inet->hdrincl = val ? 1 : 0;
break;
case IP_MTU_DISCOVER:
-   if (val<0 || val>2)
+

[PATCH 3/4] [NET] MTU discovery check in ip6_fragment()