Re: Intel 82559 NIC corrupted EEPROM

2006-11-15 Thread John

John wrote:


-0009 : System RAM
000a-000b : Video RAM area
000f-000f : System ROM
0010-0ffe : System RAM
  0010-00296a1a : Kernel code
  00296a1b-0031bbe7 : Kernel data
0fff-0fff2fff : ACPI Non-volatile Storage
0fff3000-0fff : ACPI Tables
2000-200f : :00:08.0
2010-201f : :00:09.0
2020-202f : :00:0a.0
e000-e3ff : :00:00.0
e500-e50f : :00:08.0
e510-e51f : :00:09.0
e520-e52f : :00:0a.0
e530-e5300fff : :00:08.0
e5301000-e5301fff : :00:0a.0
e5302000-e5302fff : :00:09.0
- : reserved

I've also attached:

o config-2.6.18.1-adlink used to compile this kernel
o dmesg output after the machine boots


I suppose the information I've sent is not enough to locate the
root of the problem. Is there more I can provide?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel 82559 NIC corrupted EEPROM

2006-11-27 Thread John

John wrote:


-0009 : System RAM
000a-000b : Video RAM area
000f-000f : System ROM
0010-0ffe : System RAM
  0010-00296a1a : Kernel code
  00296a1b-0031bbe7 : Kernel data
0fff-0fff2fff : ACPI Non-volatile Storage
0fff3000-0fff : ACPI Tables
2000-200f : :00:08.0
2010-201f : :00:09.0
2020-202f : :00:0a.0
e000-e3ff : :00:00.0
e500-e50f : :00:08.0
e510-e51f : :00:09.0
e520-e52f : :00:0a.0
e530-e5300fff : :00:08.0
e5301000-e5301fff : :00:0a.0
e5302000-e5302fff : :00:09.0
- : reserved

I've also attached:

o config-2.6.18.1-adlink used to compile this kernel
o dmesg output after the machine boots


I suppose the information I've sent is not enough to locate the
root of the problem. Is there more I can provide?


Here is some context for those who have been added to the CC list:
http://groups.google.com/group/linux.kernel/browse_frm/thread/bdc8fd08fb601c26

As far as I understand, some consider the eepro100 driver to be 
obsolete, and it has been considered for removal.


What is the current status?

Unfortunately, e100 does not work out-of-the-box on this system.

Is there something I can do to improve the situation?

--
Regards,

John

[ E-mail address is a bit-bucket. I *do* monitor the mailing lists. ]

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel 82559 NIC corrupted EEPROM

2006-11-29 Thread John

Jesse Brandeburg wrote:


John wrote:


Here is some context for those who have been added to the CC list:
http://groups.google.com/group/linux.kernel/browse_frm/thread/bdc8fd08fb601c26

As far as I understand, some consider the eepro100 driver to be
obsolete, and it has been considered for removal.

What is the current status?

Unfortunately, e100 does not work out-of-the-box on this system.

Is there something I can do to improve the situation?


Let's go ahead and print the output from e100_load_eeprom
debug patch attached.


Loading (then unloading) e100.ko fails the first few times (i.e. the 
driver claims one of the EEPROMs is corrupted). Thereafter, sometimes it 
fails, other times it works. Sounds like a race, no?


$ cat load_unload
: > /var/log/kern.log
insmod e100.ko debug=16
sleep 1
cp /var/log/kern.log insmod_$I.txt
ip link > ip_link_$I.txt
sleep 2
rmmod e100
let "I=I+1"

(cf. attached compressed archive)

FAILURE:
insmod_100.txt
insmod_101.txt
insmod_102.txt
insmod_105.txt
insmod_107.txt
insmod_108.txt
insmod_110.txt
insmod_111.txt
insmod_114.txt

SUCCESS:
insmod_103.txt
insmod_104.txt
insmod_106.txt
insmod_109.txt
insmod_112.txt
insmod_113.txt
insmod_115.txt
insmod_116.txt

On an unrelated note, insmod_100.txt is truncated at the beginning, and 
insmod_110.txt is truncated in the middle (!!) cf. line 14. What would 
cause klogd to behave like that?


Regards.


TEST-e100.tar.bz2
Description: Binary data


Realtek RTL8111B serious performance issues

2007-07-17 Thread john



Hi,

I originally sent this email to the linux-net list before realizing it
probably belonged on the netdev list.

I just subscribed to this list, so I apologize if this is a known issue.  I
did try looking through the archives, and did not see it there either.

We just put together a new "app server" based on a P35 chipset motherboard,
4 gigabytes of RAM, Q6600 processor, and integrated Realtek RTL8111B gigabit
NIC.  When we SSH or RSH into this machine, and try to run any X application
(emacs, firefox) the application's graphics are drawn *extremely* slowly.
It can take 10 seconds from the time an emacs window pops up until it is
done drawing all of it's icons.

Firefox is even worse.  Loading pages is painful.  The "spinning dots", in the
upper right and corner, never actually spin.  It takes a long time for a
page to be displayed, and when it is draw, it is all-at-once.  Scrolling a
page up/down is extremely jurky.

We are currently running kernel 2.6.22.1, but I have also tried going back
to 2.6.20.x without any change in behavior.

The NIC driver is loaded as:

kernel: eth0: RTL8168b/8111b at 0xc264, 00:1a:4d:43:db:d4, IRQ 17

I tried going to Realtek's site to see if there was a newer driver, but the
only driver there seems to be for older kernels.

I finally put an old Linksys 10/100 PCI NIC in the system, and that has
SOLVED the problem.  We would prefer using the integrated NIC, however.


04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI 
Express Gigabit Ethernet controller (rev 01)
 Subsystem: Giga-byte Technology Unknown device e000
 Flags: bus master, fast devsel, latency 0, IRQ 17
 I/O ports at c000 [size=256]
 Memory at f800 (64-bit, non-prefetchable) [size=4K]
 [virtual] Expansion ROM at fb20 [disabled] [size=64K]
 Capabilities: [40] Power Management version 2
 Capabilities: [48] Vital Product Data
 Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/1 Enable-
 Capabilities: [60] Express Endpoint IRQ 0
 Capabilities: [84] Vendor Specific Information
 Capabilities: [100] Advanced Error Reporting
 Capabilities: [12c] Virtual Channel
 Capabilities: [148] Device Serial Number 68-81-ec-10-00-00-00-25
 Capabilities: [154] Power Budgeting

Anyone have any suggestions for solving this problem?

Thanks,

John


--

| |
+--+  ==  |  John Patrick Poet Blue Sky Tours
|  |  |  Director of Systems Development   10832 Prospect Ave., N.E.
| +---+  [EMAIL PROTECTED] Albuquerque, N.M. 87112
| |  Ph. 505 293 9462  Fx. 505 293 6902
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Realtek RTL8111B serious performance issues

2007-07-18 Thread john


On Wed, 18 Jul 2007, Francois Romieu wrote:


[EMAIL PROTECTED] <[EMAIL PROTECTED]> :
[...]

Anyone have any suggestions for solving this problem?


Try 2.6.23-rc1 when it is published or apply against 2.6.22 one of:
http://www.fr.zoreil.com/people/francois/misc/20070628-2.6.22-rc6-r8169-test.patch


Unfortunately, the 20070628 patch did not make any difference.



http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.22-rc6/r8169-20070628/



I tried various patches from that directory (aren't most or all of them
included in the 20070628 patch?), but none of them helped either.


This problem could be very difficult to track down.  Like I said, it
definately effects emacs and firefox being "drawn" on a remote computer.
Ping times, however, are not that bad:

PING 192.168.26.150: 56 data bytes
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=0. time=0.287 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=1. time=0.279 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=2. time=0.196 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=3. time=0.201 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=4. time=0.159 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=5. time=0.148 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=6. time=0.150 
ms

Also, wget gets good throughput when retrieving files.

It just seems to be X traffic which is extremely slow.  Using the old
Linksys 10/100 PCI NIC, emacs comes up virtually instantaneously.  Using the
integrated Realtek 8111B, emacs takes 10 seconds to draw.

Thank you very much for trying to help.

John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Realtek RTL8111B serious performance issues

2007-07-19 Thread john


On Thu, 19 Jul 2007, Bill Fink wrote:


Hi John,

On Wed, 18 Jul 2007, [EMAIL PROTECTED] wrote:


On Wed, 18 Jul 2007, Francois Romieu wrote:


[EMAIL PROTECTED] <[EMAIL PROTECTED]> :
[...]

Anyone have any suggestions for solving this problem?


Try 2.6.23-rc1 when it is published or apply against 2.6.22 one of:
http://www.fr.zoreil.com/people/francois/misc/20070628-2.6.22-rc6-r8169-test.patch


Unfortunately, the 20070628 patch did not make any difference.



http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.22-rc6/r8169-20070628/



I tried various patches from that directory (aren't most or all of them
included in the 20070628 patch?), but none of them helped either.


This problem could be very difficult to track down.  Like I said, it
definately effects emacs and firefox being "drawn" on a remote computer.
Ping times, however, are not that bad:

PING 192.168.26.150: 56 data bytes
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=0. time=0.287 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=1. time=0.279 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=2. time=0.196 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=3. time=0.201 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=4. time=0.159 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=5. time=0.148 
ms
64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=6. time=0.150 
ms

Also, wget gets good throughput when retrieving files.

It just seems to be X traffic which is extremely slow.  Using the old
Linksys 10/100 PCI NIC, emacs comes up virtually instantaneously.  Using the
integrated Realtek 8111B, emacs takes 10 seconds to draw.

Thank you very much for trying to help.


Any chance that the Realtek 8111B is sharing interrupts with another
device ("cat /proc/interrupts")?  Perhaps it is, and the Linksys isn't,
which could explain the difference in behavior.  Just something simple
to check and either rule in or out.



Yes it was, however "fixing" that did not solve the problem.

Thanks for the thought.

John

P.S. I did send the pcap files to Francois Romieu, but I did not CC the list
because they were large.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bug in tcp/ip stack

2007-07-22 Thread john

I tracked down something that appears to be a small bug in networking code.
The way in witch i can reproduce it a complex one but it works 100%, so 
here comes the details:


I noticed strange packets on my fw coming from mail server with RST/ACK 
flags set coming from source port with no one listening on it and no 
connection attempts made to them from outside.
There are few messages on forums describing same problem and calling 
them alien ACK/RST packets.


Postfix mail server gives this behavior if for some reason client resets 
connection but some packets from client arrives after the RST, the serer 
box responds with RST and then with RST/ACK (with the wrong source port 
number).


Here is packet dump

10.0010.0.0.25410.0.0.68TCP5 > smtp [SYN] 
Seq=0 Len=0
20.00103610.0.0.6810.0.0.254TCPsmtp > 5 [SYN, 
ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460
30.00109610.0.0.25410.0.0.68TCP5 > smtp [ACK] 
Seq=1 Ack=1 Win=1500 Len=0

40.00112510.0.0.25410.0.0.68SMTPCommand: EHLO localhost
50.00115010.0.0.25410.0.0.68TCP5 > smtp [RST] 
Seq=17 Len=0
60.00117510.0.0.25410.0.0.68TCP5 > smtp [FIN, 
ACK] Seq=17 Ack=1 Win=1500 Len=0
70.00125110.0.0.6810.0.0.254TCPsmtp > 5 [ACK] 
Seq=1 Ack=17 Win=5840 Len=0
80.00128410.0.0.6810.0.0.254TCPsmtp > 5 [RST] 
Seq=1 Len=0
!!!90.21842710.0.0.6810.0.0.254TCP32768 > 5 
[RST, ACK] Seq=0 Ack=0 Win=5840 Len=0


It is not the postfix bug, it is present in current 2.6.x and 2.4.x 
kernel versions but not in the 2.2.x tree, so after investigation i 
found it was introduced in 2.4.0-test9-pre3 back in year 2000 and 
survived for 7 years WOW :)



Whole 2.4.0-test9-pre3 diff is pretty big, but i managed to find lines 
responsible for this,

they are located in include/net/tcp.h

in function tcp_enter_cwr

if (sk->prev && !(sk->userlocks&SOCK_BINDPORT_LOCK))
 tcp_put_port(sk);



It is not a big problem but under some setups the fw's conntrack table 
can get filled pretty quickly,  because wrong port number changes every 
time.


Can, you please check this out?

Evalds
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


small bug in tcp

2007-07-28 Thread john
When application closes socket with unread data in receive buffer, tcp 
stack sends rst packet from the wrong source port, not the source port 
of the socket being closed.


This is the same problem that was described in my first post, witch 
unfortunately nobody cared to look into.


This problem appeared in 2.4.0-test9-pre3 and is still present in kernel.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


strange tcp behavior

2007-08-01 Thread john
1186035057.207629127.0.0.1 -> 127.0.0.1TCP 5 > smtp [SYN]
Seq=0 Len=0
1186035057.207632127.0.0.1 -> 127.0.0.1TCP smtp > 5 [SYN, ACK]
Seq=0 Ack=1 Win=32792 Len=0 MSS=16396
1186035057.207666127.0.0.1 -> 127.0.0.1TCP 5 > smtp [ACK]
Seq=1 Ack=1 Win=1500 Len=0
1186035057.207699127.0.0.1 -> 127.0.0.1SMTP Command: EHLO localhost
1186035057.207718127.0.0.1 -> 127.0.0.1TCP smtp > 5 [ACK]
Seq=1 Ack=17 Win=32792 Len=0
1186035057.207736127.0.0.1 -> 127.0.0.1TCP 5 > smtp [RST]
Seq=17 Len=0
1186035057.223934127.0.0.1 -> 127.0.0.1TCP 33787 > 5 [RST,
ACK] Seq=0 Ack=0 Win=32792 Len=0



Can someone please comment as to why, tcp  stack sends rst packet from the
wrong source port in this situation.

This is the same problem that was described in my first two posts, witch 
unfortunately nobody seemed to notice.

Here is source code witch can reproduce the behavior described, the client
side code is a complete mess but with a little bit it works.

Server:

#include 
#include 
#include 
#include 
#include 

void main(void) {
int ms;
int ss;
struct sockaddr_in sa;
char *str = "HELLO FRIEND";
struct pollfd fd;
int flags;

ms = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
flags = fcntl(ms, F_GETFL, 0);
fcntl(ms, F_SETFL, flags | O_NONBLOCK);

memset(&sa, 0, sizeof(sa));
sa.sin_family = AF_INET;
sa.sin_addr.s_addr = htonl(INADDR_ANY);
sa.sin_port = htons(25);

bind(ms, (struct sockaddr *) &sa, sizeof(sa));

listen(ms, 0);

fd.fd = ms;
fd.events = POLLIN;

while(poll(&fd, 1, -1)) {
ss = accept(ms, NULL, NULL);

usleep(1);
send(ss, str, strlen(str), MSG_NOSIGNAL);
close(ss);

memset(&fd, 0, sizeof(fd));
fd.fd = ms;
fd.events = POLLIN;
}
}

Client:


#include 
#include 
#include 
#include 
#include 
#include 
//#include 

//#include 

struct sockaddr_in localaddr;
struct sockaddr_in remoteaddr;

struct sockaddr rawaddr;

int sdl, sdr;

struct tcphdr header;

struct pheader_t {
uint32_t saddr;
uint32_t daddr;
uint8_t r;
uint8_t protocol;
uint16_t length;
};

struct pheader_t pheader;

unsigned short tbuf[2048];
unsigned char buf[2048];

char *msg = "EHLO localhost\r\n";

unsigned char *p;

char *src_addr = "127.0.0.1";
char *dst_addr = "127.0.0.1";

unsigned short sprt = 5;
unsigned short dprt = 25;


struct timeval tv;

unsigned seq, ack_seq;

int data;

void mysend(void) {
int i, sum;
int len;

if(data) {
len = strlen(msg);
memcpy((char *) tbuf + sizeof(pheader) + sizeof(header),
msg, len);
} else
len = 0;

bzero(&pheader, sizeof(pheader));
pheader.saddr = (in_addr_t) inet_addr(src_addr);
pheader.daddr = (in_addr_t) inet_addr(dst_addr);
pheader.protocol = 6;
pheader.length = htons(sizeof(header) + len);

memcpy(tbuf, &pheader, sizeof(pheader));
memcpy((char *) tbuf + sizeof(pheader), &header, sizeof(header));



sum = 0;

for(i = 0; i < (sizeof(pheader) + sizeof(header)) / 2 + len / 2;
i++) {
sum += tbuf[i];
sum = (sum & 0x) + (sum >> 16);
}

header.check = ~sum;

memcpy((char *) tbuf + sizeof(pheader), &header, sizeof(header));

sendto(sdr,  (char *) tbuf + sizeof(pheader), sizeof(header) +
len, 0, (struct sockaddr *) &remoteaddr, sizeof(remoteaddr));
}


void main(void)
{
gettimeofday(&tv, NULL);
srand(tv.tv_sec & tv.tv_usec);

remoteaddr.sin_family = AF_INET;
remoteaddr.sin_addr.s_addr = (in_addr_t) inet_addr(dst_addr);


sdl = socket(PF_INET, SOCK_PACKET, htons(ETH_P_ALL));
strcpy(rawaddr.sa_data, "lo");
bind(sdl, (struct sockaddr *) &rawaddr, sizeof(rawaddr));

sdr = socket(AF_INET, SOCK_RAW, IPPROTO_TCP);


bzero(&header, sizeof(header));
header.source = htons(sprt);
header.dest = htons(dprt);

seq = rand();
ack_seq = 0;

header.seq = htonl(seq);
header.ack_seq = htonl(ack_seq);

header.doff = sizeof(header) / 4;

header.syn = 1;

header.window = htons(1500);

mysend();

while(1) {
recvfrom(sdl, buf, sizeof(buf), 0, NULL, NULL);
//  p = buf + (*buf & 0x0f) * 4;
p = (buf + 14) + (*(buf + 14) & 0x0f) * 4;
if(ntohs(((struct tcphdr *)p)->source) == dprt &&
ntohs(((struct tcphdr *)p)->dest) == sprt && ((struct
tcphdr *)p)->syn == 1 && ((struct tcphdr *)p)->ack == 1)
break;
}


bzero(&header, sizeof(header));
header.source = htons(sprt);
header.dest = htons(dpr

Re: r8169: slow samba performance

2007-08-27 Thread john


On Wed, 22 Aug 2007, Bruce Cole wrote:


Shane wrote:

On Wed, Aug 22, 2007 at 09:39:47AM -0700, Bruce Cole wrote:


Shane, join the crowd :)  Try the fix I just re-posted over here:



Bruce, gigabit speeds thanks for the pointer.  This fix
works well for me though I just added the three or so lines
in the elseif statement as it rejected with the
r8169-20070818.  I suppose I could've merged the whole
thing and if you need that tested, let me know but this is
looking good.

Glad it works for you.  I'm not the maintainer, and also don't have adequate 
specs from Realtek to definitively explain why the NPQ bit apparently needs 
to be re-enabled when some but not all of the TX FIFO is dequeued.  It is 
documented as if it isn't cleared until the FIFO is empty.  So I assume an 
official patch will have to wait until Francois is back.



I have had abysmal performance trying to remotely run X apps via ssh on a
computer with a RTL8111 NIC.  Saw this message and decided to give this
patch a try --- success!  Much, much better.

Thanks,

John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: r8169: slow samba performance

2007-09-04 Thread john


On Mon, 3 Sep 2007, Francois Romieu wrote:


[EMAIL PROTECTED] <[EMAIL PROTECTED]> :
[...]

I have had abysmal performance trying to remotely run X apps via ssh on a
computer with a RTL8111 NIC.  Saw this message and decided to give this
patch a try --- success!  Much, much better.


Can you give a try to:

http://www.fr.zoreil.com/people/francois/misc/20070903-2.6.23-rc5-r8169-test.patch

or just patches #0001 + #0002 at:

http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.23-rc5/r8169-20070903/



20070903-2.6.23-rc5-r8169-test.patch applied against 2.6.23-rc5 works fine.
Performance is acceptable.

Would you like me to *just* try patches 1 & 2, to help narrow down anything?

Thanks,

John

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: r8169: slow samba performance

2007-09-04 Thread john


On Tue, 4 Sep 2007, Francois Romieu wrote:


[EMAIL PROTECTED] <[EMAIL PROTECTED]> :
[...]

20070903-2.6.23-rc5-r8169-test.patch applied against 2.6.23-rc5 works fine.
Performance is acceptable.


Does "acceptable" mean that there is a noticeable difference when compared
to the patch based on a busy-waiting loop ?



Without this patch, latency in bringing up emacs, or in display of pages in
firefox is extremely high.  With the patch, latency is pretty much what I
see when using an old tulip based NIC.

Is there a specific test you wish me to try?



Would you like me to *just* try patches 1 & 2, to help narrow down anything?


I expect patch #2 alone to be enough to enhance the performance. If it gets
proven, the patch would be a good candidate for a quick merge upstream.



Okay, I will build another kernel with just #2 applied.


John


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel 82559 NIC corrupted EEPROM

2007-02-07 Thread John

Jesse Brandeburg wrote:


John wrote:


Jesse Brandeburg wrote:


can you try adding mdelay(100); in e100_eeprom_load before the for loop,
and then change the multiple udelay(4) to mdelay(1) in e100_eeprom_read


I applied the attached patch.

Loading the driver now takes around one minute :-)


ouch, but yep, thats what happens when you use "super extra delay"


I ran 'source load_unload' 25 times in a loop.

The first 12 times were successful. The last 13 times failed.
(cf. attached archive)

I noticed something very strange.

The number of words obviously in error (0x) returned by the EEPROM
on 00:09.0 is not constant.


That is very strange, I would think that maybe you have something else
on the bus with the e100 that may be hogging bus cycles you have
failing hardware (maybe a bad eeprom, or possibly a bad mac chip)


$ grep -c 0x insmod*
insmod_300.txt:0
insmod_301.txt:0
insmod_302.txt:0
insmod_303.txt:0
insmod_304.txt:0
insmod_305.txt:0
insmod_306.txt:0
insmod_307.txt:0
insmod_308.txt:0
insmod_309.txt:0
insmod_310.txt:0
insmod_311.txt:0
insmod_312.txt:1
insmod_313.txt:5
insmod_314.txt:24
insmod_315.txt:45
insmod_316.txt:243
insmod_317.txt:256
insmod_318.txt:256
insmod_319.txt:256
insmod_320.txt:256
insmod_321.txt:256
insmod_322.txt:256
insmod_323.txt:253
insmod_324.txt:240


this is even stranger, does it cycle back down (sine wave) to zero
again?  The delays did seem to work, at least sometimes.  This
indicates that something needs that extra delay to successfully read
the eeprom.  I might try changing all the udelay(4) to udelay(40) (x10
increase) and see if that gives you a happy medium of "most times
driver loads without error"

John, this problem seems to be very specific to your hardware.  I know
that you have put in a lot of time debugging this, but I'm not sure
what we can do from here.  If this were a generic code problem more
people would be reporting the issue.

What would you like to do?  At this stage I would like e100 to work
better than it is, but I'm not sure what to do next.


Hello everyone,

I'm resurrecting this thread because it appears we'll need to support 
these motherboards for several months to come, yet Adrian Bunk has 
scheduled the removal of eepro100 in January 2007.


To recap, we have to support ~30 EBC-2000T motherboards.
http://www.adlinktech.com/PD/web/PD_detail.php?pid=213
These motherboards come with three on-board Intel 82559 NICs.

Last time I checked, i.e. two months ago, e100 did not correctly 
initialize all three NICs on these motherboards. Therefore, we've been 
using eepro100.


I will be testing the latest 2.6.20 kernel to see if the situation has 
changed, but I wanted to let you all know that there are still some 
eepro100 users out there, out of necessity.


Regards,

John

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CLOCK_MONOTONIC datagram timestamps by the kernel

2007-02-28 Thread John

Hello,

I know it's possible to have Linux timestamp incoming datagrams as soon 
as they are received, then for one to retrieve this timestamp later with 
an ioctl command or a recvmsg call.


As far as I understand, one can either do

  const int on = 1;
  setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &on, sizeof on);

then use recvmsg()

or not set the SO_TIMESTAMP socket option and just call

  ioctl(sock, SIOCGSTAMP, &tv);

after each datagram has been received.

SIOCGSTAMP
Return a struct timeval with the receive timestamp of the last
packet passed to the user. This is useful for accurate round trip time
measurements. See setitimer(2) for a description of struct timeval.


As far as I understand, this timestamp is given by the CLOCK_REALTIME 
clock. However, I would like to obtain a timestamp given by the 
CLOCK_MONOTONIC clock.


Relevant parts of the code (I think):

net/core/dev.c

void net_enable_timestamp(void)
{
  atomic_inc(&netstamp_needed);
}

void __net_timestamp(struct sk_buff *skb)
{
  struct timeval tv;

  do_gettimeofday(&tv);
  skb_set_timestamp(skb, &tv);
}

static inline void net_timestamp(struct sk_buff *skb)
{
  if (atomic_read(&netstamp_needed))
__net_timestamp(skb);
  else {
skb->tstamp.off_sec = 0;
skb->tstamp.off_usec = 0;
  }
}

do_gettimeofday() just calls __get_realtime_clock_ts()

Would it be possible to replace do_gettimeofday() by ktime_get_ts() with 
the appropriate division by 1000 to convert the struct timespec back 
into a struct timeval?


void __net_timestamp(struct sk_buff *skb)
{
  struct timespec now;
  struct timeval tv;

  ktime_get_ts(&ts);
  tv.tv_sec = now.tv_sec;
  tv->tv_usec = now.tv_nsec/1000;
  skb_set_timestamp(skb, &tv);
}

How many apps / drivers would this break?

Is there perhaps a different way to achieve this?

Regards.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CLOCK_MONOTONIC datagram timestamps by the kernel

2007-02-28 Thread John

John wrote:


I know it's possible to have Linux timestamp incoming datagrams as soon
as they are received, then for one to retrieve this timestamp later with
an ioctl command or a recvmsg call.


Has it ever been proposed to modify struct skb_timeval to hold 
nanosecond stamps instead of just microsecond stamps? Then make the 
improved precision somehow available to user space.


On a related note, the comment for skb_set_timestamp() states:

/**
 * skb_set_timestamp - set timestamp of a skb
 * @skb: skb to set stamp of
 * @stamp: pointer to struct timeval to get stamp from
 *
 * Timestamps are stored in the skb as offsets to a base timestamp.
 * This function converts a struct timeval to an offset and stores
 * it in the skb.
 */

But there is no mention of an offset in the code:

static inline void skb_set_timestamp(
  struct sk_buff *skb, const struct timeval *stamp)
{
  skb->tstamp.off_sec  = stamp->tv_sec;
  skb->tstamp.off_usec = stamp->tv_usec;
}

Likewise for skb_get_timestamp:

/**
 * skb_get_timestamp - get timestamp from a skb
 * @skb: skb to get stamp from
 * @stamp: pointer to struct timeval to store stamp in
 *
 * Timestamps are stored in the skb as offsets to a base timestamp.
 * This function converts the offset back to a struct timeval and stores
 * it in stamp.
 */

static inline void skb_get_timestamp(
  const struct sk_buff *skb, struct timeval *stamp)
{
  stamp->tv_sec  = skb->tstamp.off_sec;
  stamp->tv_usec = skb->tstamp.off_usec;
}

Are the comments related to code that has since been modified?

Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CLOCK_MONOTONIC datagram timestamps by the kernel

2007-02-28 Thread John

Eric Dumazet wrote:


John wrote:


I know it's possible to have Linux timestamp incoming datagrams as soon
as they are received, then for one to retrieve this timestamp later with
an ioctl command or a recvmsg call.

Has it ever been proposed to modify struct skb_timeval to hold
nanosecond stamps instead of just microsecond stamps? Then make the
improved precision somehow available to user space.


Most modern NICS are able to delay packet delivery, in order to reduce number 
of interrupts and benefit from better cache hits.


You are referring to NAPI interrupt mitigation, right?

AFAIU, it is possible to disable this feature.

I'm dealing with 200-4000 packets per second. I don't think I'd save 
much with interrupt mitigation. Please correct any misconception.


Then kernel is not realtime and some delays can occur between the hardware 
interrupt and the very moment we timestamp the packet. If CPU caches are 
cold, even the instruction fetches could easily add some us.


I've applied the real-time patch.
http://rt.wiki.kernel.org/index.php/Main_Page
This doesn't make Linux hard real-time, but the interrupt handlers can 
run with the highest priority (even kernel threads are preempted).


Enabling nanosecond stamps would be a lie to users, because real accuracy is 
not nanosecond, but in the order of 10 us (at least)


POSIX is moving to nanoseconds interfaces.
http://www.opengroup.org/onlinepubs/009695399/functions/clock_settime.html

struct timeval and struct timespec take as much space (64 bits).

If the hardware can indeed manage sub-microsecond accuracy, a struct 
timeval forces the kernel to discard valuable information.


If you depend on a < 50 us precision, then linux might be the wrong OS for 
your application. Or maybe you need a NIC that is able to provide a timestamp 
in the packet itself (well... along with the packet...) , so that kernel 
latencies are not a problem.


Does Linux support NICs that can do that?

Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CLOCK_MONOTONIC datagram timestamps by the kernel

2007-02-28 Thread John

Eric Dumazet wrote:

On Wednesday 28 February 2007 15:23, John wrote:

Eric Dumazet wrote:

John wrote:

I know it's possible to have Linux timestamp incoming datagrams as soon
as they are received, then for one to retrieve this timestamp later
with an ioctl command or a recvmsg call.

Has it ever been proposed to modify struct skb_timeval to hold
nanosecond stamps instead of just microsecond stamps? Then make the
improved precision somehow available to user space.

Most modern NICS are able to delay packet delivery, in order to reduce
number of interrupts and benefit from better cache hits.


You are referring to NAPI interrupt mitigation, right?


Nope; I am referring to hardware features. NAPI is software.

See ethtool -c eth0

# ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off  TX: off
stats-block-usecs: 100
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 300
rx-frames: 60
rx-usecs-irq: 300
rx-frames-irq: 60

tx-usecs: 200
tx-frames: 53
tx-usecs-irq: 200
tx-frames-irq: 53

You can see on this setup, rx interrupts can be delayed up to 300 us (up to 60
packets might be delayed)


One can disable interrupt mitigation. Your argument that it introduces 
latency therefore becomes irrelevant.



POSIX is moving to nanoseconds interfaces.
http://www.opengroup.org/onlinepubs/009695399/functions/clock_settime.html


You snipped too much. I also wrote:

struct timeval and struct timespec take as much space (64 bits).

If the hardware can indeed manage sub-microsecond accuracy, a struct
timeval forces the kernel to discard valuable information.

The fact that you are able to give nanosecond timestamps inside kernel is not 
sufficient. It is necessary of course, but not sufficient. This precision is 
OK to time locally generated events. The moment you ask a 'nanosecond' 
timestamp, it's usually long before/after the real event.


If you rely on nanosecond precision on network packets, then something is 
wrong with your algo. Even rt patches wont make sure your cpu caches are 
pre-filled, or that the routers/links between your machines are not busy.
A cache miss cost 40 ns for example. A typical interrupt handler or rx 
processing can trigger 100 cache misses, or not at all if cache is hot.


Consider an idle Linux 2.6.20-rt8 system, equipped with a single PCI-E 
gigabit Ethernet NIC, running on a modern CPU (e.g. Core 2 Duo E6700). 
All this system does is time stamp 1000 packets per second.


Are you claiming that this platform *cannot* handle most packets within 
less than 1 microsecond of their arrival?


If there are platforms that can achieve sub-microsecond precision, and 
if it is not more expensive to support nanosecond resolution (I said 
resolution not precision), then it makes sense to support nanosecond 
resolution in Linux. Right?



You said that rt gives highest priority to interrupt handlers :
If you have several nics, what will happen if you receive packets on both 
nics, or if the NIC interrupt happens in the same time than timer interrupt ? 
One timestamp will be wrong for sure.


Again, this is irrelevant. We are discussing whether it would make sense 
to support sub-microsecond resolution. If there is one platform that can 
achieve sub-microsecond precision, there is a need for sub-microsecond 
resolution. As long as we are changing the resolution, we might as well 
use something standard like struct timespec.


For sure we could timestamp packets with nanosecond resolution, and eventually 
with MONOTONIC value too, but it will give you (and others) false confidence 
on the real precision. us timestamps are already wrong...


IMHO, this is not true for all platforms.

Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CLOCK_MONOTONIC datagram timestamps by the kernel

2007-03-02 Thread John

Eric Dumazet wrote:


John wrote:


Consider an idle Linux 2.6.20-rt8 system, equipped with a single PCI-E
gigabit Ethernet NIC, running on a modern CPU (e.g. Core 2 Duo E6700).
All this system does is time stamp 1000 packets per second.

Are you claiming that this platform *cannot* handle most packets within
less than 1 microsecond of their arrival?


Yes I claim it. You expect too much of this platform, unless "most" means
10 % for you ;)


By "most" I meant more than 50%.

Has someone tried to measure interrupt latency in Linux? I'd like to 
plot the distribution of network IRQ to interrupt handler latencies.


If you replace "1 us" by "50 us", then yes, it probably can do it, if "most" 
means 99%, (not 99.999 %)


I think we need cold, hard numbers at this point :-)

Anyway, if you want to play, you can apply this patch on top of 
linux-2.6.21-rc2  (nanosecond resolution infrastructure needs 2.6.21)

I let you do the adjustments for rt kernel.


Why does it require 2.6.21?


This patch converts sk_buff timestamp to use new nanosecond infra
(added in 2.6.21)


Is this mentioned somewhere in the 2.6.21-rc1 ChangeLog?
http://kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.21-rc1

Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Mellanox ConnectX3 Pro and kernel 4.4 low throughput bug

2016-02-09 Thread John

I'm running into a bug with kernel 4.4.0 where a VM-VM test between two
different baremetal hosts (HP Proliant dl360gen9s) has receive-side 
throughput
that's about 25% lower than expected with a Mellanox ConnectX3-pro NIC. 
The VMs
are connected over a VXLAN tunnel that I used OpenvSwitch 2.4.90 to set 
up on
both hosts. When the mellanox NIC is the endpoint of the vxlan tunnel 
and its VM

receives a throughput test the VM gets about 6.65Gb/s throughput where other
NICs get ~8.3Gb/s (8.04 for niantic, 8.65 for broadcom). When I test the
mellanox in a (patched) 3.14.57 kernel, I get 8.9Gb/s between VMs. I 
have traced
the issue as far as a TUN interface that 'plugs in' to openvswitch, 
which takes
packets for the VM. If I run tcpdump on this tun interface (called vnet0 
in my
case), I get small tcp packets - they're all 1398 in length - when I do 
a VM-VM

test. I also see high CPU usage for the vhost kernel thread. If I run ftrace
during a throughput test and grep for the vhost thread (once done), and 
wc -l

the result there is an order of magnitude more function calls in this thread
versus the same thing with the broadcom. If I do the same test with a 
broadcom
NIC as the endpoint for the vxlan tunnel, I get large packets - the size 
varies
but generally it's in the five digit range - some are almost 65535. 
There are

fewer calls in the vhost thread, as mentioned above. This is also visible in
top, the vhost kernel thread and the libvirt+ process both have noticeably
higher CPU usage.

I've tried doing a bisect of the kernel and figuring out where the 
change
occurred that allowed the broadcom NIC to perform GRO but not the 
mellanox. I
know that between 4.2 and 4.3 the tun device started to perform GRO and 
this is
where the difference in throughput started. However there's something 
between
these two versions that breaks my setup completely and I can't get any 
kind of

traffic to or from the VM from anywhere. I tried to draw a diagram here:

|-high CPU%
->[mlx4_en/core]>[vxlan]--->[openvswitch]--->[tun]>[vhost]--->VM
   |-small packets (1398)

|-low CPU%
->[bnx2x ]>[vxlan]--->[openvswitch]--->[tun]>[vhost]--->VM
   |-big packets (~65535)


NIC info:

root@hLinux-ovstest-1:/home/john# ethtool -i rename8
driver: mlx4_en
version: 2.2-1 (Feb 2014)
firmware-version: 2.34.5010
bus-info: :08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

root@hLinux-ovstest-1:/home/john# ethtool -k rename8
Features for rename8:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: on [requested off]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
busy-poll: on [fixed]

root@hLinux-ovstest-1:/home/john# lspci -vvs :08:00.0
08:00.0 Ethernet controller: Mellanox Technologies MT27520 Family 
[ConnectX-3 Pro]

Subsystem: Hewlett-Packard Company Device 801f
Physical Slot: 1
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
SERR- 
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 0
Region 0: Memory at 9600 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at 9400 (64-bit, prefetchable) [size=32M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)

Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] Vital Product Data
Product Name: HP Ethernet 10G 2-port 546SFP+ Adapter
Read-only fiel

Re: Kernel memory leak in bnx2x driver with vxlan tunnel

2016-01-20 Thread John



On 01/19/2016 06:31 PM, Thomas Graf wrote:

On 01/19/16 at 04:51pm, Jesse Gross wrote:

On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet  wrote:

So what is the purpose of having a dst if we need to drop it ?

Adding code in GRO would be fine if someone explains me the purpose of
doing apparently useless work.

(refcounting on dst is not exactly free)

In the GRO case, the dst is only dropped on the packets which have
been merged and therefore need to be freed (the GRO_MERGED_FREE case).
It's not being thrown away for the overall frame, just metadata that
has been duplicated on each individual frame, similar to the metadata
in struct sk_buff itself. And while it is not used by the IP stack
there are other consumers (eBPF/OVS/etc.). This entire process is
controlled by the COLLECT_METADATA flag on tunnels, so there is no
cost in situations where it is not actually used.

Right. There were thoughts around leveraging a per CPU scratch
buffer without a refcount and turn it into a full reference when
the packet gets enqueued somewhere but the need hasn't really come
up yet.

Jesse, is this what you have in mind:

diff --git a/net/core/dev.c b/net/core/dev.c
index cc9e365..3a5e96d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, 
struct sk_buff *skb)
 break;
  
 case GRO_MERGED_FREE:

-   if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
+   if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
+   skb_release_head_state(skb);
 kmem_cache_free(skbuff_head_cache, skb);
-   else
+   } else
 __kfree_skb(skb);
 break;
So I've tested the below patch (same as one above with minor 
modifications made to make it compile) and it worked - no memory leak. 
Should I submit this or...?


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4355129..a8fac63 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2829,6 +2829,7 @@ int skb_zerocopy(struct sk_buff *to, struct 
sk_buff *from,

 void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len);
 int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
 void skb_scrub_packet(struct sk_buff *skb, bool xnet);
+void skb_release_head_state(struct sk_buff *skb);
 unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
 struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t 
features);

 struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index ae00b89..76e3623 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4337,9 +4337,10 @@ static gro_result_t napi_skb_finish(gro_result_t 
ret, struct sk_buff *skb)

 break;

 case GRO_MERGED_FREE:
-if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
+if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
+skb_release_head_state(skb);
 kmem_cache_free(skbuff_head_cache, skb);
-else
+} else
 __kfree_skb(skb);
 break;

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2df375..45f6f50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -633,7 +633,7 @@ fastpath:
 kmem_cache_free(skbuff_fclone_cache, fclones);
 }

-static void skb_release_head_state(struct sk_buff *skb)
+void skb_release_head_state(struct sk_buff *skb)
 {
 skb_dst_drop(skb);
 #ifdef CONFIG_XFRM


Re: Intel 82559 NIC corrupted EEPROM

2006-11-08 Thread John
If this bit equals 0b, the idle recognition circuit is disabled 
and the 82559 always remains in an active state. Thus, the 82559 always 
requests PCI CLK using the Clockrun mechanism.


Auke, do you agree with Donald Becker's warning?

If I disable STB, the NICs will waste a bit more power when idle,
is that correct? Are there other implications?

Thanks for reading this far!

John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel 82559 NIC corrupted EEPROM

2006-11-09 Thread John

Auke Kok wrote:

This is what I was afraid of: even though the code allows you to bypass 
the EEPROM checksum, the probe fails on a further check to see if the 
MAC address is valid.


Since something with this NIC specifically made the EEPROM return all 
0xff's, the MAC address is automatically invalid, and thus probe fails.


I don't understand why you think there is something wrong with a
specific NIC?

In 2.6.14.7, e100.ko fails to read the EEPROM on :00:08.0 (eth0)
In 2.6.18.1, e100.ko fails to read the EEPROM on :00:09.0 (eth1)
In both kernels, eepro100.ko successfully reads all the EEPROMs.

It seems that the driver has more problems with this NIC than just the 
eeprom checksum being bad. Needless to say this might need fixing.


Can you load the eepro driver and send me the full eeprom dump?
Perhaps I can duplicate things over here.


00:08.0 EEPROM contents, size 64x16

  3000 0464 e4e6 0e03  0201 4701 
  7213 8310 40a2 0001 8086   
         
         
         
         
  0128       
         92f7

00:09.0 EEPROM contents, size 64x16

  3000 0464 e5e6 0e03  0201 4701 
  7213 8310 40a2 0001 8086   
         
         
         
         
  0128       
         91f7

00:0a.0 EEPROM contents, size 64x16

  3000 0464 e6e6 0e03  0201 4701 
  7213 8310 40a2 0001 8086   
         
         
         
         
  0128       
         90f7
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel 82559 NIC corrupted EEPROM

2006-11-09 Thread John

Jesse Brandeburg wrote:


I suspect that one reason Becker's code works is that it uses IO
based access (slower, and different method) to the adapter rather
than memory mapped access.


I've noticed this difference.


The second thought is that the adapter is in D3, and something about
your kernel or the driver doesn't successfully wake it up to D0.


On my NICs, the EEPROM ID (Word 0Ah) is set to 0x40a2.
Thus DDPD (bit 6) is set to 0.

DDPD is the "Disable Deep Power Down while PME is disabled" bit.
0 - Deep Power Down is enabled in D3 state while PME-disabled.
1 - Deep Power Down disabled in D3 state while PME-disabled.
This bit should be set to 1b if a TCO controller is being used via the 
SMB because it requires receive functionality at all power states.


Are you suggesting I try and set DDPD to 1?
Or is this completely unrelated?


An indication of this would be looking at lspci -vv before/after
loading the driver.


$ diff -u lspci_vv_before_e100.txt lspci_vv_after_e100.txt
--- lspci_vv_before_e100.txt2006-11-09 14:51:30.0 +0100
+++ lspci_vv_after_e100.txt 2006-11-09 14:51:30.0 +0100
@@ -74,21 +74,20 @@
Expansion ROM at 2000 [disabled] [size=1M]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA 
PME(D0+,D1+,D2+,D3hot+,D3cold+)

-   Status: D0 PME-Enable+ DSel=0 DScale=2 PME-
+   Status: D0 PME-Enable- DSel=0 DScale=2 PME-

 00:09.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 
100] (rev 08)

Subsystem: Intel Corporation EtherExpress PRO/100B (TX)
-   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR- FastB2B-
+   Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium 
>TAbort- SERR- 
-   Latency: 32 (2000ns min, 14000ns max), cache line size 08
Interrupt: pin A routed to IRQ 10
-   Region 0: Memory at e5302000 (32-bit, non-prefetchable) [size=4K]
-   Region 1: I/O ports at dc00 [size=64]
-   Region 2: Memory at e510 (32-bit, non-prefetchable) [size=1M]
+   Region 0: Memory at e5302000 (32-bit, non-prefetchable) 
[disabled] [size=4K]

+   Region 1: I/O ports at dc00 [disabled] [size=64]
+   Region 2: Memory at e510 (32-bit, non-prefetchable) 
[disabled] [size=1M]

Expansion ROM at 2010 [disabled] [size=1M]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA 
PME(D0+,D1+,D2+,D3hot+,D3cold+)

-   Status: D0 PME-Enable+ DSel=0 DScale=2 PME-
+   Status: D0 PME-Enable- DSel=0 DScale=2 PME-

 00:0a.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 
100] (rev 08)

Subsystem: Intel Corporation EtherExpress PRO/100B (TX)


Also, after loading/unloading eepro100 does the e100 driver work?


No.


A third idea is look for a master abort in lspci after e100 fails to
load.


I don't understand that one.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Intel 82559 NIC corrupted EEPROM

2006-11-10 Thread John

Jesse Brandeburg wrote:


Can you send output of cat /proc/iomem


-0009 : System RAM
000a-000b : Video RAM area
000f-000f : System ROM
0010-0ffe : System RAM
  0010-00296a1a : Kernel code
  00296a1b-0031bbe7 : Kernel data
0fff-0fff2fff : ACPI Non-volatile Storage
0fff3000-0fff : ACPI Tables
2000-200f : :00:08.0
2010-201f : :00:09.0
2020-202f : :00:0a.0
e000-e3ff : :00:00.0
e500-e50f : :00:08.0
e510-e51f : :00:09.0
e520-e52f : :00:0a.0
e530-e5300fff : :00:08.0
e5301000-e5301fff : :00:0a.0
e5302000-e5302fff : :00:09.0
- : reserved

I've also attached:

o config-2.6.18.1-adlink used to compile this kernel
o dmesg output after the machine boots


try something like the attached patch


Loading e100-debug.ko reports:

e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI
e100: Copyright(c) 1999-2005 Intel Corporation

***e100 debug: unable to set power state (error 0)
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 12
PCI: setting IRQ 12 as level-triggered
ACPI: PCI Interrupt :00:08.0[A] -> Link [LNKA]
 -> GSI 12 (level, low) -> IRQ 12
***e100 debug: read 0100/ from the same register
e100: eth0: e100_probe: addr 0xe530, irq 12, MAC addr 00:30:64:04:E6:E4

***e100 debug: unable to set power state (error 0)
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10
PCI: setting IRQ 10 as level-triggered
ACPI: PCI Interrupt :00:09.0[A] -> Link [LNKB]
 -> GSI 10 (level, low) -> IRQ 10
***e100 debug: read 0100/ from the same register
e100: :00:09.0: e100_eeprom_load: EEPROM corrupted
ACPI: PCI interrupt for device :00:09.0 disabled
e100: probe of :00:09.0 failed with error -11

***e100 debug: unable to set power state (error 0)
ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11
PCI: setting IRQ 11 as level-triggered
ACPI: PCI Interrupt :00:0a.0[A] -> Link [LNKC]
 -> GSI 11 (level, low) -> IRQ 11
***e100 debug: read 0100/ from the same register
e100: eth1: e100_probe: addr 0xe5301000, irq 11, MAC addr 00:30:64:04:E6:E6


In other words, the behavior is the same for all three NICs.

pci_set_power_state(pdev, PCI_D0) returns 0
pci_iomap returns something != NULL

Can I provide more information to help locate the problem?
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.18.1-hrt
# Tue Nov  7 17:52:26 2006
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_KMOD is not set

#
# Block layer
#
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
# CONFIG_IOSCHED_AS is not set
# CONFIG_IOSCHED_DEADLINE is not set
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
# CONFIG_HIGH_RES_TIMERS is not set
# CONFIG_SMP is not set
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
CONFIG_MPENTIUMIII=y
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CON

[PATCH] fix up sysctl_tcp_mem initialization

2006-11-14 Thread John Heffner
The initial values of sysctl_tcp_mem are sometimes greater than the 
total memory in the system (particularly on SMP systems).  This patch 
ensures that tcp_mem[2] is always <= 3/4 nr_kernel_pages.


However, I wonder if we want to set this differently than the way this 
patch does it.  Depending on how far off the memory size is from a power 
of two (exactly equal to a power of two is the worst case), and if total 
memory <128M, it can be substantially less than 3/4.


  -John
Fix up tcp_mem initiail settings to take into account the size of the
hash entries (different on SMP and non-SMP systems).

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit d4ef8c8245c0a033622ce9ba9e25d379475254f6
tree 5377b8af0bac3b92161188e7369a84e472b5acb2
parent ea55b7c31b47edf90132baea9a088da3bbe2bb5c
author John Heffner <[EMAIL PROTECTED]> Tue, 14 Nov 2006 14:53:27 -0500
committer John Heffner <[EMAIL PROTECTED]> Tue, 14 Nov 2006 14:53:27 -0500

 net/ipv4/tcp.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4322318..c05e8ed 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2316,9 +2316,10 @@ void __init tcp_init(void)
sysctl_max_syn_backlog = 128;
}
 
-   sysctl_tcp_mem[0] =  768 << order;
-   sysctl_tcp_mem[1] = 1024 << order;
-   sysctl_tcp_mem[2] = 1536 << order;
+   /* Allow no more than 3/4 kernel memory (usually less) allocated to TCP 
*/
+   sysctl_tcp_mem[0] = (1536 / sizeof (struct inet_bind_hashbucket)) << 
order;
+   sysctl_tcp_mem[1] = sysctl_tcp_mem[0] * 4 / 3;
+   sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
max_share = min(4UL*1024*1024, limit);


Re: [PATCH] fix up sysctl_tcp_mem initialization

2006-11-15 Thread John Heffner

David Miller wrote:
However, I wonder if we want to set this differently than the way this 
patch does it.  Depending on how far off the memory size is from a power 
of two (exactly equal to a power of two is the worst case), and if total 
memory <128M, it can be substantially less than 3/4.


Longer term, yes, probably a better way exists.

So you concern is that when we round to a power of 2 like we do
now, we often mis-shoot?


I'm not that concerned about it, but basically yes, there are big (x2) 
jumps on power-of-two memory size boundaries.  There's also a bigger 
(x8) discontinuity at 128k pages.  It could be smoother.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] [TCP] MTUprobe: Cleanup send queue check (no need to loop)

2007-11-21 Thread John Heffner

Ilpo Järvinen wrote:

The original code has striking complexity to perform a query
which can be reduced to a very simple compare.

FIN seqno may be included to write_seq but it should not make
any significant difference here compared to skb->len which was
used previously. One won't end up there with SYN still queued.

Use of write_seq check guarantees that there's a valid skb in
send_head so I removed the extra check.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>


Acked-by: John Heffner <[EMAIL PROTECTED]>



---
 net/ipv4/tcp_output.c |7 +--
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ff22ce8..1822ce6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1315,12 +1315,7 @@ static int tcp_mtu_probe(struct sock *sk)
}
 
 	/* Have enough data in the send queue to probe? */

-   len = 0;
-   if ((skb = tcp_send_head(sk)) == NULL)
-   return -1;
-   while ((len += skb->len) < size_needed && !tcp_skb_is_last(sk, skb))
-   skb = tcp_write_queue_next(sk, skb);
-   if (len < size_needed)
+   if (tp->write_seq - tp->snd_nxt < size_needed)
return -1;
 
 	if (tp->snd_wnd < size_needed)


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] [TCP]: MTUprobe: receiver window & data available checks fixed

2007-11-21 Thread John Heffner

Ilpo Järvinen wrote:

It seems that the checked range for receiver window check should
begin from the first rather than from the last skb that is going
to be included to the probe. And that can be achieved without
reference to skbs at all, snd_nxt and write_seq provides the
correct seqno already. Plus, it SHOULD account packets that are
necessary to trigger fast retransmit [RFC4821].

Location of snd_wnd < probe_size/size_needed check is bogus
because it will cause the other if() match as well (due to
snd_nxt >= snd_una invariant).

Removed dead obvious comment.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>


Acked-by: John Heffner <[EMAIL PROTECTED]>



---
 net/ipv4/tcp_output.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 30d6737..ff22ce8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1289,6 +1289,7 @@ static int tcp_mtu_probe(struct sock *sk)
struct sk_buff *skb, *nskb, *next;
int len;
int probe_size;
+   int size_needed;
unsigned int pif;
int copy;
int mss_now;
@@ -1307,6 +1308,7 @@ static int tcp_mtu_probe(struct sock *sk)
/* Very simple search strategy: just double the MSS. */
mss_now = tcp_current_mss(sk, 0);
probe_size = 2*tp->mss_cache;
+   size_needed = probe_size + (tp->reordering + 1) * mss_now;
if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) {
/* TODO: set timer for probe_converge_event */
return -1;
@@ -1316,18 +1318,15 @@ static int tcp_mtu_probe(struct sock *sk)
len = 0;
if ((skb = tcp_send_head(sk)) == NULL)
return -1;
-   while ((len += skb->len) < probe_size && !tcp_skb_is_last(sk, skb))
+   while ((len += skb->len) < size_needed && !tcp_skb_is_last(sk, skb))
skb = tcp_write_queue_next(sk, skb);
-   if (len < probe_size)
+   if (len < size_needed)
return -1;
 
-	/* Receive window check. */

-   if (after(TCP_SKB_CB(skb)->seq + probe_size, tp->snd_una + 
tp->snd_wnd)) {
-   if (tp->snd_wnd < probe_size)
-   return -1;
-   else
-   return 0;
-   }
+   if (tp->snd_wnd < size_needed)
+   return -1;
+   if (after(tp->snd_nxt + size_needed, tp->snd_una + tp->snd_wnd))
+   return 0;
 
 	/* Do we need to wait to drain cwnd? */

pif = tcp_packets_in_flight(tp);


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6 0/3]: Three TCP fixes

2007-12-04 Thread John Heffner

Ilpo Järvinen wrote:

...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
as lower bound even though the ssthresh was already halved, 
so snd_ssthresh should suffice.


I remember this coming up at least once before, so it's probably worth a 
comment in the code.  Rate-halving attempts to actually reduce cwnd to 
half the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound 
on how far rate-halving can reduce cwnd.  See the "Bounding Parameters" 
section of <http://www.psc.edu/networking/papers/FACKnotes/current/>.


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6 0/3]: Three TCP fixes

2007-12-04 Thread John Heffner

Ilpo Järvinen wrote:

On Tue, 4 Dec 2007, John Heffner wrote:


Ilpo Järvinen wrote:

...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
as lower bound even though the ssthresh was already halved, so snd_ssthresh
should suffice.

I remember this coming up at least once before, so it's probably worth a
comment in the code.  Rate-halving attempts to actually reduce cwnd to half
the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound on how far
rate-halving can reduce cwnd.  See the "Bounding Parameters" section of
<http://www.psc.edu/networking/papers/FACKnotes/current/>.


Thanks for the info! Sadly enough it makes NewReno recovery quite 
inefficient when there are enough losses and high BDP link (in my case 
384k/200ms, BDP sized buffer). There might be yet another bug in it as 
well (it is still a bit unclear how tcp variables behaved during my 
scenario and I'll investigate further) but reduction in the transfer 
rate is going to last longer than a short moment (which is used as 
motivation in those FACK notes). In fact, if I just use RFC2581 like 
setting w/o rate-halving (and experience the initial "pause" in sending), 
the ACK clock to send out new data works very nicely beating rate halving 
fair and square. For SACK/FACK it works much nicer because recovery is 
finished much earlier and slow start recovers cwnd quickly.


I believe this is exactly the reason why Matt (CC'd) and Jamshid 
abandoned this line of work in the late 90's.  In my opinion, it's 
probably not such a bad idea to use cwnd/2 as the bound.  In some 
situations, the current rate-halving code will work better, but as you 
point out, in others the cwnd is lowered too much.



...Mind if I ask another similar one, any idea why prior_ssthresh is 
smaller (3/4 of it) than cwnd used to be (see tcp_current_ssthresh)?


Not sure on that one.  I'm not aware of any publications this is based 
on.  Maybe Alexey knows?


  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-05 Thread John Heffner

David Miller wrote:

Ilpo, I was pondering the kind of debugging one does to find
congestion control issues and even SACK bugs and it's currently too
painful because there is no standard way to track state changes.

I assume you're using something like carefully crafted printk's,
kprobes, or even ad-hoc statistic counters.  That's what I used to do
:-)

With that in mind it occurred to me that we might want to do something
like a state change event generator.

Basically some application or even a daemon listens on this generic
netlink socket family we create.  The header of each event packet
indicates what socket the event is for and then there is some state
information.

Then you can look at a tcpdump and this state dump side by side and
see what the kernel decided to do.

Now there is the question of granularity.

A very important consideration in this is that we want this thing to
be enabled in the distributions, therefore it must be cheap.  Perhaps
one test at the end of the packet input processing.

So I say we pick some state to track (perhaps start with tcp_info)
and just push that at the end of every packet input run.  Also,
we add some minimal filtering capability (match on specific IP
address and/or port, for example).

Maybe if we want to get really fancy we can have some more-expensive
debug mode where detailed specific events get generated via some
macros we can scatter all over the place.  This won't be useful
for general user problem analysis, but it will be excellent for
developers.

Let me know if you think this is useful enough and I'll work on
an implementation we can start playing with.



FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD:
http://caia.swin.edu.au/urp/newtcp/tools.html
http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf

  -John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP's initial cwnd setting correct?...

2007-08-08 Thread John Heffner

That sounds right to me.

  -John


Ilpo Järvinen wrote:

On Mon, 6 Aug 2007, Ilpo Järvinen wrote:

...Goto logic could be cleaner (somebody has any suggestion for better 
way to structure it?)


...I could probably move the setting of snd_cwnd earlier to avoid 
this problem if this seems a valid fix at all.




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP's initial cwnd setting correct?...

2007-08-08 Thread John Heffner
I believe the current calculation is correct.  The RFC specifies a 
window of no more than 4380 bytes unless 2*MSS > 4380.  If you change 
the code in this way, then MSS=1461 will give you an initial window of 
3*MSS == 4383, violating the spec.  Reading the pseudocode in the RFC 
3390 is a bit misleading because they use a clamp at 4380 bytes rather 
than use a multiplier in the relevant range.


  -John


David Miller wrote:

From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
Date: Mon, 6 Aug 2007 15:37:15 +0300 (EEST)


@@ -805,13 +805,13 @@ void tcp_update_metrics(struct sock *sk)
}
 }
 
-/* Numbers are taken from RFC2414.  */

+/* Numbers are taken from RFC3390.  */
 __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst)
 {
__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
 
 	if (!cwnd) {

-   if (tp->mss_cache > 1460)
+   if (tp->mss_cache >= 2190)
cwnd = 2;
else
cwnd = (tp->mss_cache > 1095) ? 3 : 4;


I remember suggesting something similar about 5 or 6 years
ago and Alexey Kuznetsov at the time explained the numbers
which are there and why they should not be changed.

I forget the reasons though, and I'll try to do the research.

These numbers have been like this forever, FWIW.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.23-rc2: WARNING: at kernel/irq/resend.c:70 check_irq_resend()

2007-08-09 Thread John Stoffel

Hi,

I'm opening this ticket as a new subject, even though it looks like it
might be related to the thread "Networking dies after random time".
Sorry for the wide CC list, but since my network hasn't died since I
rebooted into 2.6.23-rc2 (after 30+ days at 2.6.22-rc7), I'm wondering
if the problem is more than networking related.  

Honestly, I haven't gone back over the previous thread in detail, so I
might be missing info here.

System details: Dell Precision 610MT, Intel 440GX chipset, Dual PIII
Xeon, 550Mhz, 2gb RAM (upgraded from 768Mb last night), a mix of IDE,
SCSI and SATA disks in the system.  My poor PCI bus!  Just upgraded to
2.6.23-rc2.  Interrupts looks like this:

> cat /proc/interrupts 
   CPU0   CPU1   
  0:280  1   IO-APIC-edge  timer
  1:788  0   IO-APIC-edge  i8042
  6:  1  4   IO-APIC-edge  floppy
  8:  0  1   IO-APIC-edge  rtc
  9:  0  0   IO-APIC-fasteoi   acpi
 11:  82410   1239   IO-APIC-edge  Cyclom-Y
 12:279106   IO-APIC-edge  i8042
 14: 440901   4266   IO-APIC-edge  libata
 15:  0  0   IO-APIC-edge  libata
 16:2394727  42983   IO-APIC-fasteoi   ohci_hcd:usb3, Ensoniq
   AudioPCI, [EMAIL PROTECTED]::01:00.0
 17:2237362   1110   IO-APIC-fasteoi   sata_sil,
   ehci_hcd:usb1, eth0
 18: 126520  31978   IO-APIC-fasteoi   aic7xxx, aic7xxx, ide2,
   ide3, ohci1394
 19:  0  0   IO-APIC-fasteoi   ohci_hcd:usb2,
   uhci_hcd:usb4
NMI:  0  0 
LOC:   40672484   40672246 
ERR:  0
MIS:  0

I've only seen the one Warning oops, and backups and other system
processes have been running for the past 12 hours without a problem.  


[  187.747442] Probing IDE interface ide2...
[  188.011634] hde: WDC WD1200JB-00CRA1, ATA DISK drive
[  188.623038] WARNING: at kernel/irq/resend.c:70 check_irq_resend()
[  188.623105]  [] check_irq_resend+0xa8/0xc0
[  188.623204]  [] enable_irq+0xc3/0xd0
[  188.623295]  [] probe_hwif+0x670/0x7c0 [ide_core]
[  188.623448]  [] do_ide_setup_pci_device+0x154/0x480
[ide_core]
[  188.623571]  [] probe_hwif_init_with_fixup+0xc/0x90
[ide_core]
[  188.623690]  [] init_setup_hpt302+0x0/0x30 [hpt366]
[  188.623791]  [] ide_setup_pci_device+0x7b/0xc0 [ide_core]
[  188.623909]  [] init_setup_hpt302+0x0/0x30 [hpt366]
[  188.624004]  [] hpt366_init_one+0x8d/0xa0 [hpt366]
[  188.624095]  [] init_setup_hpt302+0x0/0x30 [hpt366]
[  188.624187]  [] init_chipset_hpt366+0x0/0x680 [hpt366]
[  188.624281]  [] init_hwif_hpt366+0x0/0x380 [hpt366]
[  188.624372]  [] init_dma_hpt366+0x0/0xe0 [hpt366]
[  188.624466]  [] pci_device_probe+0x56/0x80
[  188.624565]  [] driver_probe_device+0x8e/0x190
[  188.624669]  [] __driver_attach+0x9e/0xa0
[  188.624756]  [] bus_for_each_dev+0x3a/0x60
[  188.624845]  [] driver_attach+0x16/0x20
[  188.624932]  [] __driver_attach+0x0/0xa0
[  188.625017]  [] bus_add_driver+0x8a/0x1b0
[  188.625107]  [] __pci_register_driver+0x53/0xa0
[  188.625197]  [] sys_init_module+0x13d/0x1820
[  188.625315]  [] snd_timer_find+0x0/0x90 [snd_timer]
[  188.625424]  [] disable_irq+0x0/0x30
[  188.625513]  [] sys_mmap2+0xcd/0xd0
[  188.625612]  [] syscall_call+0x7/0xb
[  188.625701]  [] rpc_get_inode+0x0/0x80
[  188.625798]  ===
[  188.625871] hde: selected mode 0x45
[  188.626817] ide2 at 0xecf8-0xecff,0xecf2 on irq 18
[  188.627080] Probing IDE interface ide3...
[  188.891165] hdg: WDC WD1200JB-00EVA0, ATA DISK drive
[  189.502580] hdg: selected mode 0x45
[  189.503698] ide3 at 0xece0-0xece7,0xecda on irq 18


Let 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread John Heffner

Benjamin LaHaise wrote:

According to your patch, several packets with fin bit might be sent,
including one with data. If another host does not receive fin
retransmit, then that logic is broken, and it can not be fixed by
duplicating fins, I would even say, that remote box should drop second
packet with fin, while it can carry data, which will break higher
connection logic.


The FIN hasn't been ack'd by the other side, though and yet Linux is no 
longer transmitting packets with it sent.  Read the beginning of the trace.


I agree completely with Evgeniy.  The patch you sent would cause bad 
breakage by sending the FIN bit on segments with different sequence numbers.


Looking at your trace, it seems like the behavior of the test system 
192.168.2.2 is broken in two ways.  First, like you said it has broken 
state in that it has forgotten that it sent the FIN.  Once you do that, 
the connection state is corrupt and all bets are off.  It's sending an 
out-of-window segment that's getting tossed by Linux, and Linux 
generates an ack in response.  This is in direct RFC compliance.  The 
second problem is that the other system is generating these broken acks 
in response to the legitimate acks Linux is sending, causing the ack 
war.  I can't really guess why it's doing that...


You might be able to change Linux to prevent this ack war, but doing so 
would break RFC compliance, and given the buggy nature of the other end, 
it sounds to me like a bad idea.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm

2007-05-01 Thread John Heffner

Benjamin LaHaise wrote:

On Tue, May 01, 2007 at 09:41:28PM +0400, Evgeniy Polyakov wrote:

Hmm, 2.2 machine in your test seems to behave incorrectly:


I am aware of that.  However, I think that the loss of certain packets and 
reordering can result in the same behaviour.  What's more, is that this 
behaviour can occur in real deployed systems.  "Be strict in what you send 
and liberal in what you accept."  Both systems should be fixed, which is 
what I'm trying to do.


Actually, you cannot get in this situation by loss or reordering of 
packets, only be corruption of state on one side.  It sends the FIN, 
which effectively increases the sequence number by one.  However, all 
later segments it sends have an old lower sequence number, which are now 
out of window.


Being liberal in what you accept is good to a point, but sometimes you 
have to draw the line.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [TCP] Sysctl: document tcp_max_ssthresh (Limited Slow-Start)

2007-05-18 Thread John Heffner

Rick Jones wrote:
as an asside, "tcp_max_ssthresh" sounds like the maximum value ssthresh 
can take-on.  is that correct, or is this more of a "once ssthresh is 
above this, behave in this new way?"  If that is the case, while the 


I don't like it either, but you'll have to talk to Sally Floyd about 
that one.. ;)


In general, I would like the documentation to emphasize more how to set 
the parameter than describe the algorithm.  The max_ssthresh parameter 
should ideally be set to the bottleneck queue size, or more 
realistically a conservative value that's likely to be smaller than the 
bottleneck queue size.  When max_ssthresh is smaller than the bottleneck 
queue, (limited) slow start will not overflow it until cwnd has fully 
ramped up to the appropriate size.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UDP packet loss when running lsof

2007-05-21 Thread John Miller
 kB
VmallocUsed:  6924 kB
VmallocChunk: 34359731259 kB
HugePages_Total: 0
HugePages_Free:  0
HugePages_Rsvd:  0
Hugepagesize: 2048 kB

Thanks for your help!
Regards,
John



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: UDP packet loss when running lsof

2007-05-22 Thread John Miller


Hi Eric,


> It's a HP system with two dual core CPUs at 3GHz, the



Then you might try to bind network IRQ to one CPU
(echo 1 >/proc/irq/XX/smp_affinity)



XX being your NIC interrupt (cat /proc/interrupts to catch it)



and bind your user program to another cpu(s)


the NIC was already fixed at CPU0 and the irq_balancer switched
the timer interrupt between all CPUs and the storage HBA between
CPU1 and CPU4. Stopping the balancer and leaving NIC alone on CPU0
and the other interrupts and my program on CPU2-4 did not improve
the situation.
At least I could not see an improvement over just adding
thash_entries=2048.


You might hit a cond_resched_softirq() bug that Ingo and others
are sorting out right now. Using separate CPU for softirq
handling and your programs should help a lot here.


Shouldn't I get some syslog messages if this bug is triggered?

Nevertheless I also opened a call on Novell about this issue,
as the current cond_resched_softirq() does look completely
different than in 2.6.18


> This did help a lot, I tried thash_entries=10 and now only a
> while loop around the "cat ...tcp" triggers packet loss. Tests



I dont understand here : using a small thash_entries makes
the bug always appear ?


No. thash_entries=10 improves the situation. Without the param
nearly every look at /proc/net/tcp leads to packet loss, with
thash_entries=10 (or 2048, does not matter) I have to start a
"while true; do cat /prc/net/tcp ; done" to get packet loss
every minute.

But even with thash_entries=10 and if I leave my program alone
 on he system I get packet loss every few hours.

Regards,
John




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-23 Thread John Heffner

TJ wrote:

client SYN > server LISTENING
client < SYN ACK server SYN_RECEIVED (time-out 3s)
 server: inet_rsk(req)->acked = 1

client ACK > server (discarded)

client < SYN ACK (DUP) server (time-out 6s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 12s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 24s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 48s)
client ACK (DUP) > server (discarded)

client < SYN ACK (DUP) server (time-out 96s)
client ACK (DUP) > server (discarded)

server: half-open socket closed.

With each client ACK being dropped by the kernel's TCP_DEFER_ACCEPT
mechanism eventually the handshake fails after the 'SYN ACK' retries and
time-outs expire.

There is a case for arguing the kernel should be operating in an
enhanced handshaking mode when TCP_DEFER_ACCEPT is enabled, not an
alternative mode, and therefore should accept *both* RFC 793 and
TCP_DEFER_ACCEPT. I've been unable to find a specification or RFC for
implementing TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER to give me firm
guidance.

It seems incorrect to penalise a client that is trying to complete the
handshake according to the RFC 793 specification, especially as the
client has no way of knowing ahead of time whether or not the server is
operating deferred accept.


Interesting problem.  TCP_DEFER_ACCEPT does not conform to any standard 
I'm aware of.  (In fact, I'd say it's in violation of RFC 793.)  The 
implementation does exactly what it claims, though -- it "allows a 
listener to be awakened only  when  data  arrives  on  the  socket."


I think a more useful spec might have been "allows a listener to be 
awakened only when data arrives on the socket, unless the specified 
timeout has expired."  Once the timeout expires, it should process the 
embryonic connection as if TCP_DEFER_ACCEPT is not set.  Unfortunately, 
I don't think we can retroactively change this definition, as an 
application might depend on data being available and do a non-blocking 
read() after the accept(), expecting data to be there.  Is this worth 
trying to fix?


Also, a listen socket with a backlog and TCP_DEFER_ACCEPT will have reqs 
sit in the backlog for the full defer timeout, even if they've received 
data, which is not really the right thing to do.


I've attached a patch implementing this suggestion (compile tested only 
-- I think I got the logic right but it's late ;).  Kind of ugly, and 
uses up a bit in struct inet_request_sock.  Maybe can be done better...


  -John
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 62daf21..f9f64a5 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -72,7 +72,8 @@ struct inet_request_sock {
sack_ok: 1,
wscale_ok  : 1,
ecn_ok : 1,
-   acked  : 1;
+   acked  : 1,
+   deferred   : 1;
struct ip_options   *opt;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 185c7ec..cad2490 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -978,6 +978,7 @@ static inline void tcp_openreq_init(struct request_sock 
*req,
ireq->snd_wscale = rx_opt->snd_wscale;
ireq->wscale_ok = rx_opt->wscale_ok;
ireq->acked = 0;
+   ireq->deferred = 0;
ireq->ecn_ok = 0;
ireq->rmt_port = tcp_hdr(skb)->source;
 }
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fbe7714..1207fb8 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -444,9 +444,6 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
}
 
-   if (queue->rskq_defer_accept)
-   max_retries = queue->rskq_defer_accept;
-
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
i = lopt->clock_hand;
 
@@ -455,7 +452,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
if ((req->retrans < thresh ||
-(inet_rsk(req)->acked && req->retrans < 
max_retries))
+(inet_rsk(req)->acked && req->retrans < 
max_retries) ||
+(inet_rsk(req)->deferred && req->retrans <
+ queue->rskq_defer_accept + max_retries))
&& !req->rsk_ops->rtx_syn_ack(parent, req, 
NULL)) {
 

Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread John Heffner

TJ wrote:

Right now Juniper are claiming the issue that brought this to the
surface (the bug linked to in my original post) is a problem with the
implementation of TCP_DEFER_ACCEPT.

My position so far is that the Juniper DX OS is not following the HTTP
standard because it doesn't send a request with the connection, and as I
read the end of section 1.4 of RFC2616, an HTTP connection should be
accompanied by a request.

Can anyone confirm my interpretation or provide references to firm it
up, or refute it?


You can think of TCP_DEFER_ACCEPT as an implicit application close() 
after a certain timeout, when not receiving a request.  All HTTP servers 
do this anyway (though I think technically they're supposed to send a 
408 Request Timeout error it seems many do not).  It's a very valid 
question for Juniper as to why their box is failing to fill requests 
when its back-end connection has gone away, instead of re-establishing 
the connection and filling the request.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread John Heffner

Bill Fink wrote:

Here you can see there is a major difference in the TX CPU utilization
(99 % with TSO disabled versus only 39 % with TSO enabled), although
the TSO disabled case was able to squeeze out a little extra performance
from its extra CPU utilization.  Interestingly, with TSO enabled, the
receiver actually consumed more CPU than with TSO disabled, so I guess
the receiver CPU saturation in that case (99 %) was what restricted
its performance somewhat (this was consistent across a few test runs).



One possibility is that I think the receive-side processing tends to do 
better when receiving into an empty queue.  When the (non-TSO) sender is 
the flow's bottleneck, this is going to be the case.  But when you 
switch to TSO, the receiver becomes the bottleneck and you're always 
going to have to put the packets at the back of the receive queue.  This 
might help account for the reason why you have both lower throughput and 
higher CPU utilization -- there's a point of instability right where the 
receiver becomes the bottleneck and you end up pushing it over to the 
bad side. :)


Just a theory.  I'm honestly surprised this effect would be so 
significant.  What do the numbers from netstat -s look like in the two 
cases?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-26 Thread John Heffner

Bill Fink wrote:

Here's the beforeafter delta of the receiver's "netstat -s"
statistics for the TSO enabled case:

Ip:
3659898 total packets received
3659898 incoming packets delivered
80050 requests sent out
Tcp:
2 passive connection openings
3659897 segments received
80050 segments send out
TcpExt:
33 packets directly queued to recvmsg prequeue.
104956 packets directly received from backlog
705528 packets directly received from prequeue
3654842 packets header predicted
193 packets header predicted and directly queued to user
4 acknowledgments not containing data received
6 predicted acknowledgments

And here it is for the TSO disabled case (GSO also disabled):

Ip:
4107083 total packets received
4107083 incoming packets delivered
1401376 requests sent out
Tcp:
2 passive connection openings
4107083 segments received
1401376 segments send out
TcpExt:
2 TCP sockets finished time wait in fast timer
48486 packets directly queued to recvmsg prequeue.
1056111048 packets directly received from backlog
2273357712 packets directly received from prequeue
1819317 packets header predicted
2287497 packets header predicted and directly queued to user
4 acknowledgments not containing data received
10 predicted acknowledgments

For the TSO disabled case, there are a huge amount more TCP segments
sent out (1401376 versus 80050), which I assume are ACKs, and which
could possibly contribute to the higher throughput for the TSO disabled
case due to faster feedback, but not explain the lower CPU utilization.
There are many more packets directly queued to recvmsg prequeue
(48486 versus 33).  The numbers for packets directly received from
backlog and prequeue in the TCP disabled case seem bogus to me so
I don't know how to interpret that.  There are only about half as
many packets header predicted (1819317 versus 3654842), but there
are many more packets header predicted and directly queued to user
(2287497 versus 193).  I'll leave the analysis of all this to those
who might actually know what it all means.


There are a few interesting things here.  For one, the bursts caused by 
TSO seem to be causing the receiver to do stretch acks.  This may have a 
negative impact on flow performance, but it's hard to say for sure how 
much.  Interestingly, it will even further reduce the CPU load on the 
sender, since it has to process fewer acks.


As I suspected, in the non-TSO case the receiver gets lots of packets 
directly queued to user.  This should result in somewhat lower CPU 
utilization on the receiver.  I don't know if it can account for all the 
difference you see.


The backlog and prequeue values are probably correct, but netstat's 
description is wrong.  A quick look at the code reveals these values are 
in units of bytes, not packets.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)

2007-08-28 Thread John Heffner

OBATA Noboru wrote:

Is it correct that you think my problem can be addressed either
by the followings?

(1) Make the application timeouts longer.  (Steve has shown that
making an application timeouts twice the failover detection
timeout would be a solution.)


Right.  Is there something wrong with this approach?



(2) Let TCP have a notification of some kind.


There was some work on this in the IETF a while back (google trigtran 
linkup), but it never went anywhere to my knowledge.  In principle it's 
possible, but it's not clear that it's worth doing.  It's really just an 
optimization anyway.  Imaging the link that's failing over is one hop or 
more away from the endpoint.  You're back to the same problem again.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 29 Aug 2007 15:29:03 -0700


David Miller wrote:

None of the research folks want to commit to saying a lower value is
OK, even though it's quite clear that on a local 10 gigabit link a
minimum value of even 200 is absolutely and positively absurd.

So what do these cellphone network people want to do, increate the
minimum RTO or increase it?  Exactly how does it help them?
They want to increase it.  The folks who triggered this want to make it 
3 seconds to avoid spurrious RTOs.  Their experience the "other 
platform" they widh to replace suggests that 3 seconds is a good value 
for their network.



If the issue is wireless loss, algorithms like FRTO might help them,
because FRTO tries to make a distinction between capacity losses
(which should adjust cwnd) and radio losses (which are not capacity
based and therefore should not affect cwnd).
I was looking at that.  FRTO seems only to affect the cwnd calculations, 
and not the RTO calculation, so it seems to "deal with" spurrious RTOs 
rather than preclude them.  There is a strong desire here to not have 
spurrious RTO's in the first place.  Each spurrious retransmission will 
increase a user's charges.


All of this seems to suggest that the RTO calculation is wrong.


I think there's definitely room for improving the RTO calculation. 
However, this may not be the end-all fix...




It seems that packets in this network can be delayed several orders of
magnitude longer than the usual round trip as measured by TCP.

What exactly causes such a huge delay?  What is the TCP measured RTO
in these circumstances where spurious RTOs happen and a 3 second
minimum RTO makes things better?


I haven't done a lot of work on wireless myself, but my understanding is 
that one of the biggest problems is the behavior link-layer 
retransmission schemes.  They can suddenly increase the delay of packets 
by a significant amount when you get a burst of radio interference. 
It's hard for TCP to gracefully handle this kind of jump without some 
minimum RTO, especially since wlan RTTs can often be quite small.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

John Heffner wrote:

What exactly causes such a huge delay?  What is the TCP measured RTO
in these circumstances where spurious RTOs happen and a 3 second
minimum RTO makes things better?


I haven't done a lot of work on wireless myself, but my understanding is 
that one of the biggest problems is the behavior link-layer 
retransmission schemes.  They can suddenly increase the delay of packets 
by a significant amount when you get a burst of radio interference. It's 
hard for TCP to gracefully handle this kind of jump without some minimum 
RTO, especially since wlan RTTs can often be quite small.


(Replying to myself) Though F-RTO does often help in this case.

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NCR, was [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

Stephen Hemminger wrote:

On Wed, 29 Aug 2007 15:28:12 -0700 (PDT)
David Miller <[EMAIL PROTECTED]> wrote:

And reading NCR some more, we already have something similar in the
form of Alexey's reordering detection, in fact it handles exactly the
case NCR supposedly deals with.  We do not trigger loss recovery
strictly on the 3rd duplicate ACK, and we've known about and dealt
with the reordering issue explicitly for years.



Yeah, it looked like another case of BSD RFC writers reinventing
Linux algorithms, but it is worth getting the behaviour standardized
and more widely reviewed.


I don't believe this was the case.  NCR is substantially different, and 
came out of work at Texas A&M.  The original (only) implementation was 
in Linux IIRC.  Its goal was to do better.  Their papers say it does. 
It might be worth looking at.


In my own experience with reordering, Alexey's code had some 
hard-to-track-down bugs (look at all the work Ilpo's been doing), and 
the relative simplicity of NCR may be one of the reasons it does well in 
tests.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable

2007-08-29 Thread John Heffner

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 29 Aug 2007 16:06:27 -0700

I belive the biggest component comes from link-layer retransmissions. 
There can also be some short outtages thanks to signal blocking, 
tunnels, people with big hats and whatnot that the link-layer 
retransmissions are trying to address.  The three seconds seems to be a 
value that gives the certainty that 99 times out of 10 the segment was 
indeed lost.


The trace I've been sent shows clean RTTs ranging from ~200 milliseconds 
to ~7000 milliseconds.


Thanks for the info.

It's pretty easy to generate examples where we might have some sockets
talking over interfaces on such a network and others which are not.
Therefore, if we do this, a per-route metric is probably the best bet.


This is exactly what I was thinking.  It might even help discourage 
users from playing with this setting who should not. ;)


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make _minimum_ TCP retransmission timeout configurable take 2

2007-08-30 Thread John Heffner

Rick Jones wrote:
Like I said the consumers of this are a triffle well, 
"anxious" :)


Just curious, did you or this customer try with F-RTO enabled?  Or is 
this case you're dealing with truly hopeless?


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


82557/8/9 Ethernet Pro 100 interrupt mitigation support

2007-09-03 Thread John Sigler

(Please ignore previous message, it was sent from the wrong account.)

Hello everyone,

I have several systems with three integrated Intel 82559 (I *think*).

Does someone know if these boards support hardware interrupt mitigation?
I.e. is it possible to configure them to raise an IRQ only if their
hardware buffer is full OR if some given time (say 1 ms) has passed and
packets are available in their hardware buffer.

I've been using the eepro100 driver up to now, but I'm about to try the
e100 driver. Would I have to use NAPI? Or is this an orthogonal feature?

Regards.

00:08.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 
08)
Subsystem: Intel Corporation EtherExpress PRO/100B (TX)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
SERR- TAbort- 
SERR- TAbort- 
SERR- 

Re: 82557/8/9 Ethernet Pro 100 interrupt mitigation support

2007-09-03 Thread John Sigler

John Sigler wrote:


I have several systems with three integrated Intel 82559 (I *think*).

Does someone know if these boards support hardware interrupt mitigation?
I.e. is it possible to configure them to raise an IRQ only if their
hardware buffer is full OR if some given time (say 1 ms) has passed and
packets are available in their hardware buffer.

I've been using the eepro100 driver up to now, but I'm about to try the
e100 driver. Would I have to use NAPI? Or is this an orthogonal feature?

00:08.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 
08)
Subsystem: Intel Corporation EtherExpress PRO/100B (TX)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- TAbort- SERR- TAbort- SERR- 

Here is Intel's page for the 82559:
http://www.intel.com/design/network/products/lan/controllers/82559.htm

The "82559ER Fast Ethernet PCI Controller" data sheet mentions a 3 KB 
receive FIFO. I suppose that's too small to aggregate several frames?


The "8255x Controller Family Open Source Software Developer Manual" 
mentions the features supported by the 82559. I don't see anything 
related to interrupt mitigation support.


Does NAPI work well when there is no hardware interrupt mitigation support?

Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 82557/8/9 Ethernet Pro 100 interrupt mitigation support

2007-09-04 Thread John Sigler

Jesse Brandeburg wrote:


Auke Kok wrote:


Marc Sigler wrote:


I have several systems with three integrated Intel 82559 (I *think*).

Does someone know if these boards support hardware interrupt
mitigation? I.e. is it possible to configure them to raise an IRQ
only if their hardware buffer is full OR if some given time (say 1
ms) has passed and packets are available in their hardware buffer.

I've been using the eepro100 driver up to now, but I'm about to try
the e100 driver. Would I have to use NAPI? Or is this an orthogonal
feature? 


e100 hardware (as far as I can see from the specs) doesn't support
any irq mitigation, so you'll need to run in NAPI mode if you want to
throttle irq's. the in-kernel e100 already runs in NAPI mode, so
that's already covered. 


beware that the eepro100 driver is scheduled for removal (2.6.25 or so).


We support mitigation of interrupts in a downloadable microcode on only
a few pieces of hardware (revision id specific) in e100.c (see
e100_setup_ucode)


http://lxr.linux.no/source/drivers/net/e100.c#L1176

OK.

How do I tell which revision id I have?

00:08.0 0200: 8086:1229 (rev 08)
00:09.0 0200: 8086:1229 (rev 08)
00:0a.0 0200: 8086:1229 (rev 08)

How much memory is available on the board to bundle packets? 3000 bytes?


If you really really wanted mitigation you could probably backport the
microcode from the e100 driver in the 2.4.35 kernel for your specific
hardware.  This driver is versioned 2.X.


I forgot to mention I'm running 2.6.22.1-rt9.
I'm not sure why you mention 2.4.35?
The problem with e100 is that it fails to properly set up all three 
interfaces, which is why I'm stuck with eepro100.


Regards.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Clean up owner field in sock_lock_t

2007-09-11 Thread John Heffner
I don't know why the owner field is a (struct sock_iocb *).  I'm assuming
it's historical.  Can someone check this out?  Did I miss some alternate
usage?

These patches are against net-2.6.24.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [NET] Cleanup: Use sock_owned_by_user() macro

2007-09-11 Thread John Heffner
Changes asserts in sunrpc to use sock_owned_by_user() macro instead of
referencing sock_lock.owner directly.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/sunrpc/svcsock.c  |2 +-
 net/sunrpc/xprtsock.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index ed17a50..3a95612 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -104,7 +104,7 @@ static struct lock_class_key svc_slock_key[2];
 static inline void svc_reclassify_socket(struct socket *sock)
 {
struct sock *sk = sock->sk;
-   BUG_ON(sk->sk_lock.owner != NULL);
+   BUG_ON(sock_owned_by_user(sk));
switch (sk->sk_family) {
case AF_INET:
sock_lock_init_class_and_name(sk, "slock-AF_INET-NFSD",
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4ae7eed..282efd4 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1186,7 +1186,7 @@ static struct lock_class_key xs_slock_key[2];
 static inline void xs_reclassify_socket(struct socket *sock)
 {
struct sock *sk = sock->sk;
-   BUG_ON(sk->sk_lock.owner != NULL);
+   BUG_ON(sock_owned_by_user(sk));
switch (sk->sk_family) {
case AF_INET:
sock_lock_init_class_and_name(sk, "slock-AF_INET-NFS",
-- 
1.5.3.rc7.30.g947ad2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [NET] Change type of owner in sock_lock_t to int, rename

2007-09-11 Thread John Heffner
The type of owner in sock_lock_t is currently (struct sock_iocb *),
presumably for historical reasons.  It is never used as this type, only
tested as NULL or set to (void *)1.  For clarity, this changes it to type
int, and renames to owned, to avoid any possible type casting errors.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/net/sock.h |7 +++
 net/core/sock.c|6 +++---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 802c670..5ed9fa4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -76,10 +76,9 @@
  * between user contexts and software interrupt processing, whereas the
  * mini-semaphore synchronizes multiple users amongst themselves.
  */
-struct sock_iocb;
 typedef struct {
spinlock_t  slock;
-   struct sock_iocb*owner;
+   int owned;
wait_queue_head_t   wq;
/*
 * We express the mutex-alike socket_lock semantics
@@ -737,7 +736,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, 
int size)
  * Since ~2.3.5 it is also exclusive sleep lock serializing
  * accesses from user process context.
  */
-#define sock_owned_by_user(sk) ((sk)->sk_lock.owner)
+#define sock_owned_by_user(sk) ((sk)->sk_lock.owned)
 
 /*
  * Macro so as to not evaluate some arguments when
@@ -748,7 +747,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, 
int size)
  */
 #define sock_lock_init_class_and_name(sk, sname, skey, name, key)  \
 do {   \
-   sk->sk_lock.owner = NULL;   \
+   sk->sk_lock.owned = 0;  \
init_waitqueue_head(&sk->sk_lock.wq);   \
spin_lock_init(&(sk)->sk_lock.slock);   \
debug_check_no_locks_freed((void *)&(sk)->sk_lock,  \
diff --git a/net/core/sock.c b/net/core/sock.c
index cfed7d4..edbc562 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1575,9 +1575,9 @@ void fastcall lock_sock_nested(struct sock *sk, int 
subclass)
 {
might_sleep();
spin_lock_bh(&sk->sk_lock.slock);
-   if (sk->sk_lock.owner)
+   if (sk->sk_lock.owned)
__lock_sock(sk);
-   sk->sk_lock.owner = (void *)1;
+   sk->sk_lock.owned = 1;
spin_unlock(&sk->sk_lock.slock);
/*
 * The sk_lock has mutex_lock() semantics here:
@@ -1598,7 +1598,7 @@ void fastcall release_sock(struct sock *sk)
spin_lock_bh(&sk->sk_lock.slock);
if (sk->sk_backlog.tail)
__release_sock(sk);
-   sk->sk_lock.owner = NULL;
+   sk->sk_lock.owned = 0;
if (waitqueue_active(&sk->sk_lock.wq))
wake_up(&sk->sk_lock.wq);
spin_unlock_bh(&sk->sk_lock.slock);
-- 
1.5.3.rc7.30.g947ad2

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [IPROUTE2] ss: parse bare integers are port numbers rather than IP addresses

2007-09-11 Thread John Heffner

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 misc/ss.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 5d14f13..d617f6d 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -953,6 +953,10 @@ void *parse_hostcond(char *addr)
memset(&a, 0, sizeof(a));
a.port = -1;
 
+   /* Special case: integer by itself is considered a port number */
+   if (!get_integer(&a.port, addr, 0))
+   goto out;
+
if (fam == AF_UNIX || strncmp(addr, "unix:", 5) == 0) {
char *p;
a.addr.family = AF_UNIX;
-- 
1.5.3.rc4.29.g74276-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [IPROUTE2] Add missing LIBUTIL for dependencies.

2007-09-11 Thread John Heffner

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 Makefile |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index af0d5e4..7e4605c 100644
--- a/Makefile
+++ b/Makefile
@@ -29,7 +29,8 @@ LDLIBS += -L../lib -lnetlink -lutil
 
 SUBDIRS=lib ip tc misc netem genl
 
-LIBNETLINK=../lib/libnetlink.a ../lib/libutil.a
+LIBUTIL=../lib/libutil.a
+LIBNETLINK=../lib/libnetlink.a $(LIBUTIL)
 
 all: Config
@set -e; \
-- 
1.5.3.rc4.29.g74276-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder

2007-09-17 Thread John Heffner
Any reason you're overloading tcpi_unacked and tcpi_sacked?  It seems 
that setting idiag_rqueue and idiag_wqueue are sufficient.


  -John


Rick Jones wrote:

Return some useful information such as the maximum listen backlog and the
current listen backlog in the tcp_info structure and have that match what
one can see in /proc/net/tcp, /proc/net/tcp6, and INET_DIAG_INFO.

Signed-off-by: Rick Jones <[EMAIL PROTECTED]>
Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
---

diff -r bdcdd0e1ee9d Documentation/networking/proc_net_tcp.txt
--- a/Documentation/networking/proc_net_tcp.txt Sat Sep 01 07:00:31 2007 +
+++ b/Documentation/networking/proc_net_tcp.txt Tue Sep 11 10:38:23 2007 -0700
@@ -20,8 +20,8 @@ up into 3 parts because of the length of
   || | |   |--> number of unrecovered RTO timeouts
   || | |--> number of jiffies until timer expires
   || |> timer_active (see below)
-  ||--> receive-queue
-  |---> transmit-queue
+  ||--> receive-queue or connection backlog
+  |---> transmit-queue or connection limit
 
10000 54165785 4 cd1e6040 25 4 27 3 -1
 |  || || |  | |  | |--> slow start size threshold, 
diff -r bdcdd0e1ee9d net/ipv4/tcp.c

--- a/net/ipv4/tcp.cSat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp.cTue Sep 11 10:38:23 2007 -0700
@@ -2030,8 +2030,14 @@ void tcp_get_info(struct sock *sk, struc
info->tcpi_snd_mss = tp->mss_cache;
info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss;
 
-	info->tcpi_unacked = tp->packets_out;

-   info->tcpi_sacked = tp->sacked_out;
+   if (sk->sk_state == TCP_LISTEN) {
+   info->tcpi_unacked = sk->sk_ack_backlog;
+   info->tcpi_sacked = sk->sk_max_ack_backlog;
+   }
+   else {
+   info->tcpi_unacked = tp->packets_out;
+   info->tcpi_sacked = tp->sacked_out;
+   }
info->tcpi_lost = tp->lost_out;
info->tcpi_retrans = tp->retrans_out;
info->tcpi_fackets = tp->fackets_out;
diff -r bdcdd0e1ee9d net/ipv4/tcp_diag.c
--- a/net/ipv4/tcp_diag.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp_diag.c   Tue Sep 11 10:38:23 2007 -0700
@@ -25,11 +25,14 @@ static void tcp_diag_get_info(struct soc
const struct tcp_sock *tp = tcp_sk(sk);
struct tcp_info *info = _info;
 
-	if (sk->sk_state == TCP_LISTEN)

+   if (sk->sk_state == TCP_LISTEN) {
r->idiag_rqueue = sk->sk_ack_backlog;
-   else
+   r->idiag_wqueue = sk->sk_max_ack_backlog;
+   }
+   else {
r->idiag_rqueue = tp->rcv_nxt - tp->copied_seq;
-   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   r->idiag_wqueue = tp->write_seq - tp->snd_una;
+   }
if (info != NULL)
tcp_get_info(sk, info);
 }
diff -r bdcdd0e1ee9d net/ipv4/tcp_ipv4.c
--- a/net/ipv4/tcp_ipv4.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv4/tcp_ipv4.c   Tue Sep 11 10:38:23 2007 -0700
@@ -2320,7 +2320,8 @@ static void get_tcp4_sock(struct sock *s
sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX "
"%08X %5d %8d %lu %d %p %u %u %u %u %d",
i, src, srcp, dest, destp, sk->sk_state,
-   tp->write_seq - tp->snd_una,
+   sk->sk_state == TCP_LISTEN ? sk->sk_max_ack_backlog :
+(tp->write_seq - tp->snd_una),
sk->sk_state == TCP_LISTEN ? sk->sk_ack_backlog :
 (tp->rcv_nxt - tp->copied_seq),
timer_active,
diff -r bdcdd0e1ee9d net/ipv6/tcp_ipv6.c
--- a/net/ipv6/tcp_ipv6.c   Sat Sep 01 07:00:31 2007 +
+++ b/net/ipv6/tcp_ipv6.c   Tue Sep 11 10:38:23 2007 -0700
@@ -2005,8 +2005,10 @@ static void get_tcp6_sock(struct seq_fil
   dest->s6_addr32[0], dest->s6_addr32[1],
   dest->s6_addr32[2], dest->s6_addr32[3], destp,
   sp->sk_state,
-  tp->write_seq-tp->snd_una,
-  (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
(tp->rcv_nxt - tp->copied_seq),
+  (sp->sk_state == TCP_LISTEN) ? sp->sk_max_ack_backlog:
+ tp->write_seq-tp->snd_una,
+		   (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : 
+	(tp->rcv_nxt - tp->copied_seq),

   timer_active,
   jiffies_to_clock_t(timer_expires - jiffies),
   icsk->icsk_retransmits,

Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder

2007-09-17 Thread John Heffner

Rick Jones wrote:

John Heffner wrote:
Any reason you're overloading tcpi_unacked and tcpi_sacked?  It seems 
that setting idiag_rqueue and idiag_wqueue are sufficient.


Different fields for different structures.   The tcp_info struct doesn't 
have the idiag_mumble, so to get the two values shown in /proc/net/tcp I 
use tcpi_unacked and tcpi_sacked.


For the INET_DIAG_INFO stuff the idiag_mumble fields are used and that 
then covers ss.


Maybe I'm missing something.  get_tcp[46]_sock() does not use struct 
tcp_info.  The only way I see using this is by doing 
getsockopt(TCP_INFO) on your listen socket.  Is this the intention?


  -John

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP

2007-09-20 Thread john ye
Bottom Softirq Implementation. John Ye, 2007.08.27

Why this patch:
Make kernel be able to concurrently execute softirq's net code on SMP 
system.
Takes full advantages of SMP to handle more packets and greatly raises NIC 
throughput.
The current kernel's net packet processing logic is:
1) The CPU which handles a hardirq must be executing its related softirq.
2) One softirq instance(irqs handled by 1 CPU) can't be executed on more 
than 2 CPUs
at the same time.
The limitation make kernel network be hard to take the advantages of SMP.

How this patch:
It splits the current softirq code into 2 parts: the cpu-sensitive top half,
and the cpu-insensitive bottom half, then make bottom half(calld BS) be
executed on SMP concurrently.
The two parts are not equal in terms of size and load. Top part has constant 
code
size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules 
to match
will make the bottom part's load be very high. So, if the bottom part 
softirq
can be randomly distributed to processors and run concurrently on them, the 
network will
gain much more packet handling capacity, network throughput will be be 
increased
remarkably.

Where useful:
It's useful on SMP machines that meet the following 2 conditions:
1) have high kernel network load, for example, running iptables with 
thousands of rules, etc).
2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs).
On these system, with the increase of softirq load, some CPUs will be idle
while others(number is equal to # of NIC) keeps busy.
IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq 
concurrency.
Balancing the load of each cpus will not remarkably increase network speed.

Where NOT useful:
If the bottom half of softirq is too small(without running iptables), or the 
network
is too idle, BS patch will not be seen to have visible effect. But It has no
negative affect either.
User can turn on/off BS functionality by /proc/sys/net/bs_enable switch.

How to test:
On a linux box, run iptables, add 2000 rules to table filter & table nat to 
simulate huge
softirq load. Then, open 20 ftp sessions to download big file. On another 
machine(who
use this test machine as gateway), open 20 more ftp download sessions. 
Compare the speed,
without BS enabled, and with BS enabled.
cat /proc/sys/net/bs_enable. this is a switch to turn on/off BS
cat /proc/sys/net/bs_status. this shows the usage of each CPUs
Test shown that when bottom softirq load is high, the network throughput can 
be nearly
doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux 
box.

Bugs:
It will NOT allow hotpug CPU.
It only allows incremental CPUs ids, starting from 0 to num_online_cpus().
for example, 0,1,2,3 is OK. 0,1,8,9 is KO.

Some considerations in the future:
1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems 
no need any more,
at least not for network irq.
2) Softirq load will become very small. It only run the top half of old 
softirq, which
is much less expensive than bottom half---the netfilter program.
To let top softirq process more packets, can these 3 network parameters be 
given a larger value?
   extern int netdev_max_backlog = 1000;
   extern int netdev_budget = 300;
   extern int weight_p = 64;
3) Now, BS are running on built-in keventd thread, we can create new 
workqueues to let it run on?

Signed-off-by: John Ye (Seeker) <[EMAIL PROTECTED]>


--- old/net/ipv4/ip_input.c 2007-09-20 20:50:31.0 +0800
+++ new/net/ipv4/ip_input.c 2007-09-21 05:52:40.0 +0800
@@ -362,6 +362,198 @@
 return NET_RX_DROP;
 }

+
+#define CONFIG_BOTTOM_SOFTIRQ_SMP
+#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL
+
+#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP
+
+/*
+ *
+Bottom Softirq Implementation. John Ye, 2007.08.27
+
+Why this patch:
+Make kernel be able to concurrently execute softirq's net code on SMP 
system.
+Takes full advantages of SMP to handle more packets and greatly raises NIC 
throughput.
+The current kernel's net packet processing logic is:
+1) The CPU which handles a hardirq must be executing its related softirq.
+2) One softirq instance(irqs handled by 1 CPU) can't be executed on more 
than 2 CPUs
+at the same time.
+The limitation make kernel network be hard to take the advantages of SMP.
+
+How this patch:
+It splits the current softirq code into 2 parts: the cpu-sensitive top 
half,
+and the cpu-insensitive bottom half, then make bottom half(calld BS) be
+executed on SMP concurrently.
+The two parts are not equal in terms of size and load. Top part has 
constant code
+size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves
+netfilter(iptables) whose load varies very much. An iptalbes with 1000 
rules to match
+will make the bottom part's load be very high. So, if the bottom part 
softirq
+can be randomly distributed to processor

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP

2007-09-21 Thread John Ye
David,

Thanks for your reply. I understand it's not worth to do.

I have made it a loadable module to fulfill the function. it mainly for busy
NAT gateway server with SMP to speed up.

John Ye



- Original Message -
From: "David Miller" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: ; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, September 21, 2007 1:46 AM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq
network code on SMP


>
> The whole reason the queues are per-cpu is so that we do not
> have to touch remote processor state nor use locks of any
> kind whatsoever.
>
> With multi-queue networking cards becoming more and more
> available, which will split up the packet workload in
> hardware across all available cpus, there is less and less
> reason to make a patch like this one.
>
> We've known about this issue for ages, and if we felt it
> was appropriate to make this change, we would have done
> so years ago.
>


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


want same order in /sys/class/net/eth as /sys/bus/pci/devices

2007-09-21 Thread John Reiser
I'd like to see the same order of devices in /sys/class/net/eth*
as in /sys/bus/pci/devices.  This would make administration easier.
On Fedora 8 tests, the order I see is reversed:
  http://bugzilla.redhat.com/show_bug.cgi?id=291431

Perhaps the reversal is a result of the alias order listed in
/etc/modprobe.conf.  But the alias order was obtained from some
source.  Was the first reversal due to a user-space program
(such as the anaconda installer), or due to something within
the kernel?

-- 
John Reiser, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirqnetwork code on SMP

2007-09-22 Thread john ye
Dear Jamal,

Sorry, I sent to you all a not-good-formatted mail.
Thanks for instructions and corrections from you all.

I have thought that packet re-ordering for upper TCP protocol will become 
more intensive and this will make the network even slower.

I do randomly select a CPU to dispatch the skb to. Previously, I dispatch 
skb evenly to all CPUs( round robin, one by one). but I didn't find a quick 
coding. for_each_online_cpu is not quick enough.

According to my test result, it did make packet INPUT speed doubled because 
another CPU is used concurrently.
It seems the packets still keep "roughly ordering" after turning on BS 
patch.

The test is simple: use an 2400 lines of iptables -t filter -A INPUT -p 
tcp -s x.x.x.x --dport yy -j .
these rules make the current softirq be very busy on one CPU and make the 
incoming net very slow. after turning on BS, the speed doubled.

For NAT test, I didn't get a good result like INPUT because real environment 
limitation.
The test is very basic and is far from "full".

It seems to me that the cross-cpu spinlock_ for the queue doesn't have 
big cost and is allowable in terms of CPU time consumption, compared with 
the gains by making other CPUs joint in the work.

I have made BS patch into a loadable module. 
http://linux.chinaunix.net/bbs/thread-909725-2-1.html and let others help 
with testing.

John Ye


- Original Message - 
From: "jamal" <[EMAIL PROTECTED]>
To: "John Ye" <[EMAIL PROTECTED]>
Cc: "David Miller" <[EMAIL PROTECTED]>; ; 
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; 
<[EMAIL PROTECTED]>
Sent: Friday, September 21, 2007 7:43 PM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run 
softirqnetwork code on SMP


> On Fri, 2007-21-09 at 17:25 +0800, John Ye wrote:
>> David,
>>
>> Thanks for your reply. I understand it's not worth to do.
>>
>> I have made it a loadable module to fulfill the function. it mainly for 
>> busy
>> NAT gateway server with SMP to speed up.
>>
>
> John,
>
> It was a little hard to read your code; however, it does seems to me
> like will cause a massive amount of packet reordering to the end hosts
> using you as the gateway especially when it is receiving a lot of
> packets/second.
> You have a queue per CPU that connects your bottom and top half and
> several CPUs that may service a single NIC in your bottom half.
> one cpu in either bottom/top half has to be slightly loaded and you
> loose the ordering where incoming doesnt match outgoing packet order.
>
> cheers,
> jamal
>
> 


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently runsoftirqnetwork code on SMP

2007-09-23 Thread john ye
Dear Jamal,

Yes. you are right. I do "need some real fast traffic generator; possibly 
one that can do
thousands of tcp sessions." to get some kind of convincing result.

Also, the packet reordering is also my big concern. round-robin doesn't have 
much help.

"The INPUT speed is doubled by using 2 CPUs" is shown by these steps:
1) without intables, ftp get a 50M file from another machine, ftp can show 
speed 10M/s.
2) run iptables and add many intpalbes rules, ftp get the same file, the 
speed is down to 3M/s, top shows CPU0 busy in softirq. CPU1 idle.
3) insmod my module BS, then ftp get the same file, the speed can reach 
6M/s, top shows both CPU0 and CPU1 are busy in keventd/0/1

I will try my best to do further test. the best test should be done on a 4 
CPU GATEWAY machine. In China, there are many companies who use linux box 
running iptables as a gateway to serve 1000 around clients, for example. On 
those machines, a lot conntracking, and they have the "idle CPUs while net 
is too busy" problem.

In my BS module (If you got it), only 2 functions are needed to see: 
REP_ip_rcv(), and bs_func(). Others have nothing to do with the BS patch ---  
they are there only for accessing non-EXPORT_SYMBOLed kernel variables.

Thanks a lot for your thought.

John Ye


- Original Message - 
From: "jamal" <[EMAIL PROTECTED]>
To: "john ye" <[EMAIL PROTECTED]>
Cc: "David Miller" <[EMAIL PROTECTED]>; ; 
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; 
<[EMAIL PROTECTED]>
Sent: Sunday, September 23, 2007 8:43 PM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently 
runsoftirqnetwork code on SMP


> On Sun, 2007-23-09 at 12:45 +0800, john ye wrote:
>
>>  I do randomly select a CPU to dispatch the skb to. Previously, I
>> dispatch
>>  skb evenly to all CPUs( round robin, one by one). but I didn't find a
>> quick
>>  coding. for_each_online_cpu is not quick enough.
>
> for_each_online_cpu doenst look that expensive - but even round robin
> wont fix the reordering problem. What you need to do is make sure that a
> flow always goes to the same cpu over some period of time.
>
>>  According to my test result, it did make packet INPUT speed doubled
>> because
>>  another CPU is used concurrently.
>
> How did you measure "speed" - was it throughput? Did you measure how
> much cpu was being utilized?
>
>>  It seems the packets still keep "roughly ordering" after turning on
>> BS patch.
>
> Linux TCP is very resilient to reordering compared to other OSes, but
> even then if you hit it with enough packets it is going to start
> sweating it.
>
>>  The test is simple: use an 2400 lines of iptables -t filter -A INPUT
>> -p
>>  tcp -s x.x.x.x --dport yy -j .
>>  these rules make the current softirq be very busy on one CPU and make
>> the
>>  incoming net very slow. after turning on BS, the speed doubled.
>>
> Ok, but how do you observe "doubled"?
> Do you have conntrack on? It maybe that what you have just found is
> netfilter needs to have its work defered from packet rcv.
> You need some real fast traffic generator; possibly one that can do
> thousands of tcp sessions.
>
>>  For NAT test, I didn't get a good result like INPUT because real
>> environment limitation.
>>  The test is very basic and is far from "full".
>
> What happens when you totally compile out netfilter and you just use
> this machine as a server?
>
>>  It seems to me that the cross-cpu spinlock_ for the queue doesn't
>> have
>>  big cost and is allowable in terms of CPU time consumption, compared
>> with
>>  the gains by making other CPUs joint in the work.
>>
>>  I have made BS patch into a loadable module.
>>  http://linux.chinaunix.net/bbs/thread-909725-2-1.html and let others
>> help with testing.
>
> It is still very hard to read; and i am not sure how you are going to
> get the performance you claim eventually - you are registering as a tap
> for ip packets, which means you will process two of each incoming
> packets.
>
> cheers,
> jamal
>
> 


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP

2007-09-23 Thread John Ye
Dear Jamal,

Thanks, bothered you all.

I will look into the 2 issues. re-ordering and spinlock, and do extensive
test.
Once having result, no matter positive or negative, I will contact you.
The format will not be a mess any more.

John Ye

- Original Message -
From: "jamal" <[EMAIL PROTECTED]>
To: "john ye" <[EMAIL PROTECTED]>
Cc: "David Miller" <[EMAIL PROTECTED]>; ;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, September 24, 2007 2:07 AM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network:
concurrentlyrunsoftirqnetwork code on SMP


> John,
> It will NEVER be an acceptable solution as long as you have re-ordering.
> I will look at it - but i have to run out for now. In the meantime,
> I have indented it for you to be in proper kernel format so others can
> also look it. Attached.
>
> cheers,
> jamal
>
>


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently runsoftirqnetwork code on SMP

2007-09-25 Thread john ye
Jamal,

You pointed out a key point: it's NOT acceptable if massive packet re-ordering 
couldn¡¯t be avoided.
I debugged function tcp_ofo_queue in net/ipv4/tcp_input.c & monitored 
out_of_order_queue, found that re-ordering
becomes unacceptable with the softirq load grows.

It's simple to avoid out-of-order packets by changing random dispatch into 
dispatch based on source ip address.
e.g. cpu = iph->saddr % nr_cpus. while cpu is like a hash entry.
Considering that BS patch is mainly used on server with many incoming 
connections,
dispatch by IP should balance CPU load well.

The test is under way, it's not bad so far.
The queue spin_lock seems not cost much.

Below is the bcpp beautified module code. Last time code mess is caused by 
outlook express which killed tabs.

Thanks.

John Ye



/*
 *  BOTTOM_SOFTIRQ_NET
 *  An implementation of bottom softirq concurrent execution on SMP
 *  This is implemented by splitting current net softirq into top 
half
 *  and bottom half, dispatch the bottom half to each cpu's 
workqueue.
 *  Hopefully, it can raise the throughput of NIC when running 
iptalbes
 *  on SMP machine.
 *
 *  Version:$Id: bs_smp.c, v 2.6.13-15 for kernel 2.6.13-15-smp
 *
 *  Authors:John Ye & QianYu Ye, 2007.08.27
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static spinlock_t *p_ptype_lock;
static struct list_head *p_ptype_base;/* 16 way hashed list */

int (*Pip_options_rcv_srr)(struct sk_buff *skb);
int (*Pnf_rcv_postxfrm_nonlocal)(struct sk_buff *skb);
struct ip_rt_acct *ip_rt_acct;
struct ipv4_devconf *Pipv4_devconf;

#define ipv4_devconf (*Pipv4_devconf)
//#define ip_rt_acct Pip_rt_acct
#define ip_options_rcv_srr Pip_options_rcv_srr
#define nf_rcv_postxfrm_nonlocal Pnf_rcv_postxfrm_nonlocal
//extern int nf_rcv_postxfrm_local(struct sk_buff *skb);
//extern int ip_options_rcv_srr(struct sk_buff *skb);
static struct workqueue_struct **Pkeventd_wq;
#define keventd_wq (*Pkeventd_wq)

#define INSERT_CODE_HERE

static inline int ip_rcv_finish(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
struct iphdr *iph = skb->nh.iph;
int err;

/*
 * Initialise the virtual path cache for the packet. It describes
 * how the packet travels inside Linux networking.
 */
if (skb->dst == NULL)
{
if ((err = ip_route_input(skb, iph->daddr, iph->saddr, 
iph->tos, dev)))
{
if (err == -EHOSTUNREACH)
IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS);
goto drop;
}
}

if (nf_xfrm_nonlocal_done(skb))
return nf_rcv_postxfrm_nonlocal(skb);

#ifdef CONFIG_NET_CLS_ROUTE
if (skb->dst->tclassid)
{
struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id();
u32 idx = skb->dst->tclassid;
st[idx&0xFF].o_packets++;
st[idx&0xFF].o_bytes+=skb->len;
st[(idx>>16)&0xFF].i_packets++;
st[(idx>>16)&0xFF].i_bytes+=skb->len;
}
#endif

if (iph->ihl > 5)
{
struct ip_options *opt;

/* It looks as overkill, because not all
   IP options require packet mangling.
   But it is the easiest for now, especially taking
   into account that combination of IP options
   and running sniffer is extremely rare condition.
  --ANK (980813)
*/

if (skb_cow(skb, skb_headroom(skb)))
{
IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS);
goto drop;
}
iph = skb->nh.iph;

if (ip_options_compile(NULL, skb))
goto inhdr_error;

opt = &(IPCB(skb)->opt);
if (opt->srr)
{
struct in_device *in_dev = in_dev_get(dev);
   

Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP

2007-09-25 Thread John Ye
Jamal & Stephen,

I found BSS-hash paper you mentioned and have browsed it briefly.
The issue "may end sending all your packets to one cpu" might be dealt with
by
cpu hash (srcip + dstip) % nr_cpus, plus checking cpu balance periodically,
shift cpu by an extra seed value?

Any way, the cpu hash code must not be too expensive because every incoming
packet hits the path.

We are going to do further study on this BSS thing.

__do_IRQ has a tendency to collect same IRQ on different CPUs into one CPU
when NIC is busy(by IRQ_PENDING & IRQ_INPROGRESS control skill). so,
dispatch the load to SMP here may be good thing(?).

Thanks.

John Ye


- Original Message -
From: "jamal" <[EMAIL PROTECTED]>
To: "Stephen Hemminger" <[EMAIL PROTECTED]>
Cc: "john ye" <[EMAIL PROTECTED]>; "David Miller" <[EMAIL PROTECTED]>;
; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Wednesday, September 26, 2007 6:22 AM
Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network:
concurrentlyrunsoftirqnetwork code on SMP


> On Tue, 2007-25-09 at 09:03 -0700, Stephen Hemminger wrote:
>
> > There is a standard hash called RSS, that many drivers support because
it is
> > used by other operating systems.
>
> I think any stateless/simple thing will do (something along the lines
> what 802.1ad does for trunk, a 5 classical five tuple etc).
>
> Having solved the reordering problem in such a stateless way introduces
> a loadbalancing setback; you may end sending all your packets to one cpu
> (a problem Mr Ye didnt have when he was re-orderding ;->).
>
> cheers,
> jamal
>
>


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Make TCP prequeue configurable

2007-09-27 Thread John Heffner

Stephen Hemminger wrote:

On Fri, 28 Sep 2007 00:08:33 +0200
Eric Dumazet <[EMAIL PROTECTED]> wrote:


Hi all

I am sure some of you are going to tell me that prequeue is not
all black :)

Thank you

[RFC] Make TCP prequeue configurable

The TCP prequeue thing is based on old facts, and has drawbacks.

1) It adds 48 bytes per 'struct tcp_sock'
2) It adds some ugly code in hot paths
3) It has a small hit ratio on typical servers using many sockets
4) It may have a high hit ratio on UP machines running one process,
where the prequeue adds litle gain. (In fact, letting the user
doing the copy after being woke up is better for cache reuse)
5) Doing a copy to user in softirq handler is not good, because of
potential page faults :(
6) Maybe the NET_DMA thing is the only thing that might need prequeue.

This patch introduces a CONFIG_TCP_PREQUEUE, automatically selected if 
CONFIG_NET_DMA is on.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>



Rather than having a two more compile cases and test cases to deal
with.  If you can prove it is useless, make a case for killing
it completely.



I think it really does help in case (4) with old NICs that don't do rx 
checksumming.  I'm not sure how many people really care about this 
anymore, but probably some...?


OTOH, it would be nice to get rid of sysctl_tcp_low_latency.

  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SWS for rcvbuf < MTU

2007-03-13 Thread John Heffner

Alex Sidorenko wrote:
Here are the values from live kernel (obtained with 'crash') when the host was 
in SWS state:


full_space=708  full_space/2=354
free_space=393
window=76

In this case the test from my original fix, (window < full_space/2),  
succeeds. But John's test


free_space > window + full_space/2
393  430

does not. So I suspect that the new fix will not always work. From tcpdump 
traces we can see that both hosts exchange with 76-byte packets for a long 
time. From customer's application log we see that it continues to read 
76-byte chunks per each read() call - even though more than that is available 
in the receive buffer. Technically it's OK for read() to return even after 
reading one byte, so if sk->receive_queue contains multiple 76-byte skbuffs 
we may return after processing just one skbuff (but we we don't understand 
the details of why this happens on customer's system).


Are there any particular reasons why you want to postpone window update until 
free_space becomes > window + full_space/2 and not as soon as 
free_space > full_space/2? As the only real-life occurance of SWS shows 
free_space oscillating slightly above full_space/2, I created the fix 
specifically to match this phenomena as seen on customer's host. We reach the 
modified section only when (free_space > full_space/2) so it should be OK to 
update the window at this point if mss==full_space. 

So yes, we can test John's fix on customer's host but I doubt it will work for 
the reasons mentioned above, in brief:


'window = free_space' instead of 'window=full_space/2' is OK,
but the test 'free_space > window + full_space/2' is not for the specific 
pattern customer sees on his hosts.



Sorry for the long delay in response, I've been on vacation.  I'm okay 
with your patch, and I can't think of any real problem with it, except 
that the behavior is non-standard.  Then again, Linux acking in general 
is non-standard, which has created the bug in the first place. :)  The 
only thing I can think where it might still ack too often is if 
free_space frequently drops just below full_space/2 for a bit then rises 
above full_space/2.


I've also attached a corrected version of my earlier patch that I think 
solves the problem you noted.


Thanks,
  -John
Do full receiver-side SWS avoidance when rcvbuf < mss.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit f4333661026621e15549fb75b37be785e4a1c443
tree 30d46b64ea19634875fdd4656d33f76db526a313
parent 562aa1d4c6a874373f9a48ac184f662fbbb06a04
author John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400
committer John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400

 net/ipv4/tcp_output.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index dc15113..e621a63 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1605,8 +1605,15 @@ u32 __tcp_select_window(struct sock *sk)
 * We also don't do any window rounding when the free space
 * is too small.
 */
-   if (window <= free_space - mss || window > free_space)
+   if (window <= free_space - mss || window > free_space) {
window = (free_space/mss)*mss;
+   } else if (mss == full_space) {
+   /* Do full receive-side SWS avoidance
+* when rcvbuf <= mss */
+   window = tcp_receive_window(tp);
+   if (free_space > window + full_space/2)
+   window = free_space;
+   }
}
 
return window;


[PATCH] tcp_mem initialization

2007-03-14 Thread John Heffner
The current tcp_mem initialization gives values that are really too 
small for systems with ~256-768 MB of memory, and also for systems with 
larger page sizes (ia64).  This patch gives an alternate method of 
initialization that doesn't depend on the cache allocation functions, 
but I think should still provide a nice curve that gives a smaller 
fraction of total memory with small-memory systems, while maintaining 
the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems.


  -John

Change tcp_mem initialization function.  The fraction of total memory is now
a continuous function of memory size, and independent of page size.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>

---
commit a4461a36efb376bf01399cfd6f1ad15dc89a8794
tree 23b2fb9da52b45de8008fc7ea6bb8c10e3a3724b
parent 8b9909ded6922c33c221b105b26917780cfa497d
author John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400
committer John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400

 net/ipv4/tcp.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 74c4d10..3834b10 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2458,11 +2458,18 @@ void __init tcp_init(void)
sysctl_max_syn_backlog = 128;
}
 
-   /* Allow no more than 3/4 kernel memory (usually less) allocated to TCP 
*/
-   sysctl_tcp_mem[0] = (1536 / sizeof (struct inet_bind_hashbucket)) << 
order;
-   sysctl_tcp_mem[1] = sysctl_tcp_mem[0] * 4 / 3;
+   /* Set the pressure threshold to be a fraction of global memory that
+* is up to 1/2 at 256 MB, decreasing toward zero with the amount of
+* memory, with a floor of 128 pages.
+*/
+   limit = min(nr_all_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT);
+   limit = (limit * (nr_all_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11);
+   limit = max(limit, 128UL);
+   sysctl_tcp_mem[0] = limit / 4 * 3;
+   sysctl_tcp_mem[1] = limit;
sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
 
+   /* Set per-socket limits to no more than 1/128 the pressure threshold */
limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
max_share = min(4UL*1024*1024, limit);
 


Re: [PATCH] tcp_mem initialization

2007-03-15 Thread John Heffner

David Miller wrote:

From: John Heffner <[EMAIL PROTECTED]>
Date: Wed, 14 Mar 2007 17:25:22 -0400

The current tcp_mem initialization gives values that are really too 
small for systems with ~256-768 MB of memory, and also for systems with 
larger page sizes (ia64).  This patch gives an alternate method of 
initialization that doesn't depend on the cache allocation functions, 
but I think should still provide a nice curve that gives a smaller 
fraction of total memory with small-memory systems, while maintaining 
the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems.


Indeed, it's really dumb for any of these calculations to be
dependant upon the page size.

Your patch looks good, and I'll review it further tomorrow and
push upstream unless I find some issues with it.

Thanks John.



The way it's coded is somewhat opaque since it has to be done with 
32-bit integer arithmetic.  These plots might help make the motivation 
behind the code a little clearer.


Thanks,
  -John




[PATCH 0/3] [NET] MTU discovery changes

2007-03-23 Thread John Heffner
These are a few changes to fix/clean up some of the MTU discovery 
processing with non-stream sockets, and add a probing mode.  See also 
matching patches to tracepath to take advantage of this.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] [NET] Do pmtu check in transport layer

2007-03-23 Thread John Heffner
Check the pmtu check at the transport layer (for UDP, ICMP and raw), and
send a local error if socket is PMTUDISC_DO and packet is too big.  This is
actually a pure bugfix for ipv6.  For ipv4, it allows us to do pmtu checks
in the same way as for ipv6.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +++-
 net/ipv4/raw.c|8 +---
 net/ipv6/ip6_output.c |   11 ++-
 net/ipv6/raw.c|7 +--
 4 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d096332..593acf7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen) {
+   if (inet->cork.length + length > 0x - fragheaderlen ||
+   (inet->pmtudisc >= IP_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 87e9c16..f252f4e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,10 +271,12 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
+   int mtu;
 
-   if (length > rt->u.dst.dev->mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
-  rt->u.dst.dev->mtu);
+   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
+rt->u.dst.dev->mtu;
+   if (length > mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3055169..711dfc3 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1044,11 +1044,12 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
-   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
-   }
+   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
+inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
+   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 306d5d8..75db277 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -556,9 +556,12 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
+   int mtu;
 
-   if (length > rt->u.dst.dev->mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
+   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
+rt->u.dst.dev->mtu;
+   if (length > mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] [NET] Move DF check to ip_forward

2007-03-23 Thread John Heffner
Do fragmentation check in ip_forward, similar to ipv6 forwarding.  Also add
a debug printk in the DF check in ip_fragment since we should now never
reach it.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_forward.c |8 
 net/ipv4/ip_output.c  |2 ++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 369e721..0efb1f5 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -85,6 +85,14 @@ int ip_forward(struct sk_buff *skb)
if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
goto sr_failed;
 
+   if (unlikely(skb->len > dst_mtu(&rt->u.dst) &&
+(skb->nh.iph->frag_off & htons(IP_DF))) && !skb->local_df) 
{
+   IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
+   icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+ htonl(dst_mtu(&rt->u.dst)));
+   goto drop;
+   }
+
/* We are about to mangle packet. Copy it! */
if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len))
goto drop;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 593acf7..90bdd53 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -433,6 +433,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
iph = skb->nh.iph;
 
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
+   if (net_ratelimit())
+   printk(KERN_DEBUG "ip_fragment: requested fragment of 
packet with DF set\n");
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
  htonl(dst_mtu(&rt->u.dst)));
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] [NET] Add IP(V6)_PMTUDISC_RPOBE

2007-03-23 Thread John Heffner
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery. 
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 include/linux/skbuff.h   |3 ++-
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 ++
 net/ipv4/ip_output.c |   14 ++
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 +++
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 +++
 11 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..2dc1f8a 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4ff3940..64038b4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -284,7 +284,8 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1;
+   ipvs_property:1,
+   ign_dst_mtu;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index e79c3e3..f5874a3 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -201,7 +201,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -549,6 +550,7 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
+   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_BRIDGE_NETFILTER
new->nf_bridge  = old->nf_bridge;
nf_bridge_get(old->nf_bridge);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 90bdd53..a7e8944 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -201,7 +201,8 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) &&
+   !skb->ign_dst_mtu && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -801,7 +802,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu :
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1220,13 +1223,16 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
+   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
+   s

[PATCH 0/2] [iputils] MTU discovery changes

2007-03-23 Thread John Heffner
These add some changes that make tracepath a little more useful for 
diagnosing MTU issues.  The length flag helps distinguish between MTU 
black holes and other types of black holes by allowing you to vary the 
probe packet lengths.  Using PMTUDISC_PROBE gives you the same results 
on each run without having to flush the route cache, so you can see 
where MTU changes in the path actually occur.


The PMTUDISC_PROBE patch goes in should be conditional on whether the 
corresponding kernel patch (just sent) goes in.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [iputils] Use PMTUDISC_PROBE mode if it exists.

2007-03-23 Thread John Heffner
Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 tracepath.c  |   10 --
 tracepath6.c |   10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index 1f901ba..a562d88 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -24,6 +24,10 @@
 #include 
 #include 
 
+#ifndef IP_PMTUDISC_PROBE
+#define IP_PMTUDISC_PROBE  3
+#endif
+
 struct hhistory
 {
int hops;
@@ -322,8 +326,10 @@ main(int argc, char **argv)
}
memcpy(&target.sin_addr, he->h_addr, 4);
 
-   on = IP_PMTUDISC_DO;
-   if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on))) {
+   on = IP_PMTUDISC_PROBE;
+   if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on)) &&
+   (on = IP_PMTUDISC_DO,
+setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on {
perror("IP_MTU_DISCOVER");
exit(1);
}
diff --git a/tracepath6.c b/tracepath6.c
index d65230d..6f13a51 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -30,6 +30,10 @@
 #define SOL_IPV6 IPPROTO_IPV6
 #endif
 
+#ifndef IPV6_PMTUDISC_PROBE
+#define IPV6_PMTUDISC_PROBE3
+#endif
+
 int overhead = 48;
 int mtu = 128000;
 int hops_to = -1;
@@ -369,8 +373,10 @@ int main(int argc, char **argv)
mapped = 1;
}
 
-   on = IPV6_PMTUDISC_DO;
-   if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on))) {
+   on = IPV6_PMTUDISC_PROBE;
+   if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on)) &&
+   (on = IPV6_PMTUDISC_DO,
+setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on {
perror("IPV6_MTU_DISCOVER");
exit(1);
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [iputils] Add length flag to set initial MTU.

2007-03-23 Thread John Heffner
Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 tracepath.c  |   10 --
 tracepath6.c |   10 --
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index c3f6f74..1f901ba 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -265,7 +265,7 @@ static void usage(void) __attribute((noreturn));
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: tracepath [-n] [/]\n");
+   fprintf(stderr, "Usage: tracepath [-n] [-l ] 
[/]\n");
exit(-1);
 }
 
@@ -279,11 +279,17 @@ main(int argc, char **argv)
char *p;
int ch;
 
-   while ((ch = getopt(argc, argv, "nh?")) != EOF) {
+   while ((ch = getopt(argc, argv, "nh?l:")) != EOF) {
switch(ch) {
case 'n':   
no_resolve = 1;
break;
+   case 'l':
+   if ((mtu = atoi(optarg)) <= overhead) {
+   fprintf(stderr, "Error: length must be >= 
%d\n", overhead);
+   exit(1);
+   }
+   break;
default:
usage();
}
diff --git a/tracepath6.c b/tracepath6.c
index 23d6a8c..d65230d 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -280,7 +280,7 @@ static void usage(void) __attribute((noreturn));
 
 static void usage(void)
 {
-   fprintf(stderr, "Usage: tracepath6 [-n] [-b] [/]\n");
+   fprintf(stderr, "Usage: tracepath6 [-n] [-b] [-l ] 
[/]\n");
exit(-1);
 }
 
@@ -297,7 +297,7 @@ int main(int argc, char **argv)
int gai;
char pbuf[NI_MAXSERV];
 
-   while ((ch = getopt(argc, argv, "nbh?")) != EOF) {
+   while ((ch = getopt(argc, argv, "nbh?l:")) != EOF) {
switch(ch) {
case 'n':   
no_resolve = 1;
@@ -305,6 +305,12 @@ int main(int argc, char **argv)
case 'b':   
show_both = 1;
break;
+   case 'l':
+   if ((mtu = atoi(optarg)) <= overhead) {
+   fprintf(stderr, "Error: length must be >= 
%d\n", overhead);
+   exit(1);
+   }
+   break;
default:
usage();
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ip(7) IP_PMTUDISC_PROBE

2007-03-27 Thread John Heffner
Document new IP_PMTUDISC_PROBE value for IP_MTU_DISCOVERY.  (Going into 
2.6.22).


Thanks,
  -John
diff -rU3 man-pages-2.43-a/man7/ip.7 man-pages-2.43-b/man7/ip.7
--- man-pages-2.43-a/man7/ip.7  2006-09-26 09:54:29.0 -0400
+++ man-pages-2.43-b/man7/ip.7  2007-03-27 15:46:18.0 -0400
@@ -515,6 +515,7 @@
 IP_PMTUDISC_WANT:Use per-route settings.
 IP_PMTUDISC_DONT:Never do Path MTU Discovery.
 IP_PMTUDISC_DO:Always do Path MTU Discovery. 
+IP_PMTUDISC_PROBE:Set DF but ignore Path MTU.
 .TE   
 
 When PMTU discovery is enabled the kernel automatically keeps track of
@@ -550,6 +551,15 @@
 with the
 .B IP_MTU
 option. 
+
+It is possible to implement RFC 4821 MTU probing with
+.B SOCK_DGRAM
+of
+.B SOCK_RAW
+sockets by setting a value of IP_PMTUDISC_PROBE.  This is also particularly
+useful for diagnostic tools such as
+.BR tracepath (8)
+that wish to deliberately send probe packets larger than the observed Path MTU.
 .TP
 .B IP_MTU
 Retrieve the current known path MTU of the current socket. 


Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-27 Thread John Heffner

Mark Huth wrote:



David Miller wrote:

From: [EMAIL PROTECTED] (David Griego)
Date: Tue, 27 Mar 2007 14:47:54 -0700

 

Adds an IOCTL for aborting established TCP connections, and is
designed to be an HA performance improvement for cleaning up, failure 
notification, and application termination.


Signed-off-by:  David Griego <[EMAIL PROTECTED]>



SO_LINGER with a zero linger time plus close() isn't working
properly?

There is no reason for this ioctl at all.  Either existing
facilities provide what you need or what you want is a
protocol violation we can't do.
  
Actually, there are legitimate uses for this sort of API.  The patch 
allows an administrator to kill specific connections that are in use by 
other applications, where the close is not available, since the socket 
is owned by another process.  Say one of your large applications has 
hundreds or even thousands of open connections and you have determined 
that a particular connection is causing trouble.  This API allows the 
admin to kill that particular connection, and doesn't appear to violate 
any RFC offhand, since an abort is sent  to the peer.


One may argue that the applications should be modified, but that is not 
always possible in the case of various ISVs.  As Linux gains market 
share in the large server market, more and more applications are being 
ported from other platforms that have this sort of 
management/administrative interfaces.


Mark Huth


I also believe this is a useful thing to have.  I'm not 100% sure this 
ioctl is the way to go, but it seems reasonable.  This directly 
corresponds to writing deleteTcb to the tcpConnectionState variable in 
the TCP MIB (RFC 4022).  I don't think it constitutes a protocol violation.


As a concrete example of a way I've used this type of feature is to 
defend against a netkill [1] style attack, where the defense involves 
making decisions about which connections to kill when memory gets 
scarce.  It makes sense to do this with a system daemon, since an admin 
might have an arbitrarily complicated policy as to which applications 
and peers have priority for the memory.  This is too complicated to 
distribute and enforce across all applications.  You could do this in 
the kernel, but why if you don't have to?


  -John

[1] http://shlang.com/netkill/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Add TCP connection abort IOCTL

2007-03-27 Thread John Heffner

John Heffner wrote:
I also believe this is a useful thing to have.  I'm not 100% sure this 
ioctl is the way to go, but it seems reasonable.  This directly 
corresponds to writing deleteTcb to the tcpConnectionState variable in 
the TCP MIB (RFC 4022).  I don't think it constitutes a protocol violation.


Responding to myself in good form :P  I'll add that there are other ways 
to do this currently but all I know of are hackish, f.e. using a raw 
socket to send RST packets to yourself.


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [iputils] Add documentation for the -l flag.

2007-04-03 Thread John Heffner
---
 doc/tracepath.sgml |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml
index 71eaa8d..c0f308b 100644
--- a/doc/tracepath.sgml
+++ b/doc/tracepath.sgml
@@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this 
path
 
 
 tracepath
+-l 
 
 
 
@@ -39,6 +40,18 @@ of UDP ports to maintain trace history.
 
 
 
+OPTIONS
+
+ 
+  
+  
+Sets the initial packet length to 
+ 
+
+
+
 OUTPUT
 
 
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [iputils] Document -n flag.

2007-04-03 Thread John Heffner
---
 doc/tracepath.sgml |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml
index c0f308b..1bc83b9 100644
--- a/doc/tracepath.sgml
+++ b/doc/tracepath.sgml
@@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this 
path
 
 
 tracepath
+-n
 -l 
 
 
@@ -42,6 +43,14 @@ of UDP ports to maintain trace history.
 
 OPTIONS
 
+
+ 
+  
+  
+Do not look up host names.  Only print IP addresses numerically.
+  
+ 
+
  
   
   
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [iputils] Re-probe at same TTL after MTU reduction.

2007-04-03 Thread John Heffner
This fixes a bug that would miss a hop after an ICMP packet too big message,
since it would continue increase the TTL without probing again.
---
 tracepath.c  |6 ++
 tracepath6.c |6 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index d035a1e..19b2c6b 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -352,8 +352,14 @@ main(int argc, char **argv)
exit(1);
}
 
+restart:
for (i=0; i<3; i++) {
+   int old_mtu;
+   
+   old_mtu = mtu;
res = probe_ttl(fd, ttl);
+   if (mtu != old_mtu)
+   goto restart;
if (res == 0)
goto done;
if (res > 0)
diff --git a/tracepath6.c b/tracepath6.c
index a010218..65c4a4a 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -422,8 +422,14 @@ int main(int argc, char **argv)
exit(1);
}
 
+restart:
for (i=0; i<3; i++) {
+   int old_mtu;
+   
+   old_mtu = mtu;
res = probe_ttl(fd, ttl);
+   if (mtu != old_mtu)
+   goto restart;
if (res == 0)
goto done;
if (res > 0)
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] [iputils] Fix asymm messages.

2007-04-03 Thread John Heffner
We should only print the asymm messages in tracepath/6 when you receive a
TTL expired message, because this is the only time when we'd expect the
same number of hops back as our TTL was set to for a symmetric path.
---
 tracepath.c  |   25 -
 tracepath6.c |   25 -
 2 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/tracepath.c b/tracepath.c
index a562d88..d035a1e 100644
--- a/tracepath.c
+++ b/tracepath.c
@@ -163,19 +163,6 @@ restart:
}
}
 
-   if (rethops>=0) {
-   if (rethops<=64)
-   rethops = 65-rethops;
-   else if (rethops<=128)
-   rethops = 129-rethops;
-   else
-   rethops = 256-rethops;
-   if (sndhops>=0 && rethops != sndhops)
-   printf("asymm %2d ", rethops);
-   else if (sndhops<0 && rethops != ttl)
-   printf("asymm %2d ", rethops);
-   }
-
if (rettv) {
int diff = 
(tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec);
printf("%3d.%03dms ", diff/1000, diff%1000);
@@ -204,6 +191,18 @@ restart:
if (e->ee_origin == SO_EE_ORIGIN_ICMP &&
e->ee_type == 11 &&
e->ee_code == 0) {
+   if (rethops>=0) {
+   if (rethops<=64)
+   rethops = 65-rethops;
+   else if (rethops<=128)
+   rethops = 129-rethops;
+   else
+   rethops = 256-rethops;
+   if (sndhops>=0 && rethops != sndhops)
+   printf("asymm %2d ", rethops);
+   else if (sndhops<0 && rethops != ttl)
+   printf("asymm %2d ", rethops);
+   }
printf("\n");
break;
}
diff --git a/tracepath6.c b/tracepath6.c
index 6f13a51..a010218 100644
--- a/tracepath6.c
+++ b/tracepath6.c
@@ -176,19 +176,6 @@ restart:
}
}
 
-   if (rethops>=0) {
-   if (rethops<=64)
-   rethops = 65-rethops;
-   else if (rethops<=128)
-   rethops = 129-rethops;
-   else
-   rethops = 256-rethops;
-   if (sndhops>=0 && rethops != sndhops)
-   printf("asymm %2d ", rethops);
-   else if (sndhops<0 && rethops != ttl)
-   printf("asymm %2d ", rethops);
-   }
-
if (rettv) {
int diff = 
(tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec);
printf("%3d.%03dms ", diff/1000, diff%1000);
@@ -220,6 +207,18 @@ restart:
(e->ee_origin == SO_EE_ORIGIN_ICMP6 &&
 e->ee_type == 3 &&
 e->ee_code == 0)) {
+   if (rethops>=0) {
+   if (rethops<=64)
+   rethops = 65-rethops;
+   else if (rethops<=128)
+   rethops = 129-rethops;
+   else
+   rethops = 256-rethops;
+   if (sndhops>=0 && rethops != sndhops)
+   printf("asymm %2d ", rethops);
+   else if (sndhops<0 && rethops != ttl)
+   printf("asymm %2d ", rethops);
+   }
printf("\n");
break;
}
-- 
1.5.0.2.gc260-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] [NET] Do pmtu check in transport layer

2007-04-09 Thread John Heffner

Patrick McHardy wrote:

John Heffner wrote:

Check the pmtu check at the transport layer (for UDP, ICMP and raw), and
send a local error if socket is PMTUDISC_DO and packet is too big.  This is
actually a pure bugfix for ipv6.  For ipv4, it allows us to do pmtu checks
in the same way as for ipv6.

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d096332..593acf7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-	if (inet->cork.length + length > 0x - fragheaderlen) {

+   if (inet->cork.length + length > 0x - fragheaderlen ||
+   (inet->pmtudisc >= IP_PMTUDISC_DO &&
+inet->cork.length + length > mtu)) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}



This makes ping report an incorrect MTU when IPsec is used since we're
only accounting for the additional header_len, not the trailer_len
(which is not easily changeable). Additionally it will report different
MTUs for the first and following fragments when the socket is corked
because only the first fragment includes the header_len. It also can't
deal with things like NAT and routing by fwmark that change the route.
The old behaviour was that we get an ICMP frag. required with the MTU
of the final route, while this will always report the MTU of the
initially chosen route.

For all these reasons I think it should be reverted to the old
behaviour.


You're right, this is no good.  I think the other problems are fixable, 
but NAT really screws this.


Unfortunately, there is still a real problem with ipv6, in that the 
output side does not generate a packet too big ICMP like ipv4.  Also, it 
feels kind of undesirable be rely on local ICMP instead of direct error 
message delivery.  I'll try to generate a new patch.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-15 Thread John Heffner

Robert Iakobashvili wrote:

Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and
2.6.20.6 do not.

Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5
tcp_rmem and tcp_wmem are the same, whereas tcp_mem are
much different:

kernel  tcp_mem
---
2.6.18.312288 16384 24576
2.6.19.5  30724096   6144


Is not it done deliberately by the below patch:

commit 9e950efa20dc8037c27509666cba6999da9368e8
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Mon Nov 6 23:10:51 2006 -0800

   [TCP]: Don't use highmem in tcp hash size calculation.

   This patch removes consideration of high memory when determining TCP
   hash table sizes.  Taking into account high memory results in tcp_mem
   values that are too large.

Is it a feature?

My machine has:
MemTotal:   484368 kB
and
for all kernel configurations are actually the same with
CONFIG_HIGHMEM4G=y

Thanks,



Another patch that went in right around that time:

commit 52bf376c63eebe72e862a1a6e713976b038c3f50
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Tue Nov 14 20:25:17 2006 -0800

[TCP]: Fix up sysctl_tcp_mem initialization.

Fix up tcp_mem initial settings to take into account the size of the
hash entries (different on SMP and non-SMP systems).

    Signed-off-by: John Heffner <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

(This has been changed again for 2.6.21.)

In the dmesg, there should be some messages like this:

IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)

What do yours say?

Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-16 Thread John Heffner

Robert Iakobashvili wrote:

Hi John,

On 4/15/07, John Heffner <[EMAIL PROTECTED]> wrote:

Robert Iakobashvili wrote:
> Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and
> 2.6.20.6 do not.
>
> Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5
> tcp_rmem and tcp_wmem are the same, whereas tcp_mem are
> much different:
>
> kernel  tcp_mem
> ---
> 2.6.18.312288 16384 24576
> 2.6.19.5  30724096   6144



Another patch that went in right around that time:

commit 52bf376c63eebe72e862a1a6e713976b038c3f50
Author: John Heffner <[EMAIL PROTECTED]>
Date:   Tue Nov 14 20:25:17 2006 -0800

 [TCP]: Fix up sysctl_tcp_mem initialization.
(This has been changed again for 2.6.21.)

In the dmesg, there should be some messages like this:
IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 131072 bind 65536)

What do yours say?


For the 2.6.19.5, where we have this problem:

From dmsg:

IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
TCP established hash table entries: 16384 (order: 5, 131072 bytes)
TCP bind hash table entries: 8192 (order: 4, 65536 bytes)

#cat /proc/sys/net/ipv4/tcp_mem
307240966144

MemTotal:   484368 kB
CONFIG_HIGHMEM4G=y



Yes, this difference is caused by the commit above.  The old way didn't 
really make a lot of sense, since it was different based on smp/non-smp 
and page size, and had large discontinuities at 512MB and every power of 
two.  It was hard to make the limit never larger than the memory pool 
but never too small either, when based on the hash table size.


The current net-2.6 (2.6.21) has a redesigned tcp_mem initialization 
that should give you more appropriate values, something like 45408 60546 
90816.  For reference:


Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c
Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700

[TCP]: Fix tcp_mem[] initialization.

Change tcp_mem initialization function.  The fraction of total memory
is now a continuous function of memory size, and independent of page
size.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-16 Thread John Heffner

Robert Iakobashvili wrote:

Kernels 2.6.19 and 2.6.20 series are effectively broken right now.
Don't you wish to patch them?



I don't know if this qualifies as an unconditional bug.  The commit 
above was actually a bugfix so that the limits were not higher than 
total memory on some systems, but had the side effect that it made them 
even smaller on your particular configuration.  Also, having initial 
sysctl values that are conservatively small probably doesn't qualify as 
a bug (for patching stable trees).  You might ask the -stable 
maintainers if they have a different opinion.


For most people, 2.6.19 and 2.6.20 work fine.  For those who really care 
about the tcp_mem values (are using a substantial fraction of physical 
memory for TCP connections), the best bet is to set the tcp_mem sysctl 
values in the startup scripts, or use the new initialization function in 
2.6.21.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug in tcp?

2007-04-16 Thread John Heffner

Stephen Hemminger wrote:

A guess: maybe something related to a PAWS wraparound problem.
Does turning off sysctl net.ipv4.tcp_timestamps fix it?


That was my first thought too (aside from netfilter), but a failed PAWS 
check should not result in a reset..


  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP connection stops after high load.

2007-04-17 Thread John Heffner

David Miller wrote:

From: "Robert Iakobashvili" <[EMAIL PROTECTED]>
Date: Tue, 17 Apr 2007 10:58:04 +0300


David,

On 4/16/07, David Miller <[EMAIL PROTECTED]> wrote:

Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c
Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700

 [TCP]: Fix tcp_mem[] initialization.
 Change tcp_mem initialization function.  The fraction of total memory
 is now a continuous function of memory size, and independent of page
 size.


Kernels 2.6.19 and 2.6.20 series are effectively broken right now.
Don't you wish to patch them?

Can you verify that this patch actually fixes your problem?

Yes, it fixes.


Thanks, I will submit it to -stable branch.


My only reservation in submitting this to -stable is that it will in 
many cases increase the default tcp_mem values, which in turn can 
increase the default tcp_rmem values, and therefore the window scale. 
There will be some set of people with broken firewalls who trigger that 
problem for the first time by upgrading along the stable branch.  While 
it's not our fault, it could cause some complaints...


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/0] Re-try changes for PMTUDISC_PROBE

2007-04-18 Thread John Heffner
This backs out the the transport layer MTU checks that don't work.  As a 
consequence, I had to back out the PMTUDISC_PROBE patch as well.  These 
patches should fix the problem with ipv6 that the transport layer change 
tried to address, and re-implement PMTUDISC_PROBE.  I think this 
approach is nicer than the last one, since it doesn't require a bit in 
struct sk_buff.


Thanks,
  -John
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Revert "[NET] Do pmtu check in transport layer"

2007-04-18 Thread John Heffner
This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37.

This idea does not work, as pointed at by Patrick McHardy.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +---
 net/ipv4/raw.c|8 +++-
 net/ipv6/ip6_output.c |   11 +--
 net/ipv6/raw.c|7 ++-
 4 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 79e71ee..34606ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen ||
-   (inet->pmtudisc >= IP_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
+   if (inet->cork.length + length > 0x - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index c60aadf..24d7c9f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
-   int mtu;
 
-   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
+  rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b8e307a..4cfdad4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
-inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
-   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
+   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
+   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
+   }
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f4cd90b..f65fcd7 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
-   int mtu;
 
-   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [NET] MTU discovery check in ip6_fragment()

2007-04-18 Thread John Heffner
Adds a check in ip6_fragment() mirroring ip_fragment() for packets
that we can't fragment, and sends an ICMP Packet Too Big message
in response.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv6/ip6_output.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 4cfdad4..5a5b7d4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -567,6 +567,19 @@ static int ip6_fragment(struct sk_buff *skb, int 
(*output)(struct sk_buff *))
nexthdr = *prevhdr;
 
mtu = dst_mtu(&rt->u.dst);
+
+   /* We must not fragment if the socket is set to force MTU discovery
+* or if the skb it not generated by a local socket.  (This last
+* check should be redundant, but it's free.)
+*/
+   if (!np || np->pmtudisc >= IPV6_PMTUDISC_DO) {
+   skb->dev = skb->dst->dev;
+   icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb->dev);
+   IP6_INC_STATS(ip6_dst_idev(skb->dst), IPSTATS_MIB_FRAGFAILS);
+   kfree_skb(skb);
+   return -EMSGSIZE;
+   }
+
if (np && np->frag_size < mtu) {
if (np->frag_size)
mtu = np->frag_size;
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"

2007-04-18 Thread John Heffner
This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb.

Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37
does not work.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 -
 include/linux/in6.h  |1 -
 include/linux/skbuff.h   |3 +--
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 --
 net/ipv4/ip_output.c |   14 --
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 ---
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 ---
 11 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 2dc1f8a..1912e7c 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,7 +83,6 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
-#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index d559fac..4e8350a 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,7 +179,6 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
-#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bf9b9f..7f17cfc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -277,8 +277,7 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1,
-   ign_dst_mtu:1;
+   ipvs_property:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index 6a08b65..75f226d 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
-   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_NET_SCHED
 #ifdef CONFIG_NET_CLS_ACT
new->tc_verd = old->tc_verd;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 704bc44..79e71ee 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) &&
-   !skb->ign_dst_mtu && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
-   rt->u.dst.dev->mtu :
-   dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc < IP_PMTUDISC_DO)
+   if (inet->pmtudisc != IP_PMTUDISC_DO)
skb->local_df = 1;
 
-   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
-   skb->ign_dst_mtu = 1;
-
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 

[PATCH] [NET] Add IP(V6)_PMTUDISC_RPOBE

2007-04-18 Thread John Heffner
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 net/ipv4/ip_output.c |   20 +++-
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv6/ip6_output.c|   15 ---
 net/ipv6/ipv6_sockglue.c |2 +-
 6 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..3975cbf 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 34606ef..66e2c3a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb)
return -EINVAL;
 }
 
+static inline int ip_skb_dst_mtu(struct sk_buff *skb)
+{
+   struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;
+
+   return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
+  skb->dst->dev->mtu : dst_mtu(skb->dst);
+}
+
 static inline int ip_finish_output(struct sk_buff *skb)
 {
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
@@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
- htonl(dst_mtu(&rt->u.dst)));
+ htonl(ip_skb_dst_mtu(skb)));
kfree_skb(skb);
return -EMSGSIZE;
}
@@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu : 
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 * locally. */
-   if (inet->pmtudisc == IP_PMTUDISC_DO ||
+   if (inet->pmtudisc >= IP_PMTUDISC_DO ||
(skb->len <= dst_mtu(&rt->u.dst) &&
 ip_dont_fragment(sk, &rt->u.dst)))
df = htons(IP_DF);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c199d23..4d54457 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
inet->hdrincl = val ? 1 : 0;
break;
case IP_MTU_DISCOVER:
-   if (val<0 || val>2)
+  

[PATCH 2/4] Revert "[NET] Do pmtu check in transport layer"

2007-04-18 Thread John Heffner
This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37.

This idea does not work, as pointed at by Patrick McHardy.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv4/ip_output.c  |4 +---
 net/ipv4/raw.c|8 +++-
 net/ipv6/ip6_output.c |   11 +--
 net/ipv6/raw.c|7 ++-
 4 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 79e71ee..34606ef 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk,
fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
 
-   if (inet->cork.length + length > 0x - fragheaderlen ||
-   (inet->pmtudisc >= IP_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
+   if (inet->cork.length + length > 0x - fragheaderlen) {
ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, 
mtu-exthdrlen);
return -EMSGSIZE;
}
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index c60aadf..24d7c9f 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, 
size_t length,
struct iphdr *iph;
struct sk_buff *skb;
int err;
-   int mtu;
 
-   mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport,
+  rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index b8e307a..4cfdad4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void 
*from, char *to,
fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt 
? opt->opt_nflen : 0);
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - 
sizeof(struct frag_hdr);
 
-   if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN &&
-inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN 
- fragheaderlen) ||
-   (np->pmtudisc >= IPV6_PMTUDISC_DO &&
-inet->cork.length + length > mtu)) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
-   return -EMSGSIZE;
+   if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) {
+   if (inet->cork.length + length > sizeof(struct ipv6hdr) + 
IPV6_MAXPLEN - fragheaderlen) {
+   ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen);
+   return -EMSGSIZE;
+   }
}
 
/*
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index f4cd90b..f65fcd7 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, 
int length,
struct sk_buff *skb;
unsigned int hh_len;
int err;
-   int mtu;
 
-   mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) :
-rt->u.dst.dev->mtu;
-   if (length > mtu) {
-   ipv6_local_error(sk, EMSGSIZE, fl, mtu);
+   if (length > rt->u.dst.dev->mtu) {
+   ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
return -EMSGSIZE;
}
if (flags&MSG_PROBE)
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"

2007-04-18 Thread John Heffner
This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb.

Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37
does not work.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 -
 include/linux/in6.h  |1 -
 include/linux/skbuff.h   |3 +--
 include/net/ip.h |2 +-
 net/core/skbuff.c|2 --
 net/ipv4/ip_output.c |   14 --
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv4/raw.c   |3 ---
 net/ipv6/ip6_output.c|   12 
 net/ipv6/ipv6_sockglue.c |2 +-
 net/ipv6/raw.c   |3 ---
 11 files changed, 12 insertions(+), 33 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 2dc1f8a..1912e7c 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,7 +83,6 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
-#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index d559fac..4e8350a 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,7 +179,6 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
-#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bf9b9f..7f17cfc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -277,8 +277,7 @@ struct sk_buff {
nfctinfo:3;
__u8pkt_type:3,
fclone:2,
-   ipvs_property:1,
-   ign_dst_mtu:1;
+   ipvs_property:1;
__be16  protocol;
 
void(*destructor)(struct sk_buff *skb);
diff --git a/include/net/ip.h b/include/net/ip.h
index 6a08b65..75f226d 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph)
 static inline
 int ip_dont_fragment(struct sock *sk, struct dst_entry *dst)
 {
-   return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO ||
+   return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO ||
(inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT &&
 !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL;
C(mark);
@@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
new->ipvs_property = old->ipvs_property;
 #endif
-   new->ign_dst_mtu= old->ign_dst_mtu;
 #ifdef CONFIG_NET_SCHED
 #ifdef CONFIG_NET_CLS_ACT
new->tc_verd = old->tc_verd;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 704bc44..79e71ee 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) &&
-   !skb->ign_dst_mtu && !skb_is_gso(skb))
+   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
-   rt->u.dst.dev->mtu :
-   dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc < IP_PMTUDISC_DO)
+   if (inet->pmtudisc != IP_PMTUDISC_DO)
skb->local_df = 1;
 
-   if (inet->pmtudisc == IP_PMTUDISC_PROBE)
-   skb->ign_dst_mtu = 1;
-
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 

[PATCH 4/4] [NET] Add IP(V6)_PMTUDISC_RPOBE

2007-04-18 Thread John Heffner
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 include/linux/in.h   |1 +
 include/linux/in6.h  |1 +
 net/ipv4/ip_output.c |   20 +++-
 net/ipv4/ip_sockglue.c   |2 +-
 net/ipv6/ip6_output.c|   15 ---
 net/ipv6/ipv6_sockglue.c |2 +-
 6 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/linux/in.h b/include/linux/in.h
index 1912e7c..3975cbf 100644
--- a/include/linux/in.h
+++ b/include/linux/in.h
@@ -83,6 +83,7 @@ struct in_addr {
 #define IP_PMTUDISC_DONT   0   /* Never send DF frames */
 #define IP_PMTUDISC_WANT   1   /* Use per route hints  */
 #define IP_PMTUDISC_DO 2   /* Always DF*/
+#define IP_PMTUDISC_PROBE  3   /* Ignore dst pmtu  */
 
 #define IP_MULTICAST_IF32
 #define IP_MULTICAST_TTL   33
diff --git a/include/linux/in6.h b/include/linux/in6.h
index 4e8350a..d559fac 100644
--- a/include/linux/in6.h
+++ b/include/linux/in6.h
@@ -179,6 +179,7 @@ struct in6_flowlabel_req
 #define IPV6_PMTUDISC_DONT 0
 #define IPV6_PMTUDISC_WANT 1
 #define IPV6_PMTUDISC_DO   2
+#define IPV6_PMTUDISC_PROBE3
 
 /* Flowlabel */
 #define IPV6_FLOWLABEL_MGR 32
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 34606ef..66e2c3a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb)
return -EINVAL;
 }
 
+static inline int ip_skb_dst_mtu(struct sk_buff *skb)
+{
+   struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL;
+
+   return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ?
+  skb->dst->dev->mtu : dst_mtu(skb->dst);
+}
+
 static inline int ip_finish_output(struct sk_buff *skb)
 {
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
@@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb)
return dst_output(skb);
}
 #endif
-   if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb))
+   if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb))
return ip_fragment(skb, ip_finish_output2);
else
return ip_finish_output2(skb);
@@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct 
sk_buff*))
if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) {
IP_INC_STATS(IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
- htonl(dst_mtu(&rt->u.dst)));
+ htonl(ip_skb_dst_mtu(skb)));
kfree_skb(skb);
return -EMSGSIZE;
}
@@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk,
inet->cork.addr = ipc->addr;
}
dst_hold(&rt->u.dst);
-   inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path);
+   inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE 
?
+   rt->u.dst.dev->mtu : 
+   dst_mtu(rt->u.dst.path);
inet->cork.rt = rt;
inet->cork.length = 0;
sk->sk_sndmsg_page = NULL;
@@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk)
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
-   if (inet->pmtudisc != IP_PMTUDISC_DO)
+   if (inet->pmtudisc < IP_PMTUDISC_DO)
skb->local_df = 1;
 
/* DF bit is set when we want to see DF on outgoing frames.
 * If local_df is set too, we still allow to fragment this frame
 * locally. */
-   if (inet->pmtudisc == IP_PMTUDISC_DO ||
+   if (inet->pmtudisc >= IP_PMTUDISC_DO ||
(skb->len <= dst_mtu(&rt->u.dst) &&
 ip_dont_fragment(sk, &rt->u.dst)))
df = htons(IP_DF);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c199d23..4d54457 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
inet->hdrincl = val ? 1 : 0;
break;
case IP_MTU_DISCOVER:
-   if (val<0 || val>2)
+  

[PATCH 3/4] [NET] MTU discovery check in ip6_fragment()

2007-04-18 Thread John Heffner
Adds a check in ip6_fragment() mirroring ip_fragment() for packets
that we can't fragment, and sends an ICMP Packet Too Big message
in response.

Signed-off-by: John Heffner <[EMAIL PROTECTED]>
---
 net/ipv6/ip6_output.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 4cfdad4..5a5b7d4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -567,6 +567,19 @@ static int ip6_fragment(struct sk_buff *skb, int 
(*output)(struct sk_buff *))
nexthdr = *prevhdr;
 
mtu = dst_mtu(&rt->u.dst);
+
+   /* We must not fragment if the socket is set to force MTU discovery
+* or if the skb it not generated by a local socket.  (This last
+* check should be redundant, but it's free.)
+*/
+   if (!np || np->pmtudisc >= IPV6_PMTUDISC_DO) {
+   skb->dev = skb->dst->dev;
+   icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb->dev);
+   IP6_INC_STATS(ip6_dst_idev(skb->dst), IPSTATS_MIB_FRAGFAILS);
+   kfree_skb(skb);
+   return -EMSGSIZE;
+   }
+
if (np && np->frag_size < mtu) {
if (np->frag_size)
mtu = np->frag_size;
-- 
1.5.1.rc3.30.ga8f4-dirty

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >