Re: Intel 82559 NIC corrupted EEPROM
John wrote: -0009 : System RAM 000a-000b : Video RAM area 000f-000f : System ROM 0010-0ffe : System RAM 0010-00296a1a : Kernel code 00296a1b-0031bbe7 : Kernel data 0fff-0fff2fff : ACPI Non-volatile Storage 0fff3000-0fff : ACPI Tables 2000-200f : :00:08.0 2010-201f : :00:09.0 2020-202f : :00:0a.0 e000-e3ff : :00:00.0 e500-e50f : :00:08.0 e510-e51f : :00:09.0 e520-e52f : :00:0a.0 e530-e5300fff : :00:08.0 e5301000-e5301fff : :00:0a.0 e5302000-e5302fff : :00:09.0 - : reserved I've also attached: o config-2.6.18.1-adlink used to compile this kernel o dmesg output after the machine boots I suppose the information I've sent is not enough to locate the root of the problem. Is there more I can provide? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 82559 NIC corrupted EEPROM
John wrote: -0009 : System RAM 000a-000b : Video RAM area 000f-000f : System ROM 0010-0ffe : System RAM 0010-00296a1a : Kernel code 00296a1b-0031bbe7 : Kernel data 0fff-0fff2fff : ACPI Non-volatile Storage 0fff3000-0fff : ACPI Tables 2000-200f : :00:08.0 2010-201f : :00:09.0 2020-202f : :00:0a.0 e000-e3ff : :00:00.0 e500-e50f : :00:08.0 e510-e51f : :00:09.0 e520-e52f : :00:0a.0 e530-e5300fff : :00:08.0 e5301000-e5301fff : :00:0a.0 e5302000-e5302fff : :00:09.0 - : reserved I've also attached: o config-2.6.18.1-adlink used to compile this kernel o dmesg output after the machine boots I suppose the information I've sent is not enough to locate the root of the problem. Is there more I can provide? Here is some context for those who have been added to the CC list: http://groups.google.com/group/linux.kernel/browse_frm/thread/bdc8fd08fb601c26 As far as I understand, some consider the eepro100 driver to be obsolete, and it has been considered for removal. What is the current status? Unfortunately, e100 does not work out-of-the-box on this system. Is there something I can do to improve the situation? -- Regards, John [ E-mail address is a bit-bucket. I *do* monitor the mailing lists. ] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 82559 NIC corrupted EEPROM
Jesse Brandeburg wrote: John wrote: Here is some context for those who have been added to the CC list: http://groups.google.com/group/linux.kernel/browse_frm/thread/bdc8fd08fb601c26 As far as I understand, some consider the eepro100 driver to be obsolete, and it has been considered for removal. What is the current status? Unfortunately, e100 does not work out-of-the-box on this system. Is there something I can do to improve the situation? Let's go ahead and print the output from e100_load_eeprom debug patch attached. Loading (then unloading) e100.ko fails the first few times (i.e. the driver claims one of the EEPROMs is corrupted). Thereafter, sometimes it fails, other times it works. Sounds like a race, no? $ cat load_unload : > /var/log/kern.log insmod e100.ko debug=16 sleep 1 cp /var/log/kern.log insmod_$I.txt ip link > ip_link_$I.txt sleep 2 rmmod e100 let "I=I+1" (cf. attached compressed archive) FAILURE: insmod_100.txt insmod_101.txt insmod_102.txt insmod_105.txt insmod_107.txt insmod_108.txt insmod_110.txt insmod_111.txt insmod_114.txt SUCCESS: insmod_103.txt insmod_104.txt insmod_106.txt insmod_109.txt insmod_112.txt insmod_113.txt insmod_115.txt insmod_116.txt On an unrelated note, insmod_100.txt is truncated at the beginning, and insmod_110.txt is truncated in the middle (!!) cf. line 14. What would cause klogd to behave like that? Regards. TEST-e100.tar.bz2 Description: Binary data
Realtek RTL8111B serious performance issues
Hi, I originally sent this email to the linux-net list before realizing it probably belonged on the netdev list. I just subscribed to this list, so I apologize if this is a known issue. I did try looking through the archives, and did not see it there either. We just put together a new "app server" based on a P35 chipset motherboard, 4 gigabytes of RAM, Q6600 processor, and integrated Realtek RTL8111B gigabit NIC. When we SSH or RSH into this machine, and try to run any X application (emacs, firefox) the application's graphics are drawn *extremely* slowly. It can take 10 seconds from the time an emacs window pops up until it is done drawing all of it's icons. Firefox is even worse. Loading pages is painful. The "spinning dots", in the upper right and corner, never actually spin. It takes a long time for a page to be displayed, and when it is draw, it is all-at-once. Scrolling a page up/down is extremely jurky. We are currently running kernel 2.6.22.1, but I have also tried going back to 2.6.20.x without any change in behavior. The NIC driver is loaded as: kernel: eth0: RTL8168b/8111b at 0xc264, 00:1a:4d:43:db:d4, IRQ 17 I tried going to Realtek's site to see if there was a newer driver, but the only driver there seems to be for older kernels. I finally put an old Linksys 10/100 PCI NIC in the system, and that has SOLVED the problem. We would prefer using the integrated NIC, however. 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01) Subsystem: Giga-byte Technology Unknown device e000 Flags: bus master, fast devsel, latency 0, IRQ 17 I/O ports at c000 [size=256] Memory at f800 (64-bit, non-prefetchable) [size=4K] [virtual] Expansion ROM at fb20 [disabled] [size=64K] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable- Capabilities: [60] Express Endpoint IRQ 0 Capabilities: [84] Vendor Specific Information Capabilities: [100] Advanced Error Reporting Capabilities: [12c] Virtual Channel Capabilities: [148] Device Serial Number 68-81-ec-10-00-00-00-25 Capabilities: [154] Power Budgeting Anyone have any suggestions for solving this problem? Thanks, John -- | | +--+ == | John Patrick Poet Blue Sky Tours | | | Director of Systems Development 10832 Prospect Ave., N.E. | +---+ [EMAIL PROTECTED] Albuquerque, N.M. 87112 | | Ph. 505 293 9462 Fx. 505 293 6902 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Realtek RTL8111B serious performance issues
On Wed, 18 Jul 2007, Francois Romieu wrote: [EMAIL PROTECTED] <[EMAIL PROTECTED]> : [...] Anyone have any suggestions for solving this problem? Try 2.6.23-rc1 when it is published or apply against 2.6.22 one of: http://www.fr.zoreil.com/people/francois/misc/20070628-2.6.22-rc6-r8169-test.patch Unfortunately, the 20070628 patch did not make any difference. http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.22-rc6/r8169-20070628/ I tried various patches from that directory (aren't most or all of them included in the 20070628 patch?), but none of them helped either. This problem could be very difficult to track down. Like I said, it definately effects emacs and firefox being "drawn" on a remote computer. Ping times, however, are not that bad: PING 192.168.26.150: 56 data bytes 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=0. time=0.287 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=1. time=0.279 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=2. time=0.196 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=3. time=0.201 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=4. time=0.159 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=5. time=0.148 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=6. time=0.150 ms Also, wget gets good throughput when retrieving files. It just seems to be X traffic which is extremely slow. Using the old Linksys 10/100 PCI NIC, emacs comes up virtually instantaneously. Using the integrated Realtek 8111B, emacs takes 10 seconds to draw. Thank you very much for trying to help. John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Realtek RTL8111B serious performance issues
On Thu, 19 Jul 2007, Bill Fink wrote: Hi John, On Wed, 18 Jul 2007, [EMAIL PROTECTED] wrote: On Wed, 18 Jul 2007, Francois Romieu wrote: [EMAIL PROTECTED] <[EMAIL PROTECTED]> : [...] Anyone have any suggestions for solving this problem? Try 2.6.23-rc1 when it is published or apply against 2.6.22 one of: http://www.fr.zoreil.com/people/francois/misc/20070628-2.6.22-rc6-r8169-test.patch Unfortunately, the 20070628 patch did not make any difference. http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.22-rc6/r8169-20070628/ I tried various patches from that directory (aren't most or all of them included in the 20070628 patch?), but none of them helped either. This problem could be very difficult to track down. Like I said, it definately effects emacs and firefox being "drawn" on a remote computer. Ping times, however, are not that bad: PING 192.168.26.150: 56 data bytes 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=0. time=0.287 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=1. time=0.279 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=2. time=0.196 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=3. time=0.201 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=4. time=0.159 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=5. time=0.148 ms 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=6. time=0.150 ms Also, wget gets good throughput when retrieving files. It just seems to be X traffic which is extremely slow. Using the old Linksys 10/100 PCI NIC, emacs comes up virtually instantaneously. Using the integrated Realtek 8111B, emacs takes 10 seconds to draw. Thank you very much for trying to help. Any chance that the Realtek 8111B is sharing interrupts with another device ("cat /proc/interrupts")? Perhaps it is, and the Linksys isn't, which could explain the difference in behavior. Just something simple to check and either rule in or out. Yes it was, however "fixing" that did not solve the problem. Thanks for the thought. John P.S. I did send the pcap files to Francois Romieu, but I did not CC the list because they were large. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
bug in tcp/ip stack
I tracked down something that appears to be a small bug in networking code. The way in witch i can reproduce it a complex one but it works 100%, so here comes the details: I noticed strange packets on my fw coming from mail server with RST/ACK flags set coming from source port with no one listening on it and no connection attempts made to them from outside. There are few messages on forums describing same problem and calling them alien ACK/RST packets. Postfix mail server gives this behavior if for some reason client resets connection but some packets from client arrives after the RST, the serer box responds with RST and then with RST/ACK (with the wrong source port number). Here is packet dump 10.0010.0.0.25410.0.0.68TCP5 > smtp [SYN] Seq=0 Len=0 20.00103610.0.0.6810.0.0.254TCPsmtp > 5 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 30.00109610.0.0.25410.0.0.68TCP5 > smtp [ACK] Seq=1 Ack=1 Win=1500 Len=0 40.00112510.0.0.25410.0.0.68SMTPCommand: EHLO localhost 50.00115010.0.0.25410.0.0.68TCP5 > smtp [RST] Seq=17 Len=0 60.00117510.0.0.25410.0.0.68TCP5 > smtp [FIN, ACK] Seq=17 Ack=1 Win=1500 Len=0 70.00125110.0.0.6810.0.0.254TCPsmtp > 5 [ACK] Seq=1 Ack=17 Win=5840 Len=0 80.00128410.0.0.6810.0.0.254TCPsmtp > 5 [RST] Seq=1 Len=0 !!!90.21842710.0.0.6810.0.0.254TCP32768 > 5 [RST, ACK] Seq=0 Ack=0 Win=5840 Len=0 It is not the postfix bug, it is present in current 2.6.x and 2.4.x kernel versions but not in the 2.2.x tree, so after investigation i found it was introduced in 2.4.0-test9-pre3 back in year 2000 and survived for 7 years WOW :) Whole 2.4.0-test9-pre3 diff is pretty big, but i managed to find lines responsible for this, they are located in include/net/tcp.h in function tcp_enter_cwr if (sk->prev && !(sk->userlocks&SOCK_BINDPORT_LOCK)) tcp_put_port(sk); It is not a big problem but under some setups the fw's conntrack table can get filled pretty quickly, because wrong port number changes every time. Can, you please check this out? Evalds - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
small bug in tcp
When application closes socket with unread data in receive buffer, tcp stack sends rst packet from the wrong source port, not the source port of the socket being closed. This is the same problem that was described in my first post, witch unfortunately nobody cared to look into. This problem appeared in 2.4.0-test9-pre3 and is still present in kernel. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
strange tcp behavior
1186035057.207629127.0.0.1 -> 127.0.0.1TCP 5 > smtp [SYN] Seq=0 Len=0 1186035057.207632127.0.0.1 -> 127.0.0.1TCP smtp > 5 [SYN, ACK] Seq=0 Ack=1 Win=32792 Len=0 MSS=16396 1186035057.207666127.0.0.1 -> 127.0.0.1TCP 5 > smtp [ACK] Seq=1 Ack=1 Win=1500 Len=0 1186035057.207699127.0.0.1 -> 127.0.0.1SMTP Command: EHLO localhost 1186035057.207718127.0.0.1 -> 127.0.0.1TCP smtp > 5 [ACK] Seq=1 Ack=17 Win=32792 Len=0 1186035057.207736127.0.0.1 -> 127.0.0.1TCP 5 > smtp [RST] Seq=17 Len=0 1186035057.223934127.0.0.1 -> 127.0.0.1TCP 33787 > 5 [RST, ACK] Seq=0 Ack=0 Win=32792 Len=0 Can someone please comment as to why, tcp stack sends rst packet from the wrong source port in this situation. This is the same problem that was described in my first two posts, witch unfortunately nobody seemed to notice. Here is source code witch can reproduce the behavior described, the client side code is a complete mess but with a little bit it works. Server: #include #include #include #include #include void main(void) { int ms; int ss; struct sockaddr_in sa; char *str = "HELLO FRIEND"; struct pollfd fd; int flags; ms = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP); flags = fcntl(ms, F_GETFL, 0); fcntl(ms, F_SETFL, flags | O_NONBLOCK); memset(&sa, 0, sizeof(sa)); sa.sin_family = AF_INET; sa.sin_addr.s_addr = htonl(INADDR_ANY); sa.sin_port = htons(25); bind(ms, (struct sockaddr *) &sa, sizeof(sa)); listen(ms, 0); fd.fd = ms; fd.events = POLLIN; while(poll(&fd, 1, -1)) { ss = accept(ms, NULL, NULL); usleep(1); send(ss, str, strlen(str), MSG_NOSIGNAL); close(ss); memset(&fd, 0, sizeof(fd)); fd.fd = ms; fd.events = POLLIN; } } Client: #include #include #include #include #include #include //#include //#include struct sockaddr_in localaddr; struct sockaddr_in remoteaddr; struct sockaddr rawaddr; int sdl, sdr; struct tcphdr header; struct pheader_t { uint32_t saddr; uint32_t daddr; uint8_t r; uint8_t protocol; uint16_t length; }; struct pheader_t pheader; unsigned short tbuf[2048]; unsigned char buf[2048]; char *msg = "EHLO localhost\r\n"; unsigned char *p; char *src_addr = "127.0.0.1"; char *dst_addr = "127.0.0.1"; unsigned short sprt = 5; unsigned short dprt = 25; struct timeval tv; unsigned seq, ack_seq; int data; void mysend(void) { int i, sum; int len; if(data) { len = strlen(msg); memcpy((char *) tbuf + sizeof(pheader) + sizeof(header), msg, len); } else len = 0; bzero(&pheader, sizeof(pheader)); pheader.saddr = (in_addr_t) inet_addr(src_addr); pheader.daddr = (in_addr_t) inet_addr(dst_addr); pheader.protocol = 6; pheader.length = htons(sizeof(header) + len); memcpy(tbuf, &pheader, sizeof(pheader)); memcpy((char *) tbuf + sizeof(pheader), &header, sizeof(header)); sum = 0; for(i = 0; i < (sizeof(pheader) + sizeof(header)) / 2 + len / 2; i++) { sum += tbuf[i]; sum = (sum & 0x) + (sum >> 16); } header.check = ~sum; memcpy((char *) tbuf + sizeof(pheader), &header, sizeof(header)); sendto(sdr, (char *) tbuf + sizeof(pheader), sizeof(header) + len, 0, (struct sockaddr *) &remoteaddr, sizeof(remoteaddr)); } void main(void) { gettimeofday(&tv, NULL); srand(tv.tv_sec & tv.tv_usec); remoteaddr.sin_family = AF_INET; remoteaddr.sin_addr.s_addr = (in_addr_t) inet_addr(dst_addr); sdl = socket(PF_INET, SOCK_PACKET, htons(ETH_P_ALL)); strcpy(rawaddr.sa_data, "lo"); bind(sdl, (struct sockaddr *) &rawaddr, sizeof(rawaddr)); sdr = socket(AF_INET, SOCK_RAW, IPPROTO_TCP); bzero(&header, sizeof(header)); header.source = htons(sprt); header.dest = htons(dprt); seq = rand(); ack_seq = 0; header.seq = htonl(seq); header.ack_seq = htonl(ack_seq); header.doff = sizeof(header) / 4; header.syn = 1; header.window = htons(1500); mysend(); while(1) { recvfrom(sdl, buf, sizeof(buf), 0, NULL, NULL); // p = buf + (*buf & 0x0f) * 4; p = (buf + 14) + (*(buf + 14) & 0x0f) * 4; if(ntohs(((struct tcphdr *)p)->source) == dprt && ntohs(((struct tcphdr *)p)->dest) == sprt && ((struct tcphdr *)p)->syn == 1 && ((struct tcphdr *)p)->ack == 1) break; } bzero(&header, sizeof(header)); header.source = htons(sprt); header.dest = htons(dpr
Re: r8169: slow samba performance
On Wed, 22 Aug 2007, Bruce Cole wrote: Shane wrote: On Wed, Aug 22, 2007 at 09:39:47AM -0700, Bruce Cole wrote: Shane, join the crowd :) Try the fix I just re-posted over here: Bruce, gigabit speeds thanks for the pointer. This fix works well for me though I just added the three or so lines in the elseif statement as it rejected with the r8169-20070818. I suppose I could've merged the whole thing and if you need that tested, let me know but this is looking good. Glad it works for you. I'm not the maintainer, and also don't have adequate specs from Realtek to definitively explain why the NPQ bit apparently needs to be re-enabled when some but not all of the TX FIFO is dequeued. It is documented as if it isn't cleared until the FIFO is empty. So I assume an official patch will have to wait until Francois is back. I have had abysmal performance trying to remotely run X apps via ssh on a computer with a RTL8111 NIC. Saw this message and decided to give this patch a try --- success! Much, much better. Thanks, John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: r8169: slow samba performance
On Mon, 3 Sep 2007, Francois Romieu wrote: [EMAIL PROTECTED] <[EMAIL PROTECTED]> : [...] I have had abysmal performance trying to remotely run X apps via ssh on a computer with a RTL8111 NIC. Saw this message and decided to give this patch a try --- success! Much, much better. Can you give a try to: http://www.fr.zoreil.com/people/francois/misc/20070903-2.6.23-rc5-r8169-test.patch or just patches #0001 + #0002 at: http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.23-rc5/r8169-20070903/ 20070903-2.6.23-rc5-r8169-test.patch applied against 2.6.23-rc5 works fine. Performance is acceptable. Would you like me to *just* try patches 1 & 2, to help narrow down anything? Thanks, John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: r8169: slow samba performance
On Tue, 4 Sep 2007, Francois Romieu wrote: [EMAIL PROTECTED] <[EMAIL PROTECTED]> : [...] 20070903-2.6.23-rc5-r8169-test.patch applied against 2.6.23-rc5 works fine. Performance is acceptable. Does "acceptable" mean that there is a noticeable difference when compared to the patch based on a busy-waiting loop ? Without this patch, latency in bringing up emacs, or in display of pages in firefox is extremely high. With the patch, latency is pretty much what I see when using an old tulip based NIC. Is there a specific test you wish me to try? Would you like me to *just* try patches 1 & 2, to help narrow down anything? I expect patch #2 alone to be enough to enhance the performance. If it gets proven, the patch would be a good candidate for a quick merge upstream. Okay, I will build another kernel with just #2 applied. John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 82559 NIC corrupted EEPROM
Jesse Brandeburg wrote: John wrote: Jesse Brandeburg wrote: can you try adding mdelay(100); in e100_eeprom_load before the for loop, and then change the multiple udelay(4) to mdelay(1) in e100_eeprom_read I applied the attached patch. Loading the driver now takes around one minute :-) ouch, but yep, thats what happens when you use "super extra delay" I ran 'source load_unload' 25 times in a loop. The first 12 times were successful. The last 13 times failed. (cf. attached archive) I noticed something very strange. The number of words obviously in error (0x) returned by the EEPROM on 00:09.0 is not constant. That is very strange, I would think that maybe you have something else on the bus with the e100 that may be hogging bus cycles you have failing hardware (maybe a bad eeprom, or possibly a bad mac chip) $ grep -c 0x insmod* insmod_300.txt:0 insmod_301.txt:0 insmod_302.txt:0 insmod_303.txt:0 insmod_304.txt:0 insmod_305.txt:0 insmod_306.txt:0 insmod_307.txt:0 insmod_308.txt:0 insmod_309.txt:0 insmod_310.txt:0 insmod_311.txt:0 insmod_312.txt:1 insmod_313.txt:5 insmod_314.txt:24 insmod_315.txt:45 insmod_316.txt:243 insmod_317.txt:256 insmod_318.txt:256 insmod_319.txt:256 insmod_320.txt:256 insmod_321.txt:256 insmod_322.txt:256 insmod_323.txt:253 insmod_324.txt:240 this is even stranger, does it cycle back down (sine wave) to zero again? The delays did seem to work, at least sometimes. This indicates that something needs that extra delay to successfully read the eeprom. I might try changing all the udelay(4) to udelay(40) (x10 increase) and see if that gives you a happy medium of "most times driver loads without error" John, this problem seems to be very specific to your hardware. I know that you have put in a lot of time debugging this, but I'm not sure what we can do from here. If this were a generic code problem more people would be reporting the issue. What would you like to do? At this stage I would like e100 to work better than it is, but I'm not sure what to do next. Hello everyone, I'm resurrecting this thread because it appears we'll need to support these motherboards for several months to come, yet Adrian Bunk has scheduled the removal of eepro100 in January 2007. To recap, we have to support ~30 EBC-2000T motherboards. http://www.adlinktech.com/PD/web/PD_detail.php?pid=213 These motherboards come with three on-board Intel 82559 NICs. Last time I checked, i.e. two months ago, e100 did not correctly initialize all three NICs on these motherboards. Therefore, we've been using eepro100. I will be testing the latest 2.6.20 kernel to see if the situation has changed, but I wanted to let you all know that there are still some eepro100 users out there, out of necessity. Regards, John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
CLOCK_MONOTONIC datagram timestamps by the kernel
Hello, I know it's possible to have Linux timestamp incoming datagrams as soon as they are received, then for one to retrieve this timestamp later with an ioctl command or a recvmsg call. As far as I understand, one can either do const int on = 1; setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &on, sizeof on); then use recvmsg() or not set the SO_TIMESTAMP socket option and just call ioctl(sock, SIOCGSTAMP, &tv); after each datagram has been received. SIOCGSTAMP Return a struct timeval with the receive timestamp of the last packet passed to the user. This is useful for accurate round trip time measurements. See setitimer(2) for a description of struct timeval. As far as I understand, this timestamp is given by the CLOCK_REALTIME clock. However, I would like to obtain a timestamp given by the CLOCK_MONOTONIC clock. Relevant parts of the code (I think): net/core/dev.c void net_enable_timestamp(void) { atomic_inc(&netstamp_needed); } void __net_timestamp(struct sk_buff *skb) { struct timeval tv; do_gettimeofday(&tv); skb_set_timestamp(skb, &tv); } static inline void net_timestamp(struct sk_buff *skb) { if (atomic_read(&netstamp_needed)) __net_timestamp(skb); else { skb->tstamp.off_sec = 0; skb->tstamp.off_usec = 0; } } do_gettimeofday() just calls __get_realtime_clock_ts() Would it be possible to replace do_gettimeofday() by ktime_get_ts() with the appropriate division by 1000 to convert the struct timespec back into a struct timeval? void __net_timestamp(struct sk_buff *skb) { struct timespec now; struct timeval tv; ktime_get_ts(&ts); tv.tv_sec = now.tv_sec; tv->tv_usec = now.tv_nsec/1000; skb_set_timestamp(skb, &tv); } How many apps / drivers would this break? Is there perhaps a different way to achieve this? Regards. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CLOCK_MONOTONIC datagram timestamps by the kernel
John wrote: I know it's possible to have Linux timestamp incoming datagrams as soon as they are received, then for one to retrieve this timestamp later with an ioctl command or a recvmsg call. Has it ever been proposed to modify struct skb_timeval to hold nanosecond stamps instead of just microsecond stamps? Then make the improved precision somehow available to user space. On a related note, the comment for skb_set_timestamp() states: /** * skb_set_timestamp - set timestamp of a skb * @skb: skb to set stamp of * @stamp: pointer to struct timeval to get stamp from * * Timestamps are stored in the skb as offsets to a base timestamp. * This function converts a struct timeval to an offset and stores * it in the skb. */ But there is no mention of an offset in the code: static inline void skb_set_timestamp( struct sk_buff *skb, const struct timeval *stamp) { skb->tstamp.off_sec = stamp->tv_sec; skb->tstamp.off_usec = stamp->tv_usec; } Likewise for skb_get_timestamp: /** * skb_get_timestamp - get timestamp from a skb * @skb: skb to get stamp from * @stamp: pointer to struct timeval to store stamp in * * Timestamps are stored in the skb as offsets to a base timestamp. * This function converts the offset back to a struct timeval and stores * it in stamp. */ static inline void skb_get_timestamp( const struct sk_buff *skb, struct timeval *stamp) { stamp->tv_sec = skb->tstamp.off_sec; stamp->tv_usec = skb->tstamp.off_usec; } Are the comments related to code that has since been modified? Regards. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CLOCK_MONOTONIC datagram timestamps by the kernel
Eric Dumazet wrote: John wrote: I know it's possible to have Linux timestamp incoming datagrams as soon as they are received, then for one to retrieve this timestamp later with an ioctl command or a recvmsg call. Has it ever been proposed to modify struct skb_timeval to hold nanosecond stamps instead of just microsecond stamps? Then make the improved precision somehow available to user space. Most modern NICS are able to delay packet delivery, in order to reduce number of interrupts and benefit from better cache hits. You are referring to NAPI interrupt mitigation, right? AFAIU, it is possible to disable this feature. I'm dealing with 200-4000 packets per second. I don't think I'd save much with interrupt mitigation. Please correct any misconception. Then kernel is not realtime and some delays can occur between the hardware interrupt and the very moment we timestamp the packet. If CPU caches are cold, even the instruction fetches could easily add some us. I've applied the real-time patch. http://rt.wiki.kernel.org/index.php/Main_Page This doesn't make Linux hard real-time, but the interrupt handlers can run with the highest priority (even kernel threads are preempted). Enabling nanosecond stamps would be a lie to users, because real accuracy is not nanosecond, but in the order of 10 us (at least) POSIX is moving to nanoseconds interfaces. http://www.opengroup.org/onlinepubs/009695399/functions/clock_settime.html struct timeval and struct timespec take as much space (64 bits). If the hardware can indeed manage sub-microsecond accuracy, a struct timeval forces the kernel to discard valuable information. If you depend on a < 50 us precision, then linux might be the wrong OS for your application. Or maybe you need a NIC that is able to provide a timestamp in the packet itself (well... along with the packet...) , so that kernel latencies are not a problem. Does Linux support NICs that can do that? Regards. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CLOCK_MONOTONIC datagram timestamps by the kernel
Eric Dumazet wrote: On Wednesday 28 February 2007 15:23, John wrote: Eric Dumazet wrote: John wrote: I know it's possible to have Linux timestamp incoming datagrams as soon as they are received, then for one to retrieve this timestamp later with an ioctl command or a recvmsg call. Has it ever been proposed to modify struct skb_timeval to hold nanosecond stamps instead of just microsecond stamps? Then make the improved precision somehow available to user space. Most modern NICS are able to delay packet delivery, in order to reduce number of interrupts and benefit from better cache hits. You are referring to NAPI interrupt mitigation, right? Nope; I am referring to hardware features. NAPI is software. See ethtool -c eth0 # ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off TX: off stats-block-usecs: 100 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 300 rx-frames: 60 rx-usecs-irq: 300 rx-frames-irq: 60 tx-usecs: 200 tx-frames: 53 tx-usecs-irq: 200 tx-frames-irq: 53 You can see on this setup, rx interrupts can be delayed up to 300 us (up to 60 packets might be delayed) One can disable interrupt mitigation. Your argument that it introduces latency therefore becomes irrelevant. POSIX is moving to nanoseconds interfaces. http://www.opengroup.org/onlinepubs/009695399/functions/clock_settime.html You snipped too much. I also wrote: struct timeval and struct timespec take as much space (64 bits). If the hardware can indeed manage sub-microsecond accuracy, a struct timeval forces the kernel to discard valuable information. The fact that you are able to give nanosecond timestamps inside kernel is not sufficient. It is necessary of course, but not sufficient. This precision is OK to time locally generated events. The moment you ask a 'nanosecond' timestamp, it's usually long before/after the real event. If you rely on nanosecond precision on network packets, then something is wrong with your algo. Even rt patches wont make sure your cpu caches are pre-filled, or that the routers/links between your machines are not busy. A cache miss cost 40 ns for example. A typical interrupt handler or rx processing can trigger 100 cache misses, or not at all if cache is hot. Consider an idle Linux 2.6.20-rt8 system, equipped with a single PCI-E gigabit Ethernet NIC, running on a modern CPU (e.g. Core 2 Duo E6700). All this system does is time stamp 1000 packets per second. Are you claiming that this platform *cannot* handle most packets within less than 1 microsecond of their arrival? If there are platforms that can achieve sub-microsecond precision, and if it is not more expensive to support nanosecond resolution (I said resolution not precision), then it makes sense to support nanosecond resolution in Linux. Right? You said that rt gives highest priority to interrupt handlers : If you have several nics, what will happen if you receive packets on both nics, or if the NIC interrupt happens in the same time than timer interrupt ? One timestamp will be wrong for sure. Again, this is irrelevant. We are discussing whether it would make sense to support sub-microsecond resolution. If there is one platform that can achieve sub-microsecond precision, there is a need for sub-microsecond resolution. As long as we are changing the resolution, we might as well use something standard like struct timespec. For sure we could timestamp packets with nanosecond resolution, and eventually with MONOTONIC value too, but it will give you (and others) false confidence on the real precision. us timestamps are already wrong... IMHO, this is not true for all platforms. Regards. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CLOCK_MONOTONIC datagram timestamps by the kernel
Eric Dumazet wrote: John wrote: Consider an idle Linux 2.6.20-rt8 system, equipped with a single PCI-E gigabit Ethernet NIC, running on a modern CPU (e.g. Core 2 Duo E6700). All this system does is time stamp 1000 packets per second. Are you claiming that this platform *cannot* handle most packets within less than 1 microsecond of their arrival? Yes I claim it. You expect too much of this platform, unless "most" means 10 % for you ;) By "most" I meant more than 50%. Has someone tried to measure interrupt latency in Linux? I'd like to plot the distribution of network IRQ to interrupt handler latencies. If you replace "1 us" by "50 us", then yes, it probably can do it, if "most" means 99%, (not 99.999 %) I think we need cold, hard numbers at this point :-) Anyway, if you want to play, you can apply this patch on top of linux-2.6.21-rc2 (nanosecond resolution infrastructure needs 2.6.21) I let you do the adjustments for rt kernel. Why does it require 2.6.21? This patch converts sk_buff timestamp to use new nanosecond infra (added in 2.6.21) Is this mentioned somewhere in the 2.6.21-rc1 ChangeLog? http://kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.21-rc1 Regards. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Mellanox ConnectX3 Pro and kernel 4.4 low throughput bug
I'm running into a bug with kernel 4.4.0 where a VM-VM test between two different baremetal hosts (HP Proliant dl360gen9s) has receive-side throughput that's about 25% lower than expected with a Mellanox ConnectX3-pro NIC. The VMs are connected over a VXLAN tunnel that I used OpenvSwitch 2.4.90 to set up on both hosts. When the mellanox NIC is the endpoint of the vxlan tunnel and its VM receives a throughput test the VM gets about 6.65Gb/s throughput where other NICs get ~8.3Gb/s (8.04 for niantic, 8.65 for broadcom). When I test the mellanox in a (patched) 3.14.57 kernel, I get 8.9Gb/s between VMs. I have traced the issue as far as a TUN interface that 'plugs in' to openvswitch, which takes packets for the VM. If I run tcpdump on this tun interface (called vnet0 in my case), I get small tcp packets - they're all 1398 in length - when I do a VM-VM test. I also see high CPU usage for the vhost kernel thread. If I run ftrace during a throughput test and grep for the vhost thread (once done), and wc -l the result there is an order of magnitude more function calls in this thread versus the same thing with the broadcom. If I do the same test with a broadcom NIC as the endpoint for the vxlan tunnel, I get large packets - the size varies but generally it's in the five digit range - some are almost 65535. There are fewer calls in the vhost thread, as mentioned above. This is also visible in top, the vhost kernel thread and the libvirt+ process both have noticeably higher CPU usage. I've tried doing a bisect of the kernel and figuring out where the change occurred that allowed the broadcom NIC to perform GRO but not the mellanox. I know that between 4.2 and 4.3 the tun device started to perform GRO and this is where the difference in throughput started. However there's something between these two versions that breaks my setup completely and I can't get any kind of traffic to or from the VM from anywhere. I tried to draw a diagram here: |-high CPU% ->[mlx4_en/core]>[vxlan]--->[openvswitch]--->[tun]>[vhost]--->VM |-small packets (1398) |-low CPU% ->[bnx2x ]>[vxlan]--->[openvswitch]--->[tun]>[vhost]--->VM |-big packets (~65535) NIC info: root@hLinux-ovstest-1:/home/john# ethtool -i rename8 driver: mlx4_en version: 2.2-1 (Feb 2014) firmware-version: 2.34.5010 bus-info: :08:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: no supports-register-dump: no supports-priv-flags: yes root@hLinux-ovstest-1:/home/john# ethtool -k rename8 Features for rename8: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp6-segmentation: on udp-fragmentation-offload: off [fixed] generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: off [fixed] tx-ipip-segmentation: off [fixed] tx-sit-segmentation: off [fixed] tx-udp_tnl-segmentation: on [requested off] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off rx-fcs: off rx-all: off tx-vlan-stag-hw-insert: off [fixed] rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] busy-poll: on [fixed] root@hLinux-ovstest-1:/home/john# lspci -vvs :08:00.0 08:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 Pro] Subsystem: Hewlett-Packard Company Device 801f Physical Slot: 1 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 0 Region 0: Memory at 9600 (64-bit, non-prefetchable) [size=1M] Region 2: Memory at 9400 (64-bit, prefetchable) [size=32M] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [48] Vital Product Data Product Name: HP Ethernet 10G 2-port 546SFP+ Adapter Read-only fiel
Re: Kernel memory leak in bnx2x driver with vxlan tunnel
On 01/19/2016 06:31 PM, Thomas Graf wrote: On 01/19/16 at 04:51pm, Jesse Gross wrote: On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet wrote: So what is the purpose of having a dst if we need to drop it ? Adding code in GRO would be fine if someone explains me the purpose of doing apparently useless work. (refcounting on dst is not exactly free) In the GRO case, the dst is only dropped on the packets which have been merged and therefore need to be freed (the GRO_MERGED_FREE case). It's not being thrown away for the overall frame, just metadata that has been duplicated on each individual frame, similar to the metadata in struct sk_buff itself. And while it is not used by the IP stack there are other consumers (eBPF/OVS/etc.). This entire process is controlled by the COLLECT_METADATA flag on tunnels, so there is no cost in situations where it is not actually used. Right. There were thoughts around leveraging a per CPU scratch buffer without a refcount and turn it into a full reference when the packet gets enqueued somewhere but the need hasn't really come up yet. Jesse, is this what you have in mind: diff --git a/net/core/dev.c b/net/core/dev.c index cc9e365..3a5e96d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) break; case GRO_MERGED_FREE: - if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) + if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { + skb_release_head_state(skb); kmem_cache_free(skbuff_head_cache, skb); - else + } else __kfree_skb(skb); break; So I've tested the below patch (same as one above with minor modifications made to make it compile) and it worked - no memory leak. Should I submit this or...? diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4355129..a8fac63 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2829,6 +2829,7 @@ int skb_zerocopy(struct sk_buff *to, struct sk_buff *from, void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); +void skb_release_head_state(struct sk_buff *skb); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index ae00b89..76e3623 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4337,9 +4337,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) break; case GRO_MERGED_FREE: -if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) +if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { +skb_release_head_state(skb); kmem_cache_free(skbuff_head_cache, skb); -else +} else __kfree_skb(skb); break; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b2df375..45f6f50 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -633,7 +633,7 @@ fastpath: kmem_cache_free(skbuff_fclone_cache, fclones); } -static void skb_release_head_state(struct sk_buff *skb) +void skb_release_head_state(struct sk_buff *skb) { skb_dst_drop(skb); #ifdef CONFIG_XFRM
Re: Intel 82559 NIC corrupted EEPROM
If this bit equals 0b, the idle recognition circuit is disabled and the 82559 always remains in an active state. Thus, the 82559 always requests PCI CLK using the Clockrun mechanism. Auke, do you agree with Donald Becker's warning? If I disable STB, the NICs will waste a bit more power when idle, is that correct? Are there other implications? Thanks for reading this far! John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 82559 NIC corrupted EEPROM
Auke Kok wrote: This is what I was afraid of: even though the code allows you to bypass the EEPROM checksum, the probe fails on a further check to see if the MAC address is valid. Since something with this NIC specifically made the EEPROM return all 0xff's, the MAC address is automatically invalid, and thus probe fails. I don't understand why you think there is something wrong with a specific NIC? In 2.6.14.7, e100.ko fails to read the EEPROM on :00:08.0 (eth0) In 2.6.18.1, e100.ko fails to read the EEPROM on :00:09.0 (eth1) In both kernels, eepro100.ko successfully reads all the EEPROMs. It seems that the driver has more problems with this NIC than just the eeprom checksum being bad. Needless to say this might need fixing. Can you load the eepro driver and send me the full eeprom dump? Perhaps I can duplicate things over here. 00:08.0 EEPROM contents, size 64x16 3000 0464 e4e6 0e03 0201 4701 7213 8310 40a2 0001 8086 0128 92f7 00:09.0 EEPROM contents, size 64x16 3000 0464 e5e6 0e03 0201 4701 7213 8310 40a2 0001 8086 0128 91f7 00:0a.0 EEPROM contents, size 64x16 3000 0464 e6e6 0e03 0201 4701 7213 8310 40a2 0001 8086 0128 90f7 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 82559 NIC corrupted EEPROM
Jesse Brandeburg wrote: I suspect that one reason Becker's code works is that it uses IO based access (slower, and different method) to the adapter rather than memory mapped access. I've noticed this difference. The second thought is that the adapter is in D3, and something about your kernel or the driver doesn't successfully wake it up to D0. On my NICs, the EEPROM ID (Word 0Ah) is set to 0x40a2. Thus DDPD (bit 6) is set to 0. DDPD is the "Disable Deep Power Down while PME is disabled" bit. 0 - Deep Power Down is enabled in D3 state while PME-disabled. 1 - Deep Power Down disabled in D3 state while PME-disabled. This bit should be set to 1b if a TCO controller is being used via the SMB because it requires receive functionality at all power states. Are you suggesting I try and set DDPD to 1? Or is this completely unrelated? An indication of this would be looking at lspci -vv before/after loading the driver. $ diff -u lspci_vv_before_e100.txt lspci_vv_after_e100.txt --- lspci_vv_before_e100.txt2006-11-09 14:51:30.0 +0100 +++ lspci_vv_after_e100.txt 2006-11-09 14:51:30.0 +0100 @@ -74,21 +74,20 @@ Expansion ROM at 2000 [disabled] [size=1M] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) - Status: D0 PME-Enable+ DSel=0 DScale=2 PME- + Status: D0 PME-Enable- DSel=0 DScale=2 PME- 00:09.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08) Subsystem: Intel Corporation EtherExpress PRO/100B (TX) - Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- + Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- - Latency: 32 (2000ns min, 14000ns max), cache line size 08 Interrupt: pin A routed to IRQ 10 - Region 0: Memory at e5302000 (32-bit, non-prefetchable) [size=4K] - Region 1: I/O ports at dc00 [size=64] - Region 2: Memory at e510 (32-bit, non-prefetchable) [size=1M] + Region 0: Memory at e5302000 (32-bit, non-prefetchable) [disabled] [size=4K] + Region 1: I/O ports at dc00 [disabled] [size=64] + Region 2: Memory at e510 (32-bit, non-prefetchable) [disabled] [size=1M] Expansion ROM at 2010 [disabled] [size=1M] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) - Status: D0 PME-Enable+ DSel=0 DScale=2 PME- + Status: D0 PME-Enable- DSel=0 DScale=2 PME- 00:0a.0 Ethernet controller: Intel Corporation 82557/8/9 [Ethernet Pro 100] (rev 08) Subsystem: Intel Corporation EtherExpress PRO/100B (TX) Also, after loading/unloading eepro100 does the e100 driver work? No. A third idea is look for a master abort in lspci after e100 fails to load. I don't understand that one. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Intel 82559 NIC corrupted EEPROM
Jesse Brandeburg wrote: Can you send output of cat /proc/iomem -0009 : System RAM 000a-000b : Video RAM area 000f-000f : System ROM 0010-0ffe : System RAM 0010-00296a1a : Kernel code 00296a1b-0031bbe7 : Kernel data 0fff-0fff2fff : ACPI Non-volatile Storage 0fff3000-0fff : ACPI Tables 2000-200f : :00:08.0 2010-201f : :00:09.0 2020-202f : :00:0a.0 e000-e3ff : :00:00.0 e500-e50f : :00:08.0 e510-e51f : :00:09.0 e520-e52f : :00:0a.0 e530-e5300fff : :00:08.0 e5301000-e5301fff : :00:0a.0 e5302000-e5302fff : :00:09.0 - : reserved I've also attached: o config-2.6.18.1-adlink used to compile this kernel o dmesg output after the machine boots try something like the attached patch Loading e100-debug.ko reports: e100: Intel(R) PRO/100 Network Driver, 3.5.10-k2-NAPI e100: Copyright(c) 1999-2005 Intel Corporation ***e100 debug: unable to set power state (error 0) ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 12 PCI: setting IRQ 12 as level-triggered ACPI: PCI Interrupt :00:08.0[A] -> Link [LNKA] -> GSI 12 (level, low) -> IRQ 12 ***e100 debug: read 0100/ from the same register e100: eth0: e100_probe: addr 0xe530, irq 12, MAC addr 00:30:64:04:E6:E4 ***e100 debug: unable to set power state (error 0) ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 10 PCI: setting IRQ 10 as level-triggered ACPI: PCI Interrupt :00:09.0[A] -> Link [LNKB] -> GSI 10 (level, low) -> IRQ 10 ***e100 debug: read 0100/ from the same register e100: :00:09.0: e100_eeprom_load: EEPROM corrupted ACPI: PCI interrupt for device :00:09.0 disabled e100: probe of :00:09.0 failed with error -11 ***e100 debug: unable to set power state (error 0) ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 11 PCI: setting IRQ 11 as level-triggered ACPI: PCI Interrupt :00:0a.0[A] -> Link [LNKC] -> GSI 11 (level, low) -> IRQ 11 ***e100 debug: read 0100/ from the same register e100: eth1: e100_probe: addr 0xe5301000, irq 11, MAC addr 00:30:64:04:E6:E6 In other words, the behavior is the same for all three NICs. pci_set_power_state(pdev, PCI_D0) returns 0 pci_iomap returns something != NULL Can I provide more information to help locate the problem? # # Automatically generated make config: don't edit # Linux kernel version: 2.6.18.1-hrt # Tue Nov 7 17:52:26 2006 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_POSIX_MQUEUE is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set # CONFIG_KMOD is not set # # Block layer # # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y # CONFIG_IOSCHED_AS is not set # CONFIG_IOSCHED_DEADLINE is not set CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # # CONFIG_HIGH_RES_TIMERS is not set # CONFIG_SMP is not set CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set CONFIG_MPENTIUMIII=y # CONFIG_MPENTIUMM is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CON
[PATCH] fix up sysctl_tcp_mem initialization
The initial values of sysctl_tcp_mem are sometimes greater than the total memory in the system (particularly on SMP systems). This patch ensures that tcp_mem[2] is always <= 3/4 nr_kernel_pages. However, I wonder if we want to set this differently than the way this patch does it. Depending on how far off the memory size is from a power of two (exactly equal to a power of two is the worst case), and if total memory <128M, it can be substantially less than 3/4. -John Fix up tcp_mem initiail settings to take into account the size of the hash entries (different on SMP and non-SMP systems). Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- commit d4ef8c8245c0a033622ce9ba9e25d379475254f6 tree 5377b8af0bac3b92161188e7369a84e472b5acb2 parent ea55b7c31b47edf90132baea9a088da3bbe2bb5c author John Heffner <[EMAIL PROTECTED]> Tue, 14 Nov 2006 14:53:27 -0500 committer John Heffner <[EMAIL PROTECTED]> Tue, 14 Nov 2006 14:53:27 -0500 net/ipv4/tcp.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 4322318..c05e8ed 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2316,9 +2316,10 @@ void __init tcp_init(void) sysctl_max_syn_backlog = 128; } - sysctl_tcp_mem[0] = 768 << order; - sysctl_tcp_mem[1] = 1024 << order; - sysctl_tcp_mem[2] = 1536 << order; + /* Allow no more than 3/4 kernel memory (usually less) allocated to TCP */ + sysctl_tcp_mem[0] = (1536 / sizeof (struct inet_bind_hashbucket)) << order; + sysctl_tcp_mem[1] = sysctl_tcp_mem[0] * 4 / 3; + sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2; limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7); max_share = min(4UL*1024*1024, limit);
Re: [PATCH] fix up sysctl_tcp_mem initialization
David Miller wrote: However, I wonder if we want to set this differently than the way this patch does it. Depending on how far off the memory size is from a power of two (exactly equal to a power of two is the worst case), and if total memory <128M, it can be substantially less than 3/4. Longer term, yes, probably a better way exists. So you concern is that when we round to a power of 2 like we do now, we often mis-shoot? I'm not that concerned about it, but basically yes, there are big (x2) jumps on power-of-two memory size boundaries. There's also a bigger (x8) discontinuity at 128k pages. It could be smoother. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] [TCP] MTUprobe: Cleanup send queue check (no need to loop)
Ilpo Järvinen wrote: The original code has striking complexity to perform a query which can be reduced to a very simple compare. FIN seqno may be included to write_seq but it should not make any significant difference here compared to skb->len which was used previously. One won't end up there with SYN still queued. Use of write_seq check guarantees that there's a valid skb in send_head so I removed the extra check. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> Acked-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c |7 +-- 1 files changed, 1 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index ff22ce8..1822ce6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1315,12 +1315,7 @@ static int tcp_mtu_probe(struct sock *sk) } /* Have enough data in the send queue to probe? */ - len = 0; - if ((skb = tcp_send_head(sk)) == NULL) - return -1; - while ((len += skb->len) < size_needed && !tcp_skb_is_last(sk, skb)) - skb = tcp_write_queue_next(sk, skb); - if (len < size_needed) + if (tp->write_seq - tp->snd_nxt < size_needed) return -1; if (tp->snd_wnd < size_needed) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/2] [TCP]: MTUprobe: receiver window & data available checks fixed
Ilpo Järvinen wrote: It seems that the checked range for receiver window check should begin from the first rather than from the last skb that is going to be included to the probe. And that can be achieved without reference to skbs at all, snd_nxt and write_seq provides the correct seqno already. Plus, it SHOULD account packets that are necessary to trigger fast retransmit [RFC4821]. Location of snd_wnd < probe_size/size_needed check is bogus because it will cause the other if() match as well (due to snd_nxt >= snd_una invariant). Removed dead obvious comment. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> Acked-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 30d6737..ff22ce8 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1289,6 +1289,7 @@ static int tcp_mtu_probe(struct sock *sk) struct sk_buff *skb, *nskb, *next; int len; int probe_size; + int size_needed; unsigned int pif; int copy; int mss_now; @@ -1307,6 +1308,7 @@ static int tcp_mtu_probe(struct sock *sk) /* Very simple search strategy: just double the MSS. */ mss_now = tcp_current_mss(sk, 0); probe_size = 2*tp->mss_cache; + size_needed = probe_size + (tp->reordering + 1) * mss_now; if (probe_size > tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_high)) { /* TODO: set timer for probe_converge_event */ return -1; @@ -1316,18 +1318,15 @@ static int tcp_mtu_probe(struct sock *sk) len = 0; if ((skb = tcp_send_head(sk)) == NULL) return -1; - while ((len += skb->len) < probe_size && !tcp_skb_is_last(sk, skb)) + while ((len += skb->len) < size_needed && !tcp_skb_is_last(sk, skb)) skb = tcp_write_queue_next(sk, skb); - if (len < probe_size) + if (len < size_needed) return -1; - /* Receive window check. */ - if (after(TCP_SKB_CB(skb)->seq + probe_size, tp->snd_una + tp->snd_wnd)) { - if (tp->snd_wnd < probe_size) - return -1; - else - return 0; - } + if (tp->snd_wnd < size_needed) + return -1; + if (after(tp->snd_nxt + size_needed, tp->snd_una + tp->snd_wnd)) + return 0; /* Do we need to wait to drain cwnd? */ pif = tcp_packets_in_flight(tp); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6 0/3]: Three TCP fixes
Ilpo Järvinen wrote: ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2 as lower bound even though the ssthresh was already halved, so snd_ssthresh should suffice. I remember this coming up at least once before, so it's probably worth a comment in the code. Rate-halving attempts to actually reduce cwnd to half the delivered window. Here, cwnd/4 (ssthresh/2) is a lower bound on how far rate-halving can reduce cwnd. See the "Bounding Parameters" section of <http://www.psc.edu/networking/papers/FACKnotes/current/>. -John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6 0/3]: Three TCP fixes
Ilpo Järvinen wrote: On Tue, 4 Dec 2007, John Heffner wrote: Ilpo Järvinen wrote: ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2 as lower bound even though the ssthresh was already halved, so snd_ssthresh should suffice. I remember this coming up at least once before, so it's probably worth a comment in the code. Rate-halving attempts to actually reduce cwnd to half the delivered window. Here, cwnd/4 (ssthresh/2) is a lower bound on how far rate-halving can reduce cwnd. See the "Bounding Parameters" section of <http://www.psc.edu/networking/papers/FACKnotes/current/>. Thanks for the info! Sadly enough it makes NewReno recovery quite inefficient when there are enough losses and high BDP link (in my case 384k/200ms, BDP sized buffer). There might be yet another bug in it as well (it is still a bit unclear how tcp variables behaved during my scenario and I'll investigate further) but reduction in the transfer rate is going to last longer than a short moment (which is used as motivation in those FACK notes). In fact, if I just use RFC2581 like setting w/o rate-halving (and experience the initial "pause" in sending), the ACK clock to send out new data works very nicely beating rate halving fair and square. For SACK/FACK it works much nicer because recovery is finished much earlier and slow start recovers cwnd quickly. I believe this is exactly the reason why Matt (CC'd) and Jamshid abandoned this line of work in the late 90's. In my opinion, it's probably not such a bad idea to use cwnd/2 as the bound. In some situations, the current rate-halving code will work better, but as you point out, in others the cwnd is lowered too much. ...Mind if I ask another similar one, any idea why prior_ssthresh is smaller (3/4 of it) than cwnd used to be (see tcp_current_ssthresh)? Not sure on that one. I'm not aware of any publications this is based on. Maybe Alexey knows? -John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
David Miller wrote: Ilpo, I was pondering the kind of debugging one does to find congestion control issues and even SACK bugs and it's currently too painful because there is no standard way to track state changes. I assume you're using something like carefully crafted printk's, kprobes, or even ad-hoc statistic counters. That's what I used to do :-) With that in mind it occurred to me that we might want to do something like a state change event generator. Basically some application or even a daemon listens on this generic netlink socket family we create. The header of each event packet indicates what socket the event is for and then there is some state information. Then you can look at a tcpdump and this state dump side by side and see what the kernel decided to do. Now there is the question of granularity. A very important consideration in this is that we want this thing to be enabled in the distributions, therefore it must be cheap. Perhaps one test at the end of the packet input processing. So I say we pick some state to track (perhaps start with tcp_info) and just push that at the end of every packet input run. Also, we add some minimal filtering capability (match on specific IP address and/or port, for example). Maybe if we want to get really fancy we can have some more-expensive debug mode where detailed specific events get generated via some macros we can scatter all over the place. This won't be useful for general user problem analysis, but it will be excellent for developers. Let me know if you think this is useful enough and I'll work on an implementation we can start playing with. FWIW, sounds similar to what these guys are doing with SIFTR for FreeBSD: http://caia.swin.edu.au/urp/newtcp/tools.html http://caia.swin.edu.au/reports/070824A/CAIA-TR-070824A.pdf -John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP's initial cwnd setting correct?...
That sounds right to me. -John Ilpo Järvinen wrote: On Mon, 6 Aug 2007, Ilpo Järvinen wrote: ...Goto logic could be cleaner (somebody has any suggestion for better way to structure it?) ...I could probably move the setting of snd_cwnd earlier to avoid this problem if this seems a valid fix at all. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP's initial cwnd setting correct?...
I believe the current calculation is correct. The RFC specifies a window of no more than 4380 bytes unless 2*MSS > 4380. If you change the code in this way, then MSS=1461 will give you an initial window of 3*MSS == 4383, violating the spec. Reading the pseudocode in the RFC 3390 is a bit misleading because they use a clamp at 4380 bytes rather than use a multiplier in the relevant range. -John David Miller wrote: From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> Date: Mon, 6 Aug 2007 15:37:15 +0300 (EEST) @@ -805,13 +805,13 @@ void tcp_update_metrics(struct sock *sk) } } -/* Numbers are taken from RFC2414. */ +/* Numbers are taken from RFC3390. */ __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst) { __u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0); if (!cwnd) { - if (tp->mss_cache > 1460) + if (tp->mss_cache >= 2190) cwnd = 2; else cwnd = (tp->mss_cache > 1095) ? 3 : 4; I remember suggesting something similar about 5 or 6 years ago and Alexey Kuznetsov at the time explained the numbers which are there and why they should not be changed. I forget the reasons though, and I'll try to do the research. These numbers have been like this forever, FWIW. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.23-rc2: WARNING: at kernel/irq/resend.c:70 check_irq_resend()
Hi, I'm opening this ticket as a new subject, even though it looks like it might be related to the thread "Networking dies after random time". Sorry for the wide CC list, but since my network hasn't died since I rebooted into 2.6.23-rc2 (after 30+ days at 2.6.22-rc7), I'm wondering if the problem is more than networking related. Honestly, I haven't gone back over the previous thread in detail, so I might be missing info here. System details: Dell Precision 610MT, Intel 440GX chipset, Dual PIII Xeon, 550Mhz, 2gb RAM (upgraded from 768Mb last night), a mix of IDE, SCSI and SATA disks in the system. My poor PCI bus! Just upgraded to 2.6.23-rc2. Interrupts looks like this: > cat /proc/interrupts CPU0 CPU1 0:280 1 IO-APIC-edge timer 1:788 0 IO-APIC-edge i8042 6: 1 4 IO-APIC-edge floppy 8: 0 1 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 11: 82410 1239 IO-APIC-edge Cyclom-Y 12:279106 IO-APIC-edge i8042 14: 440901 4266 IO-APIC-edge libata 15: 0 0 IO-APIC-edge libata 16:2394727 42983 IO-APIC-fasteoi ohci_hcd:usb3, Ensoniq AudioPCI, [EMAIL PROTECTED]::01:00.0 17:2237362 1110 IO-APIC-fasteoi sata_sil, ehci_hcd:usb1, eth0 18: 126520 31978 IO-APIC-fasteoi aic7xxx, aic7xxx, ide2, ide3, ohci1394 19: 0 0 IO-APIC-fasteoi ohci_hcd:usb2, uhci_hcd:usb4 NMI: 0 0 LOC: 40672484 40672246 ERR: 0 MIS: 0 I've only seen the one Warning oops, and backups and other system processes have been running for the past 12 hours without a problem. [ 187.747442] Probing IDE interface ide2... [ 188.011634] hde: WDC WD1200JB-00CRA1, ATA DISK drive [ 188.623038] WARNING: at kernel/irq/resend.c:70 check_irq_resend() [ 188.623105] [] check_irq_resend+0xa8/0xc0 [ 188.623204] [] enable_irq+0xc3/0xd0 [ 188.623295] [] probe_hwif+0x670/0x7c0 [ide_core] [ 188.623448] [] do_ide_setup_pci_device+0x154/0x480 [ide_core] [ 188.623571] [] probe_hwif_init_with_fixup+0xc/0x90 [ide_core] [ 188.623690] [] init_setup_hpt302+0x0/0x30 [hpt366] [ 188.623791] [] ide_setup_pci_device+0x7b/0xc0 [ide_core] [ 188.623909] [] init_setup_hpt302+0x0/0x30 [hpt366] [ 188.624004] [] hpt366_init_one+0x8d/0xa0 [hpt366] [ 188.624095] [] init_setup_hpt302+0x0/0x30 [hpt366] [ 188.624187] [] init_chipset_hpt366+0x0/0x680 [hpt366] [ 188.624281] [] init_hwif_hpt366+0x0/0x380 [hpt366] [ 188.624372] [] init_dma_hpt366+0x0/0xe0 [hpt366] [ 188.624466] [] pci_device_probe+0x56/0x80 [ 188.624565] [] driver_probe_device+0x8e/0x190 [ 188.624669] [] __driver_attach+0x9e/0xa0 [ 188.624756] [] bus_for_each_dev+0x3a/0x60 [ 188.624845] [] driver_attach+0x16/0x20 [ 188.624932] [] __driver_attach+0x0/0xa0 [ 188.625017] [] bus_add_driver+0x8a/0x1b0 [ 188.625107] [] __pci_register_driver+0x53/0xa0 [ 188.625197] [] sys_init_module+0x13d/0x1820 [ 188.625315] [] snd_timer_find+0x0/0x90 [snd_timer] [ 188.625424] [] disable_irq+0x0/0x30 [ 188.625513] [] sys_mmap2+0xcd/0xd0 [ 188.625612] [] syscall_call+0x7/0xb [ 188.625701] [] rpc_get_inode+0x0/0x80 [ 188.625798] === [ 188.625871] hde: selected mode 0x45 [ 188.626817] ide2 at 0xecf8-0xecff,0xecf2 on irq 18 [ 188.627080] Probing IDE interface ide3... [ 188.891165] hdg: WDC WD1200JB-00EVA0, ATA DISK drive [ 189.502580] hdg: selected mode 0x45 [ 189.503698] ide3 at 0xece0-0xece7,0xecda on irq 18 Let - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm
Benjamin LaHaise wrote: According to your patch, several packets with fin bit might be sent, including one with data. If another host does not receive fin retransmit, then that logic is broken, and it can not be fixed by duplicating fins, I would even say, that remote box should drop second packet with fin, while it can carry data, which will break higher connection logic. The FIN hasn't been ack'd by the other side, though and yet Linux is no longer transmitting packets with it sent. Read the beginning of the trace. I agree completely with Evgeniy. The patch you sent would cause bad breakage by sending the FIN bit on segments with different sequence numbers. Looking at your trace, it seems like the behavior of the test system 192.168.2.2 is broken in two ways. First, like you said it has broken state in that it has forgotten that it sent the FIN. Once you do that, the connection state is corrupt and all bets are off. It's sending an out-of-window segment that's getting tossed by Linux, and Linux generates an ack in response. This is in direct RFC compliance. The second problem is that the other system is generating these broken acks in response to the legitimate acks Linux is sending, causing the ack war. I can't really guess why it's doing that... You might be able to change Linux to prevent this ack war, but doing so would break RFC compliance, and given the buggy nature of the other end, it sounds to me like a bad idea. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] TCP FIN gets dropped prematurely, results in ack storm
Benjamin LaHaise wrote: On Tue, May 01, 2007 at 09:41:28PM +0400, Evgeniy Polyakov wrote: Hmm, 2.2 machine in your test seems to behave incorrectly: I am aware of that. However, I think that the loss of certain packets and reordering can result in the same behaviour. What's more, is that this behaviour can occur in real deployed systems. "Be strict in what you send and liberal in what you accept." Both systems should be fixed, which is what I'm trying to do. Actually, you cannot get in this situation by loss or reordering of packets, only be corruption of state on one side. It sends the FIN, which effectively increases the sequence number by one. However, all later segments it sends have an old lower sequence number, which are now out of window. Being liberal in what you accept is good to a point, but sometimes you have to draw the line. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [TCP] Sysctl: document tcp_max_ssthresh (Limited Slow-Start)
Rick Jones wrote: as an asside, "tcp_max_ssthresh" sounds like the maximum value ssthresh can take-on. is that correct, or is this more of a "once ssthresh is above this, behave in this new way?" If that is the case, while the I don't like it either, but you'll have to talk to Sally Floyd about that one.. ;) In general, I would like the documentation to emphasize more how to set the parameter than describe the algorithm. The max_ssthresh parameter should ideally be set to the bottleneck queue size, or more realistically a conservative value that's likely to be smaller than the bottleneck queue size. When max_ssthresh is smaller than the bottleneck queue, (limited) slow start will not overflow it until cwnd has fully ramped up to the appropriate size. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP packet loss when running lsof
kB VmallocUsed: 6924 kB VmallocChunk: 34359731259 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB Thanks for your help! Regards, John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP packet loss when running lsof
Hi Eric, > It's a HP system with two dual core CPUs at 3GHz, the Then you might try to bind network IRQ to one CPU (echo 1 >/proc/irq/XX/smp_affinity) XX being your NIC interrupt (cat /proc/interrupts to catch it) and bind your user program to another cpu(s) the NIC was already fixed at CPU0 and the irq_balancer switched the timer interrupt between all CPUs and the storage HBA between CPU1 and CPU4. Stopping the balancer and leaving NIC alone on CPU0 and the other interrupts and my program on CPU2-4 did not improve the situation. At least I could not see an improvement over just adding thash_entries=2048. You might hit a cond_resched_softirq() bug that Ingo and others are sorting out right now. Using separate CPU for softirq handling and your programs should help a lot here. Shouldn't I get some syslog messages if this bug is triggered? Nevertheless I also opened a call on Novell about this issue, as the current cond_resched_softirq() does look completely different than in 2.6.18 > This did help a lot, I tried thash_entries=10 and now only a > while loop around the "cat ...tcp" triggers packet loss. Tests I dont understand here : using a small thash_entries makes the bug always appear ? No. thash_entries=10 improves the situation. Without the param nearly every look at /proc/net/tcp leads to packet loss, with thash_entries=10 (or 2048, does not matter) I have to start a "while true; do cat /prc/net/tcp ; done" to get packet loss every minute. But even with thash_entries=10 and if I leave my program alone on he system I get packet loss every few hours. Regards, John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with implementation of TCP_DEFER_ACCEPT?
TJ wrote: client SYN > server LISTENING client < SYN ACK server SYN_RECEIVED (time-out 3s) server: inet_rsk(req)->acked = 1 client ACK > server (discarded) client < SYN ACK (DUP) server (time-out 6s) client ACK (DUP) > server (discarded) client < SYN ACK (DUP) server (time-out 12s) client ACK (DUP) > server (discarded) client < SYN ACK (DUP) server (time-out 24s) client ACK (DUP) > server (discarded) client < SYN ACK (DUP) server (time-out 48s) client ACK (DUP) > server (discarded) client < SYN ACK (DUP) server (time-out 96s) client ACK (DUP) > server (discarded) server: half-open socket closed. With each client ACK being dropped by the kernel's TCP_DEFER_ACCEPT mechanism eventually the handshake fails after the 'SYN ACK' retries and time-outs expire. There is a case for arguing the kernel should be operating in an enhanced handshaking mode when TCP_DEFER_ACCEPT is enabled, not an alternative mode, and therefore should accept *both* RFC 793 and TCP_DEFER_ACCEPT. I've been unable to find a specification or RFC for implementing TCP_DEFER_ACCEPT aka BSD's SO_ACCEPTFILTER to give me firm guidance. It seems incorrect to penalise a client that is trying to complete the handshake according to the RFC 793 specification, especially as the client has no way of knowing ahead of time whether or not the server is operating deferred accept. Interesting problem. TCP_DEFER_ACCEPT does not conform to any standard I'm aware of. (In fact, I'd say it's in violation of RFC 793.) The implementation does exactly what it claims, though -- it "allows a listener to be awakened only when data arrives on the socket." I think a more useful spec might have been "allows a listener to be awakened only when data arrives on the socket, unless the specified timeout has expired." Once the timeout expires, it should process the embryonic connection as if TCP_DEFER_ACCEPT is not set. Unfortunately, I don't think we can retroactively change this definition, as an application might depend on data being available and do a non-blocking read() after the accept(), expecting data to be there. Is this worth trying to fix? Also, a listen socket with a backlog and TCP_DEFER_ACCEPT will have reqs sit in the backlog for the full defer timeout, even if they've received data, which is not really the right thing to do. I've attached a patch implementing this suggestion (compile tested only -- I think I got the logic right but it's late ;). Kind of ugly, and uses up a bit in struct inet_request_sock. Maybe can be done better... -John diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h index 62daf21..f9f64a5 100644 --- a/include/net/inet_sock.h +++ b/include/net/inet_sock.h @@ -72,7 +72,8 @@ struct inet_request_sock { sack_ok: 1, wscale_ok : 1, ecn_ok : 1, - acked : 1; + acked : 1, + deferred : 1; struct ip_options *opt; }; diff --git a/include/net/tcp.h b/include/net/tcp.h index 185c7ec..cad2490 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -978,6 +978,7 @@ static inline void tcp_openreq_init(struct request_sock *req, ireq->snd_wscale = rx_opt->snd_wscale; ireq->wscale_ok = rx_opt->wscale_ok; ireq->acked = 0; + ireq->deferred = 0; ireq->ecn_ok = 0; ireq->rmt_port = tcp_hdr(skb)->source; } diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index fbe7714..1207fb8 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -444,9 +444,6 @@ void inet_csk_reqsk_queue_prune(struct sock *parent, } } - if (queue->rskq_defer_accept) - max_retries = queue->rskq_defer_accept; - budget = 2 * (lopt->nr_table_entries / (timeout / interval)); i = lopt->clock_hand; @@ -455,7 +452,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent, while ((req = *reqp) != NULL) { if (time_after_eq(now, req->expires)) { if ((req->retrans < thresh || -(inet_rsk(req)->acked && req->retrans < max_retries)) +(inet_rsk(req)->acked && req->retrans < max_retries) || +(inet_rsk(req)->deferred && req->retrans < + queue->rskq_defer_accept + max_retries)) && !req->rsk_ops->rtx_syn_ack(parent, req, NULL)) {
Re: Problem with implementation of TCP_DEFER_ACCEPT?
TJ wrote: Right now Juniper are claiming the issue that brought this to the surface (the bug linked to in my original post) is a problem with the implementation of TCP_DEFER_ACCEPT. My position so far is that the Juniper DX OS is not following the HTTP standard because it doesn't send a request with the connection, and as I read the end of section 1.4 of RFC2616, an HTTP connection should be accompanied by a request. Can anyone confirm my interpretation or provide references to firm it up, or refute it? You can think of TCP_DEFER_ACCEPT as an implicit application close() after a certain timeout, when not receiving a request. All HTTP servers do this anyway (though I think technically they're supposed to send a 408 Request Timeout error it seems many do not). It's a very valid question for Juniper as to why their box is failing to fill requests when its back-end connection has gone away, instead of re-establishing the connection and filling the request. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Bill Fink wrote: Here you can see there is a major difference in the TX CPU utilization (99 % with TSO disabled versus only 39 % with TSO enabled), although the TSO disabled case was able to squeeze out a little extra performance from its extra CPU utilization. Interestingly, with TSO enabled, the receiver actually consumed more CPU than with TSO disabled, so I guess the receiver CPU saturation in that case (99 %) was what restricted its performance somewhat (this was consistent across a few test runs). One possibility is that I think the receive-side processing tends to do better when receiving into an empty queue. When the (non-TSO) sender is the flow's bottleneck, this is going to be the case. But when you switch to TSO, the receiver becomes the bottleneck and you're always going to have to put the packets at the back of the receive queue. This might help account for the reason why you have both lower throughput and higher CPU utilization -- there's a point of instability right where the receiver becomes the bottleneck and you end up pushing it over to the bad side. :) Just a theory. I'm honestly surprised this effect would be so significant. What do the numbers from netstat -s look like in the two cases? -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB
Bill Fink wrote: Here's the beforeafter delta of the receiver's "netstat -s" statistics for the TSO enabled case: Ip: 3659898 total packets received 3659898 incoming packets delivered 80050 requests sent out Tcp: 2 passive connection openings 3659897 segments received 80050 segments send out TcpExt: 33 packets directly queued to recvmsg prequeue. 104956 packets directly received from backlog 705528 packets directly received from prequeue 3654842 packets header predicted 193 packets header predicted and directly queued to user 4 acknowledgments not containing data received 6 predicted acknowledgments And here it is for the TSO disabled case (GSO also disabled): Ip: 4107083 total packets received 4107083 incoming packets delivered 1401376 requests sent out Tcp: 2 passive connection openings 4107083 segments received 1401376 segments send out TcpExt: 2 TCP sockets finished time wait in fast timer 48486 packets directly queued to recvmsg prequeue. 1056111048 packets directly received from backlog 2273357712 packets directly received from prequeue 1819317 packets header predicted 2287497 packets header predicted and directly queued to user 4 acknowledgments not containing data received 10 predicted acknowledgments For the TSO disabled case, there are a huge amount more TCP segments sent out (1401376 versus 80050), which I assume are ACKs, and which could possibly contribute to the higher throughput for the TSO disabled case due to faster feedback, but not explain the lower CPU utilization. There are many more packets directly queued to recvmsg prequeue (48486 versus 33). The numbers for packets directly received from backlog and prequeue in the TCP disabled case seem bogus to me so I don't know how to interpret that. There are only about half as many packets header predicted (1819317 versus 3654842), but there are many more packets header predicted and directly queued to user (2287497 versus 193). I'll leave the analysis of all this to those who might actually know what it all means. There are a few interesting things here. For one, the bursts caused by TSO seem to be causing the receiver to do stretch acks. This may have a negative impact on flow performance, but it's hard to say for sure how much. Interestingly, it will even further reduce the CPU load on the sender, since it has to process fewer acks. As I suspected, in the non-TSO case the receiver gets lots of packets directly queued to user. This should result in somewhat lower CPU utilization on the receiver. I don't know if it can account for all the difference you see. The backlog and prequeue values are probably correct, but netstat's description is wrong. A quick look at the code reveals these values are in units of bytes, not packets. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
OBATA Noboru wrote: Is it correct that you think my problem can be addressed either by the followings? (1) Make the application timeouts longer. (Steve has shown that making an application timeouts twice the failover detection timeout would be a solution.) Right. Is there something wrong with this approach? (2) Let TCP have a notification of some kind. There was some work on this in the IETF a while back (google trigtran linkup), but it never went anywhere to my knowledge. In principle it's possible, but it's not clear that it's worth doing. It's really just an optimization anyway. Imaging the link that's failing over is one hop or more away from the endpoint. You're back to the same problem again. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make _minimum_ TCP retransmission timeout configurable
David Miller wrote: From: Rick Jones <[EMAIL PROTECTED]> Date: Wed, 29 Aug 2007 15:29:03 -0700 David Miller wrote: None of the research folks want to commit to saying a lower value is OK, even though it's quite clear that on a local 10 gigabit link a minimum value of even 200 is absolutely and positively absurd. So what do these cellphone network people want to do, increate the minimum RTO or increase it? Exactly how does it help them? They want to increase it. The folks who triggered this want to make it 3 seconds to avoid spurrious RTOs. Their experience the "other platform" they widh to replace suggests that 3 seconds is a good value for their network. If the issue is wireless loss, algorithms like FRTO might help them, because FRTO tries to make a distinction between capacity losses (which should adjust cwnd) and radio losses (which are not capacity based and therefore should not affect cwnd). I was looking at that. FRTO seems only to affect the cwnd calculations, and not the RTO calculation, so it seems to "deal with" spurrious RTOs rather than preclude them. There is a strong desire here to not have spurrious RTO's in the first place. Each spurrious retransmission will increase a user's charges. All of this seems to suggest that the RTO calculation is wrong. I think there's definitely room for improving the RTO calculation. However, this may not be the end-all fix... It seems that packets in this network can be delayed several orders of magnitude longer than the usual round trip as measured by TCP. What exactly causes such a huge delay? What is the TCP measured RTO in these circumstances where spurious RTOs happen and a 3 second minimum RTO makes things better? I haven't done a lot of work on wireless myself, but my understanding is that one of the biggest problems is the behavior link-layer retransmission schemes. They can suddenly increase the delay of packets by a significant amount when you get a burst of radio interference. It's hard for TCP to gracefully handle this kind of jump without some minimum RTO, especially since wlan RTTs can often be quite small. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make _minimum_ TCP retransmission timeout configurable
John Heffner wrote: What exactly causes such a huge delay? What is the TCP measured RTO in these circumstances where spurious RTOs happen and a 3 second minimum RTO makes things better? I haven't done a lot of work on wireless myself, but my understanding is that one of the biggest problems is the behavior link-layer retransmission schemes. They can suddenly increase the delay of packets by a significant amount when you get a burst of radio interference. It's hard for TCP to gracefully handle this kind of jump without some minimum RTO, especially since wlan RTTs can often be quite small. (Replying to myself) Though F-RTO does often help in this case. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NCR, was [PATCH] make _minimum_ TCP retransmission timeout configurable
Stephen Hemminger wrote: On Wed, 29 Aug 2007 15:28:12 -0700 (PDT) David Miller <[EMAIL PROTECTED]> wrote: And reading NCR some more, we already have something similar in the form of Alexey's reordering detection, in fact it handles exactly the case NCR supposedly deals with. We do not trigger loss recovery strictly on the 3rd duplicate ACK, and we've known about and dealt with the reordering issue explicitly for years. Yeah, it looked like another case of BSD RFC writers reinventing Linux algorithms, but it is worth getting the behaviour standardized and more widely reviewed. I don't believe this was the case. NCR is substantially different, and came out of work at Texas A&M. The original (only) implementation was in Linux IIRC. Its goal was to do better. Their papers say it does. It might be worth looking at. In my own experience with reordering, Alexey's code had some hard-to-track-down bugs (look at all the work Ilpo's been doing), and the relative simplicity of NCR may be one of the reasons it does well in tests. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make _minimum_ TCP retransmission timeout configurable
David Miller wrote: From: Rick Jones <[EMAIL PROTECTED]> Date: Wed, 29 Aug 2007 16:06:27 -0700 I belive the biggest component comes from link-layer retransmissions. There can also be some short outtages thanks to signal blocking, tunnels, people with big hats and whatnot that the link-layer retransmissions are trying to address. The three seconds seems to be a value that gives the certainty that 99 times out of 10 the segment was indeed lost. The trace I've been sent shows clean RTTs ranging from ~200 milliseconds to ~7000 milliseconds. Thanks for the info. It's pretty easy to generate examples where we might have some sockets talking over interfaces on such a network and others which are not. Therefore, if we do this, a per-route metric is probably the best bet. This is exactly what I was thinking. It might even help discourage users from playing with this setting who should not. ;) -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make _minimum_ TCP retransmission timeout configurable take 2
Rick Jones wrote: Like I said the consumers of this are a triffle well, "anxious" :) Just curious, did you or this customer try with F-RTO enabled? Or is this case you're dealing with truly hopeless? -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
82557/8/9 Ethernet Pro 100 interrupt mitigation support
(Please ignore previous message, it was sent from the wrong account.) Hello everyone, I have several systems with three integrated Intel 82559 (I *think*). Does someone know if these boards support hardware interrupt mitigation? I.e. is it possible to configure them to raise an IRQ only if their hardware buffer is full OR if some given time (say 1 ms) has passed and packets are available in their hardware buffer. I've been using the eepro100 driver up to now, but I'm about to try the e100 driver. Would I have to use NAPI? Or is this an orthogonal feature? Regards. 00:08.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 08) Subsystem: Intel Corporation EtherExpress PRO/100B (TX) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- TAbort- SERR- TAbort- SERR-
Re: 82557/8/9 Ethernet Pro 100 interrupt mitigation support
John Sigler wrote: I have several systems with three integrated Intel 82559 (I *think*). Does someone know if these boards support hardware interrupt mitigation? I.e. is it possible to configure them to raise an IRQ only if their hardware buffer is full OR if some given time (say 1 ms) has passed and packets are available in their hardware buffer. I've been using the eepro100 driver up to now, but I'm about to try the e100 driver. Would I have to use NAPI? Or is this an orthogonal feature? 00:08.0 Ethernet controller: Intel Corporation 82557/8/9 Ethernet Pro 100 (rev 08) Subsystem: Intel Corporation EtherExpress PRO/100B (TX) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- TAbort- SERR- TAbort- SERR- Here is Intel's page for the 82559: http://www.intel.com/design/network/products/lan/controllers/82559.htm The "82559ER Fast Ethernet PCI Controller" data sheet mentions a 3 KB receive FIFO. I suppose that's too small to aggregate several frames? The "8255x Controller Family Open Source Software Developer Manual" mentions the features supported by the 82559. I don't see anything related to interrupt mitigation support. Does NAPI work well when there is no hardware interrupt mitigation support? Regards. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 82557/8/9 Ethernet Pro 100 interrupt mitigation support
Jesse Brandeburg wrote: Auke Kok wrote: Marc Sigler wrote: I have several systems with three integrated Intel 82559 (I *think*). Does someone know if these boards support hardware interrupt mitigation? I.e. is it possible to configure them to raise an IRQ only if their hardware buffer is full OR if some given time (say 1 ms) has passed and packets are available in their hardware buffer. I've been using the eepro100 driver up to now, but I'm about to try the e100 driver. Would I have to use NAPI? Or is this an orthogonal feature? e100 hardware (as far as I can see from the specs) doesn't support any irq mitigation, so you'll need to run in NAPI mode if you want to throttle irq's. the in-kernel e100 already runs in NAPI mode, so that's already covered. beware that the eepro100 driver is scheduled for removal (2.6.25 or so). We support mitigation of interrupts in a downloadable microcode on only a few pieces of hardware (revision id specific) in e100.c (see e100_setup_ucode) http://lxr.linux.no/source/drivers/net/e100.c#L1176 OK. How do I tell which revision id I have? 00:08.0 0200: 8086:1229 (rev 08) 00:09.0 0200: 8086:1229 (rev 08) 00:0a.0 0200: 8086:1229 (rev 08) How much memory is available on the board to bundle packets? 3000 bytes? If you really really wanted mitigation you could probably backport the microcode from the e100 driver in the 2.4.35 kernel for your specific hardware. This driver is versioned 2.X. I forgot to mention I'm running 2.6.22.1-rt9. I'm not sure why you mention 2.4.35? The problem with e100 is that it fails to properly set up all three interfaces, which is why I'm stuck with eepro100. Regards. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Clean up owner field in sock_lock_t
I don't know why the owner field is a (struct sock_iocb *). I'm assuming it's historical. Can someone check this out? Did I miss some alternate usage? These patches are against net-2.6.24. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [NET] Cleanup: Use sock_owned_by_user() macro
Changes asserts in sunrpc to use sock_owned_by_user() macro instead of referencing sock_lock.owner directly. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- net/sunrpc/svcsock.c |2 +- net/sunrpc/xprtsock.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index ed17a50..3a95612 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -104,7 +104,7 @@ static struct lock_class_key svc_slock_key[2]; static inline void svc_reclassify_socket(struct socket *sock) { struct sock *sk = sock->sk; - BUG_ON(sk->sk_lock.owner != NULL); + BUG_ON(sock_owned_by_user(sk)); switch (sk->sk_family) { case AF_INET: sock_lock_init_class_and_name(sk, "slock-AF_INET-NFSD", diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 4ae7eed..282efd4 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -1186,7 +1186,7 @@ static struct lock_class_key xs_slock_key[2]; static inline void xs_reclassify_socket(struct socket *sock) { struct sock *sk = sock->sk; - BUG_ON(sk->sk_lock.owner != NULL); + BUG_ON(sock_owned_by_user(sk)); switch (sk->sk_family) { case AF_INET: sock_lock_init_class_and_name(sk, "slock-AF_INET-NFS", -- 1.5.3.rc7.30.g947ad2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [NET] Change type of owner in sock_lock_t to int, rename
The type of owner in sock_lock_t is currently (struct sock_iocb *), presumably for historical reasons. It is never used as this type, only tested as NULL or set to (void *)1. For clarity, this changes it to type int, and renames to owned, to avoid any possible type casting errors. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- include/net/sock.h |7 +++ net/core/sock.c|6 +++--- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 802c670..5ed9fa4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -76,10 +76,9 @@ * between user contexts and software interrupt processing, whereas the * mini-semaphore synchronizes multiple users amongst themselves. */ -struct sock_iocb; typedef struct { spinlock_t slock; - struct sock_iocb*owner; + int owned; wait_queue_head_t wq; /* * We express the mutex-alike socket_lock semantics @@ -737,7 +736,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, int size) * Since ~2.3.5 it is also exclusive sleep lock serializing * accesses from user process context. */ -#define sock_owned_by_user(sk) ((sk)->sk_lock.owner) +#define sock_owned_by_user(sk) ((sk)->sk_lock.owned) /* * Macro so as to not evaluate some arguments when @@ -748,7 +747,7 @@ static inline int sk_stream_wmem_schedule(struct sock *sk, int size) */ #define sock_lock_init_class_and_name(sk, sname, skey, name, key) \ do { \ - sk->sk_lock.owner = NULL; \ + sk->sk_lock.owned = 0; \ init_waitqueue_head(&sk->sk_lock.wq); \ spin_lock_init(&(sk)->sk_lock.slock); \ debug_check_no_locks_freed((void *)&(sk)->sk_lock, \ diff --git a/net/core/sock.c b/net/core/sock.c index cfed7d4..edbc562 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1575,9 +1575,9 @@ void fastcall lock_sock_nested(struct sock *sk, int subclass) { might_sleep(); spin_lock_bh(&sk->sk_lock.slock); - if (sk->sk_lock.owner) + if (sk->sk_lock.owned) __lock_sock(sk); - sk->sk_lock.owner = (void *)1; + sk->sk_lock.owned = 1; spin_unlock(&sk->sk_lock.slock); /* * The sk_lock has mutex_lock() semantics here: @@ -1598,7 +1598,7 @@ void fastcall release_sock(struct sock *sk) spin_lock_bh(&sk->sk_lock.slock); if (sk->sk_backlog.tail) __release_sock(sk); - sk->sk_lock.owner = NULL; + sk->sk_lock.owned = 0; if (waitqueue_active(&sk->sk_lock.wq)) wake_up(&sk->sk_lock.wq); spin_unlock_bh(&sk->sk_lock.slock); -- 1.5.3.rc7.30.g947ad2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [IPROUTE2] ss: parse bare integers are port numbers rather than IP addresses
Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- misc/ss.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 5d14f13..d617f6d 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -953,6 +953,10 @@ void *parse_hostcond(char *addr) memset(&a, 0, sizeof(a)); a.port = -1; + /* Special case: integer by itself is considered a port number */ + if (!get_integer(&a.port, addr, 0)) + goto out; + if (fam == AF_UNIX || strncmp(addr, "unix:", 5) == 0) { char *p; a.addr.family = AF_UNIX; -- 1.5.3.rc4.29.g74276-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [IPROUTE2] Add missing LIBUTIL for dependencies.
Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- Makefile |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/Makefile b/Makefile index af0d5e4..7e4605c 100644 --- a/Makefile +++ b/Makefile @@ -29,7 +29,8 @@ LDLIBS += -L../lib -lnetlink -lutil SUBDIRS=lib ip tc misc netem genl -LIBNETLINK=../lib/libnetlink.a ../lib/libutil.a +LIBUTIL=../lib/libutil.a +LIBNETLINK=../lib/libnetlink.a $(LIBUTIL) all: Config @set -e; \ -- 1.5.3.rc4.29.g74276-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder
Any reason you're overloading tcpi_unacked and tcpi_sacked? It seems that setting idiag_rqueue and idiag_wqueue are sufficient. -John Rick Jones wrote: Return some useful information such as the maximum listen backlog and the current listen backlog in the tcp_info structure and have that match what one can see in /proc/net/tcp, /proc/net/tcp6, and INET_DIAG_INFO. Signed-off-by: Rick Jones <[EMAIL PROTECTED]> Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> --- diff -r bdcdd0e1ee9d Documentation/networking/proc_net_tcp.txt --- a/Documentation/networking/proc_net_tcp.txt Sat Sep 01 07:00:31 2007 + +++ b/Documentation/networking/proc_net_tcp.txt Tue Sep 11 10:38:23 2007 -0700 @@ -20,8 +20,8 @@ up into 3 parts because of the length of || | | |--> number of unrecovered RTO timeouts || | |--> number of jiffies until timer expires || |> timer_active (see below) - ||--> receive-queue - |---> transmit-queue + ||--> receive-queue or connection backlog + |---> transmit-queue or connection limit 10000 54165785 4 cd1e6040 25 4 27 3 -1 | || || | | | | |--> slow start size threshold, diff -r bdcdd0e1ee9d net/ipv4/tcp.c --- a/net/ipv4/tcp.cSat Sep 01 07:00:31 2007 + +++ b/net/ipv4/tcp.cTue Sep 11 10:38:23 2007 -0700 @@ -2030,8 +2030,14 @@ void tcp_get_info(struct sock *sk, struc info->tcpi_snd_mss = tp->mss_cache; info->tcpi_rcv_mss = icsk->icsk_ack.rcv_mss; - info->tcpi_unacked = tp->packets_out; - info->tcpi_sacked = tp->sacked_out; + if (sk->sk_state == TCP_LISTEN) { + info->tcpi_unacked = sk->sk_ack_backlog; + info->tcpi_sacked = sk->sk_max_ack_backlog; + } + else { + info->tcpi_unacked = tp->packets_out; + info->tcpi_sacked = tp->sacked_out; + } info->tcpi_lost = tp->lost_out; info->tcpi_retrans = tp->retrans_out; info->tcpi_fackets = tp->fackets_out; diff -r bdcdd0e1ee9d net/ipv4/tcp_diag.c --- a/net/ipv4/tcp_diag.c Sat Sep 01 07:00:31 2007 + +++ b/net/ipv4/tcp_diag.c Tue Sep 11 10:38:23 2007 -0700 @@ -25,11 +25,14 @@ static void tcp_diag_get_info(struct soc const struct tcp_sock *tp = tcp_sk(sk); struct tcp_info *info = _info; - if (sk->sk_state == TCP_LISTEN) + if (sk->sk_state == TCP_LISTEN) { r->idiag_rqueue = sk->sk_ack_backlog; - else + r->idiag_wqueue = sk->sk_max_ack_backlog; + } + else { r->idiag_rqueue = tp->rcv_nxt - tp->copied_seq; - r->idiag_wqueue = tp->write_seq - tp->snd_una; + r->idiag_wqueue = tp->write_seq - tp->snd_una; + } if (info != NULL) tcp_get_info(sk, info); } diff -r bdcdd0e1ee9d net/ipv4/tcp_ipv4.c --- a/net/ipv4/tcp_ipv4.c Sat Sep 01 07:00:31 2007 + +++ b/net/ipv4/tcp_ipv4.c Tue Sep 11 10:38:23 2007 -0700 @@ -2320,7 +2320,8 @@ static void get_tcp4_sock(struct sock *s sprintf(tmpbuf, "%4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX " "%08X %5d %8d %lu %d %p %u %u %u %u %d", i, src, srcp, dest, destp, sk->sk_state, - tp->write_seq - tp->snd_una, + sk->sk_state == TCP_LISTEN ? sk->sk_max_ack_backlog : +(tp->write_seq - tp->snd_una), sk->sk_state == TCP_LISTEN ? sk->sk_ack_backlog : (tp->rcv_nxt - tp->copied_seq), timer_active, diff -r bdcdd0e1ee9d net/ipv6/tcp_ipv6.c --- a/net/ipv6/tcp_ipv6.c Sat Sep 01 07:00:31 2007 + +++ b/net/ipv6/tcp_ipv6.c Tue Sep 11 10:38:23 2007 -0700 @@ -2005,8 +2005,10 @@ static void get_tcp6_sock(struct seq_fil dest->s6_addr32[0], dest->s6_addr32[1], dest->s6_addr32[2], dest->s6_addr32[3], destp, sp->sk_state, - tp->write_seq-tp->snd_una, - (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : (tp->rcv_nxt - tp->copied_seq), + (sp->sk_state == TCP_LISTEN) ? sp->sk_max_ack_backlog: + tp->write_seq-tp->snd_una, + (sp->sk_state == TCP_LISTEN) ? sp->sk_ack_backlog : + (tp->rcv_nxt - tp->copied_seq), timer_active, jiffies_to_clock_t(timer_expires - jiffies), icsk->icsk_retransmits,
Re: [PATCH] include listenq max/backlog in tcp_info and related reports - correct version/signorder
Rick Jones wrote: John Heffner wrote: Any reason you're overloading tcpi_unacked and tcpi_sacked? It seems that setting idiag_rqueue and idiag_wqueue are sufficient. Different fields for different structures. The tcp_info struct doesn't have the idiag_mumble, so to get the two values shown in /proc/net/tcp I use tcpi_unacked and tcpi_sacked. For the INET_DIAG_INFO stuff the idiag_mumble fields are used and that then covers ss. Maybe I'm missing something. get_tcp[46]_sock() does not use struct tcp_info. The only way I see using this is by doing getsockopt(TCP_INFO) on your listen socket. Is this the intention? -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP
Bottom Softirq Implementation. John Ye, 2007.08.27 Why this patch: Make kernel be able to concurrently execute softirq's net code on SMP system. Takes full advantages of SMP to handle more packets and greatly raises NIC throughput. The current kernel's net packet processing logic is: 1) The CPU which handles a hardirq must be executing its related softirq. 2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs at the same time. The limitation make kernel network be hard to take the advantages of SMP. How this patch: It splits the current softirq code into 2 parts: the cpu-sensitive top half, and the cpu-insensitive bottom half, then make bottom half(calld BS) be executed on SMP concurrently. The two parts are not equal in terms of size and load. Top part has constant code size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match will make the bottom part's load be very high. So, if the bottom part softirq can be randomly distributed to processors and run concurrently on them, the network will gain much more packet handling capacity, network throughput will be be increased remarkably. Where useful: It's useful on SMP machines that meet the following 2 conditions: 1) have high kernel network load, for example, running iptables with thousands of rules, etc). 2) have more CPUs than active NICs, e.g. a 4 CPUs machine with 2 NICs). On these system, with the increase of softirq load, some CPUs will be idle while others(number is equal to # of NIC) keeps busy. IRQBALANCE will help, but it only shifts IRQ among CPUS, makes no softirq concurrency. Balancing the load of each cpus will not remarkably increase network speed. Where NOT useful: If the bottom half of softirq is too small(without running iptables), or the network is too idle, BS patch will not be seen to have visible effect. But It has no negative affect either. User can turn on/off BS functionality by /proc/sys/net/bs_enable switch. How to test: On a linux box, run iptables, add 2000 rules to table filter & table nat to simulate huge softirq load. Then, open 20 ftp sessions to download big file. On another machine(who use this test machine as gateway), open 20 more ftp download sessions. Compare the speed, without BS enabled, and with BS enabled. cat /proc/sys/net/bs_enable. this is a switch to turn on/off BS cat /proc/sys/net/bs_status. this shows the usage of each CPUs Test shown that when bottom softirq load is high, the network throughput can be nearly doubled on 2 CPUs machine. hopefully it may be quadrupled on a 4 cpus linux box. Bugs: It will NOT allow hotpug CPU. It only allows incremental CPUs ids, starting from 0 to num_online_cpus(). for example, 0,1,2,3 is OK. 0,1,8,9 is KO. Some considerations in the future: 1) With BS patch, the irq balance code on arch/i386/kernel/io_apic.c seems no need any more, at least not for network irq. 2) Softirq load will become very small. It only run the top half of old softirq, which is much less expensive than bottom half---the netfilter program. To let top softirq process more packets, can these 3 network parameters be given a larger value? extern int netdev_max_backlog = 1000; extern int netdev_budget = 300; extern int weight_p = 64; 3) Now, BS are running on built-in keventd thread, we can create new workqueues to let it run on? Signed-off-by: John Ye (Seeker) <[EMAIL PROTECTED]> --- old/net/ipv4/ip_input.c 2007-09-20 20:50:31.0 +0800 +++ new/net/ipv4/ip_input.c 2007-09-21 05:52:40.0 +0800 @@ -362,6 +362,198 @@ return NET_RX_DROP; } + +#define CONFIG_BOTTOM_SOFTIRQ_SMP +#define CONFIG_BOTTOM_SOFTIRQ_SMP_SYSCTL + +#ifdef CONFIG_BOTTOM_SOFTIRQ_SMP + +/* + * +Bottom Softirq Implementation. John Ye, 2007.08.27 + +Why this patch: +Make kernel be able to concurrently execute softirq's net code on SMP system. +Takes full advantages of SMP to handle more packets and greatly raises NIC throughput. +The current kernel's net packet processing logic is: +1) The CPU which handles a hardirq must be executing its related softirq. +2) One softirq instance(irqs handled by 1 CPU) can't be executed on more than 2 CPUs +at the same time. +The limitation make kernel network be hard to take the advantages of SMP. + +How this patch: +It splits the current softirq code into 2 parts: the cpu-sensitive top half, +and the cpu-insensitive bottom half, then make bottom half(calld BS) be +executed on SMP concurrently. +The two parts are not equal in terms of size and load. Top part has constant code +size(mainly, in net/core/dev.c and NIC drivers), while bottom part involves +netfilter(iptables) whose load varies very much. An iptalbes with 1000 rules to match +will make the bottom part's load be very high. So, if the bottom part softirq +can be randomly distributed to processor
Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP
David, Thanks for your reply. I understand it's not worth to do. I have made it a loadable module to fulfill the function. it mainly for busy NAT gateway server with SMP to speed up. John Ye - Original Message - From: "David Miller" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: ; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Friday, September 21, 2007 1:46 AM Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP > > The whole reason the queues are per-cpu is so that we do not > have to touch remote processor state nor use locks of any > kind whatsoever. > > With multi-queue networking cards becoming more and more > available, which will split up the packet workload in > hardware across all available cpus, there is less and less > reason to make a patch like this one. > > We've known about this issue for ages, and if we felt it > was appropriate to make this change, we would have done > so years ago. > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
want same order in /sys/class/net/eth as /sys/bus/pci/devices
I'd like to see the same order of devices in /sys/class/net/eth* as in /sys/bus/pci/devices. This would make administration easier. On Fedora 8 tests, the order I see is reversed: http://bugzilla.redhat.com/show_bug.cgi?id=291431 Perhaps the reversal is a result of the alias order listed in /etc/modprobe.conf. But the alias order was obtained from some source. Was the first reversal due to a user-space program (such as the anaconda installer), or due to something within the kernel? -- John Reiser, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirqnetwork code on SMP
Dear Jamal, Sorry, I sent to you all a not-good-formatted mail. Thanks for instructions and corrections from you all. I have thought that packet re-ordering for upper TCP protocol will become more intensive and this will make the network even slower. I do randomly select a CPU to dispatch the skb to. Previously, I dispatch skb evenly to all CPUs( round robin, one by one). but I didn't find a quick coding. for_each_online_cpu is not quick enough. According to my test result, it did make packet INPUT speed doubled because another CPU is used concurrently. It seems the packets still keep "roughly ordering" after turning on BS patch. The test is simple: use an 2400 lines of iptables -t filter -A INPUT -p tcp -s x.x.x.x --dport yy -j . these rules make the current softirq be very busy on one CPU and make the incoming net very slow. after turning on BS, the speed doubled. For NAT test, I didn't get a good result like INPUT because real environment limitation. The test is very basic and is far from "full". It seems to me that the cross-cpu spinlock_ for the queue doesn't have big cost and is allowable in terms of CPU time consumption, compared with the gains by making other CPUs joint in the work. I have made BS patch into a loadable module. http://linux.chinaunix.net/bbs/thread-909725-2-1.html and let others help with testing. John Ye - Original Message - From: "jamal" <[EMAIL PROTECTED]> To: "John Ye" <[EMAIL PROTECTED]> Cc: "David Miller" <[EMAIL PROTECTED]>; ; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Friday, September 21, 2007 7:43 PM Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirqnetwork code on SMP > On Fri, 2007-21-09 at 17:25 +0800, John Ye wrote: >> David, >> >> Thanks for your reply. I understand it's not worth to do. >> >> I have made it a loadable module to fulfill the function. it mainly for >> busy >> NAT gateway server with SMP to speed up. >> > > John, > > It was a little hard to read your code; however, it does seems to me > like will cause a massive amount of packet reordering to the end hosts > using you as the gateway especially when it is receiving a lot of > packets/second. > You have a queue per CPU that connects your bottom and top half and > several CPUs that may service a single NIC in your bottom half. > one cpu in either bottom/top half has to be slightly loaded and you > loose the ordering where incoming doesnt match outgoing packet order. > > cheers, > jamal > > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently runsoftirqnetwork code on SMP
Dear Jamal, Yes. you are right. I do "need some real fast traffic generator; possibly one that can do thousands of tcp sessions." to get some kind of convincing result. Also, the packet reordering is also my big concern. round-robin doesn't have much help. "The INPUT speed is doubled by using 2 CPUs" is shown by these steps: 1) without intables, ftp get a 50M file from another machine, ftp can show speed 10M/s. 2) run iptables and add many intpalbes rules, ftp get the same file, the speed is down to 3M/s, top shows CPU0 busy in softirq. CPU1 idle. 3) insmod my module BS, then ftp get the same file, the speed can reach 6M/s, top shows both CPU0 and CPU1 are busy in keventd/0/1 I will try my best to do further test. the best test should be done on a 4 CPU GATEWAY machine. In China, there are many companies who use linux box running iptables as a gateway to serve 1000 around clients, for example. On those machines, a lot conntracking, and they have the "idle CPUs while net is too busy" problem. In my BS module (If you got it), only 2 functions are needed to see: REP_ip_rcv(), and bs_func(). Others have nothing to do with the BS patch --- they are there only for accessing non-EXPORT_SYMBOLed kernel variables. Thanks a lot for your thought. John Ye - Original Message - From: "jamal" <[EMAIL PROTECTED]> To: "john ye" <[EMAIL PROTECTED]> Cc: "David Miller" <[EMAIL PROTECTED]>; ; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Sunday, September 23, 2007 8:43 PM Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently runsoftirqnetwork code on SMP > On Sun, 2007-23-09 at 12:45 +0800, john ye wrote: > >> I do randomly select a CPU to dispatch the skb to. Previously, I >> dispatch >> skb evenly to all CPUs( round robin, one by one). but I didn't find a >> quick >> coding. for_each_online_cpu is not quick enough. > > for_each_online_cpu doenst look that expensive - but even round robin > wont fix the reordering problem. What you need to do is make sure that a > flow always goes to the same cpu over some period of time. > >> According to my test result, it did make packet INPUT speed doubled >> because >> another CPU is used concurrently. > > How did you measure "speed" - was it throughput? Did you measure how > much cpu was being utilized? > >> It seems the packets still keep "roughly ordering" after turning on >> BS patch. > > Linux TCP is very resilient to reordering compared to other OSes, but > even then if you hit it with enough packets it is going to start > sweating it. > >> The test is simple: use an 2400 lines of iptables -t filter -A INPUT >> -p >> tcp -s x.x.x.x --dport yy -j . >> these rules make the current softirq be very busy on one CPU and make >> the >> incoming net very slow. after turning on BS, the speed doubled. >> > Ok, but how do you observe "doubled"? > Do you have conntrack on? It maybe that what you have just found is > netfilter needs to have its work defered from packet rcv. > You need some real fast traffic generator; possibly one that can do > thousands of tcp sessions. > >> For NAT test, I didn't get a good result like INPUT because real >> environment limitation. >> The test is very basic and is far from "full". > > What happens when you totally compile out netfilter and you just use > this machine as a server? > >> It seems to me that the cross-cpu spinlock_ for the queue doesn't >> have >> big cost and is allowable in terms of CPU time consumption, compared >> with >> the gains by making other CPUs joint in the work. >> >> I have made BS patch into a loadable module. >> http://linux.chinaunix.net/bbs/thread-909725-2-1.html and let others >> help with testing. > > It is still very hard to read; and i am not sure how you are going to > get the performance you claim eventually - you are registering as a tap > for ip packets, which means you will process two of each incoming > packets. > > cheers, > jamal > > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP
Dear Jamal, Thanks, bothered you all. I will look into the 2 issues. re-ordering and spinlock, and do extensive test. Once having result, no matter positive or negative, I will contact you. The format will not be a mess any more. John Ye - Original Message - From: "jamal" <[EMAIL PROTECTED]> To: "john ye" <[EMAIL PROTECTED]> Cc: "David Miller" <[EMAIL PROTECTED]>; ; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, September 24, 2007 2:07 AM Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP > John, > It will NEVER be an acceptable solution as long as you have re-ordering. > I will look at it - but i have to run out for now. In the meantime, > I have indented it for you to be in proper kernel format so others can > also look it. Attached. > > cheers, > jamal > > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrently runsoftirqnetwork code on SMP
Jamal, You pointed out a key point: it's NOT acceptable if massive packet re-ordering couldn¡¯t be avoided. I debugged function tcp_ofo_queue in net/ipv4/tcp_input.c & monitored out_of_order_queue, found that re-ordering becomes unacceptable with the softirq load grows. It's simple to avoid out-of-order packets by changing random dispatch into dispatch based on source ip address. e.g. cpu = iph->saddr % nr_cpus. while cpu is like a hash entry. Considering that BS patch is mainly used on server with many incoming connections, dispatch by IP should balance CPU load well. The test is under way, it's not bad so far. The queue spin_lock seems not cost much. Below is the bcpp beautified module code. Last time code mess is caused by outlook express which killed tabs. Thanks. John Ye /* * BOTTOM_SOFTIRQ_NET * An implementation of bottom softirq concurrent execution on SMP * This is implemented by splitting current net softirq into top half * and bottom half, dispatch the bottom half to each cpu's workqueue. * Hopefully, it can raise the throughput of NIC when running iptalbes * on SMP machine. * * Version:$Id: bs_smp.c, v 2.6.13-15 for kernel 2.6.13-15-smp * * Authors:John Ye & QianYu Ye, 2007.08.27 */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static spinlock_t *p_ptype_lock; static struct list_head *p_ptype_base;/* 16 way hashed list */ int (*Pip_options_rcv_srr)(struct sk_buff *skb); int (*Pnf_rcv_postxfrm_nonlocal)(struct sk_buff *skb); struct ip_rt_acct *ip_rt_acct; struct ipv4_devconf *Pipv4_devconf; #define ipv4_devconf (*Pipv4_devconf) //#define ip_rt_acct Pip_rt_acct #define ip_options_rcv_srr Pip_options_rcv_srr #define nf_rcv_postxfrm_nonlocal Pnf_rcv_postxfrm_nonlocal //extern int nf_rcv_postxfrm_local(struct sk_buff *skb); //extern int ip_options_rcv_srr(struct sk_buff *skb); static struct workqueue_struct **Pkeventd_wq; #define keventd_wq (*Pkeventd_wq) #define INSERT_CODE_HERE static inline int ip_rcv_finish(struct sk_buff *skb) { struct net_device *dev = skb->dev; struct iphdr *iph = skb->nh.iph; int err; /* * Initialise the virtual path cache for the packet. It describes * how the packet travels inside Linux networking. */ if (skb->dst == NULL) { if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))) { if (err == -EHOSTUNREACH) IP_INC_STATS_BH(IPSTATS_MIB_INADDRERRORS); goto drop; } } if (nf_xfrm_nonlocal_done(skb)) return nf_rcv_postxfrm_nonlocal(skb); #ifdef CONFIG_NET_CLS_ROUTE if (skb->dst->tclassid) { struct ip_rt_acct *st = ip_rt_acct + 256*smp_processor_id(); u32 idx = skb->dst->tclassid; st[idx&0xFF].o_packets++; st[idx&0xFF].o_bytes+=skb->len; st[(idx>>16)&0xFF].i_packets++; st[(idx>>16)&0xFF].i_bytes+=skb->len; } #endif if (iph->ihl > 5) { struct ip_options *opt; /* It looks as overkill, because not all IP options require packet mangling. But it is the easiest for now, especially taking into account that combination of IP options and running sniffer is extremely rare condition. --ANK (980813) */ if (skb_cow(skb, skb_headroom(skb))) { IP_INC_STATS_BH(IPSTATS_MIB_INDISCARDS); goto drop; } iph = skb->nh.iph; if (ip_options_compile(NULL, skb)) goto inhdr_error; opt = &(IPCB(skb)->opt); if (opt->srr) { struct in_device *in_dev = in_dev_get(dev);
Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP
Jamal & Stephen, I found BSS-hash paper you mentioned and have browsed it briefly. The issue "may end sending all your packets to one cpu" might be dealt with by cpu hash (srcip + dstip) % nr_cpus, plus checking cpu balance periodically, shift cpu by an extra seed value? Any way, the cpu hash code must not be too expensive because every incoming packet hits the path. We are going to do further study on this BSS thing. __do_IRQ has a tendency to collect same IRQ on different CPUs into one CPU when NIC is busy(by IRQ_PENDING & IRQ_INPROGRESS control skill). so, dispatch the load to SMP here may be good thing(?). Thanks. John Ye - Original Message - From: "jamal" <[EMAIL PROTECTED]> To: "Stephen Hemminger" <[EMAIL PROTECTED]> Cc: "john ye" <[EMAIL PROTECTED]>; "David Miller" <[EMAIL PROTECTED]>; ; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Wednesday, September 26, 2007 6:22 AM Subject: Re: [PATCH: 2.6.13-15-SMP 3/3] network: concurrentlyrunsoftirqnetwork code on SMP > On Tue, 2007-25-09 at 09:03 -0700, Stephen Hemminger wrote: > > > There is a standard hash called RSS, that many drivers support because it is > > used by other operating systems. > > I think any stateless/simple thing will do (something along the lines > what 802.1ad does for trunk, a 5 classical five tuple etc). > > Having solved the reordering problem in such a stateless way introduces > a loadbalancing setback; you may end sending all your packets to one cpu > (a problem Mr Ye didnt have when he was re-orderding ;->). > > cheers, > jamal > > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Make TCP prequeue configurable
Stephen Hemminger wrote: On Fri, 28 Sep 2007 00:08:33 +0200 Eric Dumazet <[EMAIL PROTECTED]> wrote: Hi all I am sure some of you are going to tell me that prequeue is not all black :) Thank you [RFC] Make TCP prequeue configurable The TCP prequeue thing is based on old facts, and has drawbacks. 1) It adds 48 bytes per 'struct tcp_sock' 2) It adds some ugly code in hot paths 3) It has a small hit ratio on typical servers using many sockets 4) It may have a high hit ratio on UP machines running one process, where the prequeue adds litle gain. (In fact, letting the user doing the copy after being woke up is better for cache reuse) 5) Doing a copy to user in softirq handler is not good, because of potential page faults :( 6) Maybe the NET_DMA thing is the only thing that might need prequeue. This patch introduces a CONFIG_TCP_PREQUEUE, automatically selected if CONFIG_NET_DMA is on. Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> Rather than having a two more compile cases and test cases to deal with. If you can prove it is useless, make a case for killing it completely. I think it really does help in case (4) with old NICs that don't do rx checksumming. I'm not sure how many people really care about this anymore, but probably some...? OTOH, it would be nice to get rid of sysctl_tcp_low_latency. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
Alex Sidorenko wrote: Here are the values from live kernel (obtained with 'crash') when the host was in SWS state: full_space=708 full_space/2=354 free_space=393 window=76 In this case the test from my original fix, (window < full_space/2), succeeds. But John's test free_space > window + full_space/2 393 430 does not. So I suspect that the new fix will not always work. From tcpdump traces we can see that both hosts exchange with 76-byte packets for a long time. From customer's application log we see that it continues to read 76-byte chunks per each read() call - even though more than that is available in the receive buffer. Technically it's OK for read() to return even after reading one byte, so if sk->receive_queue contains multiple 76-byte skbuffs we may return after processing just one skbuff (but we we don't understand the details of why this happens on customer's system). Are there any particular reasons why you want to postpone window update until free_space becomes > window + full_space/2 and not as soon as free_space > full_space/2? As the only real-life occurance of SWS shows free_space oscillating slightly above full_space/2, I created the fix specifically to match this phenomena as seen on customer's host. We reach the modified section only when (free_space > full_space/2) so it should be OK to update the window at this point if mss==full_space. So yes, we can test John's fix on customer's host but I doubt it will work for the reasons mentioned above, in brief: 'window = free_space' instead of 'window=full_space/2' is OK, but the test 'free_space > window + full_space/2' is not for the specific pattern customer sees on his hosts. Sorry for the long delay in response, I've been on vacation. I'm okay with your patch, and I can't think of any real problem with it, except that the behavior is non-standard. Then again, Linux acking in general is non-standard, which has created the bug in the first place. :) The only thing I can think where it might still ack too often is if free_space frequently drops just below full_space/2 for a bit then rises above full_space/2. I've also attached a corrected version of my earlier patch that I think solves the problem you noted. Thanks, -John Do full receiver-side SWS avoidance when rcvbuf < mss. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- commit f4333661026621e15549fb75b37be785e4a1c443 tree 30d46b64ea19634875fdd4656d33f76db526a313 parent 562aa1d4c6a874373f9a48ac184f662fbbb06a04 author John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400 committer John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400 net/ipv4/tcp_output.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index dc15113..e621a63 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1605,8 +1605,15 @@ u32 __tcp_select_window(struct sock *sk) * We also don't do any window rounding when the free space * is too small. */ - if (window <= free_space - mss || window > free_space) + if (window <= free_space - mss || window > free_space) { window = (free_space/mss)*mss; + } else if (mss == full_space) { + /* Do full receive-side SWS avoidance +* when rcvbuf <= mss */ + window = tcp_receive_window(tp); + if (free_space > window + full_space/2) + window = free_space; + } } return window;
[PATCH] tcp_mem initialization
The current tcp_mem initialization gives values that are really too small for systems with ~256-768 MB of memory, and also for systems with larger page sizes (ia64). This patch gives an alternate method of initialization that doesn't depend on the cache allocation functions, but I think should still provide a nice curve that gives a smaller fraction of total memory with small-memory systems, while maintaining the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems. -John Change tcp_mem initialization function. The fraction of total memory is now a continuous function of memory size, and independent of page size. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- commit a4461a36efb376bf01399cfd6f1ad15dc89a8794 tree 23b2fb9da52b45de8008fc7ea6bb8c10e3a3724b parent 8b9909ded6922c33c221b105b26917780cfa497d author John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400 committer John Heffner <[EMAIL PROTECTED]> Wed, 14 Mar 2007 17:15:06 -0400 net/ipv4/tcp.c | 13 ++--- 1 files changed, 10 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 74c4d10..3834b10 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2458,11 +2458,18 @@ void __init tcp_init(void) sysctl_max_syn_backlog = 128; } - /* Allow no more than 3/4 kernel memory (usually less) allocated to TCP */ - sysctl_tcp_mem[0] = (1536 / sizeof (struct inet_bind_hashbucket)) << order; - sysctl_tcp_mem[1] = sysctl_tcp_mem[0] * 4 / 3; + /* Set the pressure threshold to be a fraction of global memory that +* is up to 1/2 at 256 MB, decreasing toward zero with the amount of +* memory, with a floor of 128 pages. +*/ + limit = min(nr_all_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT); + limit = (limit * (nr_all_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11); + limit = max(limit, 128UL); + sysctl_tcp_mem[0] = limit / 4 * 3; + sysctl_tcp_mem[1] = limit; sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2; + /* Set per-socket limits to no more than 1/128 the pressure threshold */ limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7); max_share = min(4UL*1024*1024, limit);
Re: [PATCH] tcp_mem initialization
David Miller wrote: From: John Heffner <[EMAIL PROTECTED]> Date: Wed, 14 Mar 2007 17:25:22 -0400 The current tcp_mem initialization gives values that are really too small for systems with ~256-768 MB of memory, and also for systems with larger page sizes (ia64). This patch gives an alternate method of initialization that doesn't depend on the cache allocation functions, but I think should still provide a nice curve that gives a smaller fraction of total memory with small-memory systems, while maintaining the same upper bound (pressure at 1/2, max as 3/4) on larger memory systems. Indeed, it's really dumb for any of these calculations to be dependant upon the page size. Your patch looks good, and I'll review it further tomorrow and push upstream unless I find some issues with it. Thanks John. The way it's coded is somewhat opaque since it has to be done with 32-bit integer arithmetic. These plots might help make the motivation behind the code a little clearer. Thanks, -John
[PATCH 0/3] [NET] MTU discovery changes
These are a few changes to fix/clean up some of the MTU discovery processing with non-stream sockets, and add a probing mode. See also matching patches to tracepath to take advantage of this. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [NET] Do pmtu check in transport layer
Check the pmtu check at the transport layer (for UDP, ICMP and raw), and send a local error if socket is PMTUDISC_DO and packet is too big. This is actually a pure bugfix for ipv6. For ipv4, it allows us to do pmtu checks in the same way as for ipv6. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv4/ip_output.c |4 +++- net/ipv4/raw.c|8 +--- net/ipv6/ip6_output.c | 11 ++- net/ipv6/raw.c|7 +-- 4 files changed, 19 insertions(+), 11 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index d096332..593acf7 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk, fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen; - if (inet->cork.length + length > 0x - fragheaderlen) { + if (inet->cork.length + length > 0x - fragheaderlen || + (inet->pmtudisc >= IP_PMTUDISC_DO && +inet->cork.length + length > mtu)) { ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu-exthdrlen); return -EMSGSIZE; } diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 87e9c16..f252f4e 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -271,10 +271,12 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length, struct iphdr *iph; struct sk_buff *skb; int err; + int mtu; - if (length > rt->u.dst.dev->mtu) { - ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, - rt->u.dst.dev->mtu); + mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) : +rt->u.dst.dev->mtu; + if (length > mtu) { + ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu); return -EMSGSIZE; } if (flags&MSG_PROBE) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 3055169..711dfc3 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1044,11 +1044,12 @@ int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to, fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt ? opt->opt_nflen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - sizeof(struct frag_hdr); - if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) { - if (inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN - fragheaderlen) { - ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen); - return -EMSGSIZE; - } + if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN && +inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN - fragheaderlen) || + (np->pmtudisc >= IPV6_PMTUDISC_DO && +inet->cork.length + length > mtu)) { + ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen); + return -EMSGSIZE; } /* diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index 306d5d8..75db277 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -556,9 +556,12 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length, struct sk_buff *skb; unsigned int hh_len; int err; + int mtu; - if (length > rt->u.dst.dev->mtu) { - ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu); + mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) : +rt->u.dst.dev->mtu; + if (length > mtu) { + ipv6_local_error(sk, EMSGSIZE, fl, mtu); return -EMSGSIZE; } if (flags&MSG_PROBE) -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] [NET] Move DF check to ip_forward
Do fragmentation check in ip_forward, similar to ipv6 forwarding. Also add a debug printk in the DF check in ip_fragment since we should now never reach it. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv4/ip_forward.c |8 net/ipv4/ip_output.c |2 ++ 2 files changed, 10 insertions(+), 0 deletions(-) diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index 369e721..0efb1f5 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -85,6 +85,14 @@ int ip_forward(struct sk_buff *skb) if (opt->is_strictroute && rt->rt_dst != rt->rt_gateway) goto sr_failed; + if (unlikely(skb->len > dst_mtu(&rt->u.dst) && +(skb->nh.iph->frag_off & htons(IP_DF))) && !skb->local_df) { + IP_INC_STATS(IPSTATS_MIB_FRAGFAILS); + icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, + htonl(dst_mtu(&rt->u.dst))); + goto drop; + } + /* We are about to mangle packet. Copy it! */ if (skb_cow(skb, LL_RESERVED_SPACE(rt->u.dst.dev)+rt->u.dst.header_len)) goto drop; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 593acf7..90bdd53 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -433,6 +433,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*)) iph = skb->nh.iph; if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) { + if (net_ratelimit()) + printk(KERN_DEBUG "ip_fragment: requested fragment of packet with DF set\n"); IP_INC_STATS(IPSTATS_MIB_FRAGFAILS); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(dst_mtu(&rt->u.dst))); -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] [NET] Add IP(V6)_PMTUDISC_RPOBE
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER. This option forces us not to fragment, but does not make use of the kernel path MTU discovery. That is, it allows for user-mode MTU probing (or, packetization-layer path MTU discovery). This is particularly useful for diagnostic utilities, like traceroute/tracepath. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- include/linux/in.h |1 + include/linux/in6.h |1 + include/linux/skbuff.h |3 ++- include/net/ip.h |2 +- net/core/skbuff.c|2 ++ net/ipv4/ip_output.c | 14 ++ net/ipv4/ip_sockglue.c |2 +- net/ipv4/raw.c |3 +++ net/ipv6/ip6_output.c| 12 net/ipv6/ipv6_sockglue.c |2 +- net/ipv6/raw.c |3 +++ 11 files changed, 33 insertions(+), 12 deletions(-) diff --git a/include/linux/in.h b/include/linux/in.h index 1912e7c..2dc1f8a 100644 --- a/include/linux/in.h +++ b/include/linux/in.h @@ -83,6 +83,7 @@ struct in_addr { #define IP_PMTUDISC_DONT 0 /* Never send DF frames */ #define IP_PMTUDISC_WANT 1 /* Use per route hints */ #define IP_PMTUDISC_DO 2 /* Always DF*/ +#define IP_PMTUDISC_PROBE 3 /* Ignore dst pmtu */ #define IP_MULTICAST_IF32 #define IP_MULTICAST_TTL 33 diff --git a/include/linux/in6.h b/include/linux/in6.h index 4e8350a..d559fac 100644 --- a/include/linux/in6.h +++ b/include/linux/in6.h @@ -179,6 +179,7 @@ struct in6_flowlabel_req #define IPV6_PMTUDISC_DONT 0 #define IPV6_PMTUDISC_WANT 1 #define IPV6_PMTUDISC_DO 2 +#define IPV6_PMTUDISC_PROBE3 /* Flowlabel */ #define IPV6_FLOWLABEL_MGR 32 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4ff3940..64038b4 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -284,7 +284,8 @@ struct sk_buff { nfctinfo:3; __u8pkt_type:3, fclone:2, - ipvs_property:1; + ipvs_property:1, + ign_dst_mtu; __be16 protocol; void(*destructor)(struct sk_buff *skb); diff --git a/include/net/ip.h b/include/net/ip.h index e79c3e3..f5874a3 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -201,7 +201,7 @@ int ip_decrease_ttl(struct iphdr *iph) static inline int ip_dont_fragment(struct sock *sk, struct dst_entry *dst) { - return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO || + return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO || (inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT && !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL; C(mark); @@ -549,6 +550,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE) new->ipvs_property = old->ipvs_property; #endif + new->ign_dst_mtu= old->ign_dst_mtu; #ifdef CONFIG_BRIDGE_NETFILTER new->nf_bridge = old->nf_bridge; nf_bridge_get(old->nf_bridge); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 90bdd53..a7e8944 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -201,7 +201,8 @@ static inline int ip_finish_output(struct sk_buff *skb) return dst_output(skb); } #endif - if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb)) + if (skb->len > dst_mtu(skb->dst) && + !skb->ign_dst_mtu && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); @@ -801,7 +802,9 @@ int ip_append_data(struct sock *sk, inet->cork.addr = ipc->addr; } dst_hold(&rt->u.dst); - inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path); + inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ? + rt->u.dst.dev->mtu : + dst_mtu(rt->u.dst.path); inet->cork.rt = rt; inet->cork.length = 0; sk->sk_sndmsg_page = NULL; @@ -1220,13 +1223,16 @@ int ip_push_pending_frames(struct sock *sk) * to fragment the frame generated here. No matter, what transforms * how transforms change size of the packet, it will come out. */ - if (inet->pmtudisc != IP_PMTUDISC_DO) + if (inet->pmtudisc < IP_PMTUDISC_DO) skb->local_df = 1; + if (inet->pmtudisc == IP_PMTUDISC_PROBE) + s
[PATCH 0/2] [iputils] MTU discovery changes
These add some changes that make tracepath a little more useful for diagnosing MTU issues. The length flag helps distinguish between MTU black holes and other types of black holes by allowing you to vary the probe packet lengths. Using PMTUDISC_PROBE gives you the same results on each run without having to flush the route cache, so you can see where MTU changes in the path actually occur. The PMTUDISC_PROBE patch goes in should be conditional on whether the corresponding kernel patch (just sent) goes in. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [iputils] Use PMTUDISC_PROBE mode if it exists.
Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- tracepath.c | 10 -- tracepath6.c | 10 -- 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/tracepath.c b/tracepath.c index 1f901ba..a562d88 100644 --- a/tracepath.c +++ b/tracepath.c @@ -24,6 +24,10 @@ #include #include +#ifndef IP_PMTUDISC_PROBE +#define IP_PMTUDISC_PROBE 3 +#endif + struct hhistory { int hops; @@ -322,8 +326,10 @@ main(int argc, char **argv) } memcpy(&target.sin_addr, he->h_addr, 4); - on = IP_PMTUDISC_DO; - if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on))) { + on = IP_PMTUDISC_PROBE; + if (setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on)) && + (on = IP_PMTUDISC_DO, +setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &on, sizeof(on { perror("IP_MTU_DISCOVER"); exit(1); } diff --git a/tracepath6.c b/tracepath6.c index d65230d..6f13a51 100644 --- a/tracepath6.c +++ b/tracepath6.c @@ -30,6 +30,10 @@ #define SOL_IPV6 IPPROTO_IPV6 #endif +#ifndef IPV6_PMTUDISC_PROBE +#define IPV6_PMTUDISC_PROBE3 +#endif + int overhead = 48; int mtu = 128000; int hops_to = -1; @@ -369,8 +373,10 @@ int main(int argc, char **argv) mapped = 1; } - on = IPV6_PMTUDISC_DO; - if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on))) { + on = IPV6_PMTUDISC_PROBE; + if (setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on)) && + (on = IPV6_PMTUDISC_DO, +setsockopt(fd, SOL_IPV6, IPV6_MTU_DISCOVER, &on, sizeof(on { perror("IPV6_MTU_DISCOVER"); exit(1); } -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [iputils] Add length flag to set initial MTU.
Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- tracepath.c | 10 -- tracepath6.c | 10 -- 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/tracepath.c b/tracepath.c index c3f6f74..1f901ba 100644 --- a/tracepath.c +++ b/tracepath.c @@ -265,7 +265,7 @@ static void usage(void) __attribute((noreturn)); static void usage(void) { - fprintf(stderr, "Usage: tracepath [-n] [/]\n"); + fprintf(stderr, "Usage: tracepath [-n] [-l ] [/]\n"); exit(-1); } @@ -279,11 +279,17 @@ main(int argc, char **argv) char *p; int ch; - while ((ch = getopt(argc, argv, "nh?")) != EOF) { + while ((ch = getopt(argc, argv, "nh?l:")) != EOF) { switch(ch) { case 'n': no_resolve = 1; break; + case 'l': + if ((mtu = atoi(optarg)) <= overhead) { + fprintf(stderr, "Error: length must be >= %d\n", overhead); + exit(1); + } + break; default: usage(); } diff --git a/tracepath6.c b/tracepath6.c index 23d6a8c..d65230d 100644 --- a/tracepath6.c +++ b/tracepath6.c @@ -280,7 +280,7 @@ static void usage(void) __attribute((noreturn)); static void usage(void) { - fprintf(stderr, "Usage: tracepath6 [-n] [-b] [/]\n"); + fprintf(stderr, "Usage: tracepath6 [-n] [-b] [-l ] [/]\n"); exit(-1); } @@ -297,7 +297,7 @@ int main(int argc, char **argv) int gai; char pbuf[NI_MAXSERV]; - while ((ch = getopt(argc, argv, "nbh?")) != EOF) { + while ((ch = getopt(argc, argv, "nbh?l:")) != EOF) { switch(ch) { case 'n': no_resolve = 1; @@ -305,6 +305,12 @@ int main(int argc, char **argv) case 'b': show_both = 1; break; + case 'l': + if ((mtu = atoi(optarg)) <= overhead) { + fprintf(stderr, "Error: length must be >= %d\n", overhead); + exit(1); + } + break; default: usage(); } -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ip(7) IP_PMTUDISC_PROBE
Document new IP_PMTUDISC_PROBE value for IP_MTU_DISCOVERY. (Going into 2.6.22). Thanks, -John diff -rU3 man-pages-2.43-a/man7/ip.7 man-pages-2.43-b/man7/ip.7 --- man-pages-2.43-a/man7/ip.7 2006-09-26 09:54:29.0 -0400 +++ man-pages-2.43-b/man7/ip.7 2007-03-27 15:46:18.0 -0400 @@ -515,6 +515,7 @@ IP_PMTUDISC_WANT:Use per-route settings. IP_PMTUDISC_DONT:Never do Path MTU Discovery. IP_PMTUDISC_DO:Always do Path MTU Discovery. +IP_PMTUDISC_PROBE:Set DF but ignore Path MTU. .TE When PMTU discovery is enabled the kernel automatically keeps track of @@ -550,6 +551,15 @@ with the .B IP_MTU option. + +It is possible to implement RFC 4821 MTU probing with +.B SOCK_DGRAM +of +.B SOCK_RAW +sockets by setting a value of IP_PMTUDISC_PROBE. This is also particularly +useful for diagnostic tools such as +.BR tracepath (8) +that wish to deliberately send probe packets larger than the observed Path MTU. .TP .B IP_MTU Retrieve the current known path MTU of the current socket.
Re: [PATCH] NET: Add TCP connection abort IOCTL
Mark Huth wrote: David Miller wrote: From: [EMAIL PROTECTED] (David Griego) Date: Tue, 27 Mar 2007 14:47:54 -0700 Adds an IOCTL for aborting established TCP connections, and is designed to be an HA performance improvement for cleaning up, failure notification, and application termination. Signed-off-by: David Griego <[EMAIL PROTECTED]> SO_LINGER with a zero linger time plus close() isn't working properly? There is no reason for this ioctl at all. Either existing facilities provide what you need or what you want is a protocol violation we can't do. Actually, there are legitimate uses for this sort of API. The patch allows an administrator to kill specific connections that are in use by other applications, where the close is not available, since the socket is owned by another process. Say one of your large applications has hundreds or even thousands of open connections and you have determined that a particular connection is causing trouble. This API allows the admin to kill that particular connection, and doesn't appear to violate any RFC offhand, since an abort is sent to the peer. One may argue that the applications should be modified, but that is not always possible in the case of various ISVs. As Linux gains market share in the large server market, more and more applications are being ported from other platforms that have this sort of management/administrative interfaces. Mark Huth I also believe this is a useful thing to have. I'm not 100% sure this ioctl is the way to go, but it seems reasonable. This directly corresponds to writing deleteTcb to the tcpConnectionState variable in the TCP MIB (RFC 4022). I don't think it constitutes a protocol violation. As a concrete example of a way I've used this type of feature is to defend against a netkill [1] style attack, where the defense involves making decisions about which connections to kill when memory gets scarce. It makes sense to do this with a system daemon, since an admin might have an arbitrarily complicated policy as to which applications and peers have priority for the memory. This is too complicated to distribute and enforce across all applications. You could do this in the kernel, but why if you don't have to? -John [1] http://shlang.com/netkill/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Add TCP connection abort IOCTL
John Heffner wrote: I also believe this is a useful thing to have. I'm not 100% sure this ioctl is the way to go, but it seems reasonable. This directly corresponds to writing deleteTcb to the tcpConnectionState variable in the TCP MIB (RFC 4022). I don't think it constitutes a protocol violation. Responding to myself in good form :P I'll add that there are other ways to do this currently but all I know of are hackish, f.e. using a raw socket to send RST packets to yourself. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [iputils] Add documentation for the -l flag.
--- doc/tracepath.sgml | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml index 71eaa8d..c0f308b 100644 --- a/doc/tracepath.sgml +++ b/doc/tracepath.sgml @@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this path tracepath +-l @@ -39,6 +40,18 @@ of UDP ports to maintain trace history. +OPTIONS + + + + +Sets the initial packet length to + + + + OUTPUT -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [iputils] Document -n flag.
--- doc/tracepath.sgml |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/doc/tracepath.sgml b/doc/tracepath.sgml index c0f308b..1bc83b9 100644 --- a/doc/tracepath.sgml +++ b/doc/tracepath.sgml @@ -15,6 +15,7 @@ traces path to a network host discovering MTU along this path tracepath +-n -l @@ -42,6 +43,14 @@ of UDP ports to maintain trace history. OPTIONS + + + + +Do not look up host names. Only print IP addresses numerically. + + + -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [iputils] Re-probe at same TTL after MTU reduction.
This fixes a bug that would miss a hop after an ICMP packet too big message, since it would continue increase the TTL without probing again. --- tracepath.c |6 ++ tracepath6.c |6 ++ 2 files changed, 12 insertions(+), 0 deletions(-) diff --git a/tracepath.c b/tracepath.c index d035a1e..19b2c6b 100644 --- a/tracepath.c +++ b/tracepath.c @@ -352,8 +352,14 @@ main(int argc, char **argv) exit(1); } +restart: for (i=0; i<3; i++) { + int old_mtu; + + old_mtu = mtu; res = probe_ttl(fd, ttl); + if (mtu != old_mtu) + goto restart; if (res == 0) goto done; if (res > 0) diff --git a/tracepath6.c b/tracepath6.c index a010218..65c4a4a 100644 --- a/tracepath6.c +++ b/tracepath6.c @@ -422,8 +422,14 @@ int main(int argc, char **argv) exit(1); } +restart: for (i=0; i<3; i++) { + int old_mtu; + + old_mtu = mtu; res = probe_ttl(fd, ttl); + if (mtu != old_mtu) + goto restart; if (res == 0) goto done; if (res > 0) -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [iputils] Fix asymm messages.
We should only print the asymm messages in tracepath/6 when you receive a TTL expired message, because this is the only time when we'd expect the same number of hops back as our TTL was set to for a symmetric path. --- tracepath.c | 25 - tracepath6.c | 25 - 2 files changed, 24 insertions(+), 26 deletions(-) diff --git a/tracepath.c b/tracepath.c index a562d88..d035a1e 100644 --- a/tracepath.c +++ b/tracepath.c @@ -163,19 +163,6 @@ restart: } } - if (rethops>=0) { - if (rethops<=64) - rethops = 65-rethops; - else if (rethops<=128) - rethops = 129-rethops; - else - rethops = 256-rethops; - if (sndhops>=0 && rethops != sndhops) - printf("asymm %2d ", rethops); - else if (sndhops<0 && rethops != ttl) - printf("asymm %2d ", rethops); - } - if (rettv) { int diff = (tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec); printf("%3d.%03dms ", diff/1000, diff%1000); @@ -204,6 +191,18 @@ restart: if (e->ee_origin == SO_EE_ORIGIN_ICMP && e->ee_type == 11 && e->ee_code == 0) { + if (rethops>=0) { + if (rethops<=64) + rethops = 65-rethops; + else if (rethops<=128) + rethops = 129-rethops; + else + rethops = 256-rethops; + if (sndhops>=0 && rethops != sndhops) + printf("asymm %2d ", rethops); + else if (sndhops<0 && rethops != ttl) + printf("asymm %2d ", rethops); + } printf("\n"); break; } diff --git a/tracepath6.c b/tracepath6.c index 6f13a51..a010218 100644 --- a/tracepath6.c +++ b/tracepath6.c @@ -176,19 +176,6 @@ restart: } } - if (rethops>=0) { - if (rethops<=64) - rethops = 65-rethops; - else if (rethops<=128) - rethops = 129-rethops; - else - rethops = 256-rethops; - if (sndhops>=0 && rethops != sndhops) - printf("asymm %2d ", rethops); - else if (sndhops<0 && rethops != ttl) - printf("asymm %2d ", rethops); - } - if (rettv) { int diff = (tv.tv_sec-rettv->tv_sec)*100+(tv.tv_usec-rettv->tv_usec); printf("%3d.%03dms ", diff/1000, diff%1000); @@ -220,6 +207,18 @@ restart: (e->ee_origin == SO_EE_ORIGIN_ICMP6 && e->ee_type == 3 && e->ee_code == 0)) { + if (rethops>=0) { + if (rethops<=64) + rethops = 65-rethops; + else if (rethops<=128) + rethops = 129-rethops; + else + rethops = 256-rethops; + if (sndhops>=0 && rethops != sndhops) + printf("asymm %2d ", rethops); + else if (sndhops<0 && rethops != ttl) + printf("asymm %2d ", rethops); + } printf("\n"); break; } -- 1.5.0.2.gc260-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [NET] Do pmtu check in transport layer
Patrick McHardy wrote: John Heffner wrote: Check the pmtu check at the transport layer (for UDP, ICMP and raw), and send a local error if socket is PMTUDISC_DO and packet is too big. This is actually a pure bugfix for ipv6. For ipv4, it allows us to do pmtu checks in the same way as for ipv6. diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index d096332..593acf7 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -822,7 +822,9 @@ int ip_append_data(struct sock *sk, fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen; - if (inet->cork.length + length > 0x - fragheaderlen) { + if (inet->cork.length + length > 0x - fragheaderlen || + (inet->pmtudisc >= IP_PMTUDISC_DO && +inet->cork.length + length > mtu)) { ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu-exthdrlen); return -EMSGSIZE; } This makes ping report an incorrect MTU when IPsec is used since we're only accounting for the additional header_len, not the trailer_len (which is not easily changeable). Additionally it will report different MTUs for the first and following fragments when the socket is corked because only the first fragment includes the header_len. It also can't deal with things like NAT and routing by fwmark that change the route. The old behaviour was that we get an ICMP frag. required with the MTU of the final route, while this will always report the MTU of the initially chosen route. For all these reasons I think it should be reverted to the old behaviour. You're right, this is no good. I think the other problems are fixable, but NAT really screws this. Unfortunately, there is still a real problem with ipv6, in that the output side does not generate a packet too big ICMP like ipv4. Also, it feels kind of undesirable be rely on local ICMP instead of direct error message delivery. I'll try to generate a new patch. Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP connection stops after high load.
Robert Iakobashvili wrote: Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and 2.6.20.6 do not. Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5 tcp_rmem and tcp_wmem are the same, whereas tcp_mem are much different: kernel tcp_mem --- 2.6.18.312288 16384 24576 2.6.19.5 30724096 6144 Is not it done deliberately by the below patch: commit 9e950efa20dc8037c27509666cba6999da9368e8 Author: John Heffner <[EMAIL PROTECTED]> Date: Mon Nov 6 23:10:51 2006 -0800 [TCP]: Don't use highmem in tcp hash size calculation. This patch removes consideration of high memory when determining TCP hash table sizes. Taking into account high memory results in tcp_mem values that are too large. Is it a feature? My machine has: MemTotal: 484368 kB and for all kernel configurations are actually the same with CONFIG_HIGHMEM4G=y Thanks, Another patch that went in right around that time: commit 52bf376c63eebe72e862a1a6e713976b038c3f50 Author: John Heffner <[EMAIL PROTECTED]> Date: Tue Nov 14 20:25:17 2006 -0800 [TCP]: Fix up sysctl_tcp_mem initialization. Fix up tcp_mem initial settings to take into account the size of the hash entries (different on SMP and non-SMP systems). Signed-off-by: John Heffner <[EMAIL PROTECTED]> Signed-off-by: David S. Miller <[EMAIL PROTECTED]> (This has been changed again for 2.6.21.) In the dmesg, there should be some messages like this: IP route cache hash table entries: 32768 (order: 5, 131072 bytes) TCP established hash table entries: 131072 (order: 8, 1048576 bytes) TCP bind hash table entries: 65536 (order: 6, 262144 bytes) TCP: Hash tables configured (established 131072 bind 65536) What do yours say? Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP connection stops after high load.
Robert Iakobashvili wrote: Hi John, On 4/15/07, John Heffner <[EMAIL PROTECTED]> wrote: Robert Iakobashvili wrote: > Vanilla 2.6.18.3 works for me perfectly, whereas 2.6.19.5 and > 2.6.20.6 do not. > > Looking into the tcp /proc entries of 2.6.18.3 versus 2.6.19.5 > tcp_rmem and tcp_wmem are the same, whereas tcp_mem are > much different: > > kernel tcp_mem > --- > 2.6.18.312288 16384 24576 > 2.6.19.5 30724096 6144 Another patch that went in right around that time: commit 52bf376c63eebe72e862a1a6e713976b038c3f50 Author: John Heffner <[EMAIL PROTECTED]> Date: Tue Nov 14 20:25:17 2006 -0800 [TCP]: Fix up sysctl_tcp_mem initialization. (This has been changed again for 2.6.21.) In the dmesg, there should be some messages like this: IP route cache hash table entries: 32768 (order: 5, 131072 bytes) TCP established hash table entries: 131072 (order: 8, 1048576 bytes) TCP bind hash table entries: 65536 (order: 6, 262144 bytes) TCP: Hash tables configured (established 131072 bind 65536) What do yours say? For the 2.6.19.5, where we have this problem: From dmsg: IP route cache hash table entries: 4096 (order: 2, 16384 bytes) TCP established hash table entries: 16384 (order: 5, 131072 bytes) TCP bind hash table entries: 8192 (order: 4, 65536 bytes) #cat /proc/sys/net/ipv4/tcp_mem 307240966144 MemTotal: 484368 kB CONFIG_HIGHMEM4G=y Yes, this difference is caused by the commit above. The old way didn't really make a lot of sense, since it was different based on smp/non-smp and page size, and had large discontinuities at 512MB and every power of two. It was hard to make the limit never larger than the memory pool but never too small either, when based on the hash table size. The current net-2.6 (2.6.21) has a redesigned tcp_mem initialization that should give you more appropriate values, something like 45408 60546 90816. For reference: Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700 [TCP]: Fix tcp_mem[] initialization. Change tcp_mem initialization function. The fraction of total memory is now a continuous function of memory size, and independent of page size. Signed-off-by: John Heffner <[EMAIL PROTECTED]> Signed-off-by: David S. Miller <[EMAIL PROTECTED]> Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP connection stops after high load.
Robert Iakobashvili wrote: Kernels 2.6.19 and 2.6.20 series are effectively broken right now. Don't you wish to patch them? I don't know if this qualifies as an unconditional bug. The commit above was actually a bugfix so that the limits were not higher than total memory on some systems, but had the side effect that it made them even smaller on your particular configuration. Also, having initial sysctl values that are conservatively small probably doesn't qualify as a bug (for patching stable trees). You might ask the -stable maintainers if they have a different opinion. For most people, 2.6.19 and 2.6.20 work fine. For those who really care about the tcp_mem values (are using a substantial fraction of physical memory for TCP connections), the best bet is to set the tcp_mem sysctl values in the startup scripts, or use the new initialization function in 2.6.21. Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bug in tcp?
Stephen Hemminger wrote: A guess: maybe something related to a PAWS wraparound problem. Does turning off sysctl net.ipv4.tcp_timestamps fix it? That was my first thought too (aside from netfilter), but a failed PAWS check should not result in a reset.. -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP connection stops after high load.
David Miller wrote: From: "Robert Iakobashvili" <[EMAIL PROTECTED]> Date: Tue, 17 Apr 2007 10:58:04 +0300 David, On 4/16/07, David Miller <[EMAIL PROTECTED]> wrote: Commit: 53cdcc04c1e85d4e423b2822b66149b6f2e52c2c Author: John Heffner <[EMAIL PROTECTED]> Fri, 16 Mar 2007 15:04:03 -0700 [TCP]: Fix tcp_mem[] initialization. Change tcp_mem initialization function. The fraction of total memory is now a continuous function of memory size, and independent of page size. Kernels 2.6.19 and 2.6.20 series are effectively broken right now. Don't you wish to patch them? Can you verify that this patch actually fixes your problem? Yes, it fixes. Thanks, I will submit it to -stable branch. My only reservation in submitting this to -stable is that it will in many cases increase the default tcp_mem values, which in turn can increase the default tcp_rmem values, and therefore the window scale. There will be some set of people with broken firewalls who trigger that problem for the first time by upgrading along the stable branch. While it's not our fault, it could cause some complaints... Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/0] Re-try changes for PMTUDISC_PROBE
This backs out the the transport layer MTU checks that don't work. As a consequence, I had to back out the PMTUDISC_PROBE patch as well. These patches should fix the problem with ipv6 that the transport layer change tried to address, and re-implement PMTUDISC_PROBE. I think this approach is nicer than the last one, since it doesn't require a bit in struct sk_buff. Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Revert "[NET] Do pmtu check in transport layer"
This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37. This idea does not work, as pointed at by Patrick McHardy. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv4/ip_output.c |4 +--- net/ipv4/raw.c|8 +++- net/ipv6/ip6_output.c | 11 +-- net/ipv6/raw.c|7 ++- 4 files changed, 11 insertions(+), 19 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 79e71ee..34606ef 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk, fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen; - if (inet->cork.length + length > 0x - fragheaderlen || - (inet->pmtudisc >= IP_PMTUDISC_DO && -inet->cork.length + length > mtu)) { + if (inet->cork.length + length > 0x - fragheaderlen) { ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu-exthdrlen); return -EMSGSIZE; } diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index c60aadf..24d7c9f 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length, struct iphdr *iph; struct sk_buff *skb; int err; - int mtu; - mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) : -rt->u.dst.dev->mtu; - if (length > mtu) { - ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu); + if (length > rt->u.dst.dev->mtu) { + ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, + rt->u.dst.dev->mtu); return -EMSGSIZE; } if (flags&MSG_PROBE) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index b8e307a..4cfdad4 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to, fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt ? opt->opt_nflen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - sizeof(struct frag_hdr); - if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN && -inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN - fragheaderlen) || - (np->pmtudisc >= IPV6_PMTUDISC_DO && -inet->cork.length + length > mtu)) { - ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen); - return -EMSGSIZE; + if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) { + if (inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN - fragheaderlen) { + ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen); + return -EMSGSIZE; + } } /* diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index f4cd90b..f65fcd7 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length, struct sk_buff *skb; unsigned int hh_len; int err; - int mtu; - mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) : -rt->u.dst.dev->mtu; - if (length > mtu) { - ipv6_local_error(sk, EMSGSIZE, fl, mtu); + if (length > rt->u.dst.dev->mtu) { + ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu); return -EMSGSIZE; } if (flags&MSG_PROBE) -- 1.5.1.rc3.30.ga8f4-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [NET] MTU discovery check in ip6_fragment()
Adds a check in ip6_fragment() mirroring ip_fragment() for packets that we can't fragment, and sends an ICMP Packet Too Big message in response. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv6/ip6_output.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 4cfdad4..5a5b7d4 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -567,6 +567,19 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *)) nexthdr = *prevhdr; mtu = dst_mtu(&rt->u.dst); + + /* We must not fragment if the socket is set to force MTU discovery +* or if the skb it not generated by a local socket. (This last +* check should be redundant, but it's free.) +*/ + if (!np || np->pmtudisc >= IPV6_PMTUDISC_DO) { + skb->dev = skb->dst->dev; + icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb->dev); + IP6_INC_STATS(ip6_dst_idev(skb->dst), IPSTATS_MIB_FRAGFAILS); + kfree_skb(skb); + return -EMSGSIZE; + } + if (np && np->frag_size < mtu) { if (np->frag_size) mtu = np->frag_size; -- 1.5.1.rc3.30.ga8f4-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"
This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb. Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37 does not work. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- include/linux/in.h |1 - include/linux/in6.h |1 - include/linux/skbuff.h |3 +-- include/net/ip.h |2 +- net/core/skbuff.c|2 -- net/ipv4/ip_output.c | 14 -- net/ipv4/ip_sockglue.c |2 +- net/ipv4/raw.c |3 --- net/ipv6/ip6_output.c| 12 net/ipv6/ipv6_sockglue.c |2 +- net/ipv6/raw.c |3 --- 11 files changed, 12 insertions(+), 33 deletions(-) diff --git a/include/linux/in.h b/include/linux/in.h index 2dc1f8a..1912e7c 100644 --- a/include/linux/in.h +++ b/include/linux/in.h @@ -83,7 +83,6 @@ struct in_addr { #define IP_PMTUDISC_DONT 0 /* Never send DF frames */ #define IP_PMTUDISC_WANT 1 /* Use per route hints */ #define IP_PMTUDISC_DO 2 /* Always DF*/ -#define IP_PMTUDISC_PROBE 3 /* Ignore dst pmtu */ #define IP_MULTICAST_IF32 #define IP_MULTICAST_TTL 33 diff --git a/include/linux/in6.h b/include/linux/in6.h index d559fac..4e8350a 100644 --- a/include/linux/in6.h +++ b/include/linux/in6.h @@ -179,7 +179,6 @@ struct in6_flowlabel_req #define IPV6_PMTUDISC_DONT 0 #define IPV6_PMTUDISC_WANT 1 #define IPV6_PMTUDISC_DO 2 -#define IPV6_PMTUDISC_PROBE3 /* Flowlabel */ #define IPV6_FLOWLABEL_MGR 32 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 8bf9b9f..7f17cfc 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -277,8 +277,7 @@ struct sk_buff { nfctinfo:3; __u8pkt_type:3, fclone:2, - ipvs_property:1, - ign_dst_mtu:1; + ipvs_property:1; __be16 protocol; void(*destructor)(struct sk_buff *skb); diff --git a/include/net/ip.h b/include/net/ip.h index 6a08b65..75f226d 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph) static inline int ip_dont_fragment(struct sock *sk, struct dst_entry *dst) { - return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO || + return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO || (inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT && !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL; C(mark); @@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE) new->ipvs_property = old->ipvs_property; #endif - new->ign_dst_mtu= old->ign_dst_mtu; #ifdef CONFIG_NET_SCHED #ifdef CONFIG_NET_CLS_ACT new->tc_verd = old->tc_verd; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 704bc44..79e71ee 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb) return dst_output(skb); } #endif - if (skb->len > dst_mtu(skb->dst) && - !skb->ign_dst_mtu && !skb_is_gso(skb)) + if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); @@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk, inet->cork.addr = ipc->addr; } dst_hold(&rt->u.dst); - inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ? - rt->u.dst.dev->mtu : - dst_mtu(rt->u.dst.path); + inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path); inet->cork.rt = rt; inet->cork.length = 0; sk->sk_sndmsg_page = NULL; @@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk) * to fragment the frame generated here. No matter, what transforms * how transforms change size of the packet, it will come out. */ - if (inet->pmtudisc < IP_PMTUDISC_DO) + if (inet->pmtudisc != IP_PMTUDISC_DO) skb->local_df = 1; - if (inet->pmtudisc == IP_PMTUDISC_PROBE) - skb->ign_dst_mtu = 1; - /* DF bit is set when we want to see DF on outgoing frames. * If local_df is set too, we still allow to fragment this frame
[PATCH] [NET] Add IP(V6)_PMTUDISC_RPOBE
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER. This option forces us not to fragment, but does not make use of the kernel path MTU discovery. That is, it allows for user-mode MTU probing (or, packetization-layer path MTU discovery). This is particularly useful for diagnostic utilities, like traceroute/tracepath. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- include/linux/in.h |1 + include/linux/in6.h |1 + net/ipv4/ip_output.c | 20 +++- net/ipv4/ip_sockglue.c |2 +- net/ipv6/ip6_output.c| 15 --- net/ipv6/ipv6_sockglue.c |2 +- 6 files changed, 31 insertions(+), 10 deletions(-) diff --git a/include/linux/in.h b/include/linux/in.h index 1912e7c..3975cbf 100644 --- a/include/linux/in.h +++ b/include/linux/in.h @@ -83,6 +83,7 @@ struct in_addr { #define IP_PMTUDISC_DONT 0 /* Never send DF frames */ #define IP_PMTUDISC_WANT 1 /* Use per route hints */ #define IP_PMTUDISC_DO 2 /* Always DF*/ +#define IP_PMTUDISC_PROBE 3 /* Ignore dst pmtu */ #define IP_MULTICAST_IF32 #define IP_MULTICAST_TTL 33 diff --git a/include/linux/in6.h b/include/linux/in6.h index 4e8350a..d559fac 100644 --- a/include/linux/in6.h +++ b/include/linux/in6.h @@ -179,6 +179,7 @@ struct in6_flowlabel_req #define IPV6_PMTUDISC_DONT 0 #define IPV6_PMTUDISC_WANT 1 #define IPV6_PMTUDISC_DO 2 +#define IPV6_PMTUDISC_PROBE3 /* Flowlabel */ #define IPV6_FLOWLABEL_MGR 32 diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 34606ef..66e2c3a 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb) return -EINVAL; } +static inline int ip_skb_dst_mtu(struct sk_buff *skb) +{ + struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL; + + return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ? + skb->dst->dev->mtu : dst_mtu(skb->dst); +} + static inline int ip_finish_output(struct sk_buff *skb) { #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) @@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb) return dst_output(skb); } #endif - if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb)) + if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); @@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*)) if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) { IP_INC_STATS(IPSTATS_MIB_FRAGFAILS); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, - htonl(dst_mtu(&rt->u.dst))); + htonl(ip_skb_dst_mtu(skb))); kfree_skb(skb); return -EMSGSIZE; } @@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk, inet->cork.addr = ipc->addr; } dst_hold(&rt->u.dst); - inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path); + inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ? + rt->u.dst.dev->mtu : + dst_mtu(rt->u.dst.path); inet->cork.rt = rt; inet->cork.length = 0; sk->sk_sndmsg_page = NULL; @@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk) * to fragment the frame generated here. No matter, what transforms * how transforms change size of the packet, it will come out. */ - if (inet->pmtudisc != IP_PMTUDISC_DO) + if (inet->pmtudisc < IP_PMTUDISC_DO) skb->local_df = 1; /* DF bit is set when we want to see DF on outgoing frames. * If local_df is set too, we still allow to fragment this frame * locally. */ - if (inet->pmtudisc == IP_PMTUDISC_DO || + if (inet->pmtudisc >= IP_PMTUDISC_DO || (skb->len <= dst_mtu(&rt->u.dst) && ip_dont_fragment(sk, &rt->u.dst))) df = htons(IP_DF); diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index c199d23..4d54457 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level, inet->hdrincl = val ? 1 : 0; break; case IP_MTU_DISCOVER: - if (val<0 || val>2) +
[PATCH 2/4] Revert "[NET] Do pmtu check in transport layer"
This reverts commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37. This idea does not work, as pointed at by Patrick McHardy. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv4/ip_output.c |4 +--- net/ipv4/raw.c|8 +++- net/ipv6/ip6_output.c | 11 +-- net/ipv6/raw.c|7 ++- 4 files changed, 11 insertions(+), 19 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 79e71ee..34606ef 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -810,9 +810,7 @@ int ip_append_data(struct sock *sk, fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen; - if (inet->cork.length + length > 0x - fragheaderlen || - (inet->pmtudisc >= IP_PMTUDISC_DO && -inet->cork.length + length > mtu)) { + if (inet->cork.length + length > 0x - fragheaderlen) { ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu-exthdrlen); return -EMSGSIZE; } diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index c60aadf..24d7c9f 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -271,12 +271,10 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length, struct iphdr *iph; struct sk_buff *skb; int err; - int mtu; - mtu = inet->pmtudisc == IP_PMTUDISC_DO ? dst_mtu(&rt->u.dst) : -rt->u.dst.dev->mtu; - if (length > mtu) { - ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, mtu); + if (length > rt->u.dst.dev->mtu) { + ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->dport, + rt->u.dst.dev->mtu); return -EMSGSIZE; } if (flags&MSG_PROBE) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index b8e307a..4cfdad4 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1079,12 +1079,11 @@ int ip6_append_data(struct sock *sk, int getfrag(void *from, char *to, fragheaderlen = sizeof(struct ipv6hdr) + rt->u.dst.nfheader_len + (opt ? opt->opt_nflen : 0); maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - sizeof(struct frag_hdr); - if ((mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN && -inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN - fragheaderlen) || - (np->pmtudisc >= IPV6_PMTUDISC_DO && -inet->cork.length + length > mtu)) { - ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen); - return -EMSGSIZE; + if (mtu <= sizeof(struct ipv6hdr) + IPV6_MAXPLEN) { + if (inet->cork.length + length > sizeof(struct ipv6hdr) + IPV6_MAXPLEN - fragheaderlen) { + ipv6_local_error(sk, EMSGSIZE, fl, mtu-exthdrlen); + return -EMSGSIZE; + } } /* diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index f4cd90b..f65fcd7 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -558,12 +558,9 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length, struct sk_buff *skb; unsigned int hh_len; int err; - int mtu; - mtu = np->pmtudisc == IPV6_PMTUDISC_DO ? dst_mtu(&rt->u.dst) : -rt->u.dst.dev->mtu; - if (length > mtu) { - ipv6_local_error(sk, EMSGSIZE, fl, mtu); + if (length > rt->u.dst.dev->mtu) { + ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu); return -EMSGSIZE; } if (flags&MSG_PROBE) -- 1.5.1.rc3.30.ga8f4-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] Revert "[NET] Add IP(V6)_PMTUDISC_RPOBE"
This reverts commit d21d2a90b879c0cf159df5944847e6d9833816eb. Must be backed out because commit 87e927a0583bd4a8ba9e97cd75b58d8aa1c76e37 does not work. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- include/linux/in.h |1 - include/linux/in6.h |1 - include/linux/skbuff.h |3 +-- include/net/ip.h |2 +- net/core/skbuff.c|2 -- net/ipv4/ip_output.c | 14 -- net/ipv4/ip_sockglue.c |2 +- net/ipv4/raw.c |3 --- net/ipv6/ip6_output.c| 12 net/ipv6/ipv6_sockglue.c |2 +- net/ipv6/raw.c |3 --- 11 files changed, 12 insertions(+), 33 deletions(-) diff --git a/include/linux/in.h b/include/linux/in.h index 2dc1f8a..1912e7c 100644 --- a/include/linux/in.h +++ b/include/linux/in.h @@ -83,7 +83,6 @@ struct in_addr { #define IP_PMTUDISC_DONT 0 /* Never send DF frames */ #define IP_PMTUDISC_WANT 1 /* Use per route hints */ #define IP_PMTUDISC_DO 2 /* Always DF*/ -#define IP_PMTUDISC_PROBE 3 /* Ignore dst pmtu */ #define IP_MULTICAST_IF32 #define IP_MULTICAST_TTL 33 diff --git a/include/linux/in6.h b/include/linux/in6.h index d559fac..4e8350a 100644 --- a/include/linux/in6.h +++ b/include/linux/in6.h @@ -179,7 +179,6 @@ struct in6_flowlabel_req #define IPV6_PMTUDISC_DONT 0 #define IPV6_PMTUDISC_WANT 1 #define IPV6_PMTUDISC_DO 2 -#define IPV6_PMTUDISC_PROBE3 /* Flowlabel */ #define IPV6_FLOWLABEL_MGR 32 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 8bf9b9f..7f17cfc 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -277,8 +277,7 @@ struct sk_buff { nfctinfo:3; __u8pkt_type:3, fclone:2, - ipvs_property:1, - ign_dst_mtu:1; + ipvs_property:1; __be16 protocol; void(*destructor)(struct sk_buff *skb); diff --git a/include/net/ip.h b/include/net/ip.h index 6a08b65..75f226d 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -206,7 +206,7 @@ int ip_decrease_ttl(struct iphdr *iph) static inline int ip_dont_fragment(struct sock *sk, struct dst_entry *dst) { - return (inet_sk(sk)->pmtudisc >= IP_PMTUDISC_DO || + return (inet_sk(sk)->pmtudisc == IP_PMTUDISC_DO || (inet_sk(sk)->pmtudisc == IP_PMTUDISC_WANT && !(dst_metric(dst, RTAX_LOCK)&(1<destructor = NULL; C(mark); @@ -543,7 +542,6 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old) #if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE) new->ipvs_property = old->ipvs_property; #endif - new->ign_dst_mtu= old->ign_dst_mtu; #ifdef CONFIG_NET_SCHED #ifdef CONFIG_NET_CLS_ACT new->tc_verd = old->tc_verd; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 704bc44..79e71ee 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -198,8 +198,7 @@ static inline int ip_finish_output(struct sk_buff *skb) return dst_output(skb); } #endif - if (skb->len > dst_mtu(skb->dst) && - !skb->ign_dst_mtu && !skb_is_gso(skb)) + if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); @@ -788,9 +787,7 @@ int ip_append_data(struct sock *sk, inet->cork.addr = ipc->addr; } dst_hold(&rt->u.dst); - inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ? - rt->u.dst.dev->mtu : - dst_mtu(rt->u.dst.path); + inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path); inet->cork.rt = rt; inet->cork.length = 0; sk->sk_sndmsg_page = NULL; @@ -1208,16 +1205,13 @@ int ip_push_pending_frames(struct sock *sk) * to fragment the frame generated here. No matter, what transforms * how transforms change size of the packet, it will come out. */ - if (inet->pmtudisc < IP_PMTUDISC_DO) + if (inet->pmtudisc != IP_PMTUDISC_DO) skb->local_df = 1; - if (inet->pmtudisc == IP_PMTUDISC_PROBE) - skb->ign_dst_mtu = 1; - /* DF bit is set when we want to see DF on outgoing frames. * If local_df is set too, we still allow to fragment this frame
[PATCH 4/4] [NET] Add IP(V6)_PMTUDISC_RPOBE
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER. This option forces us not to fragment, but does not make use of the kernel path MTU discovery. That is, it allows for user-mode MTU probing (or, packetization-layer path MTU discovery). This is particularly useful for diagnostic utilities, like traceroute/tracepath. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- include/linux/in.h |1 + include/linux/in6.h |1 + net/ipv4/ip_output.c | 20 +++- net/ipv4/ip_sockglue.c |2 +- net/ipv6/ip6_output.c| 15 --- net/ipv6/ipv6_sockglue.c |2 +- 6 files changed, 31 insertions(+), 10 deletions(-) diff --git a/include/linux/in.h b/include/linux/in.h index 1912e7c..3975cbf 100644 --- a/include/linux/in.h +++ b/include/linux/in.h @@ -83,6 +83,7 @@ struct in_addr { #define IP_PMTUDISC_DONT 0 /* Never send DF frames */ #define IP_PMTUDISC_WANT 1 /* Use per route hints */ #define IP_PMTUDISC_DO 2 /* Always DF*/ +#define IP_PMTUDISC_PROBE 3 /* Ignore dst pmtu */ #define IP_MULTICAST_IF32 #define IP_MULTICAST_TTL 33 diff --git a/include/linux/in6.h b/include/linux/in6.h index 4e8350a..d559fac 100644 --- a/include/linux/in6.h +++ b/include/linux/in6.h @@ -179,6 +179,7 @@ struct in6_flowlabel_req #define IPV6_PMTUDISC_DONT 0 #define IPV6_PMTUDISC_WANT 1 #define IPV6_PMTUDISC_DO 2 +#define IPV6_PMTUDISC_PROBE3 /* Flowlabel */ #define IPV6_FLOWLABEL_MGR 32 diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 34606ef..66e2c3a 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -189,6 +189,14 @@ static inline int ip_finish_output2(struct sk_buff *skb) return -EINVAL; } +static inline int ip_skb_dst_mtu(struct sk_buff *skb) +{ + struct inet_sock *inet = skb->sk ? inet_sk(skb->sk) : NULL; + + return (inet && inet->pmtudisc == IP_PMTUDISC_PROBE) ? + skb->dst->dev->mtu : dst_mtu(skb->dst); +} + static inline int ip_finish_output(struct sk_buff *skb) { #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM) @@ -198,7 +206,7 @@ static inline int ip_finish_output(struct sk_buff *skb) return dst_output(skb); } #endif - if (skb->len > dst_mtu(skb->dst) && !skb_is_gso(skb)) + if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); @@ -422,7 +430,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff*)) if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->local_df)) { IP_INC_STATS(IPSTATS_MIB_FRAGFAILS); icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, - htonl(dst_mtu(&rt->u.dst))); + htonl(ip_skb_dst_mtu(skb))); kfree_skb(skb); return -EMSGSIZE; } @@ -787,7 +795,9 @@ int ip_append_data(struct sock *sk, inet->cork.addr = ipc->addr; } dst_hold(&rt->u.dst); - inet->cork.fragsize = mtu = dst_mtu(rt->u.dst.path); + inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ? + rt->u.dst.dev->mtu : + dst_mtu(rt->u.dst.path); inet->cork.rt = rt; inet->cork.length = 0; sk->sk_sndmsg_page = NULL; @@ -1203,13 +1213,13 @@ int ip_push_pending_frames(struct sock *sk) * to fragment the frame generated here. No matter, what transforms * how transforms change size of the packet, it will come out. */ - if (inet->pmtudisc != IP_PMTUDISC_DO) + if (inet->pmtudisc < IP_PMTUDISC_DO) skb->local_df = 1; /* DF bit is set when we want to see DF on outgoing frames. * If local_df is set too, we still allow to fragment this frame * locally. */ - if (inet->pmtudisc == IP_PMTUDISC_DO || + if (inet->pmtudisc >= IP_PMTUDISC_DO || (skb->len <= dst_mtu(&rt->u.dst) && ip_dont_fragment(sk, &rt->u.dst))) df = htons(IP_DF); diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index c199d23..4d54457 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -542,7 +542,7 @@ static int do_ip_setsockopt(struct sock *sk, int level, inet->hdrincl = val ? 1 : 0; break; case IP_MTU_DISCOVER: - if (val<0 || val>2) +
[PATCH 3/4] [NET] MTU discovery check in ip6_fragment()
Adds a check in ip6_fragment() mirroring ip_fragment() for packets that we can't fragment, and sends an ICMP Packet Too Big message in response. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- net/ipv6/ip6_output.c | 13 + 1 files changed, 13 insertions(+), 0 deletions(-) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 4cfdad4..5a5b7d4 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -567,6 +567,19 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *)) nexthdr = *prevhdr; mtu = dst_mtu(&rt->u.dst); + + /* We must not fragment if the socket is set to force MTU discovery +* or if the skb it not generated by a local socket. (This last +* check should be redundant, but it's free.) +*/ + if (!np || np->pmtudisc >= IPV6_PMTUDISC_DO) { + skb->dev = skb->dst->dev; + icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb->dev); + IP6_INC_STATS(ip6_dst_idev(skb->dst), IPSTATS_MIB_FRAGFAILS); + kfree_skb(skb); + return -EMSGSIZE; + } + if (np && np->frag_size < mtu) { if (np->frag_size) mtu = np->frag_size; -- 1.5.1.rc3.30.ga8f4-dirty - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html