Hi Iljitsch, You wrote:
> Ok, so this is not the same subnet, right? Note that if you feed > tcpdump a few -v's you don't have to do as much header decoding in > your head. Please take a look at my page, which documents the situation clearly, complete with the commands I give tcpdump: http://www.firstpr.com.au/ip/ivip/ipv4-bits/actual-packets.html#jumbo1 > What it looks like to me is that you're actually tcpdumping rather > than ipdumping: what you see is an initial two segment > transmission but as a single packet. Could it be that you're > tcpdumping some virtual interface rather than real packets on the > wire? A complete packet dump, with lengths easily visible in a different colour, is: http://www.firstpr.com.au/ip/ivip/ipv4-bits/jumbo-mss.ht << ml (Type in the "ml" at the end manually, I don't want links to this big file.) This is the full hex dump of the packets - the TCP handshake, the HTTP request, the data packets and the ACKs. They all have full TCP headers, checksums etc. If you search for 7260 you will see the longest packet - carrying 5 times the MSS amount of data. If you look at its Identification field (3rd block of hex) it is 3d47. In the next outgoing packet, the Identification field is 5 more: 3d4c: > Can you capture ethernet headers? Yes. Just add 't' to this URL: http://www.firstpr.com.au/ip/ivip/ipv4-bits/with-ethernet-headers.tx I just did this - it is a capture of a similar packet transfer to a client here at home. This shows an ACK packet and in response, just 13 microseconds later, a jumboframe going out. 07:15:57.497378 IP 72.36.140.10.80 > 150.101.162.123.3941: . 18357:25417(7060) ack 634 win 6963 0x0000: 0012 807c 117f 0015 609b 0c04 0800 4500 0x0010: 1bbc 244c 4000 4006 ede0 4824 8c0a 9665 ... 0x1ba0: af6b 6664 0dcc 168e a084 dc28 a81b efb6 0x1bb0: 26ac 8aeb 4149 413c dba6 f5e8 0e1a 5d2d 0x1bc0: bc10 7fa0 5d33 0310 695c The IPv4 packet is: Length = 7100 bytes |||| 4500 1bbc 244c 4000 4006 ede0 4824 8c0a | DF=1 The TCP segment size is 7060 - exactly 5 times the lower of the two MSS values - 1412. This section: http://www.firstpr.com.au/ip/ivip/ipv4-bits/actual-packets.html#2008-08-12 shows the timing of outgoing packets with respect to the ACKs which seem to prompt them. Two of the packets are 8512 bytes. This is a TCP segment of 6 x 1412 = 8472. > Maybe some work is being offloaded to the NIC? That wouldn't fit with the complete IP header for the whole jumboframe, or the ethernet packet dump. I am sure this is a true record of the physical packet leaving the machine. > If not, I'd say that all of this is a bug in the linux networking > code (which is weird to begin with) I can't imagine it is a bug. It is conceivable it is http doing this, but the short turnaround time between each ACK arriving the the large packet being sent out makes me think there is something in the kernel which is bunching together the contents of TCP packets created by httpd, and then having a go at firing them out to the Net, with DF=1 - presumably being able to fire the same stuff out in smaller, or even normal sized, TCP packets if it gets a PTB. > but I have no explanation about why you would be seeing normal > size packets without fragmentation. I'm pretty sure ISPs wouldn't > want to expend CPU cycles to do this on behalf of their hosted > customers... I have no explanation for either of these things - the server bundling together TCP data in flagrant violation of the RFCs as I understand them, and (as best I can guess) the PPPoE router taking it upon itself to recreate the individual RFC conformant TCP packets. > (BTW, I thought having a server 1600 km away was impressive...) Its 14,500km or so (9,000 miles) from Melbourne to Dallas Fort Worth. Editing text files on the server involves my keystroke going through a planet thickness of quartz to the server, and the new character coming back the same way, all in about 230ms. >> I couldn't easily find the thread at: > >> http://www.merit.edu/mail.archives/nanog/ > > Look for "microsoft". Its not in the subject lines, and there is no search facility. >>> ??? Why would advertising a large MSS be a problem? You send >>> what the other advertises he/she can handle and obviously _they_ >>> will be sending you what they can handle. >> >> Yes, but what if, for some reason, there is a router in the path >> with a smaller MTU than is generally seen by the client or by >> Google? > > MSS is end-to-end, you still need PMTUD or fragmentation. Yes - and with Google sending out large packets with DF=0, it is expecting any hapless router in the middle, with a lower next hop MTU than this length, to do a lot of work without complaint. >> I think there is no workable alternative to RFC 1191 PMTUD. > > What they should have done back then was create a mechanism that > allows the receiver of fragments to tell the sender that the > packet was fragmented and what the size of the largest fragment > was. > > This would have been harder to deploy (changes on both ends) but > more robust. OK - so packets would not be dropped, just chopped into two or more fragments, which themselves might be fragmented too. Then you rely on the destination host to tell the sender about the fragmentation, rather than the router which fragments. I think this is a lot of work for the router. Better to drop the too-big packet and send a PTB. That would be faster and so bring forward the time when the sending host creates packets which are suitable for the whole path. RFC 1191 has the router send the exact MTU, which is better than what I think you are suggesting, since the destination host wouldn't be able to tell what the MTU limit was which caused the fragmentation. That would leave the sending host no option but to use trial and error - sending more not quite so long packets which would have a high chance of being fragmentted too. I think RFC 1191 is a better approach than what I understand of your suggestion. >> RFC 4821 is so difficult to implement > > Indeed. Still, it probably has to be done at some point, > especially if we ever want to move away from 1500 as the > internet's maximum packet size. We should all stop burning fossil fuel at some point too. I don't see a problem with the widespread use of 1500 byte MTU gear being incompatible with RFC 1191. It would be nice if the sending host had a better clue about the outside world than the simple fact that its Ethernet link has an MTU of 9k or so. This is pretty dumb, and it might involve each session sending a 9k packet towards the destination host, where some poor 1500 MTU next hop router goes "Not again . . . " and sends back a PTB for the millionth time. As long as most of the PTBs get back to the RFC 1191 compliant sending host, I think it will work fine. However, designing a map-encap system which does not completely disrupt this in terms of MTU limits between the ITR and ETR is very challenging. >>> The first mistake was to invent the DF bit in the first place. > >> I guess you mean that all packets should always have been >> non-fragmentable and that something like RFC 1191 should always >> have been in existence. > > No: if you have fragmentation anyway, there is no reason to have a > source say it can't be done. It would arguably be useful for the > destination to say that, but this isn't what DF does so before RFC > 1191 came along it was useless. I can't understand why the PTB message was first defined without also including the MTU value which the packet's length exceeded. It seems like such a no-brainer - and the RFC states that the MTU Discovery Working Group spents months reinventing what "was first suggested by Geof Cooper, who in two short paragraphs set out all the basic ideas". >>> The second mistake is to suggest that the DF bit be set for ALL >>> packets to do PMTUD in RFC 1191. >> >> I don't understand your objection. > > Set it only for 10% of your packets and you still have > connectivity when there is a black hole and the PMTUD works just > fine. OK - so routers would fragment 90% of the packets and the PTB only goes back when one of the 10% of packets has its DF flag set? That just seems to slow down the sending host's response to the MTU situation. I still like RFC 1191 better. There's no fragmentation and the sending host gets the fastest possible feedback that it needs to send smaller packets. That fragmented situation is less reliable than when the packets sail straight through, so I think it is a good thing get the sending host properly adapted to the PMTU ASAP. >> Removing fragmentation from the network is a really good aspect >> of IPv6, I think. Ideally, I think, all packets should be sent >> DF=1 and all applications should be ready to cope > > No. This is a layer 3 job, not a layer 7 job. I don't understand this. > An interesting approach would be to simply truncate packets that > are too big rather than fragment or drop them. A difference > between the IP length field and the actual length of the packet > indicates truncation. > > Transports would have to be changed to semi-ACK truncated data so > the sender only retransmits a checksum over the semi-ACKed data > after which a full ACK/NAK is possible. I don't clearly understand this either - but it sounds messy. > The semi-ACK also implicitly signals the maximum path MTU. Yes, this would have the advantage that if there were a series of narrower bottlenecks - 1400, 1300, 1170 - that in a single round-trip time the sending host would know the full PMTU to the destination host. With RFC 1191, this would take three round trip times. Unfortunately, the surviving packet fragment isn't much use to the destination host, so it still takes 1.5 RTTs to get the data there. Still, that is better than 3.5 RTTs with RFC 1191. >> The only reasonable solution seems to be send all packets DF=0 >> and expect all routers to report PMTU troubles with a PTB >> message. > > You mean DF=1? Oops - Yes. > DF=0 is not a solution for IPv6... > >> Networks which block PTB packets are doing themselves and anyone >> who connects to them a grave disservice. > > Yes, but they've been getting away with it so far because > "everyone" supports a 1500-byte MTU. So now breaking _that_ > assumption creates problems. My understanding of this is that if all hosts have a next hop MTU of 1500, and the core has an MTU of 9000, then it is no problem if the destination network blocks PTBs from leaving that network since no host would be sending packets bigger than 1500 anyway. But plenty of servers - probably most by now - have gigabit ethernet and so have a real PMTU for most of the core, and into quite a few edge networks, of 9k or so. When they send a packet to some edge network with 1500 MTU links, which blocks the PTBs which should go back to the sending host, then there is a black hole. It gets messier with various hosts having various next hop and nearby MTU limits in the ~1460 to 1500 range, various host settings and various ICMP-blockig destination networks with their own ~1460 to 1500 MTU limits. I guess the majority of websites now can send jumboframes, like my server can. But my server offers an MSS of 1460 for a packet size of 1500. Then, if an ICMP-filtering edge network has a 1500 MTU, there is no problem. I am not sure where the MSS is configured. But for reasons unknown, my server is trying its luck with DF=1 jumboframes way longer than 1500 or the MSS from the client. I would think that this strategy would come unstuck with an ICMP-filtering edge network with a 1500 MTU - unless this "TCP bundling" facility also disabled itself in the absence of ACKs. If this mysterious system did come unstuck due to networks blocking PTBs, then I would get complaints that people couldn't access some things (larger files) on my website - but there are no such complaints. >>> I'm not sure if implicitly making IPv6 packets unfragmentable >>> was mistake, but relying on ICMP messages was. > >> Do you suggest some other kind of message, or do you think PMTUD >> should be done on the basis of positive acknowledgements alone, >> with silent discarding of a too-big packet at whichever router >> can't handle it? > > With IPv6, it would have been possible to come up with a > truncation approach or maybe something where routers write a > maximum packet size in certain packets. > > But now the only way forward is RFC 4821 etc while working hard to > fix PMTUD black holes until 4821 is widely implemented. > >> Google: No results found for "RFC 4821 deployment". > > Yeah, none for "RFC 791 deployment" either... Touché - but where is the evidence of applications and operating systems actually implementing RFC 4821? Is there any site, any working group or whatever where this is discussed? - Robin -- to unsubscribe send a message to [EMAIL PROTECTED] with the word 'unsubscribe' in a single line as the message text body. archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg
