>>> 1. do not impede 1500-byte operation >>> 2. discover and utilize jumboframe capability where possible >>> 3. discover and utilize (close to) the maximum MTU >>> 4. recover from sudden MTU reductions fast enough for TCP and similar >>> to survive >> 5. Must be fully automatic and not require any admin intervention to do >> the "right" thing. >> 6. Minimise the resources used. > > Agree, except that packets are cheap on a 1000 Mbps LAN, so those don't > count much towards 6.
Packet rate however starts becoming a problem at faster speeds, at gige it starts becoming a problem for hosts to deal with unless they are careful. And not all networks are fast, 3G networks are becoming more prevalent. We should not waste resources needlessly :) >>> However, this doesn't accommodate finding out jumboframe support at >>> reduced sizes very well. For this, I think we should use an additional >>> exchange, but this one should probably happen over multicast. > > >> I disagree. There is no need to for every host to have a full >> understanding of the layer 2 topology of the network it is on. > > > That's not what the mechanism that I outlined does. What it does do is > let all jumbo-capable system send send at least one packet at their > maximum size and get feedback on whether it was received by anyone. For > that, every packet may update information that the system (host or > router) holds, but then the old information is gone so there is no > per-(potential) correspondent state. Ok, I sat down and reread it a few more times and I think I've got more what you're trying to get at now. >> We're >> starting to see some very large L2 networks as MAN's (eg NLR[1]) and >> IPv6's /64 per subnet puts no real practical limit on how large a single >> L2 segment can be. > > > Hm, they invented routers for a reason. You know, that was exactly my thought when I heard about it too. :) > I don't think we have to bend over backwards to accommodate unreasonably > large layer 2 networks. But: what would be reasonable to accommodate? > 1000 systems? I think it's probable that 1000 systems would be quite feasible. It's also possible that as the scarcity of addresses is relieved that people move back towards IP based virtual hosting. How many SSL sites would a single ISP want to host if it could do so easily? If you have a big outdoor concert with some AP's covering it you could have thousands of people all ending up sharing a L2. > I'm not sure this is a problem: as far as I know (and that's not too > far) switches use per-port buffer space, so although such a packet uses > up buffer space for a lot of ports, there is nothing unreasonable about > that (packet is gone again in 72 microseconds anyway). But some vendor > feedback would be good here. Fair enough. >> What happens on l2's where not every node can see every other node? > > Neighbor discovery fails? Host A can talk to Host B ok. Host A can talk to Host C ok. Host B can't talk to Host C. This happens in ad hoc wireless networks. With your system I'm not entirely sure how you deal with who's "turn" it is next if not all nodes can see all other nodes. Host A should still be able to talk to Host B. >> Some L2's only allow end hosts to talk to a master host. What about a >> network with vlans where some hosts are on differing sets of vlans? > > I don't think those exist in that way. Ad hoc wireless was a much better example ;) >>> A C >>> | | >>> +-+--+ +----+ +--+-+ >>> |9000+--+3000+--+8000| >>> +-+--+ +----+ +--+-+ >>> | | >>> B D > > >> If A and B are talking to each other and C and D are talking to each >> other, why do (A and B) need to talk to C and D? > > Ah, but how do you know that A doesn't talk to D, and is never going to? How do you know it will in the time before the topology of the network changes? Given that the topology of the network changes every time a host comes and goes, the chance that you'll want to talk to most of the users during time is rather low. >> Why not a simpler protocol? > > >> Host A sends a ND to Host B with Host A's MRU. >> Host B replies with a NS to Host A with Host B's MRU. >> Host A can now start transmitting the data it wants to send. >> Host A now sends Host B a ICMP MTU Probe, at some size less than or >> equal to Host B's MRU >> If Host B receives the packet, it replies with an ICMP MTU Probe reply >> saying what it received. > > > The problem here is that if the switches in the middle don't support > jumboframe sizes A and B do, you can't use jumboframes at all. Since > there is no standard jumboframe size, this is a problem. Of course A > and B can discover the maximum size between them, but doing that > between any set of correspondents seems suboptimal. > On the other hand, if we reuse the information learned for previous > correspondents, this could work well. > > So: > > When ND indicates a neighbor supports jumboframes we start by sending > an MTU prope with either an MTU that worked towards another > correspondent fairly recently (last hour or so). If the correspondent > receives this packet it sends an ack and both sides can increase the MTU. I'd start at the minimum "MTU" size. A colleague of mine (Matthew Luckie) has done some research into path MTU's. He has a work in progress paper ( http://www.wand.net.nz/~mjl12/debugging-pmtud.pdf ) where he enumerates all the common MTU's he's seen on the Internet. I'd start with a similar table trying the lowest size, and sending that, if it's received try the next lowest size and so on until you don't get a reply. When you don't get a reply try the previous-mtu-that-worked+1, if that succeeds start a binary search between previous-mtu-that-worked and the one that didn't. For a "common" MTU, you only have to endure two timeouts (the next highest common MTU, and the +1 test). For an uncommon MTU you can increase the MTU to maximum "common" MTU that's lower than your MTU quickly, and can endure the timeouts from then on. If you pick a common MTU and hope, you end up having to endure timeouts in the case of an incorrect guess. If you don't want to hard code that table into your stack, you can at least do something simple like increment the probed MTU by say 512. > However, it's possible we can improve on this MTU, so now we do a > binary search between the current MTU and the maximum usable one (= > minimum of local and remote MTU/MRUs), at one packet per second or so. You want to try and avoid doing a binary search because about 50% of the time, you're going to have a timeout. log2(9000-1500)>10, thats 10 timeouts, if each timeout is a second or so as you suggest, then that's over 10 seconds for a successful mtu probe. If you start low and move higher then you'll get a successful response from the remote end immediately, increasing your MTU every round trip. As soon as you get a timeout, you hope that the MTU you tried was the previously successful "common MTU", so you try that one +1, to make sure it fails. If it succeeds then it's a completely "unknown" MTU so only then do you fall back to a binary search. [later after reading below], hrm you're suggesting sending two packets at once, which is fine this so long as reordering doesn't occur (which seems unlikely at L2). > If the first probe with a recently used MTU fails then we do a binary > search betwen that value and an initial one, which should probably be > 1508 (some NICs don't do jumboframes but support 1504 for VLAN use, > nearly all MTUs are 32 bit aligned, the maximum 3 extra bytes aren't > worth it anyway). Yep. See the table I quoted above. > I think we can assume the MTU is the same in both directions. I agree. > So when system A tries with 3000 bytes (worked with C!) towards B, B sets an > ack flag and tries with 9216, which fails, so A sends a NAK and tries > with 6108, and so on. Hang on, if they don't receive a packet, how can they know to send a NAK? if they're just waiting for a timeout how can they know if the packet got lost on the way there or on the way back? > With each successful packet (either received or > acknowledged) the MTU towards the correspondent can be increased > immediately. Yep! So you want to increase your chances of successful packets getting through :) >> By working "up" known sizes instead of doing a binary search (or working >> down), Host A can quickly ratchet up sizes without waiting for a timeout >> gaining immediate benefits from larger MTUs as they are discovered. > > > You can do this with binary search too, as long as you send a separate > "now testing YYYY, ack XXXX" packet. Doubling the number of packets that need to be sent. Chances are it's going to be one of a very few sizes, and as you say, it'll probably even be the size that the other ones on the link are. If instead of using special "ICMP MTU Probes" we use "ICMP Echo request" /"ICMP Echo Reply" messages, there is no changes to any packet formats needed, all it needs to be done is have implemented in a TCP/IP stack, and the concept is even reusable for IPv4. Other hosts don't even have to be upgraded to support this either. magic! Stacks would be free to do as you suggest (doing a binary search) or as I suggest (ramp up and do a binary search only as a last resort). So the general approach would be: * If a packet arrives from a host that is larger than the cached MTU for that neighbour, increase it to the size of the packet arriving. * When receiving a ND (but not a NS!), and you have no cached MTU for that neighbour, you start the MTU discovery process (using any mechanism for selecting the packet sizes the implementation deems appropriate (ie, either yours, or mine, or if someone can come up with a method thats even better than ours, they could use that!) > I don't think we want to do this at top speed, though, because the > control traffic could get in the way of real traffic. Hrm, now thats a good point. Ah, but you should only be putting one packet onto the network after another packet has come off, so you can't overflow any queues with it. >> Assuming no reordering you then don't have to wait for a timeout. If >> reordering does occur you then send a "Whoops! reordering! didn't expect >> that on the same L2!" and then everyone flags that interface as >> "possible reordering" and then always waits for a timeout. > > >> In the common case of no reordering this will be much faster due to not >> waiting for timeouts. > > > If the "I sent..." packet comes in before the actual test packet, then > this would look like the test packet didn't make it, so the receiver > would send a NAK. However, if the packet does make it and comes in > late, then obviously the receiver notices this and it can send out an > ACK as well. Since we don't want to hammer the layer 2 network with > possibly invalid packets, the receiver (well, sender of the original > probe) would probably want to wait long enough for that ACK to come in > before sending the next packet anyway. Hammering shouldn't be too bad -- You're only putting a packet onto the network when a packet is taken off it, so you can't overflow any buffers. With packet rate issues, any host which is "slow" just slows down the rate of packets being sent to the rate that it can cope. >> Not everything has a MAC address. > > Yikes! You really are a modern day René Descartes, aren't you? :-) Well, this needs to work on L2's that aren't Ethernet (even if there aren't many of them left!) so assumptions that everything has a MAC may be premature. >> Difference in link local addresses? >> This sounds very much like turning Ethernet into token ring <grin>. > > Ring networks are very cool, too bad we don't have them anymore. Heh, indeed. :) >>> Alternatively, we could add an RA option that administrators can >>> use to >>> tell hosts the jumboframe size the layer 2 network supports. (The RA >>> option doesn't say anything about the capabilities of the _router_.) >>> Then the whole multicast taking turns discovery isn't necessary, >>> and we >>> can suffice with a quick one-to-one verification before jumboframes >>> are >>> used. > >> This still seems to fall foul of either requiring the administrator to >> configure the router > > Well, that's what administrators do, isn't it? Not when it's my flatmate who wants to plug his shiny new console into the switch and have it "just work". Even if I understand MTU's I really don't want to start trying to figure out what MTU's all the switches on a large campus are. What was the MTU of the 3 port switch in my VoIP phone? Who knows? And to be perfectly honest, who the hell cares? :) I should plug everything in and it should "just go" and have the best possible performance. >> or degrading the entire network to the level of the >> router. > > No, the announcement "the switch can handle 4500 bytes" wouldn't have > anything to do with "I can handle 1500". Which switch? I live in a flat with 3 other people, we have at least 4 devices that act like switches on one segment. (2 switches, a voip phone (you can daisy chain a PC off it), and an AP). I have no idea what the maximum MTU of all those switches are, let alone all the end hosts around here. Sure I could spend an afternoon configuring everything just right so that I get the maximum possible efficiency through my network, but it's unreasonable to think that everyone else in the world will. > It's probably a good idea to make announcements like this part of the > protocol, but not as RA options. That way, switches can announce their > own MTU capabilities, even if they don't otherwise support IPv6. So if > the switch says that it can do 4500, we only have to try 4500 (ack) and > 4504 (nak) and everything is much faster. (Unless the layer 2 network > is more complex, of course, but then either 4500 gets a nack/timeout or > 4504 gets an ack.) > > Hm, maybe 4 bytes larger than an earlier maximum is always a good idea... > > It would be even better if we could ask the switch what our port > supports, but I'm not sure how to do this in such a way that a switch > that doesn't support this protocol floods the request so the results > are meaningless. Hrm, so Ethernet has capability negotiation (which is how speed, duplex, pause frame support etc is negotiated). I have no idea if it says if the switch supports jumbo gram, IEEE specs make my head hurt. [hrm, this email is getting rather long :) -------------------------------------------------------------------- IETF IPv6 working group mailing list ipv6@ietf.org Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6 --------------------------------------------------------------------