Re: jumbo frame of GbE and IPv6 -- A proposal

Perry Lorier Tue, 26 Jul 2005 04:49:27 -0700

>>> 1. do not impede 1500-byte operation
>>> 2. discover and utilize jumboframe capability where possible
>>> 3. discover and utilize (close to) the maximum MTU
>>> 4. recover from sudden MTU reductions fast enough for TCP and similar
>>> to survive
>> 5. Must be fully automatic and not require any admin intervention  to do
>> the "right" thing.
>> 6. Minimise the resources used.
> 
> Agree, except that packets are cheap on a 1000 Mbps LAN, so those  don't
> count much towards 6.


Packet rate however starts becoming a problem at faster speeds, at gige
it starts becoming a problem for hosts to deal with unless they are
careful.  And not all networks are fast, 3G networks are becoming more
prevalent.  We should not waste resources needlessly :)

>>> However, this doesn't accommodate finding out jumboframe support at
>>> reduced sizes very well. For this, I think we should use an   additional
>>> exchange, but this one should probably happen over  multicast.
> 
> 
>> I disagree.  There is no need to for every host to have a full
>> understanding of the layer 2 topology of the network it is on.
> 
> 
> That's not what the mechanism that I outlined does. What it does do  is
> let all jumbo-capable system send send at least one packet at  their
> maximum size and get feedback on whether it was received by  anyone. For
> that, every packet may update information that the system  (host or
> router) holds, but then the old information is gone so there  is no
> per-(potential) correspondent state.

Ok, I sat down and reread it a few more times and I think I've got more
what you're trying to get at now.

>> We're
>> starting to see some very large L2 networks as MAN's (eg NLR[1]) and
>> IPv6's /64 per subnet puts no real practical limit on how large a  single
>> L2 segment can be.
> 
> 
> Hm, they invented routers for a reason.

You know, that was exactly my thought when I heard about it too. :)

> I don't think we have to bend over backwards to accommodate unreasonably
> large layer 2 networks. But: what would be reasonable to accommodate?
> 1000 systems?

I think it's probable that 1000 systems would be quite feasible.  It's
also possible that as the scarcity of addresses is relieved that people
move back towards IP based virtual hosting.  How many SSL sites would a
single ISP want to host if it could do so easily?

If you have a big outdoor concert with some AP's covering it you could
have thousands of people all ending up sharing a L2.

> I'm not sure this is a problem: as far as I know (and that's not too 
> far) switches use per-port buffer space, so although such a packet  uses
> up buffer space for a lot of ports, there is nothing  unreasonable about
> that (packet is gone again in 72 microseconds  anyway). But some vendor
> feedback would be good here.

Fair enough.

>> What happens on l2's where not every node can see every other node?
> 
> Neighbor discovery fails?

Host A can talk to Host B ok.
Host A can talk to Host C ok.
Host B can't talk to Host C.

This happens in ad hoc wireless networks.  With your system I'm not
entirely sure how you deal with who's "turn" it is next if not all nodes
can see all other nodes.  Host A should still be able to talk to Host B.

>> Some L2's only allow end hosts to talk to a master host.  What about a
>> network with vlans where some hosts are on differing sets of vlans?
> 
> I don't think those exist in that way.

Ad hoc wireless was a much better example ;)

>>>   A                C
>>>   |                |
>>> +-+--+  +----+  +--+-+
>>> |9000+--+3000+--+8000|
>>> +-+--+  +----+  +--+-+
>>>   |                |
>>>   B                D
> 
> 
>> If A and B are talking to each other and C and D are talking to each
>> other, why do (A and B) need to talk to C and D?
> 
> Ah, but how do you know that A doesn't talk to D, and is never going to?

How do you know it will in the time before the topology of the network
changes?  Given that the topology of the network changes every time a
host comes and goes, the chance that you'll want to talk to most of the
users during time is rather low.

>> Why not a simpler protocol?
> 
> 
>> Host A sends a ND to Host B with Host A's MRU.
>> Host B replies with a NS to Host A with Host B's MRU.
>> Host A can now start transmitting the data it wants to send.
>> Host A now sends Host B a ICMP MTU Probe, at some size less than or
>> equal to Host B's MRU
>> If Host B receives the packet, it replies with an ICMP MTU Probe reply
>> saying what it received.
> 
> 
> The problem here is that if the switches in the middle don't support 
> jumboframe sizes A and B do, you can't use jumboframes at all. Since 
> there is no standard jumboframe size, this is a problem. Of course A 
> and B can discover the maximum size between them, but doing that 
> between any set of correspondents seems suboptimal.

> On the other hand, if we reuse the information learned for previous 
> correspondents, this could work well.
> 
> So:
> 
> When ND indicates a neighbor supports jumboframes we start by  sending
> an MTU prope with either an MTU that worked towards another 
> correspondent fairly recently (last hour or so). If the correspondent 
> receives this packet it sends an ack and both sides can increase the  MTU.

I'd start at the minimum "MTU" size.

A colleague of mine (Matthew Luckie) has done some research into path
MTU's.  He has a work in progress paper (
http://www.wand.net.nz/~mjl12/debugging-pmtud.pdf ) where he enumerates
all the common MTU's he's seen on the Internet.

I'd start with a similar table trying the lowest size, and sending that,
if it's received try the next lowest size and so on until you don't get
a reply.  When you don't get a reply try the previous-mtu-that-worked+1,
if that succeeds start a binary search between previous-mtu-that-worked
and the one that didn't.

For a "common" MTU, you only have to endure two timeouts (the next
highest common MTU, and the +1 test). For an uncommon MTU you can
increase the MTU to maximum "common" MTU that's lower than your MTU
quickly, and can endure the timeouts from then on.

If you pick a common MTU and hope, you end up having to endure timeouts
in the case of an incorrect guess.

If you don't want to hard code that table into your stack, you can at
least do something simple like increment the probed MTU by say 512.

> However, it's possible we can improve on this MTU, so now we do a 
> binary search between the current MTU and the maximum usable one (= 
> minimum of local and remote MTU/MRUs), at one packet per second or so.

You want to try and avoid doing a binary search because about 50% of the
time, you're going to have a timeout.  log2(9000-1500)>10, thats 10
timeouts, if each timeout is a second or so as you suggest, then that's
over 10 seconds for a successful mtu probe.

If you start low and move higher then you'll get a successful response
from the remote end immediately, increasing your MTU every round trip.
As soon as you get a timeout, you hope that the MTU you tried was the
previously successful "common MTU", so you try that one +1, to make sure
 it fails.  If it succeeds then it's a completely "unknown" MTU so only
then do you fall back to a binary search.

[later after reading below], hrm you're suggesting sending two packets
at once, which is fine this so long as reordering doesn't occur (which
seems unlikely at L2).

> If the first probe with a recently used MTU fails then we do a binary 
> search betwen that value and an initial one, which should probably be 
> 1508 (some NICs don't do jumboframes but support 1504 for VLAN use, 
> nearly all MTUs are 32 bit aligned, the maximum 3 extra bytes aren't 
> worth it anyway).

Yep. See the table I quoted above.

> I think we can assume the MTU is the same in both directions. 

I agree.

> So when system A tries with 3000 bytes (worked with C!) towards B, B sets an 
> ack flag and tries with 9216, which fails, so A sends a NAK and tries 
> with 6108, and so on. 

Hang on, if they don't receive a packet, how can they know to send a
NAK?  if they're just waiting for a timeout how can they know if the
packet got lost on the way there or on the way back?

> With each successful packet (either received or
> acknowledged) the MTU towards the correspondent can be increased 
> immediately.

Yep!  So you want to increase your chances of successful packets getting
through :)

>> By working "up" known sizes instead of doing a binary search (or  working
>> down), Host A can quickly ratchet up sizes without waiting for a  timeout
>> gaining immediate benefits from larger MTUs as they are discovered.
> 
> 
> You can do this with binary search too, as long as you send a  separate
> "now testing YYYY, ack XXXX" packet.

Doubling the number of packets that need to be sent.  Chances are it's
going to be one of a very few sizes, and as you say, it'll probably even
be the size that the other ones on the link are.

If instead of using special "ICMP MTU Probes" we use "ICMP Echo request"
/"ICMP Echo Reply" messages, there is no changes to any packet formats
needed, all it needs to be done is have implemented in a TCP/IP stack,
and the concept is even reusable for IPv4.  Other hosts don't even have
to be upgraded to support this either.  magic!

Stacks would be free to do as you suggest (doing a binary search) or as
I suggest (ramp up and do a binary search only as a last resort).

So the general approach would be:
* If a packet arrives from a host that is larger than the cached MTU for
that neighbour, increase it to the size of the packet arriving.
* When receiving a ND (but not a NS!), and you have no cached MTU for
that neighbour, you start the MTU discovery process (using any mechanism
for selecting the packet sizes the implementation deems appropriate (ie,
either yours, or mine, or if someone can come up with a method thats
even better than ours, they could use that!)

> I don't think we want to do this at top speed, though, because the 
> control traffic could get in the way of real traffic.

Hrm, now thats a good point. Ah, but you should only be putting one
packet onto the network after another packet has come off, so you can't
overflow any queues with it.

>> Assuming no reordering you then don't have to wait for a timeout.  If
>> reordering does occur you then send a "Whoops! reordering! didn't  expect
>> that on the same L2!" and then everyone flags that interface as
>> "possible reordering" and then always waits for a timeout.
> 
> 
>> In the common case of no reordering this will be much faster due to  not
>> waiting for timeouts.
> 
> 
> If the "I sent..." packet comes in before the actual test packet,  then
> this would look like the test packet didn't make it, so the  receiver
> would send a NAK. However, if the packet does make it and  comes in
> late, then obviously the receiver notices this and it can  send out an
> ACK as well. Since we don't want to hammer the layer 2  network with
> possibly invalid packets, the receiver (well, sender of  the original
> probe) would probably want to wait long enough for that  ACK to come in
> before sending the next packet anyway.

Hammering shouldn't be too bad -- You're only putting a packet onto the
network when a packet is taken off it, so you can't overflow any
buffers.  With packet rate issues, any host which is "slow" just slows
down the rate of packets being sent to the rate that it can cope.

>> Not everything has a MAC address.
> 
> Yikes! You really are a modern day René Descartes, aren't you?  :-)

Well, this needs to work on L2's that aren't Ethernet (even if there
aren't many of them left!) so assumptions that everything has a MAC may
be premature.

>> Difference in link local addresses?
>> This sounds very much like turning Ethernet into token ring <grin>.
> 
> Ring networks are very cool, too bad we don't have them anymore.

Heh, indeed. :)

>>> Alternatively, we could add an RA option that administrators can 
>>> use  to
>>> tell hosts the jumboframe size the layer 2 network supports. (The  RA
>>> option doesn't say anything about the capabilities of the  _router_.)
>>> Then the whole multicast taking turns discovery isn't  necessary, 
>>> and we
>>> can suffice with a quick one-to-one verification  before  jumboframes
>>> are
>>> used.
> 
>> This still seems to fall foul of either requiring the administrator to
>> configure the router
> 
> Well, that's what administrators do, isn't it?

Not when it's my flatmate who wants to plug his shiny new console into
the switch and have it "just work".  Even if I understand MTU's I really
don't want to start trying to figure out what MTU's all the switches on
a large campus are.  What was the MTU of the 3 port switch in my VoIP
phone?  Who knows?  And to be perfectly honest, who the hell cares? :)
I should plug everything in and it should "just go" and have the best
possible performance.

>> or degrading the entire network to the level of the
>> router.
> 
> No, the announcement "the switch can handle 4500 bytes" wouldn't have 
> anything to do with "I can handle 1500".

Which switch?  I live in a flat with 3 other people, we have at least 4
devices that act like switches on one segment.  (2 switches, a voip
phone (you can daisy chain a PC off it), and an AP).  I have no idea
what the maximum MTU of all those switches are, let alone all the end
hosts around here.  Sure I could spend an afternoon configuring
everything just right so that I get the maximum possible efficiency
through my network, but it's unreasonable to think that everyone else in
the world will.

> It's probably a good idea to make announcements like this part of the 
> protocol, but not as RA options. That way, switches can announce  their
> own MTU capabilities, even if they don't otherwise support  IPv6. So if
> the switch says that it can do 4500, we only have to try  4500 (ack) and
> 4504 (nak) and everything is much faster. (Unless the  layer 2 network
> is more complex, of course, but then either 4500 gets  a nack/timeout or
> 4504 gets an ack.)
> 
> Hm, maybe 4 bytes larger than an earlier maximum is always a good  idea...
> 
> It would be even better if we could ask the switch what our port 
> supports, but I'm not sure how to do this in such a way that a switch 
> that doesn't support this protocol floods the request so the results 
> are meaningless.

Hrm, so Ethernet has capability negotiation (which is how speed, duplex,
pause frame support etc is negotiated).  I have no idea if it says if
the switch supports jumbo gram, IEEE specs make my head hurt.

[hrm, this email is getting rather long :)

--------------------------------------------------------------------
IETF IPv6 working group mailing list
ipv6@ietf.org
Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6
--------------------------------------------------------------------

Re: jumbo frame of GbE and IPv6 -- A proposal

Reply via email to