Re: jumbo frame of GbE and IPv6 -- A proposal

Iljitsch van Beijnum Sun, 24 Jul 2005 11:11:01 -0700

On 24-jul-2005, at 15:16, Perry Lorier wrote:

I think our requirements are:

1. do not impede 1500-byte operation
2. discover and utilize jumboframe capability where possible
3. discover and utilize (close to) the maximum MTU
4. recover from sudden MTU reductions fast enough for TCP and similar
to survive

I'd add to this list:
5. Must be fully automatic and not require any admin interventionto do
the "right" thing.
6. Minimise the resources used.

Agree, except that packets are cheap on a 1000 Mbps LAN, so thosedon't count much towards 6.

Whenever two systems (hosts or routers) on a link perform neighbor
discovery, they can trigger the MTU verficiation immediately
afterward, and if jumboframe support is confirmed by receiving the
larger packets, the MTU for the the neighbor can be updated. If the
larger packets don't make it to the neighbor there is no complexity
and no delay: communication was already underway at 1500 bytes and
continues without the need for further action.

Yep!  I'd considered this, but at 3am didn't want to confuse the issue
by introducing too many differing ideas at the same time. (Also, I

wanted to go to sleep :) You've seemed to have thought throughsome of

the issues a bit better than I had too :)

(-:

However, this doesn't accommodate finding out jumboframe support at
reduced sizes very well. For this, I think we should use anadditional
exchange, but this one should probably happen over  multicast.

I disagree.  There is no need to for every host to have a full
understanding of the layer 2 topology of the network it is on.

That's not what the mechanism that I outlined does. What it does dois let all jumbo-capable system send send at least one packet attheir maximum size and get feedback on whether it was received byanyone. For that, every packet may update information that the system(host or router) holds, but then the old information is gone so thereis no per-(potential) correspondent state.

We're
starting to see some very large L2 networks as MAN's (eg NLR[1]) and
IPv6's /64 per subnet puts no real practical limit on how large asingle
L2 segment can be.

Hm, they invented routers for a reason, I don't think we have to bendover backwards to accommodate unreasonably large layer 2 networks.But: what would be reasonable to accommodate? 1000 systems?

Multicasting out "jumbo" packets is going to tax switch queue sizes to
the extreme.  A 48 port switch, receiving a single multicast 9k packet

(that it can forward) may end up with 48*9k=430,000 bytes (ornearly 1/2

a megabyte) of buffer space used with a single packet arriving.

I'm not sure this is a problem: as far as I know (and that's not toofar) switches use per-port buffer space, so although such a packetuses up buffer space for a lot of ports, there is nothingunreasonable about that (packet is gone again in 72 microsecondsanyway). But some vendor feedback would be good here.

What happens on l2's where not every node can see every other node?


Neighbor discovery fails?

Some L2's only allow end hosts to talk to a master host.  What about a
network with vlans where some hosts are on differing sets of vlans?


I don't think those exist in that way.

  A                C
  |                |
+-+--+  +----+  +--+-+
|9000+--+3000+--+8000|
+-+--+  +----+  +--+-+
  |                |
  B                D

If A and B are talking to each other and C and D are talking to each
other, why do (A and B) need to talk to C and D?


Ah, but how do you know that A doesn't talk to D, and is never going to?

Why not a simpler protocol?

Host A sends a ND to Host B with Host A's MRU.
Host B replies with a NS to Host A with Host B's MRU.
Host A can now start transmitting the data it wants to send.
Host A now sends Host B a ICMP MTU Probe, at some size less than or
equal to Host B's MRU
If Host B receives the packet, it replies with an ICMP MTU Probe reply
saying what it received.

The problem here is that if the switches in the middle don't supportjumboframe sizes A and B do, you can't use jumboframes at all. Sincethere is no standard jumboframe size, this is a problem. Of course Aand B can discover the maximum size between them, but doing thatbetween any set of correspondents seems suboptimal.

On the other hand, if we reuse the information learned for previouscorrespondents, this could work well.

So:

When ND indicates a neighbor supports jumboframes, we start bysending an MTU prope with either an MTU that worked towards anothercorrespondent fairly recently (last hour or so). If the correspondentreceives this packet it sends an ack and both sides can increase theMTU.

However, it's possible we can improve on this MTU, so now we do abinary search between the current MTU and the maximum usable one (=minimum of local and remote MTU/MRUs), at one packet per second or so.

If the first probe with a recently used MTU fails then we do a binarysearch betwen that value and an initial one, which should probably be1508 (some NICs don't do jumboframes but support 1504 for VLAN use,nearly all MTUs are 32 bit aligned, the maximum 3 extra bytes aren'tworth it anyway).

I think we can assume the MTU is the same in both directions. So whensystem A tries with 3000 bytes (worked with C!) towards B, B sets anack flag and tries with 9216, which fails, so A sends a NAK and trieswith 6108, and so on. With each successful packet (either received oracknowledged) the MTU towards the correspondent can be increasedimmediately.

By working "up" known sizes instead of doing a binary search (orworkingdown), Host A can quickly ratchet up sizes without waiting for atimeout
gaining immediate benefits from larger MTUs as they are discovered.

You can do this with binary search too, as long as you send aseparate "now testing YYYY, ack XXXX" packet.

I don't think we want to do this at top speed, though, because thecontrol traffic could get in the way of real traffic.

If you persist with this idea, switch the order of the packets sent.
Send the packet first, then send the announcement that it was sent.

Yes, that makes sense. I thought there would be a possibility thatthe switch or hub would need some time to recover after receiving apacket that's too large, though.

Assuming no reordering you then don't have to wait for a timeout.  If

reordering does occur you then send a "Whoops! reordering! didn'texpect

that on the same L2!" and then everyone flags that interface as
"possible reordering" and then always waits for a timeout.

In the common case of no reordering this will be much faster due tonot
waiting for timeouts.

If the "I sent..." packet comes in before the actual test packet,then this would look like the test packet didn't make it, so thereceiver would send a NAK. However, if the packet does make it andcomes in late, then obviously the receiver notices this and it cansend out an ACK as well. Since we don't want to hammer the layer 2network with possibly invalid packets, the receiver (well, sender ofthe original probe) would probably want to wait long enough for thatACK to come in before sending the next packet anyway.

Not everything has a MAC address.


Yikes! You really are a modern day René Descartes, aren't you?  :-)

Difference in link local addresses?
This sounds very much like turning Ethernet into token ring <grin>.


Ring networks are very cool, too bad we don't have them anymore.

Alternatively, we could add an RA option that administrators canuse to
tell hosts the jumboframe size the layer 2 network supports. (The  RA
option doesn't say anything about the capabilities of the  _router_.)
Then the whole multicast taking turns discovery isn't necessary,and wecan suffice with a quick one-to-one verification beforejumboframes are
used.

This still seems to fall foul of either requiring the administrator to
configure the router


Well, that's what administrators do, isn't it?

or degrading the entire network to the level of the
router.

No, the announcement "the switch can handle 4500 bytes" wouldn't haveanything to do with "I can handle 1500".

It's probably a good idea to make announcements like this part of theprotocol, but not as RA options. That way, switches can announcetheir own MTU capabilities, even if they don't otherwise supportIPv6. So if the switch says that it can do 4500, we only have to try4500 (ack) and 4504 (nak) and everything is much faster. (Unless thelayer 2 network is more complex, of course, but then either 4500 getsa nack/timeout or 4504 gets an ack.)

Hm, maybe 4 bytes larger than an earlier maximum is always a goodidea...

It would be even better if we could ask the switch what our portsupports, but I'm not sure how to do this in such a way that a switchthat doesn't support this protocol floods the request so the resultsare meaningless.

--------------------------------------------------------------------
IETF IPv6 working group mailing list
ipv6@ietf.org
Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6
--------------------------------------------------------------------

Re: jumbo frame of GbE and IPv6 -- A proposal

Reply via email to