On 24-jul-2005, at 15:16, Perry Lorier wrote:

I think our requirements are:

1. do not impede 1500-byte operation
2. discover and utilize jumboframe capability where possible
3. discover and utilize (close to) the maximum MTU
4. recover from sudden MTU reductions fast enough for TCP and similar
to survive

I'd add to this list:
5. Must be fully automatic and not require any admin intervention to do
the "right" thing.
6. Minimise the resources used.

Agree, except that packets are cheap on a 1000 Mbps LAN, so those don't count much towards 6.

Whenever two systems (hosts or routers) on a link perform neighbor
discovery, they can trigger the MTU verficiation immediately
afterward, and if jumboframe support is confirmed by receiving the
larger packets, the MTU for the the neighbor can be updated. If the
larger packets don't make it to the neighbor there is no complexity
and no delay: communication was already underway at 1500 bytes and
continues without the need for further action.

Yep!  I'd considered this, but at 3am didn't want to confuse the issue
by introducing too many differing ideas at the same time. (Also, I
wanted to go to sleep :) You've seemed to have thought through some of
the issues a bit better than I had too :)

(-:

However, this doesn't accommodate finding out jumboframe support at
reduced sizes very well. For this, I think we should use an additional
exchange, but this one should probably happen over  multicast.

I disagree.  There is no need to for every host to have a full
understanding of the layer 2 topology of the network it is on.

That's not what the mechanism that I outlined does. What it does do is let all jumbo-capable system send send at least one packet at their maximum size and get feedback on whether it was received by anyone. For that, every packet may update information that the system (host or router) holds, but then the old information is gone so there is no per-(potential) correspondent state.

We're
starting to see some very large L2 networks as MAN's (eg NLR[1]) and
IPv6's /64 per subnet puts no real practical limit on how large a single
L2 segment can be.

Hm, they invented routers for a reason, I don't think we have to bend over backwards to accommodate unreasonably large layer 2 networks. But: what would be reasonable to accommodate? 1000 systems?

Multicasting out "jumbo" packets is going to tax switch queue sizes to
the extreme.  A 48 port switch, receiving a single multicast 9k packet
(that it can forward) may end up with 48*9k=430,000 bytes (or nearly 1/2
a megabyte) of buffer space used with a single packet arriving.

I'm not sure this is a problem: as far as I know (and that's not too far) switches use per-port buffer space, so although such a packet uses up buffer space for a lot of ports, there is nothing unreasonable about that (packet is gone again in 72 microseconds anyway). But some vendor feedback would be good here.

What happens on l2's where not every node can see every other node?

Neighbor discovery fails?

Some L2's only allow end hosts to talk to a master host.  What about a
network with vlans where some hosts are on differing sets of vlans?

I don't think those exist in that way.

  A                C
  |                |
+-+--+  +----+  +--+-+
|9000+--+3000+--+8000|
+-+--+  +----+  +--+-+
  |                |
  B                D

If A and B are talking to each other and C and D are talking to each
other, why do (A and B) need to talk to C and D?

Ah, but how do you know that A doesn't talk to D, and is never going to?

Why not a simpler protocol?

Host A sends a ND to Host B with Host A's MRU.
Host B replies with a NS to Host A with Host B's MRU.
Host A can now start transmitting the data it wants to send.
Host A now sends Host B a ICMP MTU Probe, at some size less than or
equal to Host B's MRU
If Host B receives the packet, it replies with an ICMP MTU Probe reply
saying what it received.

The problem here is that if the switches in the middle don't support jumboframe sizes A and B do, you can't use jumboframes at all. Since there is no standard jumboframe size, this is a problem. Of course A and B can discover the maximum size between them, but doing that between any set of correspondents seems suboptimal.

On the other hand, if we reuse the information learned for previous correspondents, this could work well.

So:

When ND indicates a neighbor supports jumboframes, we start by sending an MTU prope with either an MTU that worked towards another correspondent fairly recently (last hour or so). If the correspondent receives this packet it sends an ack and both sides can increase the MTU.

However, it's possible we can improve on this MTU, so now we do a binary search between the current MTU and the maximum usable one (= minimum of local and remote MTU/MRUs), at one packet per second or so.

If the first probe with a recently used MTU fails then we do a binary search betwen that value and an initial one, which should probably be 1508 (some NICs don't do jumboframes but support 1504 for VLAN use, nearly all MTUs are 32 bit aligned, the maximum 3 extra bytes aren't worth it anyway).

I think we can assume the MTU is the same in both directions. So when system A tries with 3000 bytes (worked with C!) towards B, B sets an ack flag and tries with 9216, which fails, so A sends a NAK and tries with 6108, and so on. With each successful packet (either received or acknowledged) the MTU towards the correspondent can be increased immediately.

By working "up" known sizes instead of doing a binary search (or working down), Host A can quickly ratchet up sizes without waiting for a timeout
gaining immediate benefits from larger MTUs as they are discovered.

You can do this with binary search too, as long as you send a separate "now testing YYYY, ack XXXX" packet.

I don't think we want to do this at top speed, though, because the control traffic could get in the way of real traffic.

If you persist with this idea, switch the order of the packets sent.
Send the packet first, then send the announcement that it was sent.

Yes, that makes sense. I thought there would be a possibility that the switch or hub would need some time to recover after receiving a packet that's too large, though.

Assuming no reordering you then don't have to wait for a timeout.  If
reordering does occur you then send a "Whoops! reordering! didn't expect
that on the same L2!" and then everyone flags that interface as
"possible reordering" and then always waits for a timeout.

In the common case of no reordering this will be much faster due to not
waiting for timeouts.

If the "I sent..." packet comes in before the actual test packet, then this would look like the test packet didn't make it, so the receiver would send a NAK. However, if the packet does make it and comes in late, then obviously the receiver notices this and it can send out an ACK as well. Since we don't want to hammer the layer 2 network with possibly invalid packets, the receiver (well, sender of the original probe) would probably want to wait long enough for that ACK to come in before sending the next packet anyway.

Not everything has a MAC address.

Yikes! You really are a modern day René Descartes, aren't you?  :-)

Difference in link local addresses?
This sounds very much like turning Ethernet into token ring <grin>.

Ring networks are very cool, too bad we don't have them anymore.

Alternatively, we could add an RA option that administrators can use to
tell hosts the jumboframe size the layer 2 network supports. (The  RA
option doesn't say anything about the capabilities of the  _router_.)
Then the whole multicast taking turns discovery isn't necessary, and we can suffice with a quick one-to-one verification before jumboframes are
used.

This still seems to fall foul of either requiring the administrator to
configure the router

Well, that's what administrators do, isn't it?

or degrading the entire network to the level of the
router.

No, the announcement "the switch can handle 4500 bytes" wouldn't have anything to do with "I can handle 1500".

It's probably a good idea to make announcements like this part of the protocol, but not as RA options. That way, switches can announce their own MTU capabilities, even if they don't otherwise support IPv6. So if the switch says that it can do 4500, we only have to try 4500 (ack) and 4504 (nak) and everything is much faster. (Unless the layer 2 network is more complex, of course, but then either 4500 gets a nack/timeout or 4504 gets an ack.)

Hm, maybe 4 bytes larger than an earlier maximum is always a good idea...

It would be even better if we could ask the switch what our port supports, but I'm not sure how to do this in such a way that a switch that doesn't support this protocol floods the request so the results are meaningless.
--------------------------------------------------------------------
IETF IPv6 working group mailing list
ipv6@ietf.org
Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6
--------------------------------------------------------------------

Reply via email to