On 23-jul-2005, at 17:12, Perry Lorier wrote:

Before two hosts can transmit packets on Ethernet they have to undergo
neighbour solicitation to find the remote ends hardware address anyway.

Right.

When you send the neighbour solicitation you could pad the packet out
with multiple "MTU padding" options to make the packet the size of the
MTU you want to use.

I don't think we want to do this in the neighbor _solicitation_ because there is a good chance that the neighbor doesn't support the larger MTU so the packet is lost.

If the original host does not receive a reply, it updates it's neighbour cache to say that the MTU supported is the MTU announced by the RA's, if no MTU is announced then it uses a configured minimum MTU for that link
 and retransmits the query without any "MTU padding" options.

This is problematic for two reasons: when the packet gets lost because either the receiver or the layer 2 network don't support a sufficiently large MTU/MRU, there is a timeout, which wastes time, and if the receiver does in fact support jumboframes but of a smaller size than the sender supports, this isn't detected.

If this sounds insane, in my defense it's 3am and sounded like a good
idea at the time :)

:-)

I think our requirements are:

1. do not impede 1500-byte operation
2. discover and utilize jumboframe capability where possible
3. discover and utilize (close to) the maximum MTU
4. recover from sudden MTU reductions fast enough for TCP and similar to survive

First of all, we need for hosts to find out that their correspondents support a larger MTU/MRU. This can easily be done in an ND option.

Since we're not going to get cooperation from switches, let alone hubs, it's important that we send test packets to see whether the jumboframes actually make it to the other side. I think using ND for this isn't a good idea: in its current form, it doesn't support the required packet size, and when the padded packet doesn't make it, there recovery complexities and delays.

So it makes sense to come up with a new protocol for this. An interesting notion here is that this protocol doesn't have to be IPv6- specific. However, in IPv6 we have neighbor unreachability detection which we can use to find MTU reductions fast enough to fall back to 1500 bytes before bad things happen. In IPv4 or pure ethernet, we don't have that, and we also don't have neighbor discovery to exchange per-host MRU/MTU information.

If we use IPv6 for this, I think a new ICMP type makes sense. Whenever two systems (hosts or routers) on a link perform neighbor discovery, they can trigger the MTU verficiation immediately afterward, and if jumboframe support is confirmed by receiving the larger packets, the MTU for the the neighbor can be updated. If the larger packets don't make it to the neighbor there is no complexity and no delay: communication was already underway at 1500 bytes and continues without the need for further action.

However, this doesn't accommodate finding out jumboframe support at reduced sizes very well. For this, I think we should use an additional exchange, but this one should probably happen over multicast. Hosts/routers could take turns in a distributed search for the largest supported framesize. I think it's important that all jumbo-capable systems take part in this in order to deal with unusual topologies. For instance, consider a network with three switches: one support 9000, another 8000 and the two are connected through a third switch that only supports 3000 bytes:


  A                C
  |                |
+-+--+  +----+  +--+-+
|9000+--+3000+--+8000|
+-+--+  +----+  +--+-+
  |                |
  B                D

Suppose all hosts support 9216 byte jumboframes.

I think the most efficient way to handle this is to do two concurrent searches: one for the maximum packet size that can be used to at least one correspondent, and one for the minimum jumboframe size that is supported by all jumboframe supporting systems.

So first A sends out an announcement that it's going to send a 9216 byte and a 5596 (1500 + 4096) byte packet, and then sends the packets. Nobody receives the first packet, but everyone knows A sent it because of the preceding announcement, and B receives the second packet.

Then B would (for instance) send out its 9216 byte packet along with a 1500 + 2048 = 3548 byte packet, and also indicates the largest size that worked (5596) and the smallest size that didn't work (9126). A receives the 3548 byte packet but not the 9216 byte one.

C is next and sends out 9216 and (1500 + 1024 = ) 2524 byte packets, along with the information that no jumboframe size has worked so far. A, B and D all receive the 2524 byte packet.

D then sends out 9216 and (1500 + 1536 = ) 3036 byte packets with information that it received 2524 but not 3548. C receives the 3036 byte packet.

It's now A's turn again. A knows that the size that everyone can receive is betweeen 2524 and 3036 and the size that at least one correspondent can receive is between 5596 and 9216. So it sends out 2780 and 7406 byte packets.

And so on.

After a few round like this, each system knows the maximum jumboframe size it can send/receive (so it can adjust its announcements in the ND option), and the minimum jumboframe size that everyone supports. It's probably doable to generalize this into any given number of levels, but I doubt that more than 3 is worth the trouble, and maybe having two levels even isn't. On the other hand, if some hosts support 9000 but the majority support 8192 it may be a good idea to forget 9000 and just do 8192.

This may sound horribly complex, but it really isn't.  :-)

The biggest challenge is probably making the different systems talk in turn, but that can probably be done by having a timer that depends on the difference in MAC address between the last system to transmit and prospective next one.

Extra credit: monitor spanning tree events for quick adaption to changing layer 2 topologies.

Alternatively, we could add an RA option that administrators can use to tell hosts the jumboframe size the layer 2 network supports. (The RA option doesn't say anything about the capabilities of the _router_.) Then the whole multicast taking turns discovery isn't necessary, and we can suffice with a quick one-to-one verification before jumboframes are used.

--------------------------------------------------------------------
IETF IPv6 working group mailing list
ipv6@ietf.org
Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6
--------------------------------------------------------------------

Reply via email to