On 23-jul-2005, at 17:12, Perry Lorier wrote:
Before two hosts can transmit packets on Ethernet they have to undergo
neighbour solicitation to find the remote ends hardware address
anyway.
Right.
When you send the neighbour solicitation you could pad the packet out
with multiple "MTU padding" options to make the packet the size of the
MTU you want to use.
I don't think we want to do this in the neighbor _solicitation_
because there is a good chance that the neighbor doesn't support the
larger MTU so the packet is lost.
If the original host does not receive a reply, it updates it's
neighbour
cache to say that the MTU supported is the MTU announced by the
RA's, if
no MTU is announced then it uses a configured minimum MTU for that
link
and retransmits the query without any "MTU padding" options.
This is problematic for two reasons: when the packet gets lost
because either the receiver or the layer 2 network don't support a
sufficiently large MTU/MRU, there is a timeout, which wastes time,
and if the receiver does in fact support jumboframes but of a smaller
size than the sender supports, this isn't detected.
If this sounds insane, in my defense it's 3am and sounded like a good
idea at the time :)
:-)
I think our requirements are:
1. do not impede 1500-byte operation
2. discover and utilize jumboframe capability where possible
3. discover and utilize (close to) the maximum MTU
4. recover from sudden MTU reductions fast enough for TCP and similar
to survive
First of all, we need for hosts to find out that their correspondents
support a larger MTU/MRU. This can easily be done in an ND option.
Since we're not going to get cooperation from switches, let alone
hubs, it's important that we send test packets to see whether the
jumboframes actually make it to the other side. I think using ND for
this isn't a good idea: in its current form, it doesn't support the
required packet size, and when the padded packet doesn't make it,
there recovery complexities and delays.
So it makes sense to come up with a new protocol for this. An
interesting notion here is that this protocol doesn't have to be IPv6-
specific. However, in IPv6 we have neighbor unreachability detection
which we can use to find MTU reductions fast enough to fall back to
1500 bytes before bad things happen. In IPv4 or pure ethernet, we
don't have that, and we also don't have neighbor discovery to
exchange per-host MRU/MTU information.
If we use IPv6 for this, I think a new ICMP type makes sense.
Whenever two systems (hosts or routers) on a link perform neighbor
discovery, they can trigger the MTU verficiation immediately
afterward, and if jumboframe support is confirmed by receiving the
larger packets, the MTU for the the neighbor can be updated. If the
larger packets don't make it to the neighbor there is no complexity
and no delay: communication was already underway at 1500 bytes and
continues without the need for further action.
However, this doesn't accommodate finding out jumboframe support at
reduced sizes very well. For this, I think we should use an
additional exchange, but this one should probably happen over
multicast. Hosts/routers could take turns in a distributed search for
the largest supported framesize. I think it's important that all
jumbo-capable systems take part in this in order to deal with unusual
topologies. For instance, consider a network with three switches: one
support 9000, another 8000 and the two are connected through a third
switch that only supports 3000 bytes:
A C
| |
+-+--+ +----+ +--+-+
|9000+--+3000+--+8000|
+-+--+ +----+ +--+-+
| |
B D
Suppose all hosts support 9216 byte jumboframes.
I think the most efficient way to handle this is to do two concurrent
searches: one for the maximum packet size that can be used to at
least one correspondent, and one for the minimum jumboframe size that
is supported by all jumboframe supporting systems.
So first A sends out an announcement that it's going to send a 9216
byte and a 5596 (1500 + 4096) byte packet, and then sends the
packets. Nobody receives the first packet, but everyone knows A sent
it because of the preceding announcement, and B receives the second
packet.
Then B would (for instance) send out its 9216 byte packet along with
a 1500 + 2048 = 3548 byte packet, and also indicates the largest size
that worked (5596) and the smallest size that didn't work (9126). A
receives the 3548 byte packet but not the 9216 byte one.
C is next and sends out 9216 and (1500 + 1024 = ) 2524 byte packets,
along with the information that no jumboframe size has worked so far.
A, B and D all receive the 2524 byte packet.
D then sends out 9216 and (1500 + 1536 = ) 3036 byte packets with
information that it received 2524 but not 3548. C receives the 3036
byte packet.
It's now A's turn again. A knows that the size that everyone can
receive is betweeen 2524 and 3036 and the size that at least one
correspondent can receive is between 5596 and 9216. So it sends out
2780 and 7406 byte packets.
And so on.
After a few round like this, each system knows the maximum jumboframe
size it can send/receive (so it can adjust its announcements in the
ND option), and the minimum jumboframe size that everyone supports.
It's probably doable to generalize this into any given number of
levels, but I doubt that more than 3 is worth the trouble, and maybe
having two levels even isn't. On the other hand, if some hosts
support 9000 but the majority support 8192 it may be a good idea to
forget 9000 and just do 8192.
This may sound horribly complex, but it really isn't. :-)
The biggest challenge is probably making the different systems talk
in turn, but that can probably be done by having a timer that depends
on the difference in MAC address between the last system to transmit
and prospective next one.
Extra credit: monitor spanning tree events for quick adaption to
changing layer 2 topologies.
Alternatively, we could add an RA option that administrators can use
to tell hosts the jumboframe size the layer 2 network supports. (The
RA option doesn't say anything about the capabilities of the
_router_.) Then the whole multicast taking turns discovery isn't
necessary, and we can suffice with a quick one-to-one verification
before jumboframes are used.
--------------------------------------------------------------------
IETF IPv6 working group mailing list
ipv6@ietf.org
Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6
--------------------------------------------------------------------