On 09/15/2012 06:06 AM, Simon Wilkinson wrote:
My best guess for what may have been attempted in the code: when the
calling application defines an error via rx_SetMsgsizeRetryErr, we kill
the call immediately with an error (e.g. RX_MSGSIZE). Otherwise, we try
to force the 1400-byte packet through, and lower packet sizes back to
the discovered MTU as soon as we can. If fragments can't get through,
the call dies with a network error.
Hi Andrew,
I think your understanding on the PMTU code is roughly correct. It pretty much
matches what I worked out last time I looked at this.
One critical thing that I think your overview misses is that we have two
different types of MTU discovery. In my notes, I've take to calling these low
and high MTU.
High MTU is where we attempt to discover if the MTU of the link is larger than
the RX packet size. Code to do this has been in the tree for a while - Derrick
reworked this as part of the YFS grant work, but I don't think ever got
something that worked. High MTU discovery uses ICMP errors, the DF flag, and
works in approximately the same way as TCP PMTU discovery, with the exception
(as you note) that we can't resize existing RX packets.
When I looked at this last, my intention was to use high MTU discovery as a
means of safely enabling jumbograms. Rather than using jumbograms to go over
the known MTU (which causes fragmentation, and all of the problems that
jumbograms are known for), you'd use jumbograms to combine RX packets to just
below the discovered MTU. Doing this avoids all of the problems of jumbograms,
and means that we don't have to get into creating oversize RX packets, which
has its own pitfalls.
Low MTU is where the MTU of the link is smaller than the RX packet size. This
is the case that Derrick discovered at the conference at UIUC and wrote code to
work around. Low MTU detection doesn't use the traditional path MTU discovery
code, but instead uses padded RX ping packets. If we don't get a response to a
ping packet of a certain size, then we resend the ping with a lower size. When
we eventually get a response, that's the MTU of the link. This is the code that
uses rx_SetMsgsizeRetryErr - if that's registered, and we aren't making
progress because of MTU, then the call will be failed with that error, and the
application can retry, and thus get a smaller packet size.
To my mind, keeping the two of these separate makes sense at present. There are
a lot of questions around support for setting the DF flag, and getting the ICMP
errors delivered to the RX stack, especially when that stack is in userspace.
The low MTU detection should work everywhere. Last time I looked, low MTU had
some issues - in particular, it was using hard ACKs to determine with a call
was making progress, when actually the presence of soft ACKs is sufficient (you
don't care that the packet has reached the application, just that it has been
successfully received by the network stack)
It would be good to keep discussing this. Like most of RX, this code is all a
bit tangled, and I think discussing overall design intent is a great way to
make sure that the patches do what we all expect them to!
Is this already documented somewhere outside of the source code? Should
this be in the wiki?
Jason
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel