Re: [Int-area] Please review: draft-savola-mtufrag-network-tunneling-04.txt

Joe Touch Thu, 15 Sep 2005 11:12:50 -0700


Pekka Savola wrote:
> On Tue, 30 Aug 2005, Joe Touch wrote:
> 
>>> I think we just have to agree to disagree.  Over half a dozen people
>>> have stated that they find the draft very useful, and at least two ADs
>>> have been willing to sponsor it.  There have also been discussion on
>>> various lists, particularly form the ops perspective (e.g.,
>>> http://www.merit.edu/mail.archives/nanog/2004-10/threads.html#00125).
>>
>>
>> A quick scan of that list didn't note anyone who saw the relationship to
>> RFC2003. Since the I-D didn't cite that, the (IMO) definitive current
>> work on the topic, it's not clear what the positive input on this draft
>> really means (maybe that they should read RFC2003, i.e.).
> 
> If it makes you happier, I could try querying the folks on whether they
> think the draft is useful after taking a look at RFC 2003 -- if you
> provide the list of section numbers of RFC 2003 I should refer them to
> read in particular.
> 
> I'm pretty sure I already know the answer, but if that helps you drop
> the argument, I can try it..


It'd be useful either way. The fact that they didn't notice 2003's
omission in the first place suggests they might not be the appropriate
group to poll on whether a new RFC on this issue is useful, though.

...
>>>> This is just making a decision at the tunnel that you prefer efficiency
>>>> to PMTUD support. 2003 doesn't say it's OK to make that decision, esp.
>>>> since it's not detectable from the endpoints. I don't think this is a
>>>> compelling reason, as a result.
>>>
>>>
>>> That may or may not be a compelling reason.  Many operators are doing so
>>> regardless, and vendors are supporting it.  I don't think there is
>>> anything to be gained (but rather lost) by having the IETF be in denial
>>> about this.  This document is precisely not a BCP because I want to
>>> document the *operational* issues (and solutions, even if not optimal
>>> ones).
>>>
>>> However, having disclaimers e.g., noting clearly that the behaviour is
>>> incompliant and may break things is fine with me, and I certainly want
>>> to make such a point myself.
>>>
>>> So, could you check if there's text which should be added or reworded to
>>> make this clearer?  Suggestions would be welcome.
>>
>> Text that describes the things it WILL break. Why are vendors supporting
>> it? Is it just further denial by others that RFC2003 exists or has
>> precedence here? Or is there real utility?
> 
> I already described one example of a utility w/ avoiding high-speed
> reassembly at a router, which just simply does not work.  I could query
> for others from vendors, but I don't think those use cases belong in the
> draft at least in a verbose manner. Including the use cases would seem
> to make even more compelling case for causing more violations in those
> cases than just being silent.  Further, it would cause ratholing
> discussions on whether folks see those use cases being really required
> or not, even though the specifics (and discussion about them) is not the
> point, just that the operators ARE using them and probably for good
> reasons.

Whether vendors fail to support requirements and whether there are good
reasons for their doing so are two different things; ignorance and
apathy can be as likely as real cause. You're right that we don't need
to go into a poll of current products or deployments, but I'm
uncomfortable with asserting that we have to live with it or change the
requirement because vendors don't correctly support it. If that were the
case, there would be no point to docs like RFC2525 (known TCP
implementation problems).

The key here is whether the point of this doc is to suggest that the
spec be modified because of real reason, or to document current
violations of spec in this regard. The third option - which is how it's
currently described - is "here's what people do; it's a violation, but
people do it". I don't believe it's appropriate to even indirectly
validate broken implementations because of a herd mentality, so one of
the first two is preferable. I.e., either it has to be "there are real
reasons and the spec should be changed", or "people do it, it's a
violation, and here's where it causes problems".

And I don't buy the fact that people do it because the can't implement
it; ATM segmentation and reassembly works fine at very high speeds too,
and although segments must come in in-order, they are much smaller data
chunks (typically). Sure, it's expensive, but that's not a reason to
violate a spec.

>> We already know - as described in the new PMTUD doc, as well as in other
>> places - that this practice has issues. Why do we need another document
>> that says so? Or if we do, why should it be an RFC (if the purpose is to
>> educate or discuss, a tech report or magazine paper would suffice).
> 
> PMTUD docs are describing new PMTUD mechanisms, not focusing on the
> narrow problem and solutions which are used *today*.  This is much more
> focused and shorter, hence more likely to be read.
> 
> I think the RFC series is most appropriate here because it causes most
> dissemination (and maybe follow-up activity) among the IETF and
> operators from my perspective.  But that decision is irrelevant to this
> discussion.

This argues that an article might be useful, but that an RFC is
definitely not. If this is a snapshot of current use, why clutter the
RFC series with it?

We don't issue hurricane warnings by RFC either, so the "widely read"
case doesn't wash IMO either.

>>>> Reassembly doesn't require infinite buffers; the fragments need be held
>>>> only for a network MSL, and the IP ID numbers are required not to wrap
>>>> during that period, so there's no wrap problem if everybody stays to
>>>> spec.
>>>
>>>
>>> That's a lot of assumptions.  60 seconds (mentioned in RFC1122, sect
>>> 3.2.2) on 10 Gbit/s interface is basically infinite buffers.  Even if
>>> the time was 1 second, it would basically mean infinite buffers.
>>
>>
>> That presumes that the entire bandwidth is used between two hosts, and
>> it does mean that fragmentation should be avoided in those cases.
> 
> 
> Sure.  In this case, we're talking about high-speed encap/decap (on
> hardware) between routers.  While the whole bandwidth is probably not
> used up there, it could be -- and the problem is sufficiently serious
> with much lower rates as well.

ATM does this at MUCH higher rates - and has for more than a few years.

>> It's
>> even OK to drop packets that are fragments altogether if they hit a
>> tunnel that won't carry them.
> 
> If we assume MTU=1500, would it be OK to drop all the packets with size
> 1472 (or something thereabouts) which have DF bit set? 

Not only "OK", but _required_ as I read the specs. Yes, things that
don't adjust will see a blackhole - which we already know.

> That's what
> we're talking about here (and the unfeasibility of signalling back
> "Packet too Big" to all those sources sending >1472 (or whatever)
> packets).  While RFC791 says it's OK, it certainly isn't in practise.

http://www.caida.org/analysis/workload/fragments/sdscposter.xml

That shows that "A significant portion of the fragmented traffic that
crosses the UCSD-CERF link is tunneled traffic.". So it's practice by at
least one analysis.

Having seen it on other tunnels all over the place (I do a lot of tunnel
research, FWIW), I would agree that it's possible to hit places that do
violate spec, but it hasn't affected the majority of paths.

>> It's still NOT OK to clear the DF bit in the outer header when it's set
>> in the inner, though.
> 
> Sure, sure -- but that's still being done widely out there.  What would
> you prefer? Be hush-hush about it?  I want to bring the problems out in
> the open, with appropriate disclaimers of course.

As above, there are three options:

        1. argue that the spec should be changed for cause
        2. argue that the violating implementations should be changed

"Everyone does it", or even "many implementions do it" isn't a
sufficient reason to change a spec that has compliant alternatives (drop
the packets) especially when the correct operation of some protocols
(PMTU) depend on that behavior.

>>> Further, if you look at
>>> http://www.watersprings.org/pub/id/draft-mathis-frag-harmful-00.txt, the
>>> IP ID numbers wrap many times during that period.
>>
>> Since that doc is an expired ID, I presume you know why I won't address
>> its contents. When IP ID numbers wrap, that is a violation of RFC791 and
>> has known problems. There are ways to overcome it, esp. using aliases of
>> multiple IP addresses per physical interface, to avoid such wrap. But
>> ignoring the problem or accepting violating implementations is not the
>> appropriate solution.
> 
> I have no control on why the authors didn't continue the work; I
> certainly provided some suggestions for enhancements myself (note: if
> the authors are listening, I could consider picking up the draft if
> you'd like).  But that aside..
> 
> Requiring hundreds or thousands of IP aliases to overcome IP ID wrapping
> is not a [real] solution.  Could you please state a better one if one
> exists?

Huh? You want to change everyone's spec for tunneling, but the IP ID
space is not on the table? Why not have a 'large ID option'?

Besides, most hosts don't have an IP ID problem; even though gigabit
interfaces are common, the wrap problem comes up in supercomputer
contexts primarily.

> Wrt. the incompliancy, are you referring to this:
> 
>     The choice of the Identifier for a datagram is based on the need to
>     provide a way to uniquely identify the fragments of a particular
>     datagram.  The protocol module assembling fragments judges fragments
>     to belong to the same datagram if they have the same source,
>     destination, protocol, and Identifier.  **Thus, the sender must choose
>     the Identifier to be unique for this source, destination pair and
>     protocol for the time the datagram (or any fragment of it) could be
>     alive in the internet.**
> 
>     It seems then that a sending protocol module needs to keep a table
>     of Identifiers, one entry for each destination it has communicated
>     with in the last maximum packet lifetime for the internet.

It needs to ensure that the first paragraph isn't violated, which a
later paragraph addresses via a simpler mechanism that is more typical
in current use:
    However, since the Identifier field allows 65,536 different values,
    some host may be able to simply use unique identifiers independent
    of destination.

> (esp last sentence of 1st paragraph and 2nd para.)
> 
> I don't see any real solutions here.  I don't think it's acceptable to
> refuse to send any more packets to destination X until an ID slot is
> freed (a buffering problem, a denial of service issue), as it is not
> acceptable to switch source IP addresses because the peer would likely
> no longer recognize the session because the IPs changed.

If it's not acceptable, then we need to redo RFC791. Right now, even
when DF is set, the rule still applies.

Note that although RFC791 defines the ID field as for fragmentation, it
never says that if the DF bit is set that the frag  field is not
meaningful or may be set to 0.

RFC1122 does indicate that the IP ID field can be used by routers to
omit duplicate IP packets:

                (2) a
                 congested gateway might use the IP Identification field
                 (and Fragment Offset) to discard duplicate datagrams
                 from the queue.

> But no matter what, that is not really relevant for this draft.  The
> practice exists for (probably) valid operational reasons.  We can't deny
> that.

See above as to why this is exactly at issue IMO.

> Again my question would be, what changes would you like to see in the
> draft to make this clearer?

I made some above...

>>> If you'll look at the details in the draft, the frag/reasm does work --
>>> after the PMTUD violation -- because the *recipient* (a host!) performs
>>> reassembly and because the stream is thus small enough not to cause IP
>>> ID wrapping in that (src, dst, ID) context.  Therefore the reasm is no
>>> longer too "high speed". So, I think the approach "works" to some
>>> definition of "works" -- it certainly seems to solve problems some
>>> operators have been having.
>>
>>
>> "For some definition of works" isn't what the IETF is about, IMO.
> 
> I think that's a direct consequence of the IETF not providing a better
> solution, so the operators and vendors have to do with what's out there
> -- but that's not the point here.

I think it is. RFCs are not places to document current practice per se;
the point of the practice is at issue, and it's either a bug (which I
think this is) or a reason to change the spec, and if the latter is the
point, the doc needs a better argument than "everyone does it".

>>> Breaking PMTUD is of course unfortunate especially for v6, but maybe
>>> more or less a consequence of the IETF being in denial about the
>>> problem.
>>
>>
>> The IETF may be in denial about the problem, but addressing violations
>> of spec as anything other than such is worse - it doesn't move us
>> forward to change the spec in useful ways (increasing the IP ID space,
>> dealing with fragmentation more usefully, etc.).
> 
> I'm open to text suggestions on how to make the spec violations (or
> whatever) clearer, and possibly add a "call for action" statements if
> you'd like.  Any other text suggestions as long as it includes a
> description of this operationally-used technique would probably be fine
> as well.

Contact me off-list if I can clarify ways to do what I proposed above,
if they're not obvious from the point I'm making. The key is to pick a
side - change the spec, or describe violations as such only.

>>>> A packet with the DF bit set with MF also set is an error as per
>>>> RFC791.
>>>> The discussion implies that this can exist - or are you talking about
>>>> clearing only the outer DF?
>>>
>>>
>>> I'm not 100% certain as this was text submitted to me by an earlier
>>> reviewer.  But I'm not sure if I understand -- what is the scenario
>>> where you would have both DF and MF?  I don't see one -- if you clear
>>> the DF bit, by definition you don't get any packets with MF from the
>>> path before clearing the DF bit; and if you have cleared the DF bit, you
>>> may get MF bit on some packets after that point, but no longer DF bit.
>>
>>
>> The text isn't clear on that. The example you've given would be useful
>> to add.
> 
> OK.  I tried to see how to add that in the text, but couldn't figure a
> way how to incorporate so it would fit in nicely.  Could you propose
> which exact clarification wording (and where) would be helpful?

It may be in multiple places. The part where you mention setting the DF
needs to always examine the MF bit. This is an issue only in headers you
do not generate - i.e., for the inner header. The case where it seemed
ambiguous, and a simple way to fix it:

--- existing ---
   When desiring to avoid fragmentation, IPv4 allows two options: copy
   the DF bit from the inner packets to the encapsulating header, or
   always set the DF bit.  The latter is better especially in controlled
   environments, because it forces PMTUD to converge immediately.
--- proposed ---
   When desiring to avoid fragmentation, IPv4 allows two options: copy
   the DF bit from the inner packets to the encapsulating header, or
   always set the DF bit of the outer header.  The latter is
   better especially in controlled
   environments, because it forces PMTUD to converge immediately.
---

Joe

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Int-area mailing list
[email protected]
https://www1.ietf.org/mailman/listinfo/int-area

Re: [Int-area] Please review: draft-savola-mtufrag-network-tunneling-04.txt

Reply via email to