Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

Robin Whittle Mon, 08 Feb 2010 18:34:12 -0800

Short version:   Discussion of SEAL and in particular of the
                 differing approaches Fred and I have about PMTUD.
                 Fred seems to support continued use of IPv4 DF=0
                 "fragmentable in the network" packets while I
                 think it should be deprecated.  (It is banned in
                 IPv6.)

                 Fred seems to support RFC 4821 PMTUD, at least for
                 packets longer than 1500 bytes.

                 I am opposed to RFC 4821 - because I think it is
                 over-defensively trying to cope with problems which
                 should be fixed, rather than tolerated and therefore
                 encouraged.

                 ETEs (ETRs) can't be expected to try to detect
                 packets arriving supposedly from legitimate ITEs
                 (ITRs) but which are sent by an attacker.

Hi Fred,

I plan to write a "IRON: SEAL summary V2" based on what I learnt from
your two recent on-list messages and one off-list message.

Here is my response to your first on-list message and some elements
of your off-list message.

>From your off-list message, I understand:

  * The longest IPv6 prefix length IRON/RANGER is intended to support
    is /56.  This sounds OK to me but at present I plan Ivip to work
    in integer units of /64.  Maybe that can be scaled back to /56
    nearer the time of deployment.

  * I will assume that the route redirect messages currently
    specified in RANGER (native ICMP IPv4 and IPv6 redirects)
    will be replaced by SEAL messages.  This will allow the
    inclusion of a caching time.

  * I understand that the North Island IRON router will first of
    all send the traffic packet to the Seattle router, based on
    its VP prefix 43.0.0.0 /16 in its FIB.

    The Seattle router somehow gets the packet to the correct IRON
    router in the South Island - and sends a redirect to the North
    Island IRON router.  That redirect causes the North Island IRON
    router to install a more-specific prefix in its FIB, for
    43.0.56.76 /30 with the path leading to the correct South
    Island router.  So subsequent traffic packets addressed to this
    prefix will be tunneled direct to the correct South Island router
    - this is RANGER's Route Optimization process.

  * I suggested, without any thought, a 10 minute time by which an
    IRON router would purge its FIB of any more-specific route
    for a particular EUN (End User Network) EID prefix if there
    was no traffic for it.  You suggest a 2 minute STALETIME - which
    seems fine to me.

    So a router which receives a SEAL redirect would maintain it in
    its FIB for the the caching time if packets keep arriving for it
    at intervals less than STALETIME or for the STALETIME if no
    packets arrive in that time.

  * I only have a rough idea how the EUN router creates "bubbles"
    with RANGER's (IPv6's?) RA (Router Advertisements) as a means of
    the one or more IRON routers of its ISPs securely registering
    themselves as the correct destination for packets whose
    destination address matches 43.0.56.76 /30.

    To what extent does this involve adding things to IPv6 - and
    to what extent is it practical, and with what additions, for
    IPv4?

  * I think we have very different goals regarding support for DF=0
    packets and for dealing with problems in the network, such as
    tunnels which don't support RFC 1191 / RFC 1981 PMTUD and
    filters which drop PTB packets.

    I think you support creating protocols which can cope with this
    stuff, including potentially very long DF=0 packets.

    I think these filtering and bad tunnel arrangements need to be
    fixed - and that DF=0 should be deprecated.

>>> SEAL explicitly turns off PMTUD and uses its own tunnel
>>> endpoint-to-endpoint MTU determination, so in the normal
>>> case it does not expect to receive any ICMP PTBs from
>>> routers within the tunnel.
>>
>> My understanding is that this is only true for IPv4, because the SEAL
>> ITE (Ingress Tunnel Endpoint) sends packets with DF=0 to the ETE
>> (Egress Tunnel Endpoint).  For IPv6, the ITE can get PTBs from
>> routers in the tunnel since no packets are fragmentable.  So I think
>> it would not be true to state that SEAL ITE "turns off" the
>> traditional IPv6 RFC 1981 PMTUD mechanism when it tunnels packets to
>> the ITE.
> 
> Yes, that's right. My mind has been so locked into the
> IPv4 case that I forget that IPv6 does not allow
> fragmentation in the network. So, you are right that
> IPv6 as the outer protocol requires RFC1981 PMTUD
> feedback from the network.

I understand from what follows that you intend to rewrite SEAL to not
use the IPv6 Fragment Header, but to use an explicit SEAL header by
which the ITE can request the ETE acknowledge the receipt of the packet.

This would mean that if there were no PTBs arriving at the ITE due to
an MTU limit in a router in the ITE -> ETE path, that the ITE could
try various packet lengths until it found a length short enough to
avoid the lowest MTU limit.

>>> SEAL *can* enable PMTUD for certain "expendable" packets,
>> I don't recall what these would be.
> 
> Out-of-band probes, e.g.

OK - I guess such as I just mentioned

>> Is there a mechanism for SEAL, in IPv4, to send these "expendable"
>> packets with DF=1?
> 
> Yes; just set DF=1 in the outer IPv4 header and send it.

OK.

>>>>> In some environments, it may be necessary to insert a
>>>>> mid-layer UDP header in order to give ECMP/LAG routers
>>>>> a handle to support multipath traffic flow separation.
>>>>    http://en.wikipedia.org/wiki/Equal-cost_multi-path_routing
>>>>
>>>> http://www.force10networks.com/CSPortal20/TechTips/0065_HowDoIConfigureLoadBalancing.aspx
>>>>
>>>> As far as I know, these techniques are not something to consider with
>>>> the RANGER CES, or with LISP or Ivip.  If the routers can handle
>>>> ordinary traffic packets they can handle encapsulated packets too.  I
>>>> haven't read about these techniques in detail.  I guess that within
>>>> RANGER, beyond its use as a CES scalable routing solution, you may
>>>> want to support ECMP and LAG.
>>>
>>> There has been a great deal of talk about taking care
>>> of ECMP/LAG routers within the network that only
>>> recognize common-case protocols (i.e., TCP and UDP),
>>> which is why LISP has locked into using UDP encaps.
>>
>> Do you expect this to be the case for IRON?  If so, then I guess that
>> SEAL in IRON must always use this UDP header before the SEAL header -
>> since no ITE could know for sure whether ECMP/LAG is in use on the
>> path to the ETE.
> 
> Yes, I guess so.

OK but ...

>> If anyone can point me to good references on ECMP used with LAG, I
>> would really appreciate it.

I am keen to read more about this.

Here you seem to agree with my suggestion that the "mid-layer
headers" are after the IPv4/6 header and before the SEAL header.

However, I see from the Figures 1 and 2 in:

  http://tools.ietf.org/html/draft-templin-intarea-seal-08

that these "mid-layer" headers are after the SEAL header.

As far as I know, placing a UDP header after the SEAL header would
make it invisible to ECMP/LAG routers.  So I think that if the
packets have to be UDP packets to keep these ECMP/LAG routers happy,
the SEAL header and all that follows is part of the UDP payload.

Assuming you did this (and I am not suggesting you need to, since I
haven't yet read up on ECMP/LAG) and if you wanted to send IPv4 DF=1
packets in the tunnels, and if the MTU limiting router sent back a
PTB with only the IPv4 header and the next 8 bytes (the UDP header)
then the ITE could still authenticate the PTB by caching its 16 bit
UDP checksum.  This is because the UDP checksum would be affected by
the full 32 bit value of the SEAL_ID in the SEAL header, and the most
significant 16 bits of the SEAL_ID are in the returned IPv4 header's
Identification field anyway.

This might be handy for the IPv4 probing packets you mentioned above.

Regarding my suggestions about a timer-like algorithm by which the
ITE could decide which range of SEAL_IDs it had "recently" sent to an
ETE, for the purposes of authenticating messages from that ETE, or
authenticating PTBs from routers in the ITE -> ETE path, you wrote:

> OK, that sounds good on the ITE side but what about the
> ETE side? If the ETE is going to be tracking the SEAL_ID
> for this ITE, can't it similarly keep a sliding window
> based on the packets received within the last ~3sec?

I don't recall any prior mention of the ETE attempting to decide
whether a packet apparently from an ITE was really from that ITE or not.

Maybe you could try this for ITE <-> ETE communications, but I think
it may be impossible for reasons similar or identical to the
arguments below.

Here is an argument about why it would be pointless or worse for the
ETE in an IRON/RANGER setting to try to use SEAL_ID to decide whether
or not to accept tunneled packets containing traffic packets.

 1 - Why it can't defend against an attacker.

     I assume the attacker's purpose is to get bogus packets to the
     Destination Host (DH).

     There's no need for an attacker to try to spoof a packet
     arriving from an active ITE - one which has recently
     tunneled traffic packets to this ETE.  If the attacker could
     get the ETE to accept them, then yes, the ETE would dutifully
     forward the packets towards the DH and the DH would get the
     bogus packet.

     However, the attacker doesn't need to do this in order to
     achieve his or her goal.  They can send a packet, tunneled
     just as if it were tunneled by some ITE which the ETE had
     not recently received packets from.  To the ETE, this would
     be a "new" ITE, and the SEAL_ID would be set to some random
     value by this ITE.

     The attacker could keep up this flow of packets and the
     ETE would keep accepting them.

     There's no point in making ETE's respond to such packets
     by sending a packet to the supposed ITE, and then by
     ignoring those packets if that ITE doesn't confirm it
     sent them.  This would lead to extra network traffic and
     the attacker could simply spoof the traffic packets
     themselves and allow them to be forwarded to a legitimate
     ITE.

 2 - Why it would be worse than useless.

     For the ETE to use SEAL_ID to accept or reject packets would
     lead to an easy DoS vulnerability.

     The attacker wants to clobber the ability of ITE X to tunnel
     packets to ETE Y, before ITE X has done so.  The attacker
     crafts a single packet which appears to ETE Y to have been
     sent by ITE X, with some random SEAL_ID value.  Then, the ETE
     would use this and reject any packets genuinely coming from
     ITE Y, because the real ITE Y would have chosen a different
     random starting point for its SEAL_ID.

Just as there is no way of fully preventing attackers sending packets
to any host now, with any source address they choose, nor is there
any way of preventing such problems with a CES system.

Furthermore, since in CES system, EUNs (End User Networks) using
"edge" addresses could be connected to any ISP, if these ISPs are
going to support these EUNs, then they need to allow the forwarding
of all packets from these EUNs which have source addresses matching
any "edge" address.

At present, ISPs can do their bit to stop spoofing by dropping
packets with source addresses not matching the prefix of the network
which they came from - but that can't be applied to packets with
"edge" addresses in a CES system, unless the ISP is prepared to be
extremely fussy and watch the mapping system to see which "edge"
prefixes are currently being mapped to an ETR which serves the
particular EUN.

>> My view is that for IPv4, RFC 1191 PMTUD is an excellent system -
>> except that the PTB message should be made to follow the RFC 1981
>> requirement of sending back as much of the original packet as would
>> not make the PTB exceed 576 octets:
> 
> How can a system with a going-in strategy of *throwing
> away good data* be "excellent"?

Because it is unreasonable of hosts or networks to emit some kinds of
packets - specifically any packet which is too long for the path to
the DH, and where the host or network expects the rest of the network
to fuss about chopping the packet into fragments, and then to carry
those fragments, and then for the DH to have to reassemble those
fragments - which is a complex task.

The Post-Office has maximum packet sizes and so does the IPv6
Internet.  I think DF=0 packets were always a mistake.  I guess it
made sense in the early days of very dumb hosts, but I think it is a
host responsibility to alter its behaviour so as to send packets
which are the right size for the network to deliver in a single piece.

If the network dutifully fragments DF=0 packets and attempts to
deliver them - as it does in IPv4 - then this allows and therefore
encourages hosts to send such packets which require this inefficient,
unfair and unreliable form of handling.

Likewise, if the network attempted to deliver packets which were too
big, by sending a PTB and then by fragmenting them so the final
packet could be assembled at the DH, this would allow and therefore
encourage SHs to keep sending such packets.  It would also involve
the DH getting some of the first packet in the second, since the SH
couldn't be sure that the longer packet was successfully reassembled
at the DH from its fragments.

The too-big packet should be thrown away - except for sending enough
of it back to the SH to enable the SH to authenticate the PTB.  Also,
since there can be multiple levels of tunnel, I think the PTB should
contain a few hundred bytes of the packet, so that the SH of an inner
tunnel, which is a router in the path of an outer tunnel, can
construct a PTB to its sending host (the ingress router of a still
further outer tunnel) which contains enough information not just for
authenticating the PTB, but to allow the final outer router of the
first tunnel to construct a PTB with sufficient length for the real
SH to authenticate.   Without this, ingress tunnel routers would need
to cache substantial parts of the packets they send, in order to be
able to generate a valid PTB.  I guess many IPv4 tunnels don't do
this, which is part of the reason for the lousy PMTUD situation today.

I think that DF=0 packets should be deprecated, and that the RFC 1191
PTB message should be revised to be the same as the ~540 bytes of
packet which are returned in an IPv6 RFC 1981 PTB.

I am opposed to what I think you are attempting with SEAL, in at
least some circumstances - of fragmenting or segmenting packets which
are too long, without an attempt to tell the sending host to send a
suitably shortened packet.

DF=0 packets do not allow the SH to be told this - so I think they
should be deprecated.  They were considered unworthy of inclusion in
IPv6 in the mid-1990s.  I think they should have been deprecated in
IPv4 from that time.

>> If the RFC 1191 designers had correctly anticipated the need for one
>> or more levels of tunneling to support their PMTUD system, then I
>> think they would have altered the PTB requirements to be as long as
>> those for IPv6's RFC 1981.  Then we probably would have tunnels today
>> which properly support RFC 1191 PMTUD.
> 
> But, if any one of those tunnels uses IPsec encryption
> or the like there is no opportunity for performing the
> necessary translation function. So if there were a
> decent segmentation and reassembly capability it seems
> like IPsec implementations would be wise to use it.

If an IPSEC ingress tunnel router can't decipher the initial part of
the packet returned in a PTB, then it needs to cache a copy of the
initial part of the packet it received as input to the tunnel, for
the purpose of constructing a PTB which would be recognised by a SH,
and furthermore which is long enough that if this tunnel is within
other tunnels, then that by the time this router's PTB contents have
been passed back up the line to the other ingress tunnel routers,
that the outer one still has enough of the original traffic packet to
be able to generate a valid PTB for the SH.

Without this, the IPSEC tunnel is not supporting RFC 1191 / RFC 1981
PMTUD.  I understand such support is mandatory, so how could a
self-respecting tunnel, IPSEC or not, not support it?

The world turns to rot if RFC 1191 and RFC 1981 PMTUD are not
supported.  People have to write messy things like RFC 4821 in an
effort to get around the mess caused by such tunnels, or by filtering
PTBs.

I think we should not be so defensive about stuff happening in the
network that we allow it and adapt to cope with it, when the
practices are fundamentally inefficient and do not support the best
way of doing things.

>> Also, I think that DF=0 packets should be deprecated - unless perhaps
>> they are shorter than some constant such as 1200 bytes or so.  I
>> think it would be bad to expect ITRs and ETRs and thewhole CES
>> system to work over paths with MTUs below this.  People shouldn't use
>> such short PMTU links in the DFZ and shouldn't place their ITRs or
>> ETRs anywhere where there are such short PMTU links between them and
>> the DFZ.
> 
> DF=0 has two benefits - it can allow good data to
> get through in cases where DF=1 would have dropped
> the data, and it can allows MTU indication through
> to the ETE which can report back to the ITE.
> 
>> My view is that for IPv6, RFC 1981 is an excellent system.
> 
> How can a system that places blind faith in the network
> be "excellent"?

People pay to use services with tunnels.  If the tunnels screw up the
only efficient, practical, approach to PMTUD (RFC 1191 / RFC 1981)
then people shouldn't use such tunnels or pay for any service which
uses them.  If they do, then its all downhill from there - with
people fixing packet lengths to avoid trouble, and busying themselves
updating stacks and applications to no longer rely on the perfectly
good RFC 1191 / RFC 1981 PMTUD approach (if RFC 1191 had mandatory
~540 byte packet fragments).

>> From your research (msg05910), it seems that the current state of
>> PMTUD in IPv4 is a shambles - with some networks blocking PTBs, some
>> tunnels (or combinations of tunnels) not generating PTBs and with
>> some hosts ignoring PTBs, or not responding properly to them.  Also
>> some hosts send DF=0 packets of 1470 bytes (Google at least).
>>
>> As far as I know, everything generally works because many hosts are
>> configured not to send packets long enough to run into PMTU problems.
> 
> Agree.

>> From the current basis, there's no way we can generally adopt
>> jumboframe paths in the DFZ as they appear.
> 
> Also agree.

>> Nor is there a way of introducing a tunneling-based CES architecture
>> which relies for its PMTUD on PTBs.  My IPTM approach and I think
>> your SEAL approach should be able to cope without relying on PTBs
>> from within the tunnel (but see my forthcoming message).  But what if
>> the ITRs (ITEs) can correctly sense the PMTU to the ETRs (ETEs) and
>> are unable to alter the sending host's packet lengths?
>>
>> This could be due to:
>>
>>   1  A PTB sent by the ITR is dropped by some filtering system
>>      before it can get to the SH.  This seems more likely if
>>      the ITR is outside the ISP or end-user network where the
>>      SH is located, than within it.
>>
>>      If people filter PTBs from entering their system, or use an
>>      ISP which does the same, this is their own fault.
>>
>>      The trouble is, they get away with it now, because the packets
>>      their hosts send are generally short enough not to run into MTU
>>      problems.  Unfortunately, such networks will perceive the
>>      difficulties resulting from their choices as being caused by
>>      sending packets to a host with an SPI ("edge") address in the
>>      CES architecture - and may not think it is their own filtering
>>      which is causing the trouble.
>>
>>   2  The SH ignoring or responding incorrectly to the PTB.
>>
>>      As above - they get away with it now, and would perceive the
>>      problem as being caused by the destination network which
>>      is using the CES system's "edge" space.
> 
> Cases 1 and 2 are a problem of the end site and not of
> the ITE. If the ITE as an edge router of the site is
> sending PTBs and the source host is not either not
> getting them or not responding correctly then the end
> site has to find the problems and fix them.

I agree.

>>   3  The SH sends DF=0 packets which are too long, after
>>      encapsulation for some, many or all paths to ETRs.
>>
>>      Again, as above, they get away with it now - but would blame
>>      the CES system, or rather the destination network which they
>>      may not know has adopted the "edge" space provided by the
>>      CES system.
>>
>>      So does a CES system have to fragment every such packet?
>>      It seems so.
> 
> The CES needs to select a "safe" size for performing inner
> fragmentation while not choosing one so excessively small
> as to invoke inner fragmentation very often.

Then you are always making things less efficient than you could be,
and ruling out the use of jumboframe paths in the DFZ until such time
that every path is jumboframe compatible - which may be never.

I am opposed to the CES scheme continually fragmenting packets if we
can possibly avoid it.  Maybe we have to for DF=0 packets which are
1470 bytes and the CES scheme can only get a little less than this
into each tunnel packet.  But this would be BAD considering how
Google sends out a lot of 1470 byte DF=0 packets.  I imagine Google
could be talked into lowering this or better still into using DF=1.

>> I think that to implement defensive, complex protocols such as RFC
>> 4821 would be to accept and allow all these bad practices, and would
>> forever doom us to having to do extra work, and suffer extra
>> flakiness, just because of these bad practices.
>>
>> RFC 4821 will always be a slower and less accurate method of
>> determining PMTU to a given host than RFC 1191 or RFC 1981.  It would
>> be subject to choosing a lower than proper value, if there was an
>> outage for a while and it interpreted this as a PMTU limitation.
> 
> My belief is that SEAL used correctly has a chance
> to establish a minimum "Internet cell size" of 1500.

I can't see how you could do this, since there will always be 1500
limits in the DFZ, ISP and other networks for years to come, and
there will at times be tunneling, such as with PPPoE in DSL services.

> Then, if end systems adopt the strategy of "use
> classic PMTUD for packets no larger than 1500 and
> use RFC4821 or equivalent for packets larger than
> 1500" then we would have a path to an MTU-clean
> Internet that can scale to any future packet sizes.

I think this would be very messy.  Some hosts would be putting out 9k
byte packets and ignoring PTBs, just to see if they could get them to
the other host - and trying several times to make sure the failure
was due to a genuine MTU problem and not due to random packet loss.

 - Robin

_______________________________________________
rrg mailing list
rrg@irtf.org
http://www.irtf.org/mailman/listinfo/rrg

Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

Reply via email to