Short version:      Tunnels ingress routers accepting a PTB packet
                    from within the tunnel and updating their MTU
                    for the tunnel - but not sending a PTB.  Then
                    they send a PTB for the second too-long traffic
                    packet.

                    Three nested tunnels like this means the SH has
                    to send four too-long packets before it gets
                    a PTB.

                    I doubt there is no such a thing as "a little
                    bit of fragmentation".

                    Fred suggests an ETE could decide whether or
                    not to accept a traffic packet tunneled by an
                    ITE based on the ETE determining whether or
                    not that ITE was authorised to handle packets
                    whose source address is that of the traffic
                    packets.  I can't imagine a practical way of
                    doing this.


Hi Fred,

Thanks for your continued conversation, from which I have learned a
great deal.  You wrote, in part:


>>>> If anyone can point me to good references on ECMP used with LAG, I
>>>> would really appreciate it.
>>
>> I am keen to read more about this.

> My understanding is that a router with multiple equal
> cost routes toward the final destination can select
> which next hop to use based on a hash of the packet's
> (src,dst,sport,dpot,proto)-tuple (i.e., the packet's
> "flow identifier"). I believe RFCs 2991 and 2992 may
> have more to say on this.

Thanks for these pointers.  I will check out:

  Multipath Issues in Unicast and Multicast Next-Hop Selection
  http://tools.ietf.org/html/rfc2991

  Analysis of an Equal-Cost Multi-Path Algorithm
  http://tools.ietf.org/html/rfc2992

Searching for "Link Aggregation Group" at the IETF site turns up
a number of documents.


>> Here you seem to agree with my suggestion that the "mid-layer
>> headers" are after the IPv4/6 header and before the SEAL header.
>>
>> However, I see from the Figures 1 and 2 in:
>>
>>   http://tools.ietf.org/html/draft-templin-intarea-seal-08
>>
>> that these "mid-layer" headers are after the SEAL header.
>>
>> As far as I know, placing a UDP header after the SEAL header would
>> make it invisible to ECMP/LAG routers.  So I think that if the
>> packets have to be UDP packets to keep these ECMP/LAG routers happy,
>> the SEAL header and all that follows is part of the UDP payload.
> 
> I am actually meaning to use UDP as an "outer"
> encapsulation; not a "mid-layer" encapsulation.
> So, the order of encapsulating headers on the wire
> (left to right) would be:  IPvX/UDP/SEAL/IPvY.
> So yes, the ECMP/LAG routers get to see the UDP
> header in the clear and the SEAL header looks
> like UDP data.

OK.  I see after Figure 2 in:

  http://tools.ietf.org/html/draft-templin-intarea-seal-08

the third dot point mentions UDP as an outer layer header.  I
suggest you add a note for why this would be needed, such as
to support ECMP/LAG or for any other reason.



>> Assuming you did this (and I am not suggesting you need to, since I
>> haven't yet read up on ECMP/LAG) and if you wanted to send IPv4 DF=1
>> packets in the tunnels, and if the MTU limiting router sent back a
>> PTB with only the IPv4 header and the next 8 bytes (the UDP header)
>> then the ITE could still authenticate the PTB by caching its 16 bit
>> UDP checksum.  This is because the UDP checksum would be affected by
>> the full 32 bit value of the SEAL_ID in the SEAL header, and the most
>> significant 16 bits of the SEAL_ID are in the returned IPv4 header's
>> Identification field anyway.
> 
> For IPv4, we will run with UDP checksums set to zero.
> This just means that we may not be able to use the
> "expendable packet" approach to probing when the UDP
> header is required.

OK.


>>      However, the attacker doesn't need to do this in order to
>>      achieve his or her goal.  They can send a packet, tunneled
>>      just as if it were tunneled by some ITE which the ETE had
>>      not recently received packets from.  To the ETE, this would
>>      be a "new" ITE, and the SEAL_ID would be set to some random
>>      value by this ITE.
>>
>>      The attacker could keep up this flow of packets and the
>>      ETE would keep accepting them.
>>
>>      There's no point in making ETE's respond to such packets
>>      by sending a packet to the supposed ITE, and then by
>>      ignoring those packets if that ITE doesn't confirm it
>>      sent them.  This would lead to extra network traffic and
>>      the attacker could simply spoof the traffic packets
>>      themselves and allow them to be forwarded to a legitimate
>>      ITE.
> 
> No; an attacker pretending to be an ITE first needs
> to prove to the ETE that it is authorized to source
> packets from a specific prefix. Secure Neighbor
> Discovery (SEND) is the way that good ITE prove
> their authorization to source packets.

I don't understand this at all.  How can an ETE firstly know
that the tunneled packet it received really came from the ITE
whose IP address is in the outer source field?  This would
require it send a message to that IP address and get back a
response it could authenticate as resulting from its message.

This is not a practical thing to do when an ETE first gets a
packet.

More on this below:


>>  2 - Why it would be worse than useless.
>>
>>      For the ETE to use SEAL_ID to accept or reject packets would
>>      lead to an easy DoS vulnerability.
>>
>>      The attacker wants to clobber the ability of ITE X to tunnel
>>      packets to ETE Y, before ITE X has done so.  The attacker
>>      crafts a single packet which appears to ETE Y to have been
>>      sent by ITE X, with some random SEAL_ID value.  Then, the ETE
>>      would use this and reject any packets genuinely coming from
>>      ITE Y, because the real ITE Y would have chosen a different
>>      random starting point for its SEAL_ID.
>>
>> Just as there is no way of fully preventing attackers sending packets
>> to any host now, with any source address they choose, nor is there
>> any way of preventing such problems with a CES system.
>>
>> Furthermore, since in CES system, EUNs (End User Networks) using
>> "edge" addresses could be connected to any ISP, if these ISPs are
>> going to support these EUNs, then they need to allow the forwarding
>> of all packets from these EUNs which have source addresses matching
>> any "edge" address.
>>
>> At present, ISPs can do their bit to stop spoofing by dropping
>> packets with source addresses not matching the prefix of the network
>> which they came from - but that can't be applied to packets with
>> "edge" addresses in a CES system, unless the ISP is prepared to be
>> extremely fussy and watch the mapping system to see which "edge"
>> prefixes are currently being mapped to an ETR which serves the
>> particular EUN.
> 
> What RANGER/VET/SEAL are providing is a way for the
> ETE to do its own ingress filtering in case the ISPs
> that contain end user networks are not doing ingress
> filtering. This works when the ETE has a way of
> authenticating that a specific ITE is authorized to
> source packets from a given prefix.

I don't see a practical way an ITE could prove to any ETE
that it was authorised to send packets with a particular source address.

Theoretically, perhaps, the ITE might be able to establish that
it was located in a particular ISP's network or whatever and that
network was advertising a bunch of prefixes - so it would be OK for
the ITE to tunnel traffic packets to any ETE where the source address
of those traffic packets matched one of these ISP prefixes.

But what if the ISP's network includes one or more EUNs (end-user
networks) using "edge" space managed by IRON-RANGER?

These networks will be sending out packets with their own "edge"
addresses in the source field - and the ITE will be handling these
packets.  There could be thousands of such EUNs, each using one or
more "edge" prefixes - the total set of such prefixes would be large
and subject to considerable churn.

Even if the ITE could somehow "prove" all this, via some
PKI-compatible signed message, the message would be excessively long
and there's no way an ETE should hold onto a traffic packet and have
to check the authority of the ITE to send one with its particular
source address.



>> The too-big packet should be thrown away - except for sending enough
>> of it back to the SH to enable the SH to authenticate the PTB.  Also,
>> since there can be multiple levels of tunnel, I think the PTB should
>> contain a few hundred bytes of the packet, so that the SH of an inner
>> tunnel, which is a router in the path of an outer tunnel, can
>> construct a PTB to its sending host (the ingress router of a still
>> further outer tunnel) which contains enough information not just for
>> authenticating the PTB, but to allow the final outer router of the
>> first tunnel to construct a PTB with sufficient length for the real
>> SH to authenticate.   Without this, ingress tunnel routers would need
>> to cache substantial parts of the packets they send, in order to be
>> able to generate a valid PTB.  I guess many IPv4 tunnels don't do
>> this, which is part of the reason for the lousy PMTUD situation today.
> 
> Having the PTBs contain enough of the too-big packet
> may not be very helpful if some of the nested tunnels
> use IP encryption. 

To support the proposal I made, if an IPSEC ingress tunnel router
(ITE in SEAL) can't decipher the first part of its own emitted packet
to create a valid PTB, then it should cache enough of the packet to
do so.

Below we discuss the idea of the ITE not sending a PTB, but updating
its MTU value, so that if (as would often be the case) the next
traffic packet exceeds that size, that it would generate a PTB from
that packet instead.


> In that case, there may be no
> opportunity to transcribe the PTB into a corresponding
> message to send to the prior tunnel endpoint in the
> nesting. This may be a minor point, however, since
> the ITEs can still cache the new MTU for use when
> later large packets arrive. So, I don't think there is
> a need for the ITE to keep old packets around just a
> PTB needs to be generated later.

This "PTB on the second too-large-packet" arrangement is inexpensive
and in some ways appears to be adequate.

However, if there were three nested tunnels which did this, with a
single MTU limit in a router along the inner tunnel path, then it
would take several traffic packets being sent by the SH before the SH
would get a PTB:

  Packet 1        PTB-1 generated by router in inner tunnel path.

                  ITE of inner tunnel receives the packet, adjusts
                  its MTU value and does not generate a PTB.

  Packet 2        PTB-2 generated by ITE of inner tunnel.

                  ITE of middle tunnel receives the packet, adjusts
                  its MTU value and does not generate a PTB.

  Packet 3        PTB-3 generated by ITE of middle tunnel.

                  ITE of outer tunnel receives the packet, adjusts
                  its MTU value and does not generate a PTB.

  Packet 4        PTB-4 generated by ITE of outer tunnel - goes
                  to sending host, which adjusts its MTU.

  Packet 5        This has the correct length to get through to the
                  destination host.

So three tunnels which don't immediately send a valid PTB means three
more full length packets the SH has to send before it gets a PTB.



>> I think that DF=0 packets should be deprecated, and that the RFC 1191
>> PTB message should be revised to be the same as the ~540 bytes of
>> packet which are returned in an IPv6 RFC 1981 PTB.
>>
>> I am opposed to what I think you are attempting with SEAL, in at
>> least some circumstances - of fragmenting or segmenting packets which
>> are too long, without an attempt to tell the sending host to send a
>> suitably shortened packet.
>>
>> DF=0 packets do not allow the SH to be told this - so I think they
>> should be deprecated.  They were considered unworthy of inclusion in
>> IPv6 in the mid-1990s.  I think they should have been deprecated in
>> IPv4 from that time.
> 
> I'm afraid I must continue to disagree. Your line of thinking
> seems to be paralleling that of the Internet architects of the
> 1980's who rushed to proclaim "Fragmentation Considered Harmful".
> IMHO, that was an all-or-nothing proclamation that did not take
> into scenarios in which fragmentation is beneficial. AFAICT,
> the community took the easy way out by seeking to abolish
> fragmentation altogether. Instead, they could have done the
> obvious, which was to *fix it* and *identify its use cases*.

I am not expecting you to agree - I was just stating my beliefs.

I don't see how any attempt to fragment DF=0 packets could be
regarded as a long-term solution.  It should be the responsibility of
hosts not to send packets which the network can't handle efficiently
and reliably.  This is because it is a lot easier, overall, for the
host to do this small amount of work than to continually burden
routers with fragmentation tasks and then other routers with handling
more packets, which winds up being still more total data due to the
need for one or more extra headers.  Then there is the worsening of
reliability, and consequently the need to retransmit.

Can you suggest how you think DF=0 packets should be handled, in the
long term?

Other than a tunnel which segments and reassembles ordinary length
(~1500 byte) packets due to the particular link technology using very
small packet sizes, can you suggest when, in general, long DF=1
packets should be segmented - and how this is overall the most
efficient, robust, etc. approach compared to sending a PTB and
requiring the host to emit shorter packets?


>>>> If the RFC 1191 designers had correctly anticipated the need for one
>>>> or more levels of tunneling to support their PMTUD system, then I
>>>> think they would have altered the PTB requirements to be as long as
>>>> those for IPv6's RFC 1981.  Then we probably would have tunnels today
>>>> which properly support RFC 1191 PMTUD.
>>>
>>> But, if any one of those tunnels uses IPsec encryption
>>> or the like there is no opportunity for performing the
>>> necessary translation function. So if there were a
>>> decent segmentation and reassembly capability it seems
>>> like IPsec implementations would be wise to use it.
>>
>> If an IPSEC ingress tunnel router can't decipher the initial part of
>> the packet returned in a PTB, then it needs to cache a copy of the
>> initial part of the packet it received as input to the tunnel, for
>> the purpose of constructing a PTB which would be recognised by a SH,
>> and furthermore which is long enough that if this tunnel is within
>> other tunnels, then that by the time this router's PTB contents have
>> been passed back up the line to the other ingress tunnel routers,
>> that the outer one still has enough of the original traffic packet to
>> be able to generate a valid PTB for the SH.
> 
> But, it seems that tunnels are more and more taking
> the "lazy" approach by simply discarding initial
> PTBs silently then generating PTBs when subsequent
> large packets arrive. The theory being that since
> the network might drop an initial PTB anyway, so why
> not just have the ITE drop the initial PTB on purpose?
> With a willingness to accept such loss, there is no
> need to cache copies of packets.

Its a slippery slope which leads to the above scenario of nested
tunnels requiring a bunch of too-long packets before the SH gets a PTB.

I wonder if the tests done recently by Ben Stasiewicz:

http://listserver.internetnz.net.nz/pipermail/ipv6-techsig/2009-October/000708.html

might have been affected by this.  Matthew Luckie has written to the
RRG about this research.

>> Without this, the IPSEC tunnel is not supporting RFC 1191 / RFC 1981
>> PMTUD.  I understand such support is mandatory, so how could a
>> self-respecting tunnel, IPSEC or not, not support it?
> 
> Based on what I have heard, implementations routinely
> use rate limiting for the PTBs they send the same as
> for any ICMP message. So, already today widely deployed
> implementations drop some/many PTBs.

Yes - I think this PMTUD stuff is about as big a can of worms as
scalable routing, perhaps also deserving of a conference and
concerted research effort to decide what to do.

The trouble is, I believe the only practical scalable routing
proposals which don't require updates to all DFZ routers will require
tunneling, and such proposals (LISP, IRON-RANGER and Ivip) would run
straight into these PMTUD problems.   Even without a scalable routing
solution, the IPv4 and IPv6 Internets seem to be boxed into current
~<=1500 byte packet sizes, with no opportunity to break out of this
and use jumboframe ~9k MTU paths as they appear across the DFZ.


>> People pay to use services with tunnels.  If the tunnels screw up the
>> only efficient, practical, approach to PMTUD (RFC 1191 / RFC 1981)
>> then people shouldn't use such tunnels or pay for any service which
>> uses them.  If they do, then its all downhill from there - with
>> people fixing packet lengths to avoid trouble, and busying themselves
>> updating stacks and applications to no longer rely on the perfectly
>> good RFC 1191 / RFC 1981 PMTUD approach (if RFC 1191 had mandatory
>> ~540 byte packet fragments).
> 
> Your sudden faith in the network amazes me, but IMHO
> it is unfounded. The Internet is a wild and wolly
> place; not neat and orderly.

I don't have faith in it - I just think it would be easier to fix it
than to route around it.

I don't have complete faith in passenger jets or their ability to be
defended against terrorists.  (Though since flying to Australia from
England in a Boeing 707 in 1961 as a child, I feel happiest flying
one of your company's aircraft.)

No-one has complete faith in the ability to ensure passenger jets
from killing all on board.  However, collectively, we figure it is
easiest to do our best to make them safe and use them, rather than go
by bus, car or ship.

This is not argument by analogy - just demonstrating that I don't
have to think the Internet is perfectly orderly to decide that it is
easier and better for us all to work to make it more orderly, than to
take an overly defensive path around the disorder (and so allow it to
remain and grow), which is how I assess the task of converting all
stacks and application packetization layers to RFC 4821 PMTUD.


>> I am opposed to the CES scheme continually fragmenting packets if we
>> can possibly avoid it.  Maybe we have to for DF=0 packets which are
>> 1470 bytes and the CES scheme can only get a little less than this
>> into each tunnel packet.  But this would be BAD considering how
>> Google sends out a lot of 1470 byte DF=0 packets.  I imagine Google
>> could be talked into lowering this or better still into using DF=1.
> 
> I don't think there is a way to stop websites from setting
> DF=0 if they want to. There are more website than just
> google that do this.

I think it will be easier overall to change this behaviour than to
have routers forever fragmenting packets and then have following
routers needing to carry two, with consequent inefficiencies and
greater chance of packet loss.


> I am not sure as to whether SEAL needs to be tweaked to
> give more efficient treatment of inner fragmentation for
> DF=0 packets. That seems to me to only be an issue when
> IPv4 is used as EID space; when IPv6 is used as EID space,
> DF=1 always.

I think it is well worth crafting protocols to work well with IPv4,
given it will be decades, if ever, before IPv6 or some other
alternative takes most of the traffic away from the IPv4 Internet.


>>>> I think that to implement defensive, complex protocols such as RFC
>>>> 4821 would be to accept and allow all these bad practices, and would
>>>> forever doom us to having to do extra work, and suffer extra
>>>> flakiness, just because of these bad practices.
>>>>
>>>> RFC 4821 will always be a slower and less accurate method of
>>>> determining PMTU to a given host than RFC 1191 or RFC 1981.  It would
>>>> be subject to choosing a lower than proper value, if there was an
>>>> outage for a while and it interpreted this as a PMTU limitation.
>>>
>>> My belief is that SEAL used correctly has a chance
>>> to establish a minimum "Internet cell size" of 1500.
>>
>> I can't see how you could do this, since there will always be 1500
>> limits in the DFZ, ISP and other networks for years to come, and
>> there will at times be tunneling, such as with PPPoE in DSL services.
> 
> Well, I have been told that links in the DFZ by and large
> set an MTU of ~4KB or larger (I have no evidence of this).
> So, if the core can handle 1500 w/o fragmentation then
> we should be able to get the edges to also handle 1500
> even if it requires a little bit of fragmentation.

I don't think there's such a thing as "a little bit of
fragmentation".  Splitting large numbers of traffic packets into two
packets, and then transporting and reassembling them is at least
twice the work, in many respects.

 - Robin
_______________________________________________
rrg mailing list
rrg@irtf.org
http://www.irtf.org/mailman/listinfo/rrg

Reply via email to