Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

Robin Whittle Sat, 30 Jan 2010 10:01:19 -0800

Short version:     Mainly explaining my understanding of how
                   SEAL does PMTUD when used as part of RANGER
                   - when RANGER is acting as a Core-Edge
                   Separation solution to the routing scaling
                   problem.


                   Both RANGER and SEAL can to other things
                   besides this, and I find it hard to envisage
                   the subset of their operations which would
                   be used in a Core-Edge Separation setting.

                   I still thing it is wrong to develop a complex
                   and difficult protocol such as RFC4821,
                   because of some badly designed tunnels which
                   don't generate PTBs or due to a few networks
                   which filter out PTBs.

                   There's very little sign of anyone wanting to
                   develop or use RFC4821 - so I guess the
                   need or desire for it isn't be very strong.



http://tools.ietf.org/html/draft-templin-intarea-seal-08


Hi Fred,

Thanks for your reply:

>> So I disagree with your statement except to the extent of
>> applications which do no packetization and just rely on the stack's
>> TCP (or SCTP or whatever) packetization layers which may or may not
>> implement RFC4821.
> 
> Yes; thanks for the correction. RFC4821 concerns packetization
> layer path MTU discovery, where the application itself is a
> packetization layer when, e.g., UDP is used as the transport.

OK.  My understanding is that in the stack there is, for each
destination address, an MTU number which can be read and written by
any packetization layer.  The TCP packetization layer in the stack is
the most obvious one, but I guess if there was an SCTP layer there
too, this would also read and write this variable (or is it multiple
variables for each destination address?).  Then, for as many
applications have packetization layers, they would be able to read
and write these variables too.


>> When IPTM sends a big packet (containing most of a traffic packet) as
>> part of its PMTU probing, if this hits an MTU limit, then it is
>> dropped, with a short PTB going back to the ITR.  With SEAL, in IPv4
>> mode, the limiting router has to split the packet up and forward it,
>> so other routers and the ETR have to carry all the fragments.  So my
>> IPTM approach is arguably lighter weight than your SEAL approach.
> 
> With SEAL-FS, the ETE does not carry all the fragments.
> Instead, it uses first-fragments for reporting purposes
> only and otherwise discards all fragments.

OK - thanks.


>> IPTM doesn't rely on the PTB. (See below for how it will be able to
>> work with minimal length IPv4 PTBs.)  As long as a PTB does get to
>> the ITR - which it would in most cases - then the ITR knows about the
>> MTU problem without having to wait for the ETR to time out and send a
>> message to the ITR saying the big packet did not arrive.  Also, the
>> ITR gets an exact MTU value from this PTB, rather than having to do
>> what SEAL does - hunt back and forth to find a packet size which is
>> reliably delivered without MTU problems.
> 
> SEAL doesn't hunt back and forth. 

4.3.9.1.2 mentions an "iterative searching strategy" - which sounds
like a fancy term for "hunt"!  This occurs only in IPv4 when the ETE
gets a first fragment shorter than 576 bytes, then this is
interpreted as a "runt fragment" and so is not regarded as a true
measurement of the limiting MTU.


> In SEAL, every data
> packet is an implicit probe, and the ETE uses IPv4
> fragmentation as an indication that it needs to tell
> the ITE to reduce the size of the packets it is sending.

To function correctly in IPv4, SEAL also relies on the first fragment
arriving at the ETE, and that fragment being the length of the
shorted MTU between the ITE and the ETE.  There's not much chance of
the first fragment being lost, so this is fine.

As far as I know, IPv4 routers are not required to make their
fragments (all but the last usually) the length of their MTU limit,
but I guess most of them do.

Assuming they all do, your IPv4 fragmentation approach has an
advantage where there are two or more MTU limits in succession.

If the first limit is 1400 and the second 1300, then by relying on
PTBs, as you do with IPv6, then first of all, the ITE discovers the
1400 limit, and sends a PTB to the sending host (SH) with the correct
value for that: 1400 - 24 (IPv4 header + SEAL header) = 1376.  The SH
tries again, and the encapsulated packet generates a PTB at the
router with the 1300 limit.  This is propagated back to the SH in the
same way, with a PTB with a 1276 byte MTU - and then all is well.

With your fragmentation approach, the ETE should get a first fragment
of length 1300 bytes.  It reports this back to the ITE, and it sends
a single PTB to the SH, with a value of 1276 bytes.

Which would be faster depends on the circumstances - since the PTB
back to the ITE will often be faster than the longer path all the way
to the ETE.

If the limiting router sends a fragment of length less than the
limiting MTU, then the SEAL ITE would adopt an unrealistically low
MTU value.  I guess this is unlikely.


IPTM doesn't hunt back and forth.  When a long enough packet arrives
at the ITR for the encapsulated length to fall between its two
current markers for the Zone of Uncertainty (low and high water
markers, I think, are the terms used in other protocols) the dual
packet probe technique will usually result in reliable new
information in most instances, due to one of:

  1 - The traffic packet is delivered to the ETR and the ITR is
      informed about this - so the ITR raises the low marker.

  2 - The traffic packet was not delivered and the ITR gets a PTB.
      This enables the ITR to lower its high marker and send a
      PTB to the SH with the correct MTU value to cause the SH so
      subsequent packets it sends, once encapsulated, will fit the
      MTU limit exactly.

  3 - If the long packet does not arrive at the ETR, but the short
      one does, and the ITR receives no PTB, then perhaps the long
      packet was merely lost.  However, if this happens repeatedly
      then the ITR should discern that there is an MTU limit below
      this length, and so adjust its upper marker downwards.  This
      will result in a PTB going to the SH, which will send
      shorter packets.  Over multiple iterations, the ITR will
      discover the MTU.  If the steps downwards taken by the ITR
      in reducing its upper marker are small, this will take a long
      time.  If they are larger, the true MTU will be found faster,
      but there may be more overshoot, resulting in somewhat
      smaller MTU for all hosts whose packets are tunneled to this
      ETR than is really needed.

      Once a pair of probe packets does arrive at the ETR, the ITR
      is informed of this by the ETR and it raises its upper marker
      accordingly

 Pretty quickly, the two markers would reach the same value and the
ITR would have a reliable measure of the MTU to this ETR.

The most likely pattern would be this:  Initially (before the ITR has
ever sent packets to this ETR address) the markers are wide apart.
Then the SH sends a packet which longer, once encapsulated, than the
low marker and also longer than the actual tunnel MTU, minus
encapsulation overheard.  So the the ITR does the two packet probing
protocol and gets a PTB.  Assuming this is from the router which sets
the lowest PMTU limit in the tunnel, then this PTB enables the ITR to
send a PTB straight back to the SH.  This also enables the ITR to
lower the upper marker to match the MTU reported in the PTB it just
received.

The SH will create another packet of the correct length to suit the
tunnel (after encapsulation it will be the length specified in the
PTB from the limiting router).  The ITR gets this and because it
would be longer, after encapsulation, than the low marker, the ITR
again sends it to the ETR with the two-packet tunneling protocol.
This time, it gets through, and the ETR reports this.  As soon as
this report gets to the ITR, the ITR adjusts its lower marker to the
same value as the upper marker - so there is no more Zone of
Uncertainty.

The SH will generally continue to send packets of this length or
shorter.  If some other SH (or another application in the same SH)
sends a packet which needs to be tunneled to the same ETR and which
is longer than this now reliably known PMTU value, then the ITR will
drop it and send back a PTB, without trying to send it into the
tunnel.  That application or SH will then send packets of the right
length.

I haven't figured out every detail of this - at some stage I intend
to work on it more and write it up as an ID.  For now, it is at:

   http://www.firstpr.com.au/ip/ivip/pmtud-frag/



>> In Ivip, most traffic packets are encapsulated by the ITR with the
>> sending host's address as the outer header's source address.  Any PTB
>> which results from those goes to the sending host, which will not
>> recognise it.
> 
> If the source address of the original packet also
> goes as the source address of the outer packet, then
> wouldn't that constitute mixing both EID and RLOC
> addressing within the same routing region? I thought
> the whole purpose of the CES approach was to keep the
> EID and RLOC routing and addressing spaces separate.

The "Separation" means that a subset of the global unicast address
space is used as "SPI" space (Ivip) or "EID" space (LISP).  This is
not a separate namespace, just a subset of the global unicast address
space, in the form of multiple DFZ-advertised prefixes which have no
name in LISP, but which are called Mapped Address Blocks (MABs) in Ivip.

The ITR tunneling uses the sending host's source address as the outer
source address for all ordinarily tunneled packets.  This enables any
ISP BR filtering - which drops incoming packets due to them having a
source address from any one of the ISP's prefixes - to be enforced on
the inner packet by all ETRs, by the simple method of dropping any
inner packet whose source address is different from the outer
header's source address.

This functionality is also enforced with the 2 packet probing
arrangement - the short A packet is also sent with its outer source
address being that of the sending host.

When a packet would be, after encapsulation, a length within the
"Zone of Uncertainty" then the ITR uses the special long (B) and
short (A) protocol.  The B packet's outer source address is that of
the ITR - so the ITR would normally get any PTB which arises from the
B packet.

The "sending host" address could be a conventional (non-SPI) address
  such as from a host on PA or PI space - or it could be on an SPI
address.  This is the source address of the packet the ITR is
processing.  Whether that is the actual address of the sending host,
or just that of a NAT box which the sending host is behind, is not
known to the ITR.

All hosts and most routers make no distinction between addresses in
the SPI subset and the rest of the addresses in the global unicast
address range.  Only ITRs treat them differently - and then only when
they appear in the destination address of a packet which is forwarded
to them.  Instead of forwarding the packet according to its
destination address, the ITR's FIB processes the packet differently.

If the ITR's FIB (which may be all in software, since the ITR may be
in the sending host or be implemented on a COTS server) already has
mapping for a micronet which matches this SPI address, it uses that
mapping (a single ETR address) to tunnel the packet.  If not, it
buffers the packet, requests mapping from a nearby QSD (full database
query server), installs the mapping (a micronet start and end
address, with a single ETR address) in its FIB and then tunnels the
packet accordingly.

There's no concept of "routing region" in Ivip.  ITRs in any place
would do exactly the same thing.  All other devices - hosts and
ordinary routers - make no distinction between SPI addresses and
remainder of the addresses in the global unicast address range.



>> In this scenario, the ITR gets back just the IPv4 header and the UDP
>> header.  The attacker has to guess the 16 bit ID field in the IPv4
>> header, which is tricky - but it could eventually succeed in doing
>> so.  Here are the components of the UDP header:
>>
>>   Source port     The ITR could use a randomized source port.  This,
>>                   combined with the 16 bit ID field, could extend
>>                   the number of bits to be guessed to 32 - which
>>                   I think is sufficiently secure, considering a
>>                   successful attack only degrades efficiency, rather
>>                   than causes actual loss of connectivity.
>>
>>   Destination port   Currently, I assume there is a single UDP port
>>                   on all ETRs to send the long (B) packet to.  If
>>                   I could easily randomize this too - such as making
>>                   the most significant 8 bits fixed, and the others
>>                   up to the ITR to choose.
>>
>>                   This would be 40 random bits - perfectly secure
>>                   considering the moderate level of DoS the attack
>>                   could result in.
>>
>>   Length          If the attacker created the traffic packet, they
>>                   would know the length of what follows the UDP
>>                   header.
>>
>>   Checksum        Ahhh - this is not a header checksum.  This covers
>>                   the data behind the UDP header.  This data is
>>                   mainly from the traffic packet, but it contains
>>                   a nonce.  So the 16 bit checksum is affected by
>>                   the nonce.
>>
>> I hadn't realised this before - the UDP checksum contains another
>> 16 bits the attacker has to guess.  Combined with the IPv4 header's
>> 16 bit ID field, I think this makes it highly secure.  If this is
>> not enough, the 16 bit random ITR source UDP port should be sufficient.
>>
>> So the ITR doesn't need any more bits than are necessarily supplied
>> by a minimally compliant RFC1191 implementation in the router which
>> sends the PTB.
>>
>> How would this work for SEAL?
> 
> Using the UDP/TCP checksum as a nonce requires that the
> ITE cache copies of its recently-sent packets. 

The above procedure is only for the long (B) packet when the ITR is
still uncertain of the PMTU to a given ETR and the packet, if
normally encapsulated, would be of a length within this Zone of
Uncertainty.  The B packet is the same length, but is sent with the
ITR's address in the outer header.  Like the short A packet, it is a
UDP packet with special headers.  All the normally encapsulated
packets are IP-in-IP, with no other headers, and with the sending
host's address in the outer source address.  So normally encapsulated
packets can't generate PTBs to the ITR - only to the sending host,
which would not recognise them, since they arise from an encapsulated
packet.

So only for these traffic packets for which the ITR is using to
generate the B and A probe pair does it need to cache enough of the
initial packet to be able to generate a valid PTB to the sending
host.  The ITR would also cache the nonce, which it uses to secure
the ETR's response to these packets.

It would also cache the UDP header of the B packet, which includes
the checksum which is almost impossible for an attacker to guess due
to its dependence on the nonce.  I think the combination of an
unpredictable 16 bit ID in the outer header of the B packet, and the
influence of the nonce on the 16 bit checksum, would be sufficient to
prevent attacks succeeding at a significant rate.  If that wasn't
enough, the ITR could use 16 bit randomization of its UDP source
port, and randomize 8 or more bits of the destination UDP port too.


> But then,
> it would need to do this for every tunnel it belongs to
> and it has no way of knowing for how long it will have
> to retain the cached copies. 

As I noted above, the ITR doesn't cache its ordinarily encapsulated
packets.  The PTB would not go to the ITR.  It only needs to cache
the start of those packets it is sending with the two-packet probing
protocol.

If the ITR estimates the PMTU to a given ETR, and gets it right - and
then later the MTU falls, then there is a difficulty.  The normally
encapsulated packets which are too long will be dropped, the ITR will
not get any PTBs and the sending host will not recognise the PTBs
which are sent to it.

I can think of two approaches to minimising the impact of this.

One is to have the ITR periodically, such as every 30 seconds, send a
packet which (once encapsulated) is at, or close to, the MTU limit,
as a B and A probe pair.  This is assuming the ITR is continually
sending long packets to this ETR.

This will normally deliver the packet fine, and the ITR will be able
to confirm that the PMTU has not become any less than what it
assumes.  If the packet doesn't arrive, or if the ITR gets a PTB,
then perhaps the MTU has dropped and the ITR can find out what it has
dropped to.

It will usually find out the new PMTU from a PTB, but if there is no
PTB, then it will need to send more probes of various lengths until
one size does get through to the ETR.

It would do this as described above, without having to generate
special probe packets, by lowering its upper marker for the MTU
estimate by some value, such as 8 or 16 bytes, sending a PTB to the
sending host (or multiple sending hosts as they send packets which
would exceed this length, once encapsulated) and allowing the sending
hosts to create traffic packets which the ITR will send using the 2
packet probing technique.  This will rapidly reduce the value of the
upper marker to being equal to, or somewhat less than, the real PMTU
limit.  This will drag down the lower marker too.


The following second approach would only work if the returned part of
the packet was long enough to show the SH that it really did result
from a packet the SH sent.  In practice, this should always be the
case with IPv6, and I have been gained the impression that it is
common for IPv4 routers to send back more than the bare RFC1191
minimum anyway.  This would require the sending host to have a
modified stack which was ready to analyse PTBs which resulted from
packets in the ITR to ETR tunnel.  Assuming this modified stack code
could verify that the PTB was genuine, it would compute a new MTU for
this destination address, by subtracting the encapsulation overhead
(20 for IPv4, 40 for IPv6) from the MTU value in the PTB.

This one PTB would not help the SH learn about a reduced PMTU to
other SPI destination addresses it was sending to which also were
being tunneled to the same ETR.  But the same code would receive PTBs
from those packets too.  The ITR would be none-the-wiser, so the
first technique and the one below would still be important - but this
optional host upgrade would enable the SH to respond immediately and
correctly to a reduction in PMTU to an ETR.  Other applications in
other SHs would need to repeat this exercise, since the ITR doesn't
know these PTBs are occurring.

After 10 minutes, a sending host is allowed (RFC1191 / 1981) to try
sending a longer packet than was allowed by a previous PTB.  Then,
the ITR needs to recognise the time which has elapsed and use this
with the B and A probe technique.

This may be a little complex, considering multiple applications in
one sending host, multiple hosts and multiple destination SPI
addresses may all result in packets being sent to the one ETR address.

I can see ways of coping with this stuff - but some of it requires
carefully designed algorithms and considerable logic and state in the
ITR.

If we can upgrade the DFZ and other routers with firmware, then all
this encapsulation and PMTUD stuff can be ignored - by using Modified
Header Forwarding instead.  That is the way to do it in the long-term
future, even if we start with encapsulation.


> With SEAL, the ITE never
> has to cache packets in order to match them up with
> any PTB feedback.

Yes - here's my understanding of how your SEAL ID specifies how the
ITE and therefore the SH perform PMTUD in the tunnel path, between
the ITE and the ETE.

I assume in all cases that the ITE initially sets S_MRU for this ETE
to "infinity" as described in 4.3.3, and then uses one of the
following methods to reduce it.  This is me trying to imagine how the
SEAL would be used for ITE to ETE tunneling when RANGER is used as a
Core-Edge Elimination architecture.

RANGER can be used for many more purposes than this, and so can SEAL,
so it is quite a challenge to decide which parts of the IDs to
ignore.  I understand that in this application, the ITE will be
reducing the S_MRU value for each ETE it tunnels to.  I think that in
other SEAL applications, this may not occur, and so you have
arrangements for using SEAL segmentation to send long traffic packets
as multiple SEAL-segmented packets to the other end of the tunnel.
But this would never be invoked in a Core-Edge Separation
application, since the ITE always sends PTBs to the SHs to have them
reduce their packet length.

In this application of SEAL, I understand there is no need for any
mid-layer protocol between the IPv4 or IPv6 header and the SEAL
header, or between the SEAL header and the traffic packet.  This is
not clearly specified anywhere, since the SEAL and RANGER documents
are general purpose, and their use for a scalable routing solution as
a Core-Edge Separation architecture is only one thing they could be
used for.


Firstly I describe my understanding of what your ID specifies for IPv4.

Secondly I describe two other ways you might do PMTUD with IPv4,
without using DF=0 packets.  These would avoid whatever risk there
might be of setting the ITE's PMTU estimate too low due to a limiting
router sending out fragments which are shorter than the limiting next
hop MTU.

Finally, I describe my understanding of what your ID specifies for IPv6.

This is partly for my own reference, since it took me many hours to
discern this by reading the SEAL ID and corresponding with you.


IPv4:

  The ITE sends a DF=0 packet into the tunnel.  This starts with
  an IPv4 header, then has a SEAL header (there's no mid-level
  protocol in this Core-Edge Separation usage of RANGER and SEAL)
  and then the inner packet, the original IPv4 traffic packet.

  The source address in the outer header is that of the ITE and
  the destination address is that of the ETE.  The 32 bit
  SEAL ID is split in two.  16 bits go into the IPv4 header's
  ID field and 16 into the SEAL header's ID Extension field.

  The limiting router in the tunnel (the one where the next-hop
  MTU is less than the the length of this whole packet) fragments
  it into at least two fragments.

  Now the second para in 4.4.2 comes into play:

        When the ETE processes the IP first-fragment (i.e.,
        one with MF=1 and Offset=0 in the IP header) of a
        fragmented SEAL packet, ...

  The first para was for reassembling packets which had been
  fragmented by the SEAL protocol.  But the second para is for
  SEAL packets, as was just sent, being fragmented by a
  router between the ITE and the ETE.  This only occurs for
  IPv4 and I think it would be helpful to mention IPv4 in this
  paragraph.  Maybe it needs its own section.

       ...  it sends a "Reassembly Report - Fragmentation
       Experienced" message back to the ITE with the S_MSS field
       set to the length of the first-fragment and with the
       S_MRU field set to no more than the size of the reassembly
       buffer (see Section 4.4.5).

  I think this last part about the value of S_MRU is not clear
  enough.  What value should it be set to?

  I will assume it is set to some non-zero value.

  Assuming the limiting router sent out the first fragment with
  a length equal to the limiting next-hop MTU, then this MTU
  value is now in the S_MSS field of the message sent to the
  ITE.


  This message arrives at the ITE.  This message, according to
  Figure 4, contains:

       As much of invoking packet as possible without the
       message exceeding 576 bytes.

  Maybe your ID specifies this, but I am having trouble
  following it - there has to be a way the ITE securely
  accepts this "Fragmentation Experienced" message.

  As far as I know, the ITE looks into the message, finding
  the initial part of the packet which the ETE received as
  a first fragment.  That will contain the outer IPv4 header
  and the SEAL header, and from this these the 32 bit SEAL ID
  in the SEAL encapsulated packet can be found.

  I think you either cache the recently sent 32 bit SEAL IDs
  or maintain a sliding window function over their range so
  you can easily identify a value which was used in the last
  second or two.  In a given ITE, each ETE has its own SEAL
  ID counter.  Its value is intitialized randomly when the
  state for this ETE is created.  After that, its value
  increments with each each packet sent to the ETE.

  (I may adopt this incrementing value per ETR arrangement,
  with its sliding window, rather than using a nonce.)

  The wider the window in time, the longer you can accept these
  messages.  Since the ETE and the ITE could be on opposite
  sides of the Net, I guess you need to have a window which
  accepts SEAL IDs sent at least a second ago.

  The longer the window in time, and the more packets the
  ITE sends to this ITE, the wider the window is numerically
  and the easier it is for an attacker to guess a valid value
  and have the ETE accept a PTB with a low enough value
  to cause lost efficiency - for the next 10 minutes or so.

  Now to 4.3.9.1.2:

       4.3.9.1.2. Fragmentation Experienced (Code=1)

       If the value in the S_MRU field is non-zero, the
       ITE records the value in its soft state for this ETE.

  This means this value is stored in the S_MRU variable for
  this ETE, as defined in 4.3.3.  As noted above, I am not
  clear on what value was written into this field of the
  report by the ETE.

       The ITE then adjusts the S_MSS value in its soft state.

  This means this value is stored in the S_MSS variable for
  this ETE, as defined in 4.3.3, subject to the instructions
  in the next few sentences.

  I am a bit confused about the differing roles of these two
  variables.

       If the S_MSS value in the Reassembly Report is greater
       than 576 (i.e., the nominal minimum MTU for IPv4 links),
       the ITE records this new value in its soft state.

  OK - this is based on the assumption that the length of the
  first fragment received by the ETE reflects the limiting
  MTU of the ITE to ETE path.

       If the S_MSS value in the report is less than the current
       soft state value and also less than 576,

  How could the ITE's S_MSS value for this ETE be less than
  576?  I can't see how.  If it can't be, then the first part
  of the above sentence may be redundant.

       the ITE can discern that IP fragmentation is occurring
       but it cannot determine the true MTU of the restricting
       link due to a router on the path generating runt
       first-fragments.

  Then the next paragraph describes the "iterative searching
  strategy" to find the correct (or near enough, but perhaps
  lower) value for the S_MSS variable for this ETE.

  I think this paragraph is unclear.  I think it should state
  that the probes are occurring only due to traffic packets
  arriving at the ITE and being tunneled to this ETE, and
  these being long enough.  Since SEAL treats all packets
  as probes, this use of the term "probe" may be confusing -
  since in fact all packets may be probes.

  I think this paragraph should describe the process in detail
  - I guess it is only occurs with real traffic packets.

  The reference to section 5 of RFC1191 is not very helpful
  because it describes several algorithms.



IPv4 - my suggestion for doing it with DF=1 packets

  A - If you can be sure the routers send back more than
      the bare minimum IPv4 header + 32 bits.  (Its just
      firmware updates to have routers do this and maybe
      most or all of them already do.)

      Send the SEAL packets as noted above, but with DF=1.

      If there is an MTU problem, the ITE will get a PTB
      with the MTU value it needs, plus enough of the
      SEAL packet to extract sufficient of the traffic
      packet to make a valid PTB for the SH.

      The MTU value from the received PTB is written into
      the S_MTU for this ETE.  This gives an exact value
      without the problem of potential "runt packets" which
      arises with DF=0 in your current process.

      The ITR subtracts 24 from the MTU value it received
      in the PTB from the limiting router (20 bytes of IPv4
      header + 4 bytes of SEAL header) and uses this value
      in the PTB to the SH.

      This will work fine - the SH will then send packets
      of the correct size, so when they are SEAL encapsulated
      they will not have a problem with this MTU limit.

      If any other SH, or another application in the same
      SH, sends a packet whose length exceeds the new MTU
      value minus 24, then the ITE will send back a PTB
      accordingly.

      If there is a further, lower, MTU limit en-route to
      the ETE, then the above process will be repeated.
      This is similar in principle to your IPv6 approach.


  B - If you have to assume that some or all routers between
      the ITE and the ETE only send back the bare minimum
      amount of packet in their PTB, then you can still
      accept these packets securely, and calculate a proper
      MTU value to send to the SH, as described above.

      In order to be able to generate a valid PTB, you need
      the ITE to have cached the IPv4 header and the next
      32 bits which follows (24 bytes) for each packet sent
      which you think might give rise to a PTB.  You don't
      need to do this with packets shorter than some constant
      - depending on whatever is the lowest PMTU you ever
      expect to find between an ITE and an ETE.

      I guess you only need to cache these 24 byte items for a
      second or so.  You need to be able to index into the
      cache by using the 32 bit SEAL ID retrieved from the
      16 bit IPv4 ID and the 16 bit SEAL ID Extension in the
      initial part of the encapsulated packet, which is in
      the first fragment, as returned in the PTB.


IPv6:

  The ITE creates an IPv6 header and a Fragment Header.  As far
  as I know, there is no SEAL header.  I think this should be
  made more clear in the final part of 3.4.3.

  The 32 bit SEAL ID is written into the Identification field
  of the Fragment Header.  The ITE then appends the traffic
  packet.

  The result is forwarded towards the ETE.  If it is too big for
  a next-hop MTU in any router en-route to the ETE, that router
  sends back a PTB to the ITE with an MTU value, and enough of
  the original packet for the ITE to construct a valid PTB to the
  SH.

     (I can't find where your ID describes the reception of
      the PTB.  Section 4.3.8 should cover this, but makes no
      specific mention of PTB messages.)

  The ITE secures the acceptance of the PTB by using comparing
  the 32 bit SEAL ID, as noted for IPv4 above, via a cached set
  of recently used values or some kind of window function

  The MTU value is written to the S_MTU for this ETE.

  The ITE subtracts 48 from the MTU value (40 bytes for the IPv6
  header and 8 bytes for the Fragment Header) and uses this to set
  the MTU value in the PTB which is sent to the SH.

  As with the IPv4 approach, any packets arriving at the ITE
  which will be tunneled to this ETE, if longer than the MTU
  value minus 48, will be dropped and used to send a PTB to
  the SH.

  This is identical in principle to my IPv4 suggestion A above.


Returning to Ivip's IPTM protocol:

With IPv6, I could avoid caching any part of the packet if I could
rely on the ITR getting a PTB, since the PTB is guaranteed to contain
plenty of the inner packet - enough for the sending host to
recognise.  (Actually, the minimum amount of original packet returned
could be less than what should be returned to the SH, due to the
encapsulation IP, UDP and IPTM header which precedes it.  Still,
enough should be there that any SH should be able to recognise it.)

I still need to send two packets, since the long B packet does not
contain the full traffic packet.  It has extra things - a UDP header
and an IPTM header, with a 32 bit nonce.   The last part of the
traffic packet is not in the B packet

The part which doesn't fit is contained in the A packet.

If both A and B arrive at the ETR, the whole traffic packet is
delivered.  The A packet is matched to its B packet with the nonce
they both contain in their IPTM headers.

The ETR will send a message to the ITR, also secured by the nonce,
to tell it that both parts arrived.

If the B part doesn't arrive, after a ~0.5 second time-out, the ETR
will send a message to the ITR telling it that only the A part arrived.

If only the B part arrives, the ETR sends a message to the ITR to
that effect too.

With IPv4, if it could be assumed that all routers would return
sufficient of the packet to include the first 24 bytes of the inner
packet, then I could use the same approach as for IPv6 - and so avoid
caching any part of the traffic packet in the ITR.

If I couldn't assume this, then there are two approaches:

  1 - Do as for my suggestion B above - have the ITR cache the
      first 24 bytes of traffic packets used for this 2 packet
      probing technique.  The cache would be indexable via the
      nonce sent with each packet - and the caching time would
      be about a second.

  2 - Avoid caching in the ITR, by including the first 24 bytes
      of the traffic packet in the A packet.  If the A packet
      arrives at the ETR, and the B packet doesn't, the ETR
      can report this, and return the 24 bytes from the A
      packet.

The ITR is only doing this 2 packet probing technique infrequently,
so the caching approach is not particularly expensive.  Caching the
first 24 bytes of the packet has an advantage that when the ITR gets
a PTB - as would normally be the case if the B packet was too long -
then the ITR can send the PTB immediately, rather than waiting for a
message from the ETR, which would necessarily arrive at least a
second after the ITR tunneled the packet.


The detection of a PMTU limit doesn't have to be absolutely
bullet-proof.  It should not result in the ITR deciding that the PMTU
is lower than it actually is, but if, for some reason, the probe
process produces an indeterminate result - such as the ITR not
getting anything back from the ETR as a result of the A or B packet,
and no PTB either, then the ITR takes no further action.  This is
indistinguishable from ordinary packet loss.  The most likely outcome
is that the SH will try again, with a similar sized packet (unless it
is doing RFC 4821 ... which no hosts appear to be doing at present)
and the ITR will again generate the B and A packets.  Then, the most
likely outcome will be that the ITR learns something definite about
the PMTU, and so reduces its Zone of Uncertainty.

Doing this PMTUD stuff in the FIB of a big router handling gigabit
and 10 gigabit links could be quite challenging.  It doesn't have to
be done this way, since Ivip ITRs and ETRs can be implemented in
software on ordinary servers, which are inexpensive and can still
handle (I guess) gigabit traffic rates.  Also, having the ITR in the
sending host is a zero cost way of ensuring each ITR doesn't have to
juggle too many of these PMTUD probing sessions at once.


>> At present, for IPv4 and IPv4, your ITE (ITR) functions emit packets
>> with an outer header of IPv4 or IPv6, followed by a 32 bit SEAL header.
>>
>> Immediately following the SEAL header you may have some "mid-layer
>> headers" which I don't properly understand.  Then you have the IPv4
>> or IPv6 traffic packet, or perhaps a segment of it.
>>
>> You could make the SEAL ITE work fine with minimal length IPv4 PTBs
>> if the SEAL header was extended to 64 bits, with the additional 32
>> bits being a nonce.  That would always be returned in any PTB.
> 
> SEAL uses the 32bit ID (gotten from the 16bit IPv4 ID
> concatenated with SEAL's 16bit ID extension) as a nonce.
> There is no need that I can see for including an
> additional nonce.

OK.


>> So I think your objection to using RFC1191 PTBs should only be based
>> on your concern about the PTBs being systematically dropped due to
>> filtering.
>>
>> I assert that such filtering is a symptom of a badly administered
>> network - and that it should be fixed in the network, not worked
>> around with a protocol such as SEAL or IPTM.
> 
> In my understanding, in the interdomain routing region of
> the Internet there is no close coordination regarding the
> way "the network" is administered. There is also a wide
> variety of network vendor equipment deployed in the
> Internet which may have widely varying default behaviors.
> So, in general it seems overly optimistic to assume that
> all of the diverse policies, implementations and operational
> practices out there could be brought into strict uniformity.

OK - but at some point we need to stop adopting band-aid measures
like artificially limiting MSS or MTU values.  That just lets the PTB
filtering and lousy tunnels be less noticed.   We should not be
trying to upgrade the stacks of all hosts in the world because a few
end-user networks filter PTBs or ISPs and perhaps end-user networks
run tunnels which don't support the otherwise perfectly good RFC 1191
/ 1981 PMTUD techniques.

We would just be heaping limitations and complications on ourselves
in an overly-defensive, expensive and inefficient attempt to cope
with failure of a few ISPs and end-user networks to run the Internet
as it needs to be run.  We are paying the ISPs.  The end-user
networks which are filtering PTBs are disrupting a subset of their
own communications.

I just think it wrong in principle to develop messy new protocols
such as RFC 4821 to cope with these failings.

  - Robin

_______________________________________________
rrg mailing list
rrg@irtf.org
http://www.irtf.org/mailman/listinfo/rrg

Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

Reply via email to