Moin,

I have run into some issues with v6 PMTUD on OpenBSD 7.4, and am
somewhat at a loss on how to proceed finding a proper reproducer.

I first brushed into MTU issues when some of my mailers suddenly
started to put out ~50mbit of traffic with no apparent reason. Back
then further debugging lead to the following observations:

- I received connections from a host behind a HE IPv6 tunnel; This 
  communicated an MSS of 1440 (MTU 1500)
- Sending return packets, I received Packet-to-Big ICMP messages from 
  the HE Tunnel Host, indicating an MTU of 1480.
- OpenBSD reset the MTU to 1480 and resend
- I would receive another Packet-to-Big ICMP messages from the HE 
  Tunnel Host, indicating an MTU of 1480; OpenBSD would set the MTU to 
  1480 and resend the packet

The root cause back then was some form of (legacy?) misconfiguration on
the HE side, as the link actually had an MTU of 1472, which was
incorrectly reported in the packet-to-big messages by the HE router.

However, the additional issue seems to be that OpenBSD seems to re-
transmit endlessly on packet-to-big if the MTU is the same as the
already discovered PMTU.

I had initially benched that issue, putting on my todo to do a proper
write-up and build a tool to remotely trigger this. There might be some
amplification potential here by abusing, e.g., high-BW HE tunnel
endpoints to make some dst. send a large amount of outbound traffic;
But i could not get this working reliably with scapy. Very scrapy
cobbled together code for linux based on an example snippet to do an
HTTP request 'by foot' can be found here; Might need some fixing before
it works: 

https://rincewind.home.aperture-labs.org/~tfiebig/pmtud_code/http_reque
st.py
https://rincewind.home.aperture-labs.org/~tfiebig/pmtud_code/http_request_v6.py

Note that this needs additional firewalling on the client so the linux
kernel does not interfere with the TCP sessions, i.e., preventing the
client from sending RST.

Also, this is specific to the IPv6 implementation; For IPv4 OpenBSD
runs down to a minimal MTU (below min. MTU for v4 btw) when re-
receiving PTB ICMP messages. For v6 it does not doe this, likely due to
the logic being different in relation to the higher (1280) min MTU.

Recently, this then hit me again, when gw02.dus01.as59645.net put
~1gbit of traffic on the path to gw01.ams01.as59645.net. This occured
after I had set up a test setup in a third location; This location is
connected to gw01.ams01 via a MTU 1400 link (vxlan tunnel over IPv6 due
to lack of fragmentation for v6).

When i installed a test-device (gw02.dlft), i connected this via a MTU
1500 to gw01.dlft01, and--to test something unrelated--via a MTU 1500
link (tunnel over v4 with out fragmentation handled by an additional
device transparently that just pushes around VLANs).

All hosts have a BGP underlay using private ASNs (one per host) to
distribute the global unicast addresses on the direct links. In
addtion, there is an iBGP setup between the hosts, exchanging
fulltables and the non-router networks in use. These are handled via
loopback addresses, which are also distributed via the BGP underlay.

See the diagram below:

+-----------------------+              +-----------------------+
|gw01.dus01.as59645.net |              |gw01.ams01.as59645.net |
|         JunOS         +--------------+      VyOS (Linux)     +---+
|   lo: 2a06:d1c0::1    |              |   lo: 2a06:d1c0::a    |   |
+-----------+-----------+              +-----------+-----------+   |
            |                                      /               |
            |                                      \               |
            |                         MTU: 1400 -> /               |
            |                                      \               |
            |                                      /               |
            |                                      \               |
+-----------+-----------+             +------------+----------+    |
|gw02.dus01.as59645.net |             |gw01.dlft01.as59645.net|    |
|       OpenBSD 7.4     |             |      VyOS (Linux)     |    |
|   lo: 2a06:d1c0::2    |             |   lo: 2a06:d1c0::9    |    |
+-----------------------+             +------------+----------+    |
                                                   |               |
                                                   |               |
             +-------------------------------------+               |
             |                                                     |
+------------+----------+                                          |
|gw02.dlft01.as59645.net+------------------------------------------+
|         JunOS         |
|   lo: 2a06:d1c0::9    |
+-----------------------+

What now happened is that gw02.dlft01 opened a connection to gw02.dus01
to start an iBGP session. Thse packets flowed gw02.dlft01 -> gw01.ams01
-> gw01.dus01 -> gw02.dus01. Return packets, however, flowed gw02.dus01
-> gw01.dus01 -> gw01.ams01 -> gw01.dlft01. Hence, on the link
gw01.ams01 -> gw01.dlft01, they exceeded the path MTU, and gw01.ams01
started to send PTB ICMP messages notifying a correct MTU of 1400.

However, gw02.dus01 only retransmits the previous packet, and does not
decrease the MSS/MTU, leading to another PTB ICMP message etc. up until
the link being saturated (or rather: The virtio NIC of gw02.dus01
capping transmission).

Pcap here: https://rincewind.home.aperture-labs.org/~tfiebig/mtu.pcap

Further digging around a couple of OpenBSD 7.4 oob-ish boxes that also
learn their routes to these hosts via BGP and face the same async
routing, I found that they show the same behavior, even if there is no
loopback bound address involved. Furthermore, I could also see this
behavior when manually starting a TCP connection that would create
more-than-MTU-sized packets.

However, for OpenBSD hosts just holding a default route, even when hard
setting the MTU in a more specific route, this does not occur.
Similarly, all other routers (running mostly linux/vyos and omitted in
the diagram) do not exhibit this MTU behavior.

I also setup a test-network similar to the above, but could not
reproduce the issue there so far; This leads me to suspect that--for
the BGP issue--there is also an inter-op component going wrong.
Finally, the issue also ocurred when gw01.dus01. was a VyOS.

At the moment I do see two direct issues:
- Until-timeout retransmission when receiving same-MTU sized PTB ICMP6 
  messages
- Going below minimum MTU for IPv4 when continuously facing packet-to-
  big messages asking for an MTU >= the size of the sent packet

Please note, btw, that RFC1191 describing PMTUD in general leaves the
question of 'what to do when the requested MTU is >= the size of the
sent packet' undefined. However, RFC4443 notes that a host must limit
the number of ICMP6 error messages, which is obviously ignored by
linux, as it seems, and a bug over there.

The issue that still needs to be found is why gw02.dus01 ignores the
pmtud packets when the route is learned via BGP. However, I am
currently at a loss re: finding a reproducing configuration and can
only find this issue in the live boxes; There also is a very real
chance of me just 'holding things wrong', though.

Looking forward to further input.

With best regards,
Tobias

Reply via email to