Moin, I have run into some issues with v6 PMTUD on OpenBSD 7.4, and am somewhat at a loss on how to proceed finding a proper reproducer.
I first brushed into MTU issues when some of my mailers suddenly started to put out ~50mbit of traffic with no apparent reason. Back then further debugging lead to the following observations: - I received connections from a host behind a HE IPv6 tunnel; This communicated an MSS of 1440 (MTU 1500) - Sending return packets, I received Packet-to-Big ICMP messages from the HE Tunnel Host, indicating an MTU of 1480. - OpenBSD reset the MTU to 1480 and resend - I would receive another Packet-to-Big ICMP messages from the HE Tunnel Host, indicating an MTU of 1480; OpenBSD would set the MTU to 1480 and resend the packet The root cause back then was some form of (legacy?) misconfiguration on the HE side, as the link actually had an MTU of 1472, which was incorrectly reported in the packet-to-big messages by the HE router. However, the additional issue seems to be that OpenBSD seems to re- transmit endlessly on packet-to-big if the MTU is the same as the already discovered PMTU. I had initially benched that issue, putting on my todo to do a proper write-up and build a tool to remotely trigger this. There might be some amplification potential here by abusing, e.g., high-BW HE tunnel endpoints to make some dst. send a large amount of outbound traffic; But i could not get this working reliably with scapy. Very scrapy cobbled together code for linux based on an example snippet to do an HTTP request 'by foot' can be found here; Might need some fixing before it works: https://rincewind.home.aperture-labs.org/~tfiebig/pmtud_code/http_reque st.py https://rincewind.home.aperture-labs.org/~tfiebig/pmtud_code/http_request_v6.py Note that this needs additional firewalling on the client so the linux kernel does not interfere with the TCP sessions, i.e., preventing the client from sending RST. Also, this is specific to the IPv6 implementation; For IPv4 OpenBSD runs down to a minimal MTU (below min. MTU for v4 btw) when re- receiving PTB ICMP messages. For v6 it does not doe this, likely due to the logic being different in relation to the higher (1280) min MTU. Recently, this then hit me again, when gw02.dus01.as59645.net put ~1gbit of traffic on the path to gw01.ams01.as59645.net. This occured after I had set up a test setup in a third location; This location is connected to gw01.ams01 via a MTU 1400 link (vxlan tunnel over IPv6 due to lack of fragmentation for v6). When i installed a test-device (gw02.dlft), i connected this via a MTU 1500 to gw01.dlft01, and--to test something unrelated--via a MTU 1500 link (tunnel over v4 with out fragmentation handled by an additional device transparently that just pushes around VLANs). All hosts have a BGP underlay using private ASNs (one per host) to distribute the global unicast addresses on the direct links. In addtion, there is an iBGP setup between the hosts, exchanging fulltables and the non-router networks in use. These are handled via loopback addresses, which are also distributed via the BGP underlay. See the diagram below: +-----------------------+ +-----------------------+ |gw01.dus01.as59645.net | |gw01.ams01.as59645.net | | JunOS +--------------+ VyOS (Linux) +---+ | lo: 2a06:d1c0::1 | | lo: 2a06:d1c0::a | | +-----------+-----------+ +-----------+-----------+ | | / | | \ | | MTU: 1400 -> / | | \ | | / | | \ | +-----------+-----------+ +------------+----------+ | |gw02.dus01.as59645.net | |gw01.dlft01.as59645.net| | | OpenBSD 7.4 | | VyOS (Linux) | | | lo: 2a06:d1c0::2 | | lo: 2a06:d1c0::9 | | +-----------------------+ +------------+----------+ | | | | | +-------------------------------------+ | | | +------------+----------+ | |gw02.dlft01.as59645.net+------------------------------------------+ | JunOS | | lo: 2a06:d1c0::9 | +-----------------------+ What now happened is that gw02.dlft01 opened a connection to gw02.dus01 to start an iBGP session. Thse packets flowed gw02.dlft01 -> gw01.ams01 -> gw01.dus01 -> gw02.dus01. Return packets, however, flowed gw02.dus01 -> gw01.dus01 -> gw01.ams01 -> gw01.dlft01. Hence, on the link gw01.ams01 -> gw01.dlft01, they exceeded the path MTU, and gw01.ams01 started to send PTB ICMP messages notifying a correct MTU of 1400. However, gw02.dus01 only retransmits the previous packet, and does not decrease the MSS/MTU, leading to another PTB ICMP message etc. up until the link being saturated (or rather: The virtio NIC of gw02.dus01 capping transmission). Pcap here: https://rincewind.home.aperture-labs.org/~tfiebig/mtu.pcap Further digging around a couple of OpenBSD 7.4 oob-ish boxes that also learn their routes to these hosts via BGP and face the same async routing, I found that they show the same behavior, even if there is no loopback bound address involved. Furthermore, I could also see this behavior when manually starting a TCP connection that would create more-than-MTU-sized packets. However, for OpenBSD hosts just holding a default route, even when hard setting the MTU in a more specific route, this does not occur. Similarly, all other routers (running mostly linux/vyos and omitted in the diagram) do not exhibit this MTU behavior. I also setup a test-network similar to the above, but could not reproduce the issue there so far; This leads me to suspect that--for the BGP issue--there is also an inter-op component going wrong. Finally, the issue also ocurred when gw01.dus01. was a VyOS. At the moment I do see two direct issues: - Until-timeout retransmission when receiving same-MTU sized PTB ICMP6 messages - Going below minimum MTU for IPv4 when continuously facing packet-to- big messages asking for an MTU >= the size of the sent packet Please note, btw, that RFC1191 describing PMTUD in general leaves the question of 'what to do when the requested MTU is >= the size of the sent packet' undefined. However, RFC4443 notes that a host must limit the number of ICMP6 error messages, which is obviously ignored by linux, as it seems, and a bug over there. The issue that still needs to be found is why gw02.dus01 ignores the pmtud packets when the route is learned via BGP. However, I am currently at a loss re: finding a reproducing configuration and can only find this issue in the live boxes; There also is a very real chance of me just 'holding things wrong', though. Looking forward to further input. With best regards, Tobias