Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Wed, Jul 26, 2023 at 11:10 AM Juliusz Chroboczek wrote: > > CONNECT-UDP > > Come on, David, we all know that MASQUE is an elaborate practical joke. > With draft-asedeno-masque-connect-ethernet, you guys are obviously trying > to see how far you can go before people realise you're taking the piss. > On that note, you know a lot about 802.11, can you help us with our upcoming draft-masque-connect-layer-1 ? But more seriously, L2 VPNs already exist (c.f. L2TP, OpenVPN, etc), MASQUE is just trying to match the state of the art here. Whether the state of the art is where we want it to be is a different question, but for that we need a few chairs and beers. Are you planning on attending the Prague IETF in November? > (You could even perform dichotomy there to measure the exact MTU and > update > > the OS link MTU based on that, > > Sure. With v4-via-v6, we're already silentrly enabling IPv6 transit, so > there's some precedent to fixing the system without the admin's knowledge > :-) > Nice. :-) ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
> CONNECT-UDP Come on, David, we all know that MASQUE is an elaborate practical joke. With draft-asedeno-masque-connect-ethernet, you guys are obviously trying to see how far you can go before people realise you're taking the piss. > (You could even perform dichotomy there to measure the exact MTU and update > the OS link MTU based on that, Sure. With v4-via-v6, we're already silentrly enabling IPv6 transit, so there's some precedent to fixing the system without the admin's knowledge :-) ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Wed, Jul 26, 2023 at 5:18 AM Juliusz Chroboczek wrote: > > While you're absolutely right that this MUST NOT happen, in practice it > does. > > I think we're in at least partial agreement. The point I'm making is that > this configuration is not something that's supported by IP, and that VPN > implementations that cause MTU blackholes are quite simply buggy. > Agreed. (There's an argument to be made that IPv6 should support variable MTU > links. Good luck pushing this idea at the IETF, which, of late, appers > to be mostly interested in breaking the e2e principle and proxying > everything at the application layer. Sorry for the rant.) > (As a proxy enthusiast, I have thoughts :P. In my view, the e2e principle as we knew it broke when people started deploying TCP "accelerators". We brought back transport-layer e2e with QUIC thanks to e2e encryption. So in my view, QUIC is e2e but TCP, UDP, and IP are not. In that world, CONNECT-UDP allows you to maintain e2e because it allows QUIC. Sorry for the rant reply, but I couldn't resist) Of course, in practice misconfiguration happens, and so it's a good thing > to be able to be able to automatically detect misconfiguration and discard > the link. Definitely. Thanks for implementing and deploying that by the way. > It would be even better to be able to notify the network > administrator of the issue, but that would be a little more work than I'm > willing to do right now. > babeld automatically emailing sysadmins sounds like a fun time :-) (For example, we could send Hellos in a small packets, in order to > discover neighbours, and then send a small number of Ack Requests padded > to MTU to every discovered neighbour. If a neighbour never answers the > Ack Request, then it's fairly strong evidence that there's something > wrong.) > (You could even perform dichotomy there to measure the exact MTU and update the OS link MTU based on that, but I agree that's not necessarily babeld's job.) David ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
> While you're absolutely right that this MUST NOT happen, in practice it does. I think we're in at least partial agreement. The point I'm making is that this configuration is not something that's supported by IP, and that VPN implementations that cause MTU blackholes are quite simply buggy. (There's an argument to be made that IPv6 should support variable MTU links. Good luck pushing this idea at the IETF, which, of late, appers to be mostly interested in breaking the e2e principle and proxying everything at the application layer. Sorry for the rant.) Of course, in practice misconfiguration happens, and so it's a good thing to be able to be able to automatically detect misconfiguration and discard the link. It would be even better to be able to notify the network administrator of the issue, but that would be a little more work than I'm willing to do right now. (For example, we could send Hellos in a small packets, in order to discover neighbours, and then send a small number of Ack Requests padded to MTU to every discovered neighbour. If a neighbour never answers the Ack Request, then it's fairly strong evidence that there's something wrong.) -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Wed, Jul 26, 2023 at 02:02:14PM +0200, Juliusz Chroboczek wrote: > > Uups, nevermind this. I was looking at the other node's hellos. The > > neighbour relationship goes down properly as you'd expect. > > Merged into master. Shall I release 13.1? I think you mean 1.13, but that's ready relased so it'll have to be 14.1 eer 1.14 :) I would appreciate it yeah. --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
> Uups, nevermind this. I was looking at the other node's hellos. The > neighbour relationship goes down properly as you'd expect. Merged into master. Shall I release 13.1? ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
> I can observe (some) hellos using the padding depending on the option > setting. Problem is when I force the interface MTU to 1280 instead of the > initial 1420 the padded hellos get dropped and don't reach the other side > as you'd expect, but the regular sized hellos still make it through and so > the neighbourship relationship stays up. Clarified in the other mail, good. > Here's an idea: what if we pad the IHU response instead of all hellos? That > might have slightly less control overhead when RTT isn't enabled as you > don't need to respond to every hello then? I'm not sure how babeld > schedules IHU sending exactly. The Hellos are periodic, so the overhead is constant. The number of IHUs is proportional to the number of neighbours, so there might be arbitrarily many of those. I'm pretty sure it doesn't matter much in practice, but at least with Hellos the amount of overhead is easy to predict. -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Thu, Jul 20, 2023 at 02:40:40PM +0200, Daniel Gröber wrote: > On Wed, Jul 19, 2023 at 11:25:52PM +0200, Juliusz Chroboczek wrote: > > Could you please test the new branch "probe-mtu"? It's now using the > > IPV6_DONTFRAG cmsg in sendmsg, so it's enough to say > > > > default probe-mtu true > > > > (No global options, only per-interface options.) > > I can observe (some) hellos using the padding depending on the option > setting. Problem is when I force the interface MTU to 1280 instead of the > initial 1420 the padded hellos get dropped and don't reach the other side > as you'd expect, but the regular sized hellos still make it through and so > the neighbourship relationship stays up. Uups, nevermind this. I was looking at the other node's hellos. The neighbour relationship goes down properly as you'd expect. --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Hi Juliusz, On Wed, Jul 19, 2023 at 11:25:52PM +0200, Juliusz Chroboczek wrote: > Could you please test the new branch "probe-mtu"? It's now using the > IPV6_DONTFRAG cmsg in sendmsg, so it's enough to say > > default probe-mtu true > > (No global options, only per-interface options.) I can observe (some) hellos using the padding depending on the option setting. Problem is when I force the interface MTU to 1280 instead of the initial 1420 the padded hellos get dropped and don't reach the other side as you'd expect, but the regular sized hellos still make it through and so the neighbourship relationship stays up. Here's an idea: what if we pad the IHU response instead of all hellos? That might have slightly less control overhead when RTT isn't enabled as you don't need to respond to every hello then? I'm not sure how babeld schedules IHU sending exactly. --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Daniel, Could you please test the new branch "probe-mtu"? It's now using the IPV6_DONTFRAG cmsg in sendmsg, so it's enough to say default probe-mtu true (No global options, only per-interface options.) -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
> To test dont-fragment I first set it to disabled and changed the > (wireguard) interface MTU from 1420 to 1280 at runtime. Doing this I can > observe babel hellos being fragmented in tcpdump. > > When setting dont-fragment true this trick doesn't work and the neighbour > relationship to the other node doesn't get established. > > So it looks like it's working. Thanks for the report. I think I'll rework it to use the per-message option as you suggested in a previous mail, so that probe-mtu automatically triggers dont-fragment on the affected interface. One configuration option less. Then, after you confirm it still works for you, I'll merge into master. Thanks again, -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Hi Juliusz, While my (now fixed) tunnel stacking mitigation works for locally generated wg packets it doesn't when they are being routed for another host on the (ethernet) network. This was what motivated the MTU probing idea in the first place. I belive the probe-mtu option is still useful in general and even somewhat in the wireguard case. The hello packet padding will force PMTU discovery on the tunnel endpoint address to happen, which in turn allows my nftables rule to trigger even when the apparent interface MTU is 1500 :) Since that's a bit of a hack I've added another rule to my mitigation to just filter fragmented wireguard packets outright: meta mark 0x1000 meta protocol ip6 exthdr frag != missing counter drop On Wed, Jul 19, 2023 at 12:04:02AM +0200, Juliusz Chroboczek wrote: > Completely untested. Please checkout the branch "probe-mtu", then say > this in your config file: > > dont-fragment true > default probe-mtu true The padding logic looks good. I can see hello packet of the right (interface MTU) size leaving when probe-mtu is enabled. To test dont-fragment I first set it to disabled and changed the (wireguard) interface MTU from 1420 to 1280 at runtime. Doing this I can observe babel hellos being fragmented in tcpdump. When setting dont-fragment true this trick doesn't work and the neighbour relationship to the other node doesn't get established. So it looks like it's working. Thanks, --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Hi, after some more testing and tcpdumping I have to revise my theory of what's going on, turns out Juliusz was right all along (as usual :) On Mon, Jul 17, 2023 at 02:26:40AM +0200, Daniel Gröber wrote: > Let me try to give some more context: > > My mesh network deploys two wg tunnels per node. One wg-over-v6 and one > wg-over-v4 tunnel to support dualstack, v4-only and v6-only underlay > networks. > > Nodes run babel over all wg interfaces and will receive a default route > covering the wg-over-v6 tunnel endpoint addresses. Some nodes are served by > IPv6 routers that are themselves part of the wg mesh network and only have > v6 connectivity via wg-over-v4. > > This can cause wg-over-v6 tunnels on such nodes to want to cross a > wg-over-v4 tunnel. > > All wg interfaces have MTU 1420 configured which is the worst case for > wg-over-v6 or v4 (with MTU 1500). In the wg-over-wg-over-v4 case this > results in packets that are too big for the v4 underlay network > (1420+80+60=1560). > > Wireguard drops packets when they exceed the underlay network's MTU. Not true, wireguard will fragment it's UDP packets based on PMTU results if available in the route cache (ip -6 route show cache). It does this by setting skb->ignore_df=1. > When this happens no PTB ICMP errors are generated by wireguard inside > the tunnel This is still true, wireguard does not forward ICMP PTB errors from the endpoint to inside the tunnel, but it doesn't need to since fragmentation happens on it's UDP packets. Now one would expect wg tunnel stacking to just work despite fragmentation of the encapsulate packets being inefficient and still undesirable. However it turns out two of my tunnel stacking mitigation attempts taken together were conspiring against me! My first approach to fixing the tunnel stacking was to force wireguard output packets to be sent over the (ethernet) upstream interface only, using policy routing. This turns out to be ineffective see below. On top of this I applied the following nftables rule to prevent wg output from ever going over interfaces with MTU less than 1500. This was originally concieved to accomodate workstations rather than "core" routers but was rolled out on the routers too. meta mark 0x1000 meta protocol ip6 rt mtu < 1440 \ counter reject with icmpx type admin-prohibited \ comment "wg endpoint loopback prevention" Note the `rt mtu` match is misnamed and is actually in terms of TCP MSS so 1440+60=1500 (depends on the underlying IP protocol though). Fwmark 0x1000 is what the wg tunnels tag their encapsulated packets with. The fatal problem here is that the first mitigation will cause the upstream router to just hairpin the wg packets back at us since we're (usually) also announcing the endpoint's prefix via BGP. This will cause the fwmark to get stripped obviously so the otherwise effective nftables loopback prevention rule was being bypassed. doh! After removing the policy routing bit stacked tunnels seem to get pruned as they should now. Thanks, --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Tue, Jul 18, 2023 at 03:37:14PM -0700, David Schinazi wrote: > [ ... an hour passes by with this email half written ... ] > > Oh, and in the meantime Juliusz just went ahead and implemented probe-mtu. > Nicely done, sir! Looking at the PR it validates that the kernel-provided > MTU gets through the network. I wonder if that breaks popular tunnel > implementations today, as I suspect many don't set that correctly. Ha, good thing you mentioned it I was just about to go back to patch writing. Interesting approach. IPV6_DONTFRAG is (again) not documented in ipv6(7) so I had no idea this exists :) FYI: From looking at the linux code it looks like it's possible to set IPV6_DONTFRAG per-sendmsg() call (in the cmsg field, see ip6_datagram_send_ctl() in linux) so this could also be a per-interface option. Awesome work Juliusz! --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Hi Juliusz, While you're absolutely right that this MUST NOT happen, in practice it does. A rare scenario is when routes change deep in a network causing the e2e PMTU to change without the link MTU on the endpoints observing any change. This phenomenon happens much more commonly on tunnels when the tunnel takes a new path (e.g., moving IKEv2/IPsec to a different underlying interface via RFC 4555) - in that scenario the endpoint experiencing the migration (e.g. the cell phone) knows that something changed but the e2e peer does not. In IPv4 this can be (poorly) solved by in-network fragmentation, but that's not allowed in v6. If Babel were to magically know the MTU of its interfaces (including tunnels), it would make sense to consider that information as part of route metrics. The remaining question is where to perform the PMTUD, it feels like the responsibility of the tunnel but could also be reused across different tunnel types. [ ... an hour passes by with this email half written ... ] Oh, and in the meantime Juliusz just went ahead and implemented probe-mtu. Nicely done, sir! Looking at the PR it validates that the kernel-provided MTU gets through the network. I wonder if that breaks popular tunnel implementations today, as I suspect many don't set that correctly. David On Tue, Jul 18, 2023 at 1:42 PM Juliusz Chroboczek wrote: > >> RFC 2460: "link MTU - the maximum transmission unit, i.e., maximum > packet > >>size in octets, that can be conveyed over a link." > > > I read this as "link MTU" being the maximum packet size that you could > ever > > hope to be able send but the link technology could very well not allow > the > > maximum at times. > > Daniel, the specs are perfectly clear: there is no licence given to nodes > to systematically drop packets smaller than MTU. In fact, such links > break TCP, as you've discovered. > > > I'm still not sold on your argument, but it hardly matters. Tunnels on > top > > of the internet exist so we kind of just have to deal with it. > > Nobody is denying that. Please see RFC 4459, which describes how to make > them work reasonably well. > > -- Juliusz > > ___ > Babel-users mailing list > Babel-users@alioth-lists.debian.net > https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users > ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Completely untested. Please checkout the branch "probe-mtu", then say this in your config file: dont-fragment true default probe-mtu true -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
>> RFC 2460: "link MTU - the maximum transmission unit, i.e., maximum packet >>size in octets, that can be conveyed over a link." > I read this as "link MTU" being the maximum packet size that you could ever > hope to be able send but the link technology could very well not allow the > maximum at times. Daniel, the specs are perfectly clear: there is no licence given to nodes to systematically drop packets smaller than MTU. In fact, such links break TCP, as you've discovered. > I'm still not sold on your argument, but it hardly matters. Tunnels on top > of the internet exist so we kind of just have to deal with it. Nobody is denying that. Please see RFC 4459, which describes how to make them work reasonably well. -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Mon, Jul 17, 2023 at 11:41:01AM +0200, Juliusz Chroboczek wrote: > >> Sorry, I wasn't clear. IP requires every link to have a well-defined > >> MTU: all the nodes connected to a link must agree on the link's MTU. > > > I don't think that can be true either. PMTU can vary and paths can be > > asymmetric so two nodes could very well see different MTUs across the > > internet. There's just not many ASen that run with less than 1500 MTU :) > > I'm not speaking about PMTU. I'm speaking about link MTU. Yeah I got that confused. That what happens when you write technical emails at 2am ;) > > Do you have a referece for this "MTU well-definedness" criteria, I don't > > think I ever heard of this. > > RFC 2460: "link MTU - the maximum transmission unit, i.e., maximum packet >size in octets, that can be conveyed over a link." I read this as "link MTU" being the maximum packet size that you could ever hope to be able send but the link technology could very well not allow the maximum at times. Unfortunately they didn't use the usual RFC2119 requirement level terminology here so who knows :) > RFC 4861: "All nodes on a link must use the same MTU (or Maximum Receive >Unit) in order for multicast to work properly." I mean that only applies when you want to run NDP over the link so that's hardly relevant for L3 tunnel interfaces or internet backbone links in general. I'm still not sold on your argument, but it hardly matters. Tunnels on top of the internet exist so we kind of just have to deal with it. > > Wireguard drops packets when they exceed the underlay network's MTU. > > this happens no PTB ICMP errors are generated by wireguard inside the > > tunnel, > > If true, that's very surprising, and looks to me like a bug in Wireguard. > > But yeah, I'll add an option to probe for MTU on each Hello. I've been looking at how to implement this probing. The IPV6_MTU_DISCOVER sockopt used to configure the kernel behaviour unfortunately conflates multiple behaviours (oh joy), this list is for IPv6 on v4 DF also comes in but thankfully babel only uses a v6 socket: - whether EMSGSIZE is returned to send() when a UDP packet is too big (or the packet is simply dropped) - whether the interface MTU or PMTU result controls the above error condition when enabled - whether UDP send() calls with too large a size are automatically fragmented locally or return the error - whether ICMP PTB messages are interpreted at all (a DNS-over-UDP security feature apparently) That got me wondering: is babeld currently relying on the kernel to fragment large UPDATE packets? From my reading of the code it doesn't look like it. If my reading is right `(struct buffered).size` determines the maximum UDP payload size and this is initialized from the interface MTU. This means we can probably just set IPV6_MTU_DISCOVER to the undocumented IP_PMTUDISC_INTERFACE[1] to maximally disable PMTU behaviour. This option 1) prevents local fragmentation of any sort (interface MTU or PMTU), 2) disables updating the PMTU cache from ICMP-PTB messages for this socket since we don't need that anyway and 3) causes too big send() calls to fail with EMSGSIZE (if my reading of the kernel code is right). [1]: Introduced around 2013, see kernel commits 482fc6094a 93b36cf342 1b34657635 0b95227a7b for the full story. In principle we could also use the older IP_PMTUDISC_DONT since we don't technically have to turn off ICMP-PTB interpretation but I fell like it's neater if we disable that too. --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Hi, I also wish there was some form of ensuring a path with a minimum MTU. My use case is providing a minimum MTU for VXLAN overlay networks in very heterogeneous networks consisting of different tunnel mechanisms (gre, wireguard, via v4 and v6), direct ethernet links and ptp connections. To allow proper L2 connections with an MTU of 1500, links must at least have an MTU of 1570 to have room for unfragmented VXLAN packets. This is important since VXLAN VTEPs must not fragment packets [0], which is very annoying in this case. Having some sort of mechanism within babel that propagates routes between (not necessarily directly connected) VTEPs, that ensures a minimum MTU along a path would be very welcome. Otherwise packets might choose a path with lower metric and insufficient MTU, which will cause in dropped packets. Cheers [0]: https://datatracker.ietf.org/doc/html/rfc7348#section-4.3 On 16.07.23 20:51, Daniel Gröber wrote: Hi babelers, I've been running babel on top of my wireguard IPv6 network for a while now and I have a problem that keeps biting me and I can't find a good solution for: babel is oblivious to a link's MTU and picks paths that involve wireguard-in-wireguard tunnels even though paths without this stacking are available. The stacking (and subsequent path MTU reduction) is I belive not even bounded, so there is no static MTU I could configure on all my hosts to take care of this like one would do with a plain wireguard setup. I was able to fix this on my routers by configuring the firewall to drop UDP tunnel packets that are going to traverse interfaces with MTU<=1440. This works alright but I also have babel running on workstations that are behind these routers and there is no good way to classify which UDP packets are part of my network's wireguard tunnels and which aren't. So this got me thinking (for the hundreth time) perhaps this should be something the routing protocol takes care of? Babeld would essentially have to pad it's hello packets to a (configurable) size to detect if fragmentation is required (or they are being blackholed outright). My use-case would be well served if I could just specify a minimum MTU all paths must satisfy though more elaborate things could be done I suppose (metric based on MTU?). Opinions? Anybody have any better ideas on how to prevent this sort of tunnel stacking? Thanks, --Daniel PS: Just to clarify why the tunnel stacking happens in my setup: my network tunnels IPv6 over IPv4 (most of the time), but I want to support IPv6-only underlay networks so I have wireguard tunnels with IPv6 endpoints which can in turn get routed over V6-over-V4 wg tunnels (when the ether is flowing just right). ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
>> Sorry, I wasn't clear. IP requires every link to have a well-defined >> MTU: all the nodes connected to a link must agree on the link's MTU. > I don't think that can be true either. PMTU can vary and paths can be > asymmetric so two nodes could very well see different MTUs across the > internet. There's just not many ASen that run with less than 1500 MTU :) I'm not speaking about PMTU. I'm speaking about link MTU. > Do you have a referece for this "MTU well-definedness" criteria, I don't > think I ever heard of this. RFC 2460: "link MTU - the maximum transmission unit, i.e., maximum packet size in octets, that can be conveyed over a link." RFC 4861: "All nodes on a link must use the same MTU (or Maximum Receive Unit) in order for multicast to work properly." > Wireguard drops packets when they exceed the underlay network's MTU. > this happens no PTB ICMP errors are generated by wireguard inside the > tunnel, If true, that's very surprising, and looks to me like a bug in Wireguard. But yeah, I'll add an option to probe for MTU on each Hello. -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Mon, Jul 17, 2023 at 12:47:30AM +0200, Juliusz Chroboczek wrote: > >> IP does not support variable MTU links. > > > > Excuse me, but that's plain false. IP was designed in an environment where > > (non-ethernet) networks with various MTU standards were commonplace > > Sorry, I wasn't clear. IP requires every link to have a well-defined > MTU: all the nodes connected to a link must agree on the link's MTU. I don't think that can be true either. PMTU can vary and paths can be asymmetric so two nodes could very well see different MTUs across the internet. There's just not many ASen that run with less than 1500 MTU :) Do you have a referece for this "MTU well-definedness" criteria, I don't think I ever heard of this. > > There is a way: My routing protocol just has to stop picking links that are > > obviously going to cause a problem. > > Could you please describe the problem in detail? Because I'm probably > missing something. Let me try to give some more context: My mesh network deploys two wg tunnels per node. One wg-over-v6 and one wg-over-v4 tunnel to support dualstack, v4-only and v6-only underlay networks. Nodes run babel over all wg interfaces and will receive a default route covering the wg-over-v6 tunnel endpoint addresses. Some nodes are served by IPv6 routers that are themselves part of the wg mesh network and only have v6 connectivity via wg-over-v4. This can cause wg-over-v6 tunnels on such nodes to want to cross a wg-over-v4 tunnel. All wg interfaces have MTU 1420 configured which is the worst case for wg-over-v6 or v4 (with MTU 1500). In the wg-over-wg-over-v4 case this results in packets that are too big for the v4 underlay network (1420+80+60=1560). Wireguard drops packets when they exceed the underlay network's MTU. When this happens no PTB ICMP errors are generated by wireguard inside the tunnel, packets are simply dropped and TCP applications running on the overlay IPv6 network break badly as no ICMP errors reach the sender. This can be avoided by simply ignoring the wg-over-v6 tunnel which only exists for deployment consistency as a wg-over-v4 tunnel with (actual) 1440 MTU is available too which can reach the entire network. Worth mentioning: The reason I have to run two wg tunnels per node to begin with is that wireguard's strategy for dual-stack support is that it doesn't have one. It supports only one endpoint address per tunnel (well wg-peer really) and if you pick wrong because, say, IPv6 addresses are available but dont work, the tunnel simply blackholes everything. Yey, joy is me. > If Wireguard implements RFC 4459 Section 3.2, then pushing a too large > packet over the tunnel, then Wireguard should synthesise an ICMP "packet > too large", which will cause the sender to retry with a smaller packet. > Is that not the case? Yeah, having wg forward PTB errors from the underlay to inside the tunnel was something I considered for fixing this but I belive that would be called "insecure" by the wg project since the ICMP erros aren't signed like normal wireguard packets. So what happens when an attacker sends spoofed PTB with MTU=0 etc. ;) Furthermore on IPv4 which unfortunately is the underlay in my network more often than not ICMP blackholes are very common so breakage would could ensue again. This really is just putting lipstick on a pig. It would "work" I suppose but I don't want my network to use these paths because the double encapsulation is just plain inefficient! Prune thy inefficient paths I say :] > I'm not opposed to your probing idea, but I'd really prefer to fully > understand the problem first. Sure thing, I'm not opposed to working the problem. I've just been dealing with this problem (and ducktape "solutions" surrounding it) for a while now and I just want to get this squared away so I can go back to my (mostly) IPv6-only bliss :D I think RFC4459 simply didn't consider L3 routing protocol based solutions. Probably since the usual network vendor suspects would never implementing something uncouth like this but we need not be constrained by the inefficiencies of the commercial world in the free software community, now do we :) Speaking of which I'm working on a babeld patch to see if my idea works. Just have to dig through the kernel code first to figure out which one of the amazingly (badly) named IP_PMTUDISC_* options I want to use to force it to neither do fragmentation nor attempt PMTU for the babel socket. Thanks, --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
>> IP does not support variable MTU links. > > Excuse me, but that's plain false. IP was designed in an environment where > (non-ethernet) networks with various MTU standards were commonplace Sorry, I wasn't clear. IP requires every link to have a well-defined MTU: all the nodes connected to a link must agree on the link's MTU. Now, I agree that it is possible to simulate a variable-MTU link, as describes in RFC 4459 Section 3.2, and it will mostly work. But that's not what IP was designed for, and I don't know whether it's possible to make it reliable. > There is a way: My routing protocol just has to stop picking links that are > obviously going to cause a problem. Could you please describe the problem in detail? Because I'm probably missing something. If Wireguard implements RFC 4459 Section 3.2, then pushing a too large packet over the tunnel, then Wireguard should synthesise an ICMP "packet too large", which will cause the sender to retry with a smaller packet. Is that not the case? I'm not opposed to your probing idea, but I'd really prefer to fully understand the problem first. -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
On Sun, Jul 16, 2023 at 11:43:44PM +0200, Juliusz Chroboczek wrote: > IP does not support variable MTU links. Excuse me, but that's plain false. IP was designed in an environment where (non-ethernet) networks with various MTU standards were commonplace and this is very much supported. Why else would we have standards for Path MTU discovery cf. RFC1191/RFC1981 that have become mandatory for IPv6? > And every tunnel is able to carry packets up to its MTU? If that's not > the case, then there's no way your network can work, There is a way: My routing protocol just has to stop picking links that are obviously going to cause a problem. The way my network is structured (remember: mesh network) there always is a path that avoids the tunnel overhead stacking problem but since babel is blind to it it can and does pick problematic paths sometimes. > > Enable a config option for "minimum path MTU" on each babel node. Nodes > > then pad all hello packets to this size and set appropriate sockopts to > > stop the kernel from doing PMTUdisc behind our backs (on IPv6) and setting > > DF=1 (on IPv4). > > We can only control fragmentation in the overlay. True, but controlling fragmentation in the underlay is simply not necessary. If the tunnel underlay were to fragment[1] my tunnel MTU wouldn't be impacted so it doesn't break anything and babel can feel free to use that path. Only the case where the underlay drops packets instead of fragmenting is relevant. [1]: Which has pps performance implications and is hence usually avoided. Wireguard in particular doesn't allow fragmentation. FYI: Do note that with IPv6 in-network fragmentation is not a "thing" anymore, this is IPv4 legacy think :) Endpoints fragment nobody else. > Can you explain what the tunnelling protocol will do, and whether it will > prevent fragmantation in the underlay? >From what I observed it's clear Wireguard never fragmentsw it's UDP packets so it likely sets DF=1 when run on top of v4 and ignores PMTU on v6. IMO that's a reasonable behaviour for a tunnel protocol. Thanks, --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
> Problem is when the underlay L3 network is composed of more tunnels and not > 1500 MTU ethernet links, then at each hop the path MTU could be reduced by > the tunnel overhead again and again and again (across the entire > path). Hence no predictable MTU I can deploy across all my interfaces > exists. QED :) I'm still not following. Every tunnel has an MTU, right? And every tunnel is able to carry packets up to its MTU? If that's not the case, then there's no way your network can work, since IP does not support variable MTU links. > Enable a config option for "minimum path MTU" on each babel node. Nodes > then pad all hello packets to this size and set appropriate sockopts to > stop the kernel from doing PMTUdisc behind our backs (on IPv6) and setting > DF=1 (on IPv4). We can only control fragmentation in the overlay. Can you explain what the tunnelling protocol will do, and whether it will prevent fragmantation in the underlay? -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
Hi Juliusz, On Sun, Jul 16, 2023 at 09:22:40PM +0200, Juliusz Chroboczek wrote: > > I've been running babel on top of my wireguard IPv6 network for a while now > > and I have a problem that keeps biting me and I can't find a good solution > > for: babel is oblivious to a link's MTU and picks paths that involve > > wireguard-in-wireguard tunnels even though paths without this stacking are > > available. > > Is the MTU of your interfaces set correctly? Please type > > ip link show > > and check that the value is right. > > Babeld already checks the interface's MTU, so if the MTU is set correctly, > it's a simple matter of tweaking this code: > > https://github.com/jech/babeld/blob/master/interface.c#L300 > > If the MTU is not set correctly, then you'll run into trouble with > higher-layer protocols. I must have not explained the problem sufficiently because the interface MTU doesn't matter at all here. All that is important is that tunnel interfaces are involved in the L3 network carrying tunnel packets. "Usually" the underlying L3 network is the IPv4 internet which has a (more or less) predictable 1500 MTU, though I would call that a very 1500MTU-normative assessment. So the tunnel interface's MTU will be 1500 minus overhead. Easy. Problem is when the underlay L3 network is composed of more tunnels and not 1500 MTU ethernet links, then at each hop the path MTU could be reduced by the tunnel overhead again and again and again (across the entire path). Hence no predictable MTU I can deploy across all my interfaces exists. QED :) Babeld really has to take care of the the *PATH* MTU not just look at whatever is configured on the local interfaces for this to work. Here's one way this could be done: Enable a config option for "minimum path MTU" on each babel node. Nodes then pad all hello packets to this size and set appropriate sockopts to stop the kernel from doing PMTUdisc behind our backs (on IPv6) and setting DF=1 (on IPv4). When paths with lesser MTU are encountered these packets will simply get dropped by the network preventing neighbour relationships from forming. Problem solved :) --Daniel ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
Re: [Babel-users] MTU based routing for tunnel based babel networks?
> I've been running babel on top of my wireguard IPv6 network for a while now > and I have a problem that keeps biting me and I can't find a good solution > for: babel is oblivious to a link's MTU and picks paths that involve > wireguard-in-wireguard tunnels even though paths without this stacking are > available. Is the MTU of your interfaces set correctly? Please type ip link show and check that the value is right. Babeld already checks the interface's MTU, so if the MTU is set correctly, it's a simple matter of tweaking this code: https://github.com/jech/babeld/blob/master/interface.c#L300 If the MTU is not set correctly, then you'll run into trouble with higher-layer protocols. > So this got me thinking (for the hundreth time) perhaps this should be > something the routing protocol takes care of? Babeld would essentially have > to pad it's hello packets to a (configurable) size to detect if > fragmentation is required (or they are being blackholed outright). That's certainly a good idea, it would allow us to discard interfaces whose MTU is set incorrectly. I'll think it over. > PS: Just to clarify why the tunnel stacking happens in my setup: my network > tunnels IPv6 over IPv4 (most of the time), but I want to support IPv6-only > underlay networks so I have wireguard tunnels with IPv6 endpoints which can > in turn get routed over V6-over-V4 wg tunnels (when the ether is flowing > just right). Hehe. -- Juliusz ___ Babel-users mailing list Babel-users@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users