Re: [Babel-users] Babel RTT settings for low and high latencies
On Thu, Jul 18, 2024 at 12:38:20PM +0200, Juliusz Chroboczek wrote: > > A year ago you said [1] that bird had not yet implemented "hysteresis on > > metrics" [...] Has bird implemented this in the meantime? > > I haven't checked the BIRD implementation recently (I know, I should be > doing more work on Babel, but I'm currently spending all my mana on > fighting the French administration), you'll need to wait until Toke comes > back from his Viking raid. We do not have oscillation prevention measures in BIRD yet. I was doing work on proto/babel most recently and what's complicating that right now is BIRD 3 (thread-next branch) development disincentivising the BIRD team from applying potentially conflicting changes as both branches will need to be maintained going forward. Perhaps it's time for me to start releasing a babel-bird fork while that's going on so we can carry on development in the meantime? It seems I lost the motivation to work on it since nothing is getting merged/released which makes testing by the community significantly less likeley and doing it in a way the team *might* pick up patches (I discussed this with them privately) is too much work for me to overcome my activation energy requirements ;) --Daniel signature.asc Description: PGP signature
Re: Current bird via bookworm-backports?
Hi Jakub and Ondřej, I'm also interested in using more recent bird versions in my deployment, but I don't use third-party repos on principle. Since I'm a DD I could just start uploading and maintaining a backports for bird2, is there I reason to not do that? My guess is you just want to avoid doing the work twice since you're already maintaining that repo outside of Debian, right? --Daniel
Re: BGP on /32 (/128) interfaces
Hi Arzhel, On Fri, Apr 12, 2024 at 11:57:38AM +0200, Arzhel Younsi wrote: > But for IPv6, it's cleaner to only require the router's link local address: > testvm2006:~$ ip -6 addr > inet6 2620:0:860:140:10:192:24:4/128 scope global > testvm2006:~$ ip -6 route > default via fe80::2022:22ff:fe22:2201 dev ens13 metric 1024 pref medium > > In Bird: > neighbor fe80::2022:22ff:fe22:2201%ens13 external; > > But then the link local address doesn't work with multihop (for obvious > reason). > bird: /etc/bird/bird.conf:22:1 Multihop BGP cannot be used with link-local > addresses I use lladdrs for BGP endpoints in my network and that works fine. I think using `direct` instead of `multihop` in the v6-lladdr case would make it work for you. One word of advice: don't use the %scope syntax, use the `interface` directive instead. I don't recall exactly why but I had some subtle problem with that. As for your v4/32 problem, give `multihop 1` a try. That enforces no routers on the path to the peer like direct but allows off-subnet endpoints. Do keep in mind the docs recommend setting the source address explicitly when enabling multihop. multihop [number] Configure multihop BGP session to a neighbor that isn't directly connected. Accurately, this option should be used if the configured neighbor IP address does not match with any local network subnets. Such IP address have to be reachable through system routing table. The alternative is the direct option. For multihop BGP it is recommended to explicitly configure the source address to have it stable. Optional number argument can be used to specify the number of hops (used for TTL). Note that the number of networks (edges) in a path is counted; i.e., if two BGP speakers are separated by one router, the number of hops is 2. Default: enabled for iBGP. The reason why direct isn't working is also clear from the docs: direct Specify that the neighbor is directly connected. The IP address of the neighbor must be from a directly reachable IP range (i.e. associated with one of your router's interfaces), >>>otherwise the BGP session wouldn't start but it would wait for such interface to appear<<<. The alternative is the multihop option. Default: enabled for eBGP. Hope that helps, --Daniel
Re: [Babel-users] Coexistence of multiple babel daemons on the same host
Hi Juliusz, On Sun, Mar 31, 2024 at 04:35:28PM +0200, Juliusz Chroboczek wrote: > > I've just come across a reason I'd want to run babel with both bird and > > babeld on the same node and have them become neighbours. > > I hope you know what you're doing. I sure hope so too, haha. No it's really quite benign. Just working around BIRD still not being quite as featureful as I need to run my network but babeld's non-atomic route replacement interacting badly with BIRD's proto/radv. See below. > The simplest solution would be to run each daemon in a different container. Yeah, I am always trying to avoid that to keep things simple and manageable for me. Keep in mind my reason for even raising this on the MLs and not just shooting patches is that I think the current UX is broken. I don't want "accidentally running two babel daemons" to cause a network outage like it did for me. > If you don't want to use containers, you'll want to use separate routing > tables for the two daemons, and then set up the correct routing rules. > First configure the routing rule: I just want to import routes into bird for use with radv "trigger". These aren't going to get exported anywhere at all so no need for multiple routing tables in the kernel and all that. I do appreciate you taking the time to write down how to do it however :) > > From what I gather by a quick skimming of the relevant details in > > RFC8966 this could work by assigning each daemon a unique LL address so > > each gets it's own entry in the neighbour table, right? > > No, there's no relation between the link-local address and the routing > table. *neighbour table*. I think the crux of the issue I was seeing is that both daemons will use eg. fe80::1 to source their babel hellos and since the neighbour table is indexed by this address whoever sees both of these announcements will experience "fighting" over who gets to own that entry or something? I'm not actually sure, should have probably kept some pcaps from when this was going on. Let me know if this makes no sense to you and I'll try to recreate the circumstances. After some careful reading of the BIRD code I've actually found I can set the LL saddr it uses using `protocol device { interface "eth0" { preferred fe80::1234; }; }` so I think I can actually get this working without code changes. I'm labbing that now. Thanks, --Daniel
Coexistence of multiple babel daemons on the same host
Hi Babelers, I've just come across a reason I'd want to run babel with both bird and babeld on the same node and have them become neighbours. The details are tedious -- my usual disclamer applies ;) This got me to think about something I'd observed in the past and been meaning to write about though: when I accidentally had both babeld and bird running on a node it would casuse severe route flapping and consequently packet loss. I wonder if we can do better here? Maybe by warning about this situation in the logs, shutting down both sides or just turning this into a supported use-case? :) >From what I gather by a quick skimming of the relevant details in RFC8966 this could work by assigning each daemon a unique LL address so each gets it's own entry in the neighbour table, right? Thanks, --Daniel
Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi Alexander, On Wed, Nov 22, 2023 at 12:17:49AM +0100, Alexander Zubkov wrote: > > Can you think of a use-case where fpRPF isn't enough? > > Yes. IMHO, the problem with RPF is that routing table doesn't reflect the > network topology, but only a subset of it. Right that is the fundamental problem, so my solution to that is: routing should "just represent the full network topology" :) As the routing protocol sees it anyway, since the whole point of RFP is to only allow paths that the routing system chooses. Do note that while I implement the topology information using ECMP routes there's no reason you actually have to use ECMP. You could still have regular routes in your (main) routing table and use a separate table with ECMP routes for RPF and this is very much something I want us to support. > I mean in topologies where multiple pathes are possible, you can choose > to use or even learn only a subset of those pathes. If I undestand correctly you're talking about (local) routing daemon policy here. Yes this is something you can do and my current approach of (abusing) ECMP only works when your routing policy satisfies some symmetry criteria. However as Juliusz pointed out integrating this idea into the routing protocol proper could allow using arbitrary policy without ever breaking RPF, but figuring out the details is (exciting) future work. > In that sense might be yes, the original sin is that the routing table > doesn't reflect all the topology, not only the pathes we choose for egress. > Not sure though if it is a sin, in that case routing table would be too > overcomplicated. Right routing table (modification) performance and clutter is certainly a reason to forgoe this approach but I find that for the kind of (small) networks I want to run and that many people might run using wireguard this is perfectly fine. > If I understand correctly, such fpRPF approach works only if you both learn > all possible pathes and use all of them in a multi-nexthop route. But for > example in the Internet with its advanced BGP announcement policies it is > not true at all. Right to deply fpRPF on a large scale you really need some kind of support from the routing protocol. AFAIK there's nothing like that for BGP yet? I don't think it's completely inapplicable either though, might still work for iBGP with appropriately designed routing policy. My interest lies mostly with doing this using babel though. > So from my point of view it is good to split the topology definition > (ingress decapsulation) and the chosen pathes (egress routing). Because it > is related, but still different processes. So the system can be more > flexible. Although we need to repeat common things and keep ingress and > egress consistent/synced. To me flexibility is only desirable insofar as it doesn't conflict with system security. Source address authenticity is an important property I wouldn't want to give up here. If it's easier to ignore source address filtering than it is to implement it nobody is going to do it (cf. the internet) and I think that's the crux of the problem with "encap". Wireguard gifted us this amazing state of source filtering being the easy default and I want to keep it that way. > my point is that RPF (with its variations too) has its bounds and cannot > be a universal solution, there is no silver bullet here. No, ofc. nothing we do can possibly "fix everything for everyone" but that's no reason not to try a new approach for a particular problem in a particular use-case :) --Daniel signature.asc Description: PGP signature
Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi Erin, Juliusz, On Sat, Nov 18, 2023 at 11:21:57AM +0100, Erin Shepherd wrote: > On Sat, 18 Nov 2023, at 03:19, Daniel Gröber wrote: > > That would be a problem as I specifically want to tie the source address > > filtering to this too. I'll have a look at the internals (if and) when I > > get around to starting work on this. > > Is tying source address filtering to the routing table the right thing to > do here? It seems to me that it would cause issues similar to those we > see more generally with Unicast Reverse Path Filtering IMO not providing a way to do source address filtering at the routing level was the original sin :) There is certianly the multihoming challange to be overcome as traditional BCP38 style filtering doesn't cut it in the general case. I have some ideas on how to deal with this. I've done some experiments and found that in Linux multi-nexthop routes actually match reverse path lookups (using nftables "rt") for _any_ of the source interfaces involved. I think this can be used to build RFC 3704 style Feasible Path Reverse Path Forwarding when the routing daemon involved supports ECMP. This experiment is what got me interested in having via-wgpeer in the routing table in the first place, once we have that we can apply the above idea not just at the interface level but at the wg peer level. Neat. Can you think of a use-case where fpRPF isn't enough? It's also noteworthy that once we have this support for via-wgpeer it'd be possible to apply ip-rule policy to the filtering decision. Perhaps that gives some additional power for more fun use-cases :) On Sat, Nov 18, 2023 at 01:22:03PM +0100, Juliusz Chroboczek wrote: > Issues are caused by the kernel performing filtering that the routing > protocol is not aware of: it causes the routing daemon's routing table to > no longer match the effective forwarding table (the kernel's routing > table). That's the reason why uRPF breaks most routing protocols, that's > the reason why we have trouble making Wireguard work with Babel, and also > the reason behind https://github.com/jech/babeld/issues/111. Right on the money as always. This idea has been on my mind too. > Contrariwise, we can teach Babel to explicitly take into account the > kernel features that we're interested in using. Thus, Babel could be > aware of the restrictions placed on a wireguard interface, and collaborate > with Wireguard so that the routing table and the forwarding table remain > congruent. I haven't looked at the issue in detail, but I believe that > would be an interesting (short-term) research project, one that I would be > glad to collaborate with (but not necessarily lead, at least not right now). Sounds interesting do you have a funding source in mind? > For the specific case of source address filtering, Babel already has an > (implemented) extension to deal with source addresses, and I encourage you > to consider whether it can be used to deal with the issue at hand. Please > see https://arxiv.org/pdf/1403.0445.pdf and RFC 9079. I don't think I mentioned this to you yet, but I have another one of my crazy ideas of doing something vaguely similar to BGP flowspec with babel. Restricted to IP source/destination address, so no L4 stuff. I just want to represent firewall policy using ipv6 subtrees and distribute it in realtime using babel :) Unfortunately this is currently stalled due to an apparent nft rt match kernel bug preventing me from representing multiple possible outcomes since I want to support dropping, accepting but also stateful firewalling of matching flows. --Daniel
Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi Alexander, On Thu, Nov 09, 2023 at 12:57:26PM +0100, Alexander Zubkov wrote: > I heard recently about the lightweight tunnel infrastructure in Linux > kernel (ip route ... encap ...). And I think this might be helpful in > the context of this thread. I hadn't seen that yet, thanks for pointing it out. > Linux kernel allows already to add encapsulation parameters to the route > entry in its table. So you do not need to create tunnel devices for > that. And wireguard encapsulation and destination might be added there > too. Right, I think ultimately it's going to come down to either technical constraints or in the absence of that, maintainer preference whether via-wgpeer or "encap wg" is the way. The idea is very similar anyway. > But as I understood the technology, it works only in one way (for > outgoing packets) and the decapsulation should be processed separately, > for example in case of VXLAN and MPLS they have their own tables. That would be a problem as I specifically want to tie the source address filtering to this too. I'll have a look at the internals (if and) when I get around to starting work on this. Thanks, --Daniel signature.asc Description: PGP signature
Re: babel rtt in bird: How to set the RTT equal to the latency?
Hi Marek, On Wed, Oct 18, 2023 at 10:49:34PM +0200, Marek Küthe wrote: > I have seen that there is both rtt-min and rtt-max setting in bird, but > no max-rtt-penalty. So my question is how would I represent the above > expression in bird? The `rtt cost` option is the equivalent in bird. --Daniel
Re: Doing something wrong with VRF's
On Thu, Sep 28, 2023 at 04:03:33PM +, Nigel Kukard via Bird-users wrote: > Well .. I feel embarrassed now. The issue was a missing "kernel table ;" > in the kernel protocol. Don't feel bad. I was going to look into using VRFs with bird and now there's a nice worked example ready to play with at in my inbox :) --Daniel
Re: [RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi Juliusz, On Mon, Aug 28, 2023 at 07:40:51PM +0200, Juliusz Chroboczek wrote: > I've read the whole discussion, and I'm still not clear what advantages > the proposed route attribute has over having one interface per peer. Is > it because interfaces are expensive in the Linux kernel? Or is there some > other reason why it is better to run all WG tunnels over a single interface? Off the top of my head UDP port exhaustion is a scalability concern here, just as an example, not that I'd actually ever need that many peers in my network :) One wg-device per-peer means we need one UDP port per-peer and since currently binding to a specific IP is also not supported by wg (I have a patch pending for this though) there's no good way to work around this. Frankly having tons of interfaces is just an operational PITA in all sorts of ways. Apart from the port exhaustion having more than one wg device also means I have to _allocate_ a new port for each node in my managment system somehow instead of just using a static port for the entire network. This gets dicy fast as I want to move in the direction of dynamic peering as in tinc. Other than that my `ip -br a` output is getting unmanagably long and having more than one device means I have to keep ACL lists in sync all over the system. This is a problem for daemons that don't support automatic reload (babeld for example :P). I also have to sync the set of interface to nftables which is easy to get wrong as it's still manual in my setup. All of that could be solved, but I would also like to get my wg+babel VPN setup deployed more widely at some point and all that friction isn't going to help with that so I'd rather have this supported properly. --Daniel signature.asc Description: PGP signature
Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi Kyle, On Mon, Aug 28, 2023 at 11:40:48AM -0400, Kyle Rose wrote: > On Sat, Aug 19, 2023 at 5:25 PM Daniel Gröber wrote: > > Having read Kyle's use-case I'm thinking my original plan to extend the wg > > internal source-address filtering to use a rt lookup with our new attribute > > would not be maximally useful so now my thinking is we should just have a > > boolean toggle to disable it explicitly per device. > > If there is interest among the maintainers in eventually merging a > change with a per-interface knob to turn off the source IP check, I > will go through the trouble of putting together an initial pass at > this. I don't want to spend the time if there is firm opposition to > the idea. I think just a patch to turn off the wg source IP check is not very useful at the moment. It would encourage bad source IP filtering practices when multiple peers are involved as no mechanism for identifying the sending peer is available at the policy routing or netfilter level currently. I think such a patch would have to get merged after some kind of mechanism to identify and filter based on the sending wg peer is available. So if you want to move this along I would suggest working on this first. Since I'm also interested in having this feature I'm happy collaborate. It's just hard to find the motivation for writing more wg patches when my pending ones have (mostly) been lying around for a year without a response, but if you're also keen on this feature I'm sure it's easier to stay motivated together :) If my kernel patches go ignored for too long too I'll probably just resort to getting a forked DKMS wireguard module into Debian with this work. Perhaps that approach (or a package in a different distro) would work for your use-case too? --Daniel
Re: Pending bird patches and discussions for proto/babel
Hi Maria, On Tue, Jul 18, 2023 at 12:28:35AM +0200, Daniel Gröber wrote: > On Thu, Jun 01, 2023 at 05:33:42PM +0200, Maria Matejka via Bird-users wrote: > > > Lastly my v5 route selection patch should also be ready, all threads of > > > discussion should be resolved and there are no known bugs. However I've > > > not > > > received any testing reports, hint, hint peeps °>° > > > > OK, will look at it as well. Please send me an owl, a pigeon or another > > avian carrier compatible with RFC 1149 if we don't react in a reasonable > > time (est. a fortnight). > > Just a reminder on this :) I humbly beseech thee to consider deinself reminded again :3 --Daniel
Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi Steffen, from the questions you ask I fear you've misunderstood my intention to "replace" AllowedIPs. I'm strictly talking of the _implementation_ (allowedips.c) in Linux particular. The netlink interface would naturally stay unchanged. On Sat, Aug 19, 2023 at 10:00:17PM +0200, Steffen Vogel wrote: > Interesting ideas! I am wondering if this complexity is really necessary? I think so, right now wg's behaviour just prevents a number of advanced use-cases which is a pitty. > My biggest concern about the introduction of a route attribute is that > this adds complexity for users. WireGuard's simplicity (and portability) > have been important factors for its success. Nothing would change for users that don't use this feature. > A route attribute would introduce another source for the crypto-routing > peer selection process. What happens if the two mechanisms select > different peers? Which one would have precedence? If you read my previous mail carefully you'll find I specified how this would shake out. The rt approach will necessarily have to override AllowedIPs to be useful. > Similarly also for incoming packets. WireGuard's current principle is > really easy to understand. If the source address in in the peers > AllowedIP list, we will accept the packet. If not its discarded. This is > a central part of WireGuard's crypto-key routing feature which would > become more complex. Having read Kyle's use-case I'm thinking my original plan to extend the wg internal source-address filtering to use a rt lookup with our new attribute would not be maximally useful so now my thinking is we should just have a boolean toggle to disable it explicitly per device. Then users can do whatever fancy rt (reverse-path) lookups they want with nft. I'm particularly happy that nft rt lookups will actually do the right thing with multipath/ECMP routes (any of the involved devices satisfies the lookup) so this should actually work out fine in my case at least. Mmore prototyping is required here though. > A second concern I have with the use of route attributes is limited > portability. Not all platforms support them. How do we handle WireGuard > userspace implementations? No need. The Linux's rt table is Linux specific I have no plans to introduce this on other platforms, that's for other intersted souls to tackle. Again "replace" was referring to implementation concerns. > - Networks which are found in a Peers AllowedIP list will be installed as a > kernel route That's configurable. I always turn this off when dealing with rt daemons. > - Kernel routes with the peers unique IP address as next-hop will be added to > the Peers AllowedIPs list. > > This rather simple feature allows user to pair cunicu with a software > routing daemon like Bird2 while using a single WireGuard interface with > multiple peer-to-peer links. Sounds like you do what I want to do at kernel level in userspace, then at least we can agree this is a useful thing :) > In my setup a periodic synchronization worked fine. But I agree that it > would be nice if we could have a Netlink multicast group for subscribing > to changes like we also have for other parts of the Linux network stack > like routing tables, or link states. This feature was already discussed > on the WireGuard mailing list [7]. But unfortunately the patch was never > accepted. Maybe we can revisit this patch? Sounds like a plan, I'll have a look at it. > [1] Others planned features are: > - IP-autoconfiguration by deriving link-local addresses from peers public keys That's been discussed so many times before on the ML and someone always realises Jason is right and there's no point to this in the end. Key distribution is the crux of the problem. --Daniel
Re: [RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi Bernd, On Sat, Aug 19, 2023 at 07:50:38PM +0200, Bernd Naumann wrote: > Chances are high I do miss something, but I've just set AllowedIPs to > 0.0.0.0/0 and ::/0 and just used the routing protocol of my choice and > filters to select which routes got exported and imported... :shrug: Right, let me expand a bit. You are absolutely right, right now if you want to use wg with dynamic routing daemons you essentially have to have one wg tunnel per remote node with AllowedIPs=::/0 and that works just fine at small scales. The idea here is that we would like to go back to having just one tunnel for all nodes involved in this particular network instead, due to general operations scalability, mine is a mesh network so the number of tunnels gets rather large quickly :) Lots of tunnels suck for various reasons, monitoring if they're all up and configured properly is one example but my understanding from previous discussions is the performance is probably not ideal either. --Daniel
[RFC] Replace WireGuard AllowedIPs with IP route attribute
Hi wireguard, birds, and babelers, tl;dr I want to add a new Linux route attribute (think "via $wgpeer") to supplement wireguard's internal AllowedIPs logic for both routing and source address filtering. I've been pondering how to better integrate wireguard into dynamic routing daemons, particularly BIRD and babeld. Essentially we want to be able to dynamically add/remove AllowedIPs depending on current reachability and/or link quality stats. Looking at the wg netlink API I see two major efficiency/scalability problems: 1) there is no way to be notified of changes in AllowedIPs made by other processes meaning we have to do periodic scans and 2) a peer's AllowedIPs set can only be replaced wholesale, not modified incrementally. This is problematic as "someone" might, in the worst case, want to install an entire internet routing table's worth of AllowedIPs and the set will likely change frequently. FYI: The IPv4 table has ~1M entries at present, yikes. Assuming external AllowedIPs changes are infrequent occationally dumping them all to keep a consistent view of the state shouldn't be too much of an issue as long as the netlink interface is performant enoug, so I'm going to concentrate on the add/remove API for now. Instead of doing the obvious thing and adding a more efficient incremental AllowedIPs netlink interface I figure why not just add a route attribute to select a target wg peer on a device. That way we could not only save memory (no separate AllowedIPs trie) but also simplify routing daemon implementation considerably. This would mirror how on ethernet you can have `dev eth0 via $router_ip`. I'm still reviewing the net/ code to find the best way to do this, but I'm thinking either a new RTA_WGPEER, like: `default dev wg0 via-wgpeer $peer_pubkey` or perhaps re-using RTA_VIA and keying off a statically configured AllowedIP addresses. To start I'd make this an opt-in replacement for our usual AllowedIPs logic, making sure to only activate it if any via* RTAs are active on a particular device, but if it proves to work well I don't see why we couldn't adapt the netlink code to maintain AllowedIPs using this RTA (but invisible to userspace) to re-use the same code and get rid of allowedips.c altogether. That's assuming this ends up being less code overall or perhaps more performant. Happy to hear your thoughts, --Daniel
Re: Pending bird patches and discussions for proto/babel
Hi Maria, On Thu, Jun 01, 2023 at 05:33:42PM +0200, Maria Matejka via Bird-users wrote: > > Lastly my v5 route selection patch should also be ready, all threads of > > discussion should be resolved and there are no known bugs. However I've not > > received any testing reports, hint, hint peeps °>° > > OK, will look at it as well. Please send me an owl, a pigeon or another > avian carrier compatible with RFC 1149 if we don't react in a reasonable > time (est. a fortnight). Just a reminder on this :) --Daniel
Re: Recursive nexthop via kernel route in proto static not working
Hi Maria, So I just did a bird restart on the router I was testing on after poking around some more and now the recursive route is suddenly working without a change in config. On Tue, Jun 27, 2023 at 09:50:14PM +0200, Maria Matejka wrote: > this looks fishy. I tried a trivial config like this on my laptop as follows > and it works. Could you please disclose the full table? My master6 just has a full-table plus kernel/device routes. For some reason before the restart all of them were "unreachable"[1] but now they seem to resolve via the default route now. I didn't save the dump I did before the restart unfortunately. [1]: Not sure why, but it doesn't matter since I just use the BGP feed on this router as a reflector and BGP routes are not exported to the kernel. None of the BGP routes were covering :: (I checked with `route show for ::`) so I don't really see why `recursive ::` suddenly works now. I tried this on another one of my routers that has a similar setup and it's also working there. I guess I'll just keep an eye on this for now. > There are some details in the recursive nexthop resolution algorithm > (preventing infinite resolution loops) which may apply to your case. Can you point me to the right bits in the code so I can have a look? > BTW, is there really no message in the log? I don't see any log messages, but maybe my verbosity itn's high enough. What options do I have to twiddle to make sure? Thanks, --Daniel
Re: Recursive nexthop via kernel route in proto static not working
Hi Alexander, On Tue, Jun 27, 2023 at 04:48:46PM +0200, Alexander Zubkov wrote: > Not sure, but I would guess it can be related to the local address. It > might try to pick the first interface with such network. > Could you try your setup with some route that has the nexthop from a > unique subnet configured on the interface? At least to check if it will > become reachable or not. Ok, if I add a route like `2000::/64 dev enp1s0 via fe80::fc00:3ff:fec7:cd05` to the kernel and recurse through that it does work. protocol static static_export_kernel2 { ipv6; igp table master6; route 2001:db8::/64 recursive 2000::1234; } $ ip -6 route add 2000::/64 dev enp1s0 via fe80::fc00:3ff:fec7:cd05 $ birdc show route protocol static_export_kernel2 all 2001:db8::/64unicast [static_export_kernel2 16:01:20.131] * (200) via fe80::fc00:3ff:fec7:cd05 on enp1s0 Type: static univ > Or it might be that routes imported from the kernel are marked as > recursive, so it does not resolve because Bird does not allow double > recursion. If I remove the 2000::/64 route the recursive route goes back to unreachable. So something seems to be preventing recursion through the default route specifically, but it evidently doesn't have anything to do with the route coming from the kernel. $ ip -6 route del 2000::/64 dev enp1s0 via fe80::fc00:3ff:fec7:cd05 $ birdc show route protocol static_export_kernel2 all 2001:db8::/64unreachable [static_export_kernel2 16:01:20.131] * (200) Type: static univ Weird. Thanks for your suggestions, --Daniel
Recursive nexthop via kernel route in proto static not working
Hi, I'm trying to configure a static route that uses the system's default router. The default router is out of my control and is announced via IPv6 RA. Since the MAC might not be stable I'd prefer not to hardcode the router's link-local address. So I tried: protocol static static_export_kernel2 { ipv6; igp table master6; route 2001:db8::/48 recursive ::; } The thinking being that :: ought to be resolvable through the default route imported from proto kernel: birdc show route for :: all Table master6: ::/0 unicast [kernel1 2023-03-22] * (10) via fe80::fc00:3ff:fec7:cd05 on enp1s0 Type: inherit univ Kernel.source: 9 Kernel.metric: 1024 Kernel.hoplimit: 64 However the static route ends up being "unreachable": 2a0d:f302:114::/48 unreachable [static_export_kernel2 13:59:32.819] * (200) Type: static univ Any ideas why this isn't working? Thanks, --Daniel PS: The reason I have to create this static route when it's already covered by the default route has to do with an esoteric policy routing setup and multiple routing daemons ask for details at your own peril :)
Re: Graceful shutdown request signal
Hi Erin, On Tue, Jun 20, 2023 at 08:20:50PM +0200, Erin Shepherd wrote: > I run bird on a system which uses systemd as a service supervisor, and > would like to implement graceful restart in a way which works well with > it. I'm also interested in getting this working. I'm wondering how graceful restart is supposed to behave to begin with though. Last time I just tried `birdc graceful restart` I was surprised that this actually makes bird exit with rv=0 instead of ... well actually restarting. Is that normal or a bug in the Debian packaging/systemd service? I suppose if you're supposed to use graceful restart just before rebooting the system it makes sense but at the time I just wanted to do a full restart to update the running bird executable without causing traffic disruption and was rather surprised when it didn't come back up. > In particular, what I'd like to do is: > • If I restart the bird service (e.g. for an upgrade), Bird performs a > graceful restart > • If I manually stop bird (systemctl stop bird2), shutdown the system, or > any other action which is liable to cause a longer period of downtime, > perform a graceful restart (for BGP at least) When you want to tickle special behaviour out of systemd I found it best to not try and do everything in one unit. Using the dependency system you can stop another unit before bird.service gets stopped, this can then call birdc graceful restart, like this: bird-graceful-restart.service: [Unit] After=bird.service Requires=bird.service [Service] Type=oneshot RemainAfterExit=yes ExecStop=birdc graceful restart [Install] WantedBy=bird.service That way you can enable/disable system wide graceful bird restart by `systemctl enable bird-graceful-restart.service`. The BindsTo+After means stopping bird.service (which is part of restart) will first stop bird-graceful-restart.service, which can then casually signal bird to perform graceful restart without a race (I think). WanteBy=bird.service is needed as ExecStop will only run when a service has actually been started. Also some special handling for the case where bird is already gone (crashed or exited) and birdc would fail might be needed. Excercise for the reader :) Doing some quick testing this actually seems to do what I want. Somewhat unexpectedly `systemctl restart bird.servce` with the above enabled does then actually do what you'd expect since restart is just stop+start. Bird will exit on getting the graceful restart command so systemd can't try to stop it some more but the subsequent start will make it come back up. Neat. On system shutdown only the stop part will happen so bird exiting is actully useful now and that too should work as expected. > Ideally systemd would have an ExecStopRestart option which could be used > to distinguish between stops and restarts, but unfortunately it doesn't > (I might submit a feature request for this regardless). However, it does > support `KillSignal and RestartKillSignal options.` I've been perplexed by this in the past also. Don't even get me started on the fact that ExecStop= runs even when a service already terminated on it's own, to like stop it some more... :) The semantics are really bonkers sometimes but it's actually rather flexible if you look at the nuaunces of the different dependency types a bit. --Daniel
Re: [PATCH] [RFC] Babel: Implement route daming with fixed delay
Hi Juliusz, On Tue, Mar 07, 2023 at 01:20:28PM +0100, Juliusz Chroboczek wrote: > To be honest, we hacked things until we had acceptable worst-case behaviour. > > We had two networks to experiment with: Nexedi's production network > (hundreds of tunnels over the public IPv6 Internet) and a simulated > network we built ourselves which we believed represented the worst case (a > bufferbloated diamond network). We built a first prototype, which we > instrumented to log RTT samples and route flaps, and noticed three things: > > 1. in the production network, the RTT signal is noisy (see figures 4 and 6 >of [1]); > > 2. in the bufferbloated diamond network, when we switch away from >a congested route, we switch back too early, before the buffers have >had time to drain; > > 3. in the diamond network, we tend to switch routes as we oscillate around >a common value. I'm working on comparing the exponential filter behaviour to my damping approach, I was wondering if you still have any scripts/notes on how the simulations for Fig 10 and 11 were setup? > References: > > [1] https://arxiv.org/pdf/1403.3488.pdf > Hence, the three mechanisms: > > 1. smoothing of the RTT, to makes the signal less noisy; the smoothing is >exponential just because it's easy to implement; > > 2. saturating map from RTT to metric, so that congested routes all appear >equally bad, and we don't switch back before the buffers drain; Re-reading this, was the time until a route is reconsidered made dependent on the metric/RTT difference on purpose to get this draining behaviour? While my timer approach doesn't mirror this currently I'm not opposed to it, I just wonder if metric difference is the right variable here. Thanks, --Daniel
Re: Pending bird patches and discussions for proto/babel
Hi Maria, On Thu, Jun 01, 2023 at 05:33:42PM +0200, Maria Matejka via Bird-users wrote: > > Toke's first two patches are ready for merging, but the third (metric > > smoothing) is contested since it conflicts with my route selection patch in > > its current form and if adapted it conflicts with my scalability goals for > > babel instead. Possibly premature optimization admittedly ;) > > We generally thought that the patches were still kinda work in progress, > sorry if our non-responsivity could be interpreted as negligence. No worries, I figured that's what was going on. > > That being said since RTT metric is a new feature anyway I would prefer we > > just go with my route damping approach, mark RTT as experimental and get > > some more wider testing through a released version. > > That looks like a feasible way to go. Shall we then just try to merge these > patches as they have been sent to the list? I have some more changes to make to the dampening patch, but if we're in agreement that this direction is the way to go I'll get a v1 out ASAP. You can merge Toke's patches independently of that though. RTT without smoothing is still useful :) > > Lastly my v5 route selection patch should also be ready, all threads of > > discussion should be resolved and there are no known bugs. However I've not > > received any testing reports, hint, hint peeps °>° > > OK, will look at it as well. Please send me an owl, a pigeon or another > avian carrier compatible with RFC 1149 if we don't react in a reasonable > time (est. a fortnight). I'm afraid quality-of-life concerns for the carriers involved prevent this as a method of delivery. Perhaps once RFC 7511 is more widely deployed? I'd be open to sending a reminder mail over conflcit-free silica transmission line though :D Chirp-chirp, --Daniel
Pending bird patches and discussions for proto/babel
Hello birds, I just thought I'd put together an overview of as yet unresolved proto/babel patches and discussions to stimulate some activity :) Toke's RTT series: - lib/timer: Add current_time_now() function for immediate timestamp - babel: Add support for the RTT extension - babel: Add route metric smoothing My route selection revamp: - [v5] Babel: Replace internal route selection by bird's nest - [DRAFT] Babel: Implement route daming with fixed delay(s/daming/damping/) Toke's first two patches are ready for merging, but the third (metric smoothing) is contested since it conflicts with my route selection patch in its current form and if adapted it conflicts with my scalability goals for babel instead. Possibly premature optimization admittedly ;) Initially Toke was playing with the idea of exposing the smoothed metric to route filters in a followup patch. Me and Juliusz oppose this, for different reasons and the consensus is to not do that. My stance is that we could choose to merge smoothing as long as we don't expose it's implementation internals (the smoothed metric) to users. This way we can replace it with a more performant approach later. That being said since RTT metric is a new feature anyway I would prefer we just go with my route damping approach, mark RTT as experimental and get some more wider testing through a released version. Lastly my v5 route selection patch should also be ready, all threads of discussion should be resolved and there are no known bugs. However I've not received any testing reports, hint, hint peeps °>° --Daniel
Re: Radv proto sending adverts on wrong interface
Hi Kees, On Mon, Mar 13, 2023 at 07:23:17AM +0100, Kees Meijs | Nefos wrote: > About VLAN configuration: I guess one should never ever use VLAN_DEFAULT > c.q. VLAN 1 at all. Vendors often think differently about this case, > sometimes allowing to have a .1Q tag, sometimes not. Or sort of "both" as > well. In combination with protocols such as LACP or (variations on) STP > resulting in great fun, which of course is very sarcastic. I changed the default using `default-vlan-id 4095` for that very reason. Perhaps this is still a default vid related bug. I'll see if I can reproduce this with some other vid. > My two cents after opening up a box with a new switch, starting your > configuration: explicitly disable VLAN1 and continue using others. Same goes > for VLAN > 4090 by the way. These numbers are sometimes "reserved". The brocade docs list the reserved vlans (4091, 4092) explicitly and allow to change them like the default vid. The 4095 default vid I'm using is the example they use in the docs so I'd be surprised if that's broken. Using such low vlan ids is a holdover from using some ath9k wifi APs that only supported 4-bit vids so we had to make every number count ;) I'll definetly reconsider that in the future haha. Thanks, --Daniel
[PATCH] Fix sk sending link-local scope multicast on wrong interface
When radv sends to the ipv6 all-nodes multicast address (ff02::1) we don't set sin6_scope_id because ipa_is_link_local isn't satisfied. This will cause packets to be sent via the wrong interface. To fix this we use ip6_classify, which returns SCOPE_LINK for unicast fe80::/64 as well as multicast ff02::/8 destinations. --- Tested with bird 2.0.10 on Debian 11. sysdep/unix/io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sysdep/unix/io.c b/sysdep/unix/io.c index e131ca41..f77b4a51 100644 --- a/sysdep/unix/io.c +++ b/sysdep/unix/io.c @@ -229,7 +229,7 @@ sockaddr_fill6(struct sockaddr_in6 *sa, ip_addr a, struct iface *ifa, uint port) sa->sin6_flowinfo = 0; sa->sin6_addr = ipa_to_in6(a); - if (ifa && ipa_is_link_local(a)) + if (ifa && ip6_classify(&a) & SCOPE_LINK) sa->sin6_scope_id = ifa->index; } -- 2.30.2
Radv proto sending adverts on wrong interface
Hi, I'm using bird as a replacement for radvd since the latter has a longstanding issue with sending adverts on unconfigured interfaces under complex conditions. Turns out bird has a similar issue :) Looking at the code, when opening the socket for an interface in radv_sk_open we set sk->iface as you'd expect, which should cause packets to be sent directly via this interface. However radv sends packets to the all-nodes multicasts address for periodic adverts, see radv_send_ra. This then calls sk_send_to which (eventually) calls sockaddr_fill6. Here rther we find this code: if (ifa && ipa_is_link_local(a)) sa->sin6_scope_id = ifa->index; This would seem to be the problem to me, since a=ff02::1 doesn't pass this check so the sendmsg call goes out without the interface-index being communicated to the kernel. It looks like that causes it to just pick a random interface though I would expect it to just multicast this. Not sure why that is but it's broken on our side either way. I'd add a check here for whether the saddr is link-local too. That should cover this case. Any comments/objections? Thanks, --Daniel
[PATCH] [RFC] Babel: Implement route daming with fixed delay
In order to prevent RTT based routing from causing persistent traffic oscillations we delay core rte announcement of each prefix by a configurable but metric invariant amount of time. Initial announcements and withdrawals will go through undelayed but a chnage in route for a prefix will cancel the previous announcement and reset the delay. --- This is just quickly thrown together to demonstrate the idea of how to replace metric smoothing with something more simplistic. Obviously the performance of using a linked list for keeping the scheduled announcements is suboptimal, our binary heap could be used instead. I'm not quite sure all the logic works out but let's discuss the concept before spending more time on it :) This could be extended to make the delay depend on metric difference if it turns out that's necessary I just skipped it for simplicity. proto/babel/babel.c | 79 +++-- proto/babel/babel.h | 11 +-- 2 files changed, 86 insertions(+), 4 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index 2243fc89..d6d99a37 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -62,6 +62,7 @@ static inline int gt_mod64k(uint a, uint b) static void babel_expire_requests(struct babel_proto *p, struct babel_entry *e); static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r); +static void babel_sched_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r); static void babel_rte_update_unreachable(struct babel_proto *p, struct babel_entry *e, u8 announce); static void babel_send_route_request(struct babel_proto *p, struct babel_entry *e, struct babel_neighbor *n); static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, struct babel_seqno_request *sr, struct babel_neighbor *n); @@ -653,7 +654,7 @@ done: WALK_LIST2(r, n, nbr->routes, neigh_route) { r->metric = babel_compute_metric(nbr, r->advert_metric); - babel_announce_rte(p, r->e, r); + babel_sched_announce_rte(p, r->e, r); } } } @@ -772,6 +773,50 @@ babel_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_ro } } +/* + * When route damping is enabled we have to delay updates to the kernel FIB. We + * do this by queueing some updates before actually announcing them to core. + */ +static void +babel_sched_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r) +{ + struct babel_config *cf = (void *) p->p.cf; + + if (!cf->route_damping || + !e->selected || + e->selected->metric < r->metric || + /*^ When the best route's metric is worse than the one we're about to + * announce it's safe to do so immediately even if route damping is on as + * it won't get exported to most other protocols. Note add-path BGP is + * still notified but we're not concerned about micromanaging the + * stability of exotic setups, only our local kernel FIB's impact. */ + r->metric == BABEL_INFINITY || !r->feasible) + /*^ This checks if announce_rte will do a withdrawal. */ + { +if (e->scheduled) { + rem_node(&e->sched); + e->scheduled = NULL; +} + +babel_announce_rte(p, r->e, r); + } + else + { +struct babel_iface *ifa = r->neigh->ifa; + +if (!e->scheduled) + add_tail(&ifa->sched_rtes, &e->sched); +e->scheduled = r; + +r->sched_time = current_time() + cf->route_damping; + +if (!ifa->want_sched) { + ifa->next_sched = r->sched_time; + ifa->want_sched = 1; +} + } +} + /* * Functions to send replies */ @@ -1371,7 +1416,7 @@ babel_handle_update(union babel_msg *m, struct babel_iface *ifa) e->updated = current_time(); } - babel_announce_rte(p, e, r); + babel_sched_announce_rte(p, e, r); } void @@ -1683,6 +1728,32 @@ babel_iface_timer(timer *t) p->triggered = 0; } + if (ifa->want_sched && now_ >= ifa->next_sched) { +TRACE(D_EVENTS, "Announcing scheduled rtes on %s", ifa->ifname); + +struct babel_entry *e, *ex; + +ifa->next_sched = TIME_INFINITY; + +u8 all_announced = 1; +WALK_LIST_DELSAFE(e, ex, ifa->sched_rtes) { + struct babel_route *r = e->scheduled; + ASSERT(r); + if (now_ >= r->sched_time) { + rem_node(&e->sched); + babel_announce_rte(p, e, r); + } else { + all_announced = 0; + if (r->sched_time < ifa->next_sched) + ifa->next_sched = r->sched_time; + } +} + +if (all_announced) { + ifa->want_sched = 0; +} + } + btime next_event = MIN(ifa->next_hello, ifa->next_regular); if (ifa->want_triggered) next_event = MIN(next_event, ifa->next_triggered); tm_set(ifa->timer, next_event); @@ -1862,6 +1933,8 @@ babel_add_iface(struct babel_proto *p, struct iface *new, struct babel_iface_con init_list(&ifa->msg_queue); ifa->send_event = ev_new_init(ifa->pool, babel_send_queue, ifa)
Re: [PATCH 0/3] babel: Add support for the RTT extension
Hi Maria, Hi Toke, On Tue, Feb 28, 2023 at 02:07:06PM +0100, Maria Matejka via Bird-users wrote: > > > I think it's probably simpler to just re-announce any route that's still > > > converging every time we go through the routing table. > > > > Simpler, yes, but I do want to be able to maintain internet scale routing > > tables through babel eventually so slashing every little bit helps :) > > In version 2, update of non-best route is propagated only to some protocols > like pipes, add-path BGPs and alike. Ok, that's good for either approach. > In version 3, this is even more smoothed as all updates of one prefix are > exported asynchronously to each protocol, being notified after your Babel > ends the task (event, socket, timer), dampening best route oscillation or > other flaps. I don't quite understand why this would damp oscillations? Do you mean there's explicit route flap damping support in v3 or just that this is a side-effect of the new async world? I'd like to know more about either :) If we do have actual BGP style damping in nest in v3 I'm not sure there's much point in doing essentially the same thing in our proto. At the very least that would be a good reason to keep the babel specific daming easy to remove if it's about to become superceeded by direct nest support anyway. > This way, I'm not so scared about Babel periodically updating many routes. > BIRD has to withstand it. I still think doing uneccessary work/computation is just dumb if we can avoid it :) On Tue, Feb 28, 2023 at 04:45:35PM +0100, Toke Høiland-Jørgensen wrote: > Right, sure, that's a nice property, but I'm not actually sure how what > you sketched out above accomplishes that? Don't worry I'll send an RFC patch soon if I can make it work out, just got tied up by some (mild) covid. > > Simpler, yes, but I do want to be able to maintain internet scale routing > > tables through babel eventually so slashing every little bit helps :) > > Heh, yeah, I would like to eventually be able to do that as well, but I > think there are other optimisations we need to do first. For instance, > walking the entire routing table every second is not going to work in > the first place in this case :) True, but might as well throw myself at this RTT stuff while I have the time and energy. Large scale route table performance testing will have to wait for another day since there's not much point making it performant if the features I want/need aren't supported (and performant! haha). > >> Bear in mind that the currently selected route can also be converging, so > >> predicting when two routes "cross" gets complicated quickly. Simpler to > >> just do a periodic update and redo the comparison every time this update > >> happens. > > > > I feel like that's an artifact specific to the "metric smoothing" approach > > to route dampening not a feature though. The way I see it the behaviour we > > really want is to delay any change in selected route for a time related to > > the metric difference. > > > > Think back to what the purpose of the metric smoothing is in the first > > place: to limit oscillations of the selected route, which this will do just > > as well. > > I'm OK with finding another solution, but I think you're going to have > to explain in more detail how what you propose actually represents such > a solution, then :) Will do, I've been looking throug some network stability under dynamic routing literature to see if there's any well founded science we can apply here. Haven't really found anything good yet. The RTT paper does admit that "we lack an in-depth theoretical understanding of the performance of our algorithm, in particular of its stability." ;) There is one thing I'm unsure about: does the delay before propagating a route change to the kernel FIB actually have to depend on the metric difference to provide the network stability properties we're looking for? I think just strictly for stability a fixed delay should be fine, despite not being optimal in terms of convergence time. > > I don't agree with that. It's not as if I want per-hop information. Just a > > sum of RTTs along the path and a sum of administrative metric along the > > path rather than have those jumbled together into one number. > > > > Since babel is quite flexible in the actual metric math that would allow > > interesting ways of weighing each metric component rather than just having > > everything be linear. > > It also introduces dependencies, though. I.e., with the current approach > you can have a subset of the routers speak the RTT extension, and other > parts of the network will just have that incorporated into the metric. > Whereas if it is carried as a separate metric your entire network has to > know about the extension for it to be useful. I don't see why you couldn't do both. Incorporate the rtt (or other measures) into the metric for oblivious nodes and expose optional TLVs for ones that care about the different components. > > Fo
DHCPv6 PD server as client route originator (Was: iBGP RR IPv6 link-local next-hop not kept)
Hi Mirai, On Tue, Feb 28, 2023 at 02:57:54PM -0500, Mirai Azayaka wrote: > I am trying to send routes from my DHCPv6 prefix delegation server to > my router using iBGP. Those delegated prefix routes on the DHCPv6 > server are installed in its kernel table, such as 2001:db8:db8::/56 > via . Wow, that's actually a really interesting way to deploy DHCPv6 PD! I've been researching how to deploy PD outside of the proprietary vendor sandboxes recently (mainly still focusing on using DHCPv6 relays with route snooping), but haven't found any good free-software options so far. Perhaps your approach could be more viable for a small network like mine. Would you mind elaborating on your setup a bit more? Thanks, --Daniel (AS212704)
Re: [PATCH 0/3] babel: Add support for the RTT extension
Hi Toke, On Mon, Feb 27, 2023 at 12:16:01PM +0100, Toke Høiland-Jørgensen wrote: > >> - Add the smoothed metric as a new route attribute (so it's also > >> available to filters) > > > > I think doing that is a bad idea. If we keep filters from changing this we > > might just be able to optimize by only announcing smooted metric updates > > when the resulting route would actually be better than the currently > > selected one. > > > > If we let filters meddle with this however that becomes impossible. Unless > > you were thinking of a r/o attribute? > > I meant read-only. I agree that the filters shouldn't be able to modify > the smoothed metric. I've thought about this some more, I think we absolutely shouldn't expose the smooted metric to filters. It's an implementation detail. There's a bunch of other valid ways to implement this sort of dampening/hysteresis no reason to expose it to filter users really. How would you even use this for any practical filtering anyway? Reject routes based on `smoothed_metric - metric > threshold` or even better, increase metric based on smoothed_metric difference (oscillations galore haha). Hardly seems useful to me :) AFAICT it's perfectly fine to have EAs that filters can't access though, that's the situation we're currently in for the router-id after all so that's not really a blocker. I've been playing with rebasing your smoothing patch on mine now and I see a couple of options. My favorite so far is strategically replacing relevant babel_announce_rte calls with this: static void babel_sched_announce_rte(struct babel_proto *p, struct babel_route *r) { struct babel_config *cf = (void *) p->p.cf; if (!cf->metric_decay || !e->selected || e->selected->metric < r->metric) /* When best route's metric is less than the one we're about to announce * it's safe to do so immediately even if metric smoothing is on. Note * e->selected takes export filters into account. */ babel_announce_rte(p, r->e, r); else e->scheduled = r; } The clever bit is that announcing a route that's no better than the selected one will have no impact on route selection. Same applies if another protocol's route is active (i.e. when e->selected is NULL). Only think I'm still missing is a way to reframe the metric smoothing math such that I can install a timer for when the next scheduled route update should go through, though perhaps polling in the global timer hook is fine? > Another question is if the smoothing should incorporate any changes to > the metric that the filters do? It probably should, right? I don't think that'll give us very useful behaviour TBH. Perhaps the way to look at it is that we're not trying to smooth the (administrative) metric but rather only the link cost component which is the bit that is actually fluctuating? Problem is we only have these as seperate numbers for our local metrics/costs not the ones from other nodes. Maybe we need a protocol extension to seperate these two concepts, hmm. I have been thinking it would be nice to have seperate rtt and administrative metric numbers just for network debugging, perhaps it would make sense for general operation too. > That may be OK, but I'm not actually sure how the nest comparison works > (when we announce a new route, does the rte_better function get called > pairwise against every other candidate with the same prefix, or just > against the current best?). >From adding some quick tracing code it certainly looks like the comparison is only against the current best. There's only ever a couple of rte_better calls per prefix once things are initialized. Also see above for how "lovely" all the necessary lookups are going to be if we go down this path, haha. --Daniel
Re: [PATCH 3/3] babel: Add route metric smoothing
Hi Toke, See below. On Sun, Feb 26, 2023 at 11:10:06PM +0100, Toke Høiland-Jørgensen via Bird-users wrote: > The Babel RTT extension employs metric smoothing to dampen route > oscillations in the face of varying RTT values between two peers[0]. > > This patch implements such dampening in Bird, roughly following the > implementation in babeld (i.e., using the same exponential function > definition). The main difference is that we calculate everything in the > native Bird microsecond time unit (and increase constants accordingly), and > that we split out the smoothed metric calculation in two function variants, > one that has no side effects and one that does. > > [0] https://arxiv.org/pdf/1403.3488.pdf > > Signed-off-by: Toke Høiland-Jørgensen > --- > doc/bird.sgml| 12 + > proto/babel/babel.c | 121 --- > proto/babel/babel.h | 16 ++ > proto/babel/config.Y | 10 +++- > 4 files changed, 150 insertions(+), 9 deletions(-) > > diff --git a/doc/bird.sgml b/doc/bird.sgml > index 451dff4031d5..82f0ded13133 100644 > --- a/doc/bird.sgml > +++ b/doc/bird.sgml > @@ -1915,6 +1915,7 @@ protocol babel [] { > ipv4 { }; > ipv6 [sadr] { }; > randomize router id ; > +metric decay ; > interface { > type ; > rxcost ; > @@ -1965,6 +1966,17 @@ protocol babel [] { >router ID every time it starts up, which avoids this problem at the > cost >of not having stable router IDs in the network. Default: no. > > + metric decay > + The metric smoothing decay time. When route metrics vary (because of > + varying quality of a wireless link, or varying RTT when timestamps are > + enabled), Babel applies an exponential smoothing procedure to the > metric > + to dampen route oscillations. This parameter specifies the half-life of > + the convergence of the smoothed metric to the actual metric of the > route. > + I.e., the distance between the smoothed and the actual metric will be > + halfed for each time period specified here (until they converge). Set > to 0 > + to disable metric smoothing; if set, the value must be in the interval > + 1-180 s. Default: 4 s > + >type wired|wireless|tunnel >This option specifies the interface type: Wired, wireless or tunnel. On >wired interfaces a neighbor is considered unreachable after a small > number > diff --git a/proto/babel/babel.c b/proto/babel/babel.c > index 04613788303c..a95e37f5c413 100644 > --- a/proto/babel/babel.c > +++ b/proto/babel/babel.c > @@ -144,6 +144,103 @@ babel_expire_sources(struct babel_proto *p UNUSED, > struct babel_entry *e) >} > } > > +static u16 > +babel_calc_smoothed_metric(struct babel_proto *p, struct babel_route *r, u8 > update) > +{ > + struct babel_config *cf = (void *) p->p.cf; > + uint metric = r->metric, smoothed_metric = r->smoothed_metric; > + btime smoothed_time = r->smoothed_time, now = current_time(); > + > + if (!cf->metric_decay || metric == BABEL_INFINITY || > + metric == smoothed_metric || !smoothed_time) > + { > +smoothed_metric = metric; > +smoothed_time = now; > +goto out; > + } > + > + int diff = metric - smoothed_metric; > + > + /* > + * The decay defines the half-life of the metric convergence, so first > iterate > + * in halving steps > + */ > + while (smoothed_time < now - cf->metric_decay && diff) { > +smoothed_metric += diff/2; > +smoothed_time += cf->metric_decay; > +diff = metric - smoothed_metric; > + } > + > + /* > + * Then, update remainder in BABEL_SMOOTHING_STEP intervals using the > + * exponential function (approximated via the pre-computed reciprocal). > + */ > + while (smoothed_time < now - BABEL_SMOOTHING_STEP && diff) { > +smoothed_metric += (BABEL_SMOOTHING_STEP * diff * > +(cf->smooth_recp - BABEL_SMOOTHING_UNIT) / > BABEL_SMOOTHING_UNIT); > +smoothed_time += BABEL_SMOOTHING_STEP; > +diff = metric - smoothed_metric; > + } > + > + /* Consider the metric converged once we're close enough */ > + if (ABS(diff) < BABEL_SMOOTHING_MIN_DIFF) > +smoothed_metric = metric; > + > +out: > + if (update) { > +r->smoothed_metric = smoothed_metric; > +r->smoothed_time = smoothed_time; > + } > + > + return smoothed_metric; > +} > + > +static u16 > +babel_update_smoothed_metric(struct babel_proto *p, struct babel_route *r) > +{ > + if (!r->metric) > +return 0; > + > + if (!r->smoothed_metric) { > +r->smoothed_metric = r->metric; > +r->smoothed_time = current_time(); > +return r->smoothed_metric; > + } > + > + u16 smoothed = babel_calc_smoothed_metric(p, r, 1); > + DBG("Updated smoothed metric for prefix %N: router-id %lR metric %d/%d\n", > + r->e->n.addr, r->router_id, r->metric, smoothed); > + > + return smoothed; > +} > + > +static u16 > +babel_smoothed_metric(struct babel_prot
Re: Babel: Clarifications on seqno request handling in bird
On Mon, Feb 27, 2023 at 07:25:13PM +0100, Juliusz Chroboczek wrote: > > That's a bug in the new RFC text then ;) > > Agreed. https://www.rfc-editor.org/how-to-report/ Done :) --Daniel
[PATCH v5] Babel: Replace internal route selection by bird's nest
This introduces the ability to filter routes from specific interfaces and neighbours. With the current internal route selection proto babel exports only up to one route and an admin cannot do fine-grained filtering. To fix this we rip out the internal route selection entirely and export evey babel route as a distinct route into the bird's nest instead. In the presence of esoteric route filters this change may be observable by users however we feel the additional filtering power is more important. A release note entry for this change is recommended. --- Changes in v5: - Fix crash due missing rt_lock_source from rte_src pruning - Fix broken rt_notify retraction logic (!internal) which would just crash proto/babel/babel.c | 285 proto/babel/babel.h | 3 + 2 files changed, 135 insertions(+), 153 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index 04613788..c030784a 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -28,10 +28,12 @@ * possible routes for the prefix are tracked as &babel_route entries and the * feasibility distance is maintained through &babel_source structures. * - * The main route selection is done in babel_select_route(). This is called when - * an entry is updated by receiving updates from the network or when modified by - * internal timers. The function selects from feasible and reachable routes the - * one with the lowest metric to be announced to the core. + * The main route selection is done by bird's nest (heh). For each prefix we + * export all feasible routes to the core with a distinct source (rte_src) per + * neihbour so bird handles them as distinct routes. Nest will then notify us of + * one best route for each prefix, either our own (internal) or from another + * protocol, by calling babel_rt_notify. For internal best routes we remember + * which babel_route it selected in babel_entry.selected. * * Supported standards: * RFC 8966 - The Babel Routing Protocol @@ -59,8 +61,8 @@ static inline int gt_mod64k(uint a, uint b) { return ge_mod64k(a, b) && a != b; } static void babel_expire_requests(struct babel_proto *p, struct babel_entry *e); -static void babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_route *mod); -static inline void babel_announce_retraction(struct babel_proto *p, struct babel_entry *e); +static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r); +static void babel_rte_update_unreachable(struct babel_proto *p, struct babel_entry *e, u8 announce); static void babel_send_route_request(struct babel_proto *p, struct babel_entry *e, struct babel_neighbor *n); static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, struct babel_seqno_request *sr, struct babel_neighbor *n); static void babel_update_cost(struct babel_neighbor *n); @@ -178,9 +180,7 @@ static inline void babel_retract_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; - - if (r == r->e->selected) -babel_select_route(p, r->e, r); + babel_announce_rte(p, r->e, r); } static void @@ -192,9 +192,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct babel_route *r) rem_node(NODE r); rem_node(&r->neigh_route); - if (r->e->selected == r) -r->e->selected = NULL; - sl_free(r); } @@ -210,6 +207,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; r->expires = current_time() + cf->hold_time; +babel_announce_rte(p, r->e, r); } else { @@ -239,8 +237,6 @@ babel_expire_routes_(struct babel_proto *p, struct fib *rtable) loop: FIB_ITERATE_START(rtable, &fit, struct babel_entry, e) { -int changed = 0; - WALK_LIST_DELSAFE(r, rx, e->routes) { if (r->refresh_time && r->refresh_time <= now_) @@ -248,23 +244,12 @@ loop: if (r->expires && r->expires <= now_) { - changed = changed || (r == e->selected); + FIB_ITERATE_PUT(&fit); babel_expire_route(p, r); + goto loop; } } -if (changed) -{ - /* - * We have to restart the iteration because there may be a cascade of - * synchronous events babel_select_route() -> nest table change -> - * babel_rt_notify() -> rtable change, invalidating hidden variables. - */ - FIB_ITERATE_PUT(&fit); - babel_select_route(p, e, NULL); - goto loop; -} - /* Clean up stale entries */ if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= now_)) e->valid = BABEL_ENTRY_DUMMY; @@ -273,7 +258,7 @@ loop: if (e->unreachable && (!e->valid || (e->router_id == p->router_id))) { FIB_ITERATE_PUT(&fit); - babel_announce_retraction(p, e); + babel_rte_update_unreachable(p, e, 0); goto loop; } @@ -457,6 +442,8 @@ babel_get
Re: Babel: Clarifications on seqno request handling in bird
Hi Juliusz, On Mon, Feb 27, 2023 at 01:58:21PM +0100, Juliusz Chroboczek wrote: > > I don't think RFC8966 is really framed in bird's "multi protocol" mindset > > See the beginning of Section 3.7, which describes how a route > redistributed from another protocol has router-id set to the local > router's id. Babel updates for the same prefix are processed as usual, > with the routes inserted and maintained in Babel's route table but not > necessarily installed into the kernel's forwarding table. Right, what was missing from my thinking was the realization that sending seqno requests for routes originated by us in this way doesn't really make sense :) > > so it's unclear to me whether this is something we have to fix or > > not. > > I'm not familiar with BIRD's internals, so I don't know. There are two > possible approaches: > > - drop all Babel routes for the static prefix ; or > - maintain routes as usual, but don't select them. > > The latter is what babeld does (see xroute.c). I belive my current rt_notify approach also has this behaviour, so that should be fine then. > >> A node that has lost all feasible routes to a given destination but still > >> has unexpired unfeasible routes to that destination MUST send a seqno > >> request; > > > > I could for example read this as the above mentioned static route > > constituting a feasible route received from a bird "internal" babel > > neighbour which would make the behaviour described above perfectly fine, > > no? > > It is always safe to send a seqno request. However, unless it is known > that there is a Babel speaker at that address, it's pointless: a non-Babel > node will not know how to react to a Babel message. > > However, if there's a static route that's selected, there's not much point > maintaining the feasibility of the Babel route; just let it become > unfeasible, and optionally send a request when the static route is > retracted. When the static route is retracted we'll only send a seqno request when we receive an unfeasable update. The problem in the implementation is we don't know the router-id/seqno for the request anymore at this point. I think this is fine since we're still sending seqno requests for any unfeasable updates received while the static route is installed, so sending another one when that is retracted shouldn't make a difference. > > Further, looking at the code implementing RFC section 3.8.2.2 in > > babel_handle_update: > > > > if (!feasible && (metric != BABEL_INFINITY) && > > (!best || (r == best) || (metric < best->metric))) > > /*^ */ > > babel_generate_seqno_request(p, e, s->router_id, s->seqno + 1, nbr); > > > > It's unclear to me what the (!best) condition corresponds to from the RFC > > text. > > That used to be clear in RFC 6126 (Section 3.8.2.2): > > Additionally, a node SHOULD send a unicast seqno request whenever it > receives an unfeasible update from a currently unselected neighbour > that is "good enough", i.e., that would lead to the received route > becoming selected were it feasible. > > It became less clear in RFC 8966: > > Additionally, since metric computation does not necessarily coincide > with the delay in propagating updates, a node might receive an > unfeasible update from a currently unselected neighbour that is > preferable to the currently selected route (e.g., because it has > a much smaller metric); in that case, the node SHOULD send a unicast > seqno request to the neighbour that advertised the preferable update. > > It is definitely a good idea to send a request in this case, especially if > you ignore unfeasible updates for unknown routes (first clause of Section > 3.5.3). That's a bug in the new RFC text then ;) Actually I also missed another reason for the !best condition: to prevent (metric < best->metric) from segfaulting. Thanks, --Daniel
Re: [PATCH 0/3] babel: Add support for the RTT extension
Hi Toke, On Mon, Feb 27, 2023 at 12:14:23AM +0100, Toke Høiland-Jørgensen wrote: > > To clarify: it's really only the metric smoothing patch that's in conflict > > with my patch. I would advocate for merging only the other two patches for > > now while we figure out how to rework the smoothing on top of my patch. I'm > > happy to do the rework we just need to come up with a plan for that :) > > Hmm, I think the way to handle this is basically: > > - Add the smoothed metric as a new route attribute (so it's also > available to filters) I think doing that is a bad idea. If we keep filters from changing this we might just be able to optimize by only announcing smooted metric updates when the resulting route would actually be better than the currently selected one. If we let filters meddle with this however that becomes impossible. Unless you were thinking of a r/o attribute? > - Change babel_rte_better() to incorporate the smoothed metric (from the > attribute) in its comparison > - Change the decay logic to be timer-based instead of calculating the > smoothed metric on demand > > That last bit is probably the biggest change. We can't really do the > cached on-demand calculation of the smoothed metric if we're sticking it > in an attribute. Can you elaborate on why you think that's not possible? The way I see it there isn't much harm in the smooted metric attribute being outdated so long as we ensure we announce an update if an actually better route comes along. Suggesting this kind of seems like we're brining back the select_route function, but I think we can do it in a way that only looks at the route that changed without iterating over all of them, the nest will do that bit for us now. --Daniel
Re: [PATCH v4] Babel: Replace internal route selection by bird's nest
Hi, I've observed the following crash in DEBUGGING mode due to src->proto being poisoned, so I think the neighbour must have been flushed too soon somehow. #0 rte_recalculate (c=0x556e1970, net=0x77fc57e0, new=0x7703cce0, src=0x77fc8370) at nest/rt-table.c:1392 #1 0x555dd72f in rte_update2 (c=0x556e1970, n=0x77fc41f0, new=0x7703cce0, src=0x77fc8370) at nest/rt-table.c:1636 #2 0x555e8e69 in babel_announce_rte (p=0x556e1700, e=0x77fc4168, r=0x7736d980) at proto/babel/babel.c:726 #3 0x555e8a2e in babel_update_cost (nbr=0x556ddea0) at proto/babel/babel.c:625 #4 0x555e9f49 in babel_handle_ihu (m=0x759e10a8, ifa=0x55705390) at proto/babel/babel.c:1132 #5 0x555f11b2 in babel_process_packet (ifa=0x55705390, pkt=0x5570f700, len=34, saddr=..., sport=6696, daddr=..., dport=6696) at proto/babel/packets.c:1525 #6 0x555f13fa in babel_rx_hook (sk=0x5570f5b0, len=34) at proto/babel/packets.c:1584 #7 0x55647707 in sk_read_noflush (s=0x5570f5b0, revents=1) at sysdep/unix/io.c:1914 #8 0x55647732 in sk_read (s=0x5570f5b0, revents=1) at sysdep/unix/io.c:1923 #9 0x556485e2 in io_loop () at sysdep/unix/io.c:2368 #10 0x5564dae0 in main (argc=8, argv=0x7fffe568) at sysdep/unix/main.c:959 I haven't been able to find a reproducer yet, I think it happened after a system suspend thogh. More notes below: On Sun, Feb 26, 2023 at 09:05:24PM +0100, Daniel Gröber wrote: > This introduces the ability to filter routes from specific interfaces and > neighbours. With the current internal route selection proto babel exports > only up to one route and an admin cannot do fine-grained filtering. > > To fix this we rip out the internal route selection entirely and export > evey babel route as a distinct route into the bird's nest instead. > > In the presence of esoteric route filters this change may be observable by > users however we feel the additional filtering power is more important. A > release note entry for this change is recommended. > --- > Changes in v4: > - We've seen the light, use FIB_ITERATE as intended, thanks Ondrej. > - Reinstate seqno request logic on loss of all feasible routes, thanks >Toke. Discussion of behaviour still ongoing, see[1] > - Try to prepare for async rt_notify in BIRD 3, thanks Maria. > > [1]: http://trubka.network.cz/pipermail/bird-users/2023-February/016689.html > > > proto/babel/babel.c | 278 > proto/babel/babel.h | 3 + > 2 files changed, 128 insertions(+), 153 deletions(-) > > diff --git a/proto/babel/babel.c b/proto/babel/babel.c > index ff8b6b52..eb347a85 100644 > --- a/proto/babel/babel.c > +++ b/proto/babel/babel.c > @@ -29,10 +29,12 @@ > * possible routes for the prefix are tracked as &babel_route entries and the > * feasibility distance is maintained through &babel_source structures. > * > - * The main route selection is done in babel_select_route(). This is called > when > - * an entry is updated by receiving updates from the network or when > modified by > - * internal timers. The function selects from feasible and reachable routes > the > - * one with the lowest metric to be announced to the core. > + * The main route selection is done by bird's nest (heh). For each prefix we > + * export all feasible routes to the core with a distinct source (rte_src) > per > + * neihbour so bird handles them as distinct routes. Nest will then notify > us of > + * one best route for each prefix, either our own (internal) or from another > + * protocol, by calling babel_rt_notify. For internal best routes we remember > + * which babel_route it selected in babel_entry.selected. > */ > > #include > @@ -50,8 +52,8 @@ static inline int ge_mod64k(uint a, uint b) > { return (u16)(a - b) < 0x8000; } > > static void babel_expire_requests(struct babel_proto *p, struct babel_entry > *e); > -static void babel_select_route(struct babel_proto *p, struct babel_entry *e, > struct babel_route *mod); > -static inline void babel_announce_retraction(struct babel_proto *p, struct > babel_entry *e); > +static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, > struct babel_route *r); > +static void babel_rte_update_unreachable(struct babel_proto *p, struct > babel_entry *e, u8 announce); > static void babel_send_route_request(struct babel_proto *p, struct > babel_entry *e, struct babel_neighbor *n); > static void babel_send_seqno_request(struct babel_proto *p, struct > babel_entry *e, struct babel_seqno_request
Re: [PATCH 0/3] babel: Add support for the RTT extension
Hi, On Sun, Feb 26, 2023 at 11:10:03PM +0100, Toke Høiland-Jørgensen via Bird-users wrote: > Note that this series conflicts with Daniel's patches for moving the > route selection into the Bird nest. Figured I'd send them now so this can > be part of the discussion of that other patch (and also Daniel is one of > the current users of this extension so I expect he'd be interested in > having the two be compatible). Indeed I've been using these patches for a while, but I've had to switch to babeld due to lack of proper route filtering ;) Still consider this Tested-By: Daniel Gröber To clarify: it's really only the metric smoothing patch that's in conflict with my patch. I would advocate for merging only the other two patches for now while we figure out how to rework the smoothing on top of my patch. I'm happy to do the rework we just need to come up with a plan for that :) --Daniel
[PATCH v4] Babel: Replace internal route selection by bird's nest
This introduces the ability to filter routes from specific interfaces and neighbours. With the current internal route selection proto babel exports only up to one route and an admin cannot do fine-grained filtering. To fix this we rip out the internal route selection entirely and export evey babel route as a distinct route into the bird's nest instead. In the presence of esoteric route filters this change may be observable by users however we feel the additional filtering power is more important. A release note entry for this change is recommended. --- Changes in v4: - We've seen the light, use FIB_ITERATE as intended, thanks Ondrej. - Reinstate seqno request logic on loss of all feasible routes, thanks Toke. Discussion of behaviour still ongoing, see[1] - Try to prepare for async rt_notify in BIRD 3, thanks Maria. [1]: http://trubka.network.cz/pipermail/bird-users/2023-February/016689.html proto/babel/babel.c | 278 proto/babel/babel.h | 3 + 2 files changed, 128 insertions(+), 153 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index ff8b6b52..eb347a85 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -29,10 +29,12 @@ * possible routes for the prefix are tracked as &babel_route entries and the * feasibility distance is maintained through &babel_source structures. * - * The main route selection is done in babel_select_route(). This is called when - * an entry is updated by receiving updates from the network or when modified by - * internal timers. The function selects from feasible and reachable routes the - * one with the lowest metric to be announced to the core. + * The main route selection is done by bird's nest (heh). For each prefix we + * export all feasible routes to the core with a distinct source (rte_src) per + * neihbour so bird handles them as distinct routes. Nest will then notify us of + * one best route for each prefix, either our own (internal) or from another + * protocol, by calling babel_rt_notify. For internal best routes we remember + * which babel_route it selected in babel_entry.selected. */ #include @@ -50,8 +52,8 @@ static inline int ge_mod64k(uint a, uint b) { return (u16)(a - b) < 0x8000; } static void babel_expire_requests(struct babel_proto *p, struct babel_entry *e); -static void babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_route *mod); -static inline void babel_announce_retraction(struct babel_proto *p, struct babel_entry *e); +static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r); +static void babel_rte_update_unreachable(struct babel_proto *p, struct babel_entry *e, u8 announce); static void babel_send_route_request(struct babel_proto *p, struct babel_entry *e, struct babel_neighbor *n); static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, struct babel_seqno_request *sr, struct babel_neighbor *n); static void babel_update_cost(struct babel_neighbor *n); @@ -168,9 +170,7 @@ static inline void babel_retract_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; - - if (r == r->e->selected) -babel_select_route(p, r->e, r); + babel_announce_rte(p, r->e, r); } static void @@ -182,9 +182,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct babel_route *r) rem_node(NODE r); rem_node(&r->neigh_route); - if (r->e->selected == r) -r->e->selected = NULL; - sl_free(r); } @@ -200,6 +197,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; r->expires = current_time() + cf->hold_time; +babel_announce_rte(p, r->e, r); } else { @@ -229,8 +227,6 @@ babel_expire_routes_(struct babel_proto *p, struct fib *rtable) loop: FIB_ITERATE_START(rtable, &fit, struct babel_entry, e) { -int changed = 0; - WALK_LIST_DELSAFE(r, rx, e->routes) { if (r->refresh_time && r->refresh_time <= now_) @@ -238,23 +234,12 @@ loop: if (r->expires && r->expires <= now_) { - changed = changed || (r == e->selected); + FIB_ITERATE_PUT(&fit); babel_expire_route(p, r); + goto loop; } } -if (changed) -{ - /* - * We have to restart the iteration because there may be a cascade of - * synchronous events babel_select_route() -> nest table change -> - * babel_rt_notify() -> rtable change, invalidating hidden variables. - */ - FIB_ITERATE_PUT(&fit); - babel_select_route(p, e, NULL); - goto loop; -} - /* Clean up stale entries */ if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= now_)) e->valid = BABEL_ENTRY_DUMMY; @@ -263,7 +248,7 @@ loop: if (e->unreachable && (!e->valid || (e->router_id == p->router_id))) { FIB_ITERATE_PUT(&fit);
Re: [Babel-users] Babel: Clarifications on seqno request handling in bird
Hi Maria, On Sun, Feb 26, 2023 at 07:34:06PM +0100, Maria Matejka wrote: > > I don't think RFC8966 is really framed in bird's "multi protocol" mindset > > so it's unclear to me whether this is something we have to fix or > > not. Section 3.8.2.1. says: > > > > > A node that has lost all feasible routes to a given destination but still > > > has unexpired unfeasible routes to that destination MUST send a seqno > > > request; > > > > I could for example read this as the above mentioned static route > > constituting a feasible route received from a bird "internal" babel > > neighbour which would make the behaviour described above perfectly fine, > > no? > > From my point of view, this is perfectly fine. I did just do another reivew on the knock on effects of interpreting it like that and I do tend to agree it all looks fine. The (!best) thing I mentioned actually works in our favor here since getting rt_notifi'ed of another protocols route will then still trigger the 3.8.2.2, as it should, if we receive an unfeasible update. > > @Bird folks: can anyone think of a way to be notified of any and all > > changes in rt_notify? I assume it's not possible from some light reading of > > the nest code but figured I might as well ask. > > You may set channel ra_mode to RA_ANY, getting called rt_notify() for all > route updates. > > Or in BIRD 3, you can override the channel logic at all and if you don't > mind calling your filters yourself, you can just request "on any change, > give me all the routes for one prefix at once" and do whatever you want with > the routes. Yet these changes most probably won't get backported to BIRD 2. Hmm RA_ANY might actually be just fine if it comes to that, I'd still prefer not having to do it tho :) Thanks, --Daniel
Re: [PATCH v3] Babel: Replace internal route selection by bird's nest
Hi Toke, Thanks for the comprehensive review! See below. On Tue, Jan 31, 2023 at 12:38:25PM +0100, Toke Høiland-Jørgensen wrote: > Daniel Gröber writes: > > This appears to not actually be a breaking change as route announcement was > > already based on which routes bird would send back to babel_rt_notify > > filters shouldn't be able to tell the difference between a single and > > multiple routes AFAIK. > > Can we be a bit more sure than "appears"? :) Testers welcome :P > Just to make sure I'm understanding correctly: this relies on the nest > using babel_rte_better() to select the route with the lowest metric from > all the ones we announce, right? And if there's no filter, functionality > will be equivalent to before this patch? Right, I have yet to actually compare the before/after behaviour more comprehensively. Hence the "appears" :) Also note the feasibility condition is checked before announcing to nest so rte_better will never see ones that don't satisfy it. Furthermore metric smoothing is going to be even fun to implement with this ;) If there is no filter everything works as ever, what we still have to worry about is the with filter case. I'm not sure if there are any differences observable from the filter POV. In particular import/export filters that fiddle with the metric need to be tested still. > > +static struct babel_route * > > +babel_get_selected_route(struct babel_proto *p, struct babel_entry *e) > > +{ > > + if (e->selected_nbr) > > +return babel_get_route(p, e, e->selected_nbr); > > + return NULL; > > +} > > Why the switch to storing the neighbour and resolving that to a route > here? Couldn't we just keep the existing e->selected route and populate > that where e->selected_nbr is currently set? Just seemed more natural to me at the time, but looking at it now I wonder if it isn't better for performance to cache this after all, since get_selected_route does get called in handle_update. I think I'll roll that back. > > + for (struct babel_route *r_next, *r = expire_head; r ; r = r_next) { > > +r_next = r->expire_next; > > +babel_expire_route(p, r); > > + } > > + > > + for (struct babel_entry *e = retract_head; e ; e = e->retract_next) { > > +babel_announce_retraction(p, e); > > + } > > + > > + for (struct babel_entry *e_next, *e = delete_head; e ; e = e_next) { > > +e_next = e->delete_next; > > +fib_delete(rtable, e); > > + } > > These also be clearing the ->{expire,delete,rectract}_next pointers? I > know they're not used anywhere else, but they'll be left pointing to > random memory in some cases here... Yeah, I had that for all the loops then removed it because it complicated the expire case as that one might deallocate the route. I'll give it another go. > > +static void > > +babel_announce_rte_unreachable(struct babel_proto *p, struct babel_entry > > *e, u8 unreachable) > > +{ > > + struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : > > p->ip6_channel; > > + rte *rte = NULL; > > + > > + if (unreachable) { > > +/* Unreachable */ > > +rta a0 = { > > + .source = RTS_BABEL, > > + .scope = SCOPE_UNIVERSE, > > + .dest = RTD_UNREACHABLE, > > + .pref = 1, > > +}; > > + > > +rta *a = rta_lookup(&a0); > > +rte = rte_get_temp(a, p->p.main_source); > > + } > > + > > + e->unreachable = unreachable; > > + > > + /* Unlike the many neighbour routes we only want one unreachable route, > > for > > + * one it's ugly to have lots of unreachable routes (if we just did this > > right > > + * in babel_announce_rte) but we also loose access to the neibour > > rte_src id > > + * when the neighbour is deallocated. This happens on a different > > schedule > > + * than unreachable route removal. */ > > + rte_update2(c, e->n.addr, rte, p->p.main_source); > > +} > > OK, so this function took me a while to understand: There's a special > "unreachable" route we announce whenever we lose a route, and this > function announces or retract that particular route based on the > "unreachable" argument. Could you please add a comment to the top of the > function explaining all this (the comment at the bottom is only a > partial explanation). > > Also s/loose/lose/ ACK. > > -TRACE(D_EVENTS, "Lost feasible route for prefix %N", e->n.addr); > > -if (e->valid && (e->selected->
[PATCH v3] Babel: Replace internal route selection by bird's nest
This allows for filtering routes from specific interfaces and neighbours. With the current internal route selection proto babel exports only up to one route and an admin cannot do fine-grained filtering. To fix this we rip out the internal route selection entirely and put them all into the bird's nest instead. This appears to not actually be a breaking change as route announcement was already based on which routes bird would send back to babel_rt_notify filters shouldn't be able to tell the difference between a single and multiple routes AFAIK. --- Changes in v3: - Subsume FIB_RESTART v2 patch: instead of restarting FIB iteration we keep lists of actions to perform after FIB iteration is finished. proto/babel/babel.c | 269 +++- proto/babel/babel.h | 8 +- 2 files changed, 120 insertions(+), 157 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index ff8b6b52..087fd95b 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -29,10 +29,11 @@ * possible routes for the prefix are tracked as &babel_route entries and the * feasibility distance is maintained through &babel_source structures. * - * The main route selection is done in babel_select_route(). This is called when - * an entry is updated by receiving updates from the network or when modified by - * internal timers. The function selects from feasible and reachable routes the - * one with the lowest metric to be announced to the core. + * The main route selection is done by the bird nest (heh). For each prefix we + * export all feasible routes to the core with a distinct source (rte_src) per + * neihbour. Bird will then notify us of the optimal one by calling + * babel_rt_notify and we remember which neighbour it selected in + * babel_entry.selected_nbr. */ #include @@ -50,7 +51,7 @@ static inline int ge_mod64k(uint a, uint b) { return (u16)(a - b) < 0x8000; } static void babel_expire_requests(struct babel_proto *p, struct babel_entry *e); -static void babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_route *mod); +static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r); static inline void babel_announce_retraction(struct babel_proto *p, struct babel_entry *e); static void babel_send_route_request(struct babel_proto *p, struct babel_entry *e, struct babel_neighbor *n); static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, struct babel_seqno_request *sr, struct babel_neighbor *n); @@ -164,13 +165,19 @@ babel_get_route(struct babel_proto *p, struct babel_entry *e, struct babel_neigh return r; } +static struct babel_route * +babel_get_selected_route(struct babel_proto *p, struct babel_entry *e) +{ + if (e->selected_nbr) +return babel_get_route(p, e, e->selected_nbr); + return NULL; +} + static inline void babel_retract_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; - - if (r == r->e->selected) -babel_select_route(p, r->e, r); + babel_announce_rte(p, r->e, r); } static void @@ -182,9 +189,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct babel_route *r) rem_node(NODE r); rem_node(&r->neigh_route); - if (r->e->selected == r) -r->e->selected = NULL; - sl_free(r); } @@ -200,6 +204,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; r->expires = current_time() + cf->hold_time; +babel_announce_rte(p, r->e, r); } else { @@ -210,7 +215,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) static void babel_refresh_route(struct babel_proto *p, struct babel_route *r) { - if (r == r->e->selected) + if (r == babel_get_selected_route(p, r->e)) babel_send_route_request(p, r->e, r->neigh); r->refresh_time = 0; @@ -221,16 +226,15 @@ babel_expire_routes_(struct babel_proto *p, struct fib *rtable) { struct babel_config *cf = (void *) p->p.cf; struct babel_route *r, *rx; + struct babel_entry *retract_head = NULL, *delete_head = NULL; + struct babel_route *expire_head = NULL; struct fib_iterator fit; btime now_ = current_time(); - FIB_ITERATE_INIT(&fit, rtable); -loop: + FIB_ITERATE_INIT(&fit, rtable); FIB_ITERATE_START(rtable, &fit, struct babel_entry, e) { -int changed = 0; - WALK_LIST_DELSAFE(r, rx, e->routes) { if (r->refresh_time && r->refresh_time <= now_) @@ -238,23 +242,11 @@ loop: if (r->expires && r->expires <= now_) { - changed = changed || (r == e->selected); - babel_expire_route(p, r); + r->expire_next = expire_head; + expire_head = r; } } -if (changed) -{ - /* - * We have to restart the iteration because there may be a cascade of - * synchronous events babel_select_route() -> nest table change -> -
[PATCH v2 2/2] [RFC] Babel: Replace internal route selection by bird's nest
This allows for filtering routes from specific interfaces and neighbours. With the current internal route selection proto babel exports only up to one route and an admin cannot do fine-grained filtering. To fix this we rip out the internal route selection entirely and put them all into the bird's nest instead. This appears to not actually be a breaking change as route announcement was already based on which routes bird would send back to babel_rt_notify filters shouldn't be able to tell the difference between a single and multiple routes AFAIK. --- Changes in v2: - Only announce a single unreachable route from the main_source and ensure proper hold timing. Previously the differing lifetime of neighbours vs unreachable hold time were a problem. proto/babel/babel.c | 228 +--- proto/babel/babel.h | 5 +- 2 files changed, 94 insertions(+), 139 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index a52125ff..4bc38540 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -29,10 +29,11 @@ * possible routes for the prefix are tracked as &babel_route entries and the * feasibility distance is maintained through &babel_source structures. * - * The main route selection is done in babel_select_route(). This is called when - * an entry is updated by receiving updates from the network or when modified by - * internal timers. The function selects from feasible and reachable routes the - * one with the lowest metric to be announced to the core. + * The main route selection is done by the bird nest (heh). For each prefix we + * export all feasible routes to the core with a distinct source (rte_src) per + * neihbour. Bird will then notify us of the optimal one by calling + * babel_rt_notify and we remember which neighbour it selected in + * babel_entry.selected_nbr. */ #include @@ -50,7 +51,7 @@ static inline int ge_mod64k(uint a, uint b) { return (u16)(a - b) < 0x8000; } static void babel_expire_requests(struct babel_proto *p, struct babel_entry *e); -static void babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_route *mod); +static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r); static inline void babel_announce_retraction(struct babel_proto *p, struct babel_entry *e); static void babel_send_route_request(struct babel_proto *p, struct babel_entry *e, struct babel_neighbor *n); static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, struct babel_seqno_request *sr, struct babel_neighbor *n); @@ -168,13 +169,19 @@ babel_get_route(struct babel_proto *p, struct babel_entry *e, struct babel_neigh return r; } +static struct babel_route * +babel_get_selected_route(struct babel_proto *p, struct babel_entry *e) +{ + if (e->selected_nbr) +return babel_get_route(p, e, e->selected_nbr); + return NULL; +} + static inline void babel_retract_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; - - if (r == r->e->selected) -babel_select_route(p, r->e, r); + babel_announce_rte(p, r->e, r); } static void @@ -186,9 +193,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct babel_route *r) rem_node(NODE r); rem_node(&r->neigh_route); - if (r->e->selected == r) -r->e->selected = NULL; - sl_free(r); } @@ -204,6 +208,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; r->expires = current_time() + cf->hold_time; +babel_announce_rte(p, r->e, r); } else { @@ -214,7 +219,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) static void babel_refresh_route(struct babel_proto *p, struct babel_route *r) { - if (r == r->e->selected) + if (r == babel_get_selected_route(p, r->e)) babel_send_route_request(p, r->e, r->neigh); r->refresh_time = 0; @@ -234,8 +239,6 @@ loop: rtable_lock(rtable); FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e) { -int changed = 0; - WALK_LIST_DELSAFE(r, rx, e->routes) { if (r->refresh_time && r->refresh_time <= now_) @@ -243,14 +246,10 @@ loop: if (r->expires && r->expires <= now_) { - changed = changed || (r == e->selected); babel_expire_route(p, r); } } -if (changed) - babel_select_route(p, e, NULL); - /* Clean up stale entries */ if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= now_)) e->valid = BABEL_ENTRY_DUMMY; @@ -446,6 +445,7 @@ babel_get_neighbor(struct babel_iface *ifa, ip_addr addr) nbr = mb_allocz(ifa->pool, sizeof(struct babel_neighbor)); nbr->ifa = ifa; + nbr->src = rt_get_source(&p->p, idm_alloc(&p->src_ids)); nbr->addr = addr; nbr->rxcost = BABEL_INFINITY; nbr->txcost = BABEL_INFINITY; @@ -473,6 +473,8 @@ babel_flush_neighbor(stru
[PATCH v2 1/2] Babel: Remove unecessary FIB_ITERATE restart
The route expiration code appears to have been stolen from rip.c, in that code the rt_notify function actually does modify the rtable fib by calling fib_get. The babel code however does no such thing, so this inefficient restart is just entirely uneccesarry. To prove this is true I add a bunch of checking code that can be removed later. --- Changes in v2: - Fix assert due to misplaced lock/unlock proto/babel/babel.c | 64 ++--- proto/babel/babel.h | 8 +++--- 2 files changed, 42 insertions(+), 30 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index ff8b6b52..a52125ff 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -58,6 +58,9 @@ static void babel_update_cost(struct babel_neighbor *n); static inline void babel_kick_timer(struct babel_proto *p); static inline void babel_iface_kick_timer(struct babel_iface *ifa); +#define rtable_lock(rtable) ASSERT(!(rtable)->lock++) +#define rtable_unlock(rtable) ASSERT(!--(rtable)->lock) + /* * Functions to maintain data structures */ @@ -76,15 +79,16 @@ babel_init_entry(struct fib *f UNUSED, void *E) static inline struct babel_entry * babel_find_entry(struct babel_proto *p, const net_addr *n) { - struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable; + struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable.fib : &p->ip6_rtable.fib; return fib_find(rtable, n); } static struct babel_entry * babel_get_entry(struct babel_proto *p, const net_addr *n) { - struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable; - struct babel_entry *e = fib_get(rtable, n); + struct babel_rtable *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable; + ASSERT(!rtable->lock); + struct babel_entry *e = fib_get(&rtable->fib, n); return e; } @@ -217,17 +221,18 @@ babel_refresh_route(struct babel_proto *p, struct babel_route *r) } static void -babel_expire_routes_(struct babel_proto *p, struct fib *rtable) +babel_expire_routes_(struct babel_proto *p, struct babel_rtable *rtable) { struct babel_config *cf = (void *) p->p.cf; struct babel_route *r, *rx; struct fib_iterator fit; btime now_ = current_time(); - FIB_ITERATE_INIT(&fit, rtable); + FIB_ITERATE_INIT(&fit, &rtable->fib); loop: - FIB_ITERATE_START(rtable, &fit, struct babel_entry, e) + rtable_lock(rtable); + FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e) { int changed = 0; @@ -244,16 +249,7 @@ loop: } if (changed) -{ - /* - * We have to restart the iteration because there may be a cascade of - * synchronous events babel_select_route() -> nest table change -> - * babel_rt_notify() -> rtable change, invalidating hidden variables. - */ - FIB_ITERATE_PUT(&fit); babel_select_route(p, e, NULL); - goto loop; -} /* Clean up stale entries */ if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= now_)) @@ -263,6 +259,7 @@ loop: if (e->unreachable && (!e->valid || (e->router_id == p->router_id))) { FIB_ITERATE_PUT(&fit); + rtable_unlock(rtable); babel_announce_retraction(p, e); goto loop; } @@ -274,11 +271,13 @@ loop: if (!e->valid && EMPTY_LIST(e->routes) && EMPTY_LIST(e->sources) && EMPTY_LIST(e->requests)) { FIB_ITERATE_PUT(&fit); - fib_delete(rtable, e); + rtable_unlock(rtable); + fib_delete(&rtable->fib, e); goto loop; } } FIB_ITERATE_END; + rtable_unlock(rtable); } static void @@ -959,7 +958,7 @@ babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, * transmitted entry is updated. */ static void -babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable) +babel_send_update_(struct babel_iface *ifa, btime changed, struct babel_rtable *rtable) { struct babel_proto *p = ifa->proto; @@ -970,7 +969,8 @@ babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable) p->update_seqno_inc = 0; } - FIB_WALK(rtable, struct babel_entry, e) + rtable_lock(rtable); + FIB_WALK(&rtable->fib, struct babel_entry, e) { if (!e->valid) continue; @@ -1022,6 +1022,7 @@ babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable) } } FIB_WALK_END; + rtable_unlock(rtable); } static void @@ -2055,16 +2056,21 @@ babel_dump(struct proto *P) WALK_LIST(ifa, p->interfaces) babel_dump_iface(ifa); - FIB_WALK(&p->ip4_rtable, struct babel_entry, e) + rtable_lock(&p->ip4_rtable); + FIB_WALK(&p->ip4_rtable.fib, struct babel_entry, e) { babel_dump_entry(e); } FIB_WALK_END; - FIB_WALK(&p->ip6_rtable, struct babel_entry, e) + rtable_unlock(&p->ip4_rtable); + + rtable_lock(&p->ip6_rtable); + FIB_WALK(&p->ip6_rtable.fib, struct babel_entry, e) { babel_dump_entry(e); } FIB_WALK_END; + rtable_unlock(&p->ip6_
[PATCH] [RFC] Babel: Replace internal route selection by bird nest
The main motivation for this change is to allow for ingress route filtering. With the current internal route selection proto/babel exports only the one route it selected and an admin cannot decide which neighbours/interfaces to import certain routes from using filters. To fix this we rip out the internal route selection entirely and let bird do it's thing instead. Note that this does very much represent a breaking change as now export filters have to explicitly re-export babel routes which was implicit before, using say `source = RTS_BABEL`. It's unclear to me if we should support the old behaviour and how I would circumvent the export filter for babel routes if so. Comments and suggestions welcome. TODO: - Unreachable routes are still kind of funky since rebasing on master from 2.0.10. I see both a unicast routes and an unreachable route sometimes. --- Note: this depends on "Babel: Remove unecessary FIB_ITERATE restart", I thought it wouldn't or I would have sent it as a series. proto/babel/babel.c | 177 +++- proto/babel/babel.h | 5 +- 2 files changed, 62 insertions(+), 120 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index bed2e1e4..1fe73cfa 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -29,10 +29,11 @@ * possible routes for the prefix are tracked as &babel_route entries and the * feasibility distance is maintained through &babel_source structures. * - * The main route selection is done in babel_select_route(). This is called when - * an entry is updated by receiving updates from the network or when modified by - * internal timers. The function selects from feasible and reachable routes the - * one with the lowest metric to be announced to the core. + * The main route selection is done by the bird nest (heh). For each prefix we + * export all feasible routes to the core with a distinct source (rte_src) per + * neihbour. Bird will then notify us of the optimal one by calling + * babel_rt_notify and we remember which neighbour it selected in + * babel_entry.selected_nbr. */ #include @@ -50,7 +51,7 @@ static inline int ge_mod64k(uint a, uint b) { return (u16)(a - b) < 0x8000; } static void babel_expire_requests(struct babel_proto *p, struct babel_entry *e); -static void babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_route *mod); +static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, struct babel_route *r); static inline void babel_announce_retraction(struct babel_proto *p, struct babel_entry *e); static void babel_send_route_request(struct babel_proto *p, struct babel_entry *e, struct babel_neighbor *n); static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, struct babel_seqno_request *sr, struct babel_neighbor *n); @@ -168,13 +169,19 @@ babel_get_route(struct babel_proto *p, struct babel_entry *e, struct babel_neigh return r; } +static struct babel_route * +babel_get_selected_route(struct babel_proto *p, struct babel_entry *e) +{ + if (e->selected_nbr) +return babel_get_route(p, e, e->selected_nbr); + return NULL; +} + static inline void babel_retract_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; - - if (r == r->e->selected) -babel_select_route(p, r->e, r); + babel_announce_rte(p, r->e, r); } static void @@ -186,9 +193,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct babel_route *r) rem_node(NODE r); rem_node(&r->neigh_route); - if (r->e->selected == r) -r->e->selected = NULL; - sl_free(r); } @@ -204,9 +208,11 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; r->expires = current_time() + cf->hold_time; +babel_announce_rte(p, r->e, r); } else { +babel_announce_rte(p, r->e, r); babel_flush_route(p, r); } } @@ -214,7 +220,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) static void babel_refresh_route(struct babel_proto *p, struct babel_route *r) { - if (r == r->e->selected) + if (r == babel_get_selected_route(p, r->e)) babel_send_route_request(p, r->e, r->neigh); r->refresh_time = 0; @@ -234,8 +240,6 @@ babel_expire_routes_(struct babel_proto *p, struct babel_rtable *rtable) loop: FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e) { -int changed = 0; - WALK_LIST_DELSAFE(r, rx, e->routes) { if (r->refresh_time && r->refresh_time <= now_) @@ -243,14 +247,10 @@ loop: if (r->expires && r->expires <= now_) { - changed = changed || (r == e->selected); babel_expire_route(p, r); } } -if (changed) - babel_select_route(p, e, NULL); - /* Clean up stale entries */ if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= now_)) e->valid = BABEL
[PATCH] Babel: Remove unecessary FIB_ITERATE restart
The route expiration code appears to have been stolen from rip.c, in that code the rt_notify function actually does modify the rtable fib by calling fib_get. The babel code however does no such thing, so this inefficient restart is just entirely uneccesarry. To prove this is true I add a bunch of checking code that can be removed later. --- proto/babel/babel.c | 64 ++--- proto/babel/babel.h | 8 +++--- 2 files changed, 42 insertions(+), 30 deletions(-) diff --git a/proto/babel/babel.c b/proto/babel/babel.c index ff8b6b52..bed2e1e4 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -58,6 +58,9 @@ static void babel_update_cost(struct babel_neighbor *n); static inline void babel_kick_timer(struct babel_proto *p); static inline void babel_iface_kick_timer(struct babel_iface *ifa); +#define rtable_lock(rtable) ASSERT(!(rtable)->lock++) +#define rtable_unlock(rtable) ASSERT(!--(rtable)->lock) + /* * Functions to maintain data structures */ @@ -76,15 +79,16 @@ babel_init_entry(struct fib *f UNUSED, void *E) static inline struct babel_entry * babel_find_entry(struct babel_proto *p, const net_addr *n) { - struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable; + struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable.fib : &p->ip6_rtable.fib; return fib_find(rtable, n); } static struct babel_entry * babel_get_entry(struct babel_proto *p, const net_addr *n) { - struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable; - struct babel_entry *e = fib_get(rtable, n); + struct babel_rtable *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable; + ASSERT(!rtable->lock); + struct babel_entry *e = fib_get(&rtable->fib, n); return e; } @@ -217,17 +221,18 @@ babel_refresh_route(struct babel_proto *p, struct babel_route *r) } static void -babel_expire_routes_(struct babel_proto *p, struct fib *rtable) +babel_expire_routes_(struct babel_proto *p, struct babel_rtable *rtable) { struct babel_config *cf = (void *) p->p.cf; struct babel_route *r, *rx; struct fib_iterator fit; btime now_ = current_time(); - FIB_ITERATE_INIT(&fit, rtable); + rtable_lock(rtable); + FIB_ITERATE_INIT(&fit, &rtable->fib); loop: - FIB_ITERATE_START(rtable, &fit, struct babel_entry, e) + FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e) { int changed = 0; @@ -244,16 +249,7 @@ loop: } if (changed) -{ - /* - * We have to restart the iteration because there may be a cascade of - * synchronous events babel_select_route() -> nest table change -> - * babel_rt_notify() -> rtable change, invalidating hidden variables. - */ - FIB_ITERATE_PUT(&fit); babel_select_route(p, e, NULL); - goto loop; -} /* Clean up stale entries */ if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= now_)) @@ -263,6 +259,7 @@ loop: if (e->unreachable && (!e->valid || (e->router_id == p->router_id))) { FIB_ITERATE_PUT(&fit); + rtable_unlock(rtable); babel_announce_retraction(p, e); goto loop; } @@ -274,11 +271,13 @@ loop: if (!e->valid && EMPTY_LIST(e->routes) && EMPTY_LIST(e->sources) && EMPTY_LIST(e->requests)) { FIB_ITERATE_PUT(&fit); - fib_delete(rtable, e); + rtable_unlock(rtable); + fib_delete(&rtable->fib, e); goto loop; } } FIB_ITERATE_END; + rtable_unlock(rtable); } static void @@ -959,7 +958,7 @@ babel_send_seqno_request(struct babel_proto *p, struct babel_entry *e, * transmitted entry is updated. */ static void -babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable) +babel_send_update_(struct babel_iface *ifa, btime changed, struct babel_rtable *rtable) { struct babel_proto *p = ifa->proto; @@ -970,7 +969,8 @@ babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable) p->update_seqno_inc = 0; } - FIB_WALK(rtable, struct babel_entry, e) + rtable_lock(rtable); + FIB_WALK(&rtable->fib, struct babel_entry, e) { if (!e->valid) continue; @@ -1022,6 +1022,7 @@ babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable) } } FIB_WALK_END; + rtable_unlock(rtable); } static void @@ -2055,16 +2056,21 @@ babel_dump(struct proto *P) WALK_LIST(ifa, p->interfaces) babel_dump_iface(ifa); - FIB_WALK(&p->ip4_rtable, struct babel_entry, e) + rtable_lock(&p->ip4_rtable); + FIB_WALK(&p->ip4_rtable.fib, struct babel_entry, e) { babel_dump_entry(e); } FIB_WALK_END; - FIB_WALK(&p->ip6_rtable, struct babel_entry, e) + rtable_unlock(&p->ip4_rtable); + + rtable_lock(&p->ip6_rtable); + FIB_WALK(&p->ip6_rtable.fib, struct babel_entry, e) { babel_dump_entry(e); } FIB_WALK_END; + rtable_unlock(&p->ip6_rtable); } static void @@ -2180,11 +2186,12 @@ babel_sh
Babel: Possible segfault in bird unfeasible update handling code
Hi Babelers, I've been working on the babel proto in bird and found some code where I can't convince myself it won't segfault. The problematic bit is, I think, 's' in babel_handle_update can be NULL because nothing ensures the babel_source for a particular neighbour actually exists here: /* Regular update */ [...] s = babel_find_source(e, msg->router_id); /* for feasibility */ [...] /* RFC section 3.8.2.2 - Dealing with unfeasible updates */ if (!feasible && (metric != BABEL_INFINITY) && (!best || (r == best) || (metric < best->metric))) babel_add_seqno_request(p, e, s->router_id, s->seqno + 1, 0, nbr); //^ BUG: Can 's' be NULL here? The only place that allocates sources is babel_send_update_ which just happens at it's own pace and has nothing to do with incoming update handling AFAICT. Am I missing something or is this a bug? Perhaps find should just be replaced by babel_get_source here? --Daniel
[PATCH v3] Babel: Add option to support ecmp routes
We introduce ecmp support for the babel protocol by extending it's definition of a route being selected to mean the route being in the ECMP set. In order to keep code changes minimal we keep the pointer to an arbitrary member of the ECMP set in the FIB entry and add a new flag to babel_route which indicates which routes were actually announced to the core. Since keeping this flag update at all times is a hassle we take a lazy approach and simply check the metric of the selected route against the route in question whenever we want to know for sure if a route is in the ECMP set. --- doc/bird.sgml| 32 ++ proto/babel/babel.c | 103 +++ proto/babel/babel.h | 16 +++ proto/babel/config.Y | 5 +++ 4 files changed, 138 insertions(+), 18 deletions(-) Changes in v3: - Squash with ecmp weigth patch - Add babel_route_is_selected() as replacement for e->selected I'm not totally sure the lazy approach is safe yet. We might need additional bookeeping to reset r->active_nexthop on route retraction/flush instead. diff --git a/doc/bird.sgml b/doc/bird.sgml index 1580facd..8d159b22 100644 --- a/doc/bird.sgml +++ b/doc/bird.sgml @@ -1865,6 +1865,7 @@ protocol babel [] { ipv4 { }; ipv6 [sadr] { }; randomize router id ; + ecmp [limit ]; interface { type ; rxcost ; @@ -1879,6 +1880,7 @@ protocol babel [] { check link ; next hop ipv4 ; next hop ipv6 ; + ecmp weight ; authentication none|mac [permissive]; password ""; password " " { @@ -1909,6 +1911,18 @@ protocol babel [] { router ID every time it starts up, which avoids this problem at the cost of not having stable router IDs in the network. Default: no. + ecmp switch [limit number] + + Determines whether babel will emit ECMP (equal-cost multipath) + routes, allowing to load-balancing traffic across multiple paths. If + enabled the maximum number of next-hops to allow can be specified, + defaulting to 16. + + When neibours are using a dynamic link-quality metric this is + unlikely to be useful. For best results should be used throughout the network to get what + amounts to a hop count metric. + type wired|wireless This option specifies the interface type: Wired or wireless. On wired interfaces a neighbor is considered unreachable after a small number of @@ -1928,6 +1942,18 @@ protocol babel [] { selection and not local route selection. Default: 96 for wired interfaces, 256 for wireless. + ecmp switch [limit number] + + Determines whether babel will emit ECMP (equal-cost multipath) + routes, allowing to load-balancing traffic across multiple paths. If + enabled the maximum number of next-hops to allow can be specified, + defaulting to 16. + + When neibours are using a dynamic link-quality metric this is + unlikely to be useful. For best results should be used throughout the network to get what + amounts to a hop count metric. + limit BIRD keeps track of received Hello messages from each neighbor to establish neighbor reachability. For wired type interfaces, this option @@ -1983,6 +2009,12 @@ protocol babel [] { source for Babel packets will be used. In normal operation, it should not be necessary to set this option. + ecmp weight number + This specifies the relative weight used for nexthops going through + the iface when ECMP is enabled. Larger weight values relative to other + nexthops attract more traffic. Valid values are 1-256. Default value + is 1. + authentication none|mac [permissive] Selects authentication method to be used. e->selected || r->e->selected->metric != r->metric) +r->active_nexthop = 0; + + return r->active_nexthop; +} + static inline void babel_retract_route(struct babel_proto *p, struct babel_route *r) { r->metric = r->advert_metric = BABEL_INFINITY; - if (r == r->e->selected) + if (babel_route_is_selected(r)) babel_select_route(p, r->e, r); } @@ -210,7 +221,7 @@ babel_expire_route(struct babel_proto *p, struct babel_route *r) static void babel_refresh_route(struct babel_proto *p, struct babel_route *r) { - if (r == r->e->selected) + if (babel_route_is_selected(r)) babel_send_route_request(p, r->e, r->neigh); r->refresh_time = 0; @@ -238,7 +249,7 @@ loop: if (r->expires && r->expires <= now_) { - changed = changed || (r == e->selected); + changed = changed || babel_route_is_selected(r); babel_expire_route(p, r); } } @@ -624,7 +635,38 @@ done: } /** - * babel_announce_rte - announce selected route to the core + * babel_nexthop_insert - add next_hop of route to nexthop list
[PATCH v2 2/2] Netlink: Propagate ECMP nexthop weight to kernel for inet6 routes
Previously nl_send_route would use plain nl_add_nexthop for ecmp ipv6 routes instead of adding RTA_MULTIPATH objects via nl_add_multipath. The former lacks support for the rtnh_hops field needed for setting the nexthop weight though. On the kernel side support for nexthop weights was introduced by commit 398958ae48 ("ipv6: Add support for non-equal-cost multipath") which landed in 4.16. Before this the weight will simply be ignored. Support for the ipv6 RTA_MULTIPATH attribute was added in 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)") which landed in 3.10. If this version bound isn't met installing the route will fail. --- sysdep/linux/netlink.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index eca681f9..cf1e18d2 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -74,7 +74,6 @@ #endif #define krt_ipv4(p) ((p)->af == AF_INET) -#define krt_ecmp6(p) ((p)->af == AF_INET6) const int rt_default_ecmp = 16; @@ -1403,7 +1402,7 @@ dest: { case RTD_UNICAST: r->r.rtm_type = RTN_UNICAST; - if (nh->next && !krt_ecmp6(p)) + if (nh->next) nl_add_multipath(&r->h, rsize, nh, p->af, eattrs); else { -- 2.30.2
[PATCH v2 1/2] Netlink: Drop ECMP route splitting hacks
This removes the hacky route merging/splitting code needed to support older kernel versions. Consequently the required Linux version is raised to 4.11 for reliable operation. --- sysdep/linux/netlink.c | 218 +++-- 1 file changed, 33 insertions(+), 185 deletions(-) diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index 29b744cb..eca681f9 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -78,49 +78,6 @@ const int rt_default_ecmp = 16; -/* - * Structure nl_parse_state keeps state of received route processing. Ideally, - * we could just independently parse received Netlink messages and immediately - * propagate received routes to the rest of BIRD, but older Linux kernel (before - * version 4.11) represents and announces IPv6 ECMP routes not as one route with - * multiple next hops (like RTA_MULTIPATH in IPv4 ECMP), but as a sequence of - * routes with the same prefix. More recent kernels work as with IPv4. - * - * Therefore, BIRD keeps currently processed route in nl_parse_state structure - * and postpones its propagation until we expect it to be final; i.e., when - * non-matching route is received or when the scan ends. When another matching - * route is received, it is merged with the already processed route to form an - * ECMP route. Note that merging is done only for IPv6 (merge == 1), but the - * postponing is done in both cases (for simplicity). All IPv4 routes or IPv6 - * routes with RTA_MULTIPATH set are just considered non-matching. - * - * This is ignored for asynchronous notifications (every notification is handled - * as a separate route). It is not an issue for our routes, as we ignore such - * notifications anyways. But importing alien IPv6 ECMP routes does not work - * properly with older kernels. - * - * Whatever the kernel version is, IPv6 ECMP routes are sent as multiple routes - * for the same prefix. - */ - -struct nl_parse_state -{ - struct linpool *pool; - int scan; - int merge; - - net *net; - rta *attrs; - struct krt_proto *proto; - s8 new; - s8 krt_src; - u8 krt_type; - u8 krt_proto; - u32 krt_metric; - - u32 rta_flow;/* Used during parsing */ -}; - /* * Synchronous Netlink interface */ @@ -761,7 +718,7 @@ nl_add_multipath(struct nlmsghdr *h, uint bufsize, struct nexthop *nh, int af, e } static struct nexthop * -nl_parse_multipath(struct nl_parse_state *s, struct krt_proto *p, const net_addr *n, struct rtattr *ra, int af, int krt_src) +nl_parse_multipath(struct krt_proto *p, const net_addr *n, struct rtattr *ra, int af, int krt_src, u32 *rta_flow) { struct rtattr *a[BIRD_RTA_MAX]; struct rtnexthop *nh = RTA_DATA(ra); @@ -780,7 +737,7 @@ nl_parse_multipath(struct nl_parse_state *s, struct krt_proto *p, const net_addr if ((nh->rtnh_flags & RTNH_F_DEAD) && (krt_src != KRT_SRC_BIRD)) goto next; - *last = rv = lp_allocz(s->pool, NEXTHOP_MAX_SIZE); + *last = rv = lp_allocz(nl_linpool, NEXTHOP_MAX_SIZE); last = &(rv->next); rv->weight = nh->rtnh_hops; @@ -824,7 +781,7 @@ nl_parse_multipath(struct nl_parse_state *s, struct krt_proto *p, const net_addr rv->gw = rta_get_ipa(a[RTA_GATEWAY]); if (a[RTA_FLOW]) - s->rta_flow = rta_get_u32(a[RTA_FLOW]); + *rta_flow = rta_get_u32(a[RTA_FLOW]); #ifdef HAVE_MPLS_KERNEL if (a[RTA_VIA]) @@ -1480,36 +1437,13 @@ static inline int nl_add_rte(struct krt_proto *p, rte *e) { rta *a = e->attrs; - int err = 0; - - if (krt_ecmp6(p) && a->nh.next) - { -struct nexthop *nh = &(a->nh); - -err = nl_send_route(p, e, NL_OP_ADD, RTD_UNICAST, nh); -if (err < 0) - return err; - -for (nh = nh->next; nh; nh = nh->next) - err += nl_send_route(p, e, NL_OP_APPEND, RTD_UNICAST, nh); - -return err; - } - return nl_send_route(p, e, NL_OP_ADD, a->dest, &(a->nh)); } static inline int nl_delete_rte(struct krt_proto *p, rte *e) { - int err = 0; - - /* For IPv6, we just repeatedly request DELETE until we get error */ - do -err = nl_send_route(p, e, NL_OP_DELETE, RTD_NONE, NULL); - while (krt_ecmp6(p) && !err); - - return err; + return nl_send_route(p, e, NL_OP_DELETE, RTD_NONE, NULL); } static inline int @@ -1559,67 +1493,11 @@ krt_replace_rte(struct krt_proto *p, net *n UNUSED, rte *new, rte *old) } } -static int -nl_mergable_route(struct nl_parse_state *s, net *net, struct krt_proto *p, uint priority, uint krt_type, uint rtm_family) -{ - /* Route merging is used for IPv6 scans */ - if (!s->scan || (rtm_family != AF_INET6)) -return 0; - - /* Saved and new route must have same network, proto/table, and priority */ - if ((s->net != net) || (s->proto != p) || (s->krt_metric != priority)) -return 0; - - /* Both must be regular unicast routes */ - if ((s->krt_type != RTN_UNICAST) || (krt_type != RTN_UNICAST)) -return 0; - - return 1; -} - -static void -nl_announce_route(struct nl_parse_state *s)
[PATCH] Netlink: Propagate ecmp nexthop weight to kernel for inet6 routes
Previously nl_send_route would use plain nl_add_nexthop for ecmp ipv6 routes instead of adding RTA_MULTIPATH objects via nl_add_multipath. The former lacks support for the rtnh_hops field needed for setting the nexthop weight though. On the kernel side support for nexthop weights was introduced by commit 398958ae48 ("ipv6: Add support for non-equal-cost multipath") which landed in 4.16. Before this the weight will simply be ignored. Support for the ipv6 RTA_MULTIPATH attribute was added in 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)") which landed in 3.10. If this version bound isn't met installing the route will fail. --- sysdep/linux/netlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index 29b744cb..5e0b036d 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -1446,7 +1446,7 @@ dest: { case RTD_UNICAST: r->r.rtm_type = RTN_UNICAST; - if (nh->next && !krt_ecmp6(p)) + if (nh->next) nl_add_multipath(&r->h, rsize, nh, p->af, eattrs); else { -- 2.30.2
[PATCH] Netlink: Propagate ecmp nexthop weight to kernel for inet6 routes
Previously nl_send_route would use plain nl_add_nexthop for ecmp ipv6 routes instead of adding RTA_MULTIPATH objects via nl_add_multipath. The former lacks support for the rtnexthop.rtnh_hops field need for setting the nexthop weight though. On the kernel side support for nexthop weights was introduced by commit 398958ae48 ("ipv6: Add support for non-equal-cost multipath") which landed in 4.16. Before this the weight will simply be ignored. Support for the ipv6 RTA_MULTIPATH attribute was added in 51ebd31815 ("ipv6: add support of equal cost multipath (ECMP)") which landed in 3.10. If this version bound isn't met installing the route will fail. --- sysdep/linux/netlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index 29b744cb..5e0b036d 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -1446,7 +1446,7 @@ dest: { case RTD_UNICAST: r->r.rtm_type = RTN_UNICAST; - if (nh->next && !krt_ecmp6(p)) + if (nh->next) nl_add_multipath(&r->h, rsize, nh, p->af, eattrs); else { -- 2.30.2
[PATCH v2 2/2] Babel: Add option to control ecmp nexthop weight
--- doc/bird.sgml| 7 +++ proto/babel/babel.c | 1 + proto/babel/babel.h | 2 ++ proto/babel/config.Y | 2 ++ 4 files changed, 12 insertions(+) diff --git a/doc/bird.sgml b/doc/bird.sgml index d1e6376b..8d159b22 100644 --- a/doc/bird.sgml +++ b/doc/bird.sgml @@ -1880,6 +1880,7 @@ protocol babel [] { check link ; next hop ipv4 ; next hop ipv6 ; + ecmp weight ; authentication none|mac [permissive]; password ""; password " " { @@ -2008,6 +2009,12 @@ protocol babel [] { source for Babel packets will be used. In normal operation, it should not be necessary to set this option. + ecmp weight number + This specifies the relative weight used for nexthops going through + the iface when ECMP is enabled. Larger weight values relative to other + nexthops attract more traffic. Valid values are 1-256. Default value + is 1. + authentication none|mac [permissive] Selects authentication method to be used. gw = r->next_hop; nh->iface = r->neigh->ifa->iface; + nh->weight = r->neigh->ifa->cf->ecmp_weight; /* * If we cannot find a reachable neighbour, set the entry to be onlink. This diff --git a/proto/babel/babel.h b/proto/babel/babel.h index 46e268fd..06b82799 100644 --- a/proto/babel/babel.h +++ b/proto/babel/babel.h @@ -147,6 +147,8 @@ struct babel_iface_config { int tx_tos; int tx_priority; + u8 ecmp_weight; + ip_addr next_hop_ip4; ip_addr next_hop_ip6; diff --git a/proto/babel/config.Y b/proto/babel/config.Y index 92e754cd..0da5025d 100644 --- a/proto/babel/config.Y +++ b/proto/babel/config.Y @@ -70,6 +70,7 @@ babel_iface_start: BABEL_IFACE->tx_tos = IP_PREC_INTERNET_CONTROL; BABEL_IFACE->tx_priority = sk_priority_control; BABEL_IFACE->check_link = 1; + BABEL_IFACE->ecmp_weight = 0; }; @@ -149,6 +150,7 @@ babel_iface_item: | AUTHENTICATION NONE { BABEL_IFACE->auth_type = BABEL_AUTH_NONE; } | AUTHENTICATION MAC { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; BABEL_IFACE->auth_permissive = 0; } | AUTHENTICATION MAC PERMISSIVE { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; BABEL_IFACE->auth_permissive = 1; } + | ECMP WEIGHT expr { BABEL_IFACE->ecmp_weight = $3 - 1; if (($3<1) || ($3>256)) cf_error("ECMP weight must be in range 1-256"); } | password_list ; -- 2.30.2
[PATCH v2 1/2] Babel: Add option to support ecmp routes
--- doc/bird.sgml| 25 proto/babel/babel.c | 70 +--- proto/babel/babel.h | 5 proto/babel/config.Y | 3 ++ 4 files changed, 92 insertions(+), 11 deletions(-) diff --git a/doc/bird.sgml b/doc/bird.sgml index 1580facd..d1e6376b 100644 --- a/doc/bird.sgml +++ b/doc/bird.sgml @@ -1865,6 +1865,7 @@ protocol babel [] { ipv4 { }; ipv6 [sadr] { }; randomize router id ; + ecmp [limit ]; interface { type ; rxcost ; @@ -1909,6 +1910,18 @@ protocol babel [] { router ID every time it starts up, which avoids this problem at the cost of not having stable router IDs in the network. Default: no. + ecmp switch [limit number] + + Determines whether babel will emit ECMP (equal-cost multipath) + routes, allowing to load-balancing traffic across multiple paths. If + enabled the maximum number of next-hops to allow can be specified, + defaulting to 16. + + When neibours are using a dynamic link-quality metric this is + unlikely to be useful. For best results should be used throughout the network to get what + amounts to a hop count metric. + type wired|wireless This option specifies the interface type: Wired or wireless. On wired interfaces a neighbor is considered unreachable after a small number of @@ -1928,6 +1941,18 @@ protocol babel [] { selection and not local route selection. Default: 96 for wired interfaces, 256 for wireless. + ecmp switch [limit number] + + Determines whether babel will emit ECMP (equal-cost multipath) + routes, allowing to load-balancing traffic across multiple paths. If + enabled the maximum number of next-hops to allow can be specified, + defaulting to 16. + + When neibours are using a dynamic link-quality metric this is + unlikely to be useful. For best results should be used throughout the network to get what + amounts to a hop count metric. + limit BIRD keeps track of received Hello messages from each neighbor to establish neighbor reachability. For wired type interfaces, this option diff --git a/proto/babel/babel.c b/proto/babel/babel.c index 4a7d550f..01ea3a44 100644 --- a/proto/babel/babel.c +++ b/proto/babel/babel.c @@ -623,6 +623,34 @@ done: } } +/** + * babel_nexthop_insert - add next_hop of route to nexthop list + * @p: Babel protocol instance + * @r: Babel route + * @nhs: nexthop list head to append onto + * @nh: freshly allocated buffer to fill + */ +static void +babel_nexthop_insert( + struct babel_proto *p, + struct babel_route *r, + struct nexthop **nhs, + struct nexthop *nh) +{ + nh->gw = r->next_hop; + nh->iface = r->neigh->ifa->iface; + + /* + * If we cannot find a reachable neighbour, set the entry to be onlink. This + * makes it possible to, e.g., assign /32 addresses on a mesh interface and + * have routing work. + */ + if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0)) +nh->flags = RNF_ONLINK; + + nexthop_insert(nhs, nh); +} + /** * babel_announce_rte - announce selected route to the core * @p: Babel protocol instance @@ -635,6 +663,7 @@ done: static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e) { + struct babel_config *cf = (void *) p->p.cf; struct babel_route *r = e->selected; struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : p->ip6_channel; @@ -645,18 +674,24 @@ babel_announce_rte(struct babel_proto *p, struct babel_entry *e) .source = RTS_BABEL, .scope = SCOPE_UNIVERSE, .dest = RTD_UNICAST, - .from = r->neigh->addr, - .nh.gw = r->next_hop, - .nh.iface = r->neigh->ifa->iface, }; -/* - * If we cannot find a reachable neighbour, set the entry to be onlink. This - * makes it possible to, e.g., assign /32 addresses on a mesh interface and - * have routing work. - */ -if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0)) - a0.nh.flags = RNF_ONLINK; +struct nexthop *nhs = NULL; +babel_nexthop_insert(p, r, &nhs, allocz(sizeof(struct nexthop))); +int num_nexthops = 1; + +struct babel_route *cr; +WALK_LIST(cr, e->routes) { + if (cr == r || !cr->feasible || cr->metric != r->metric) + continue; + + if (num_nexthops++ >= cf->max_nexthops) + break; + + babel_nexthop_insert(p, cr, &nhs, allocz(sizeof(struct nexthop))); +} + +a0.nh = *nhs; rta *a = rta_lookup(&a0); rte *rte = rte_get_temp(a); @@ -736,6 +771,7 @@ babel_announce_retraction(struct babel_proto *p, struct babel_entry *e) static void babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_route *mod) { + struct babel_config *cf = (void *) p->p.cf; struct babel_route *r, *best = e->selected; /* Shortcut if only non-best was modified */ @@ -744,8
Re: Merging multiple routes into a single multipath route
Hi Johannes, On Tue, May 03, 2022 at 04:07:08PM +0200, Johannes Erwerle wrote: > I would like to use both uplinks via the linux multipath routing and I am > searching for a way to "merge" both default routes that I get into one route > with multiple next hops. > > Is there a way to accomplish that with bird? (or any other tool?) Instead of letting the dhcp client install the routes and then trying to massage them into shape you could do this directly by configuring the dhcp client appropriately. The isc-dhcp client actually and very good support for scripting in dhclient-*-hooks.d that lets you override what the dhclient-script (which handles the OS specific stuff) does by default. I have been working on a dhcp script myself to allow for VRRP redundancy on a DHCP managed interface. You should be able to use the neutralize_ip_change hack I use to override iproute2 commands called by dhclient-script for your purposes too: neutralize_ip_change () { ip () neutered_ip "$@" } neutered_ip () { : do stuff and call real "ip" command; } This works because dhclient-script (a bash script) sources the hooks so we can mess with the shell environment to override the "ip" command with a shell function that matches on the args and does whatever we want. Script attached since I haven't published it anywhere yet. Note this is only lightly tested so far but I'd be happy to help you make this work for your use-case. --Daniel #!/bin/sh # Install into /etc/dhcp and symlink to /etc/dhcp/dhclient-{enter,exit}-hooks.d IFACE=enp9s0f1.4 VRID=4 PRIO=100 DUMMYIP_OLD=192.168.254.254 DUMMYIP_NEW=192.168.254.253 PIDFILE=/var/run/"vrrpd_${IFACE}_${VRID}.pid" HOOKDIR="$(basename "$dir")" ## Debugging #set -x #exec >> /tmp/dhcp.log 2>&1 echo $HOOKDIR $reason $interface >>/tmp/dhcp.log set | grep 'ip_address\|route' | sort >>/tmp/dhcp.log printf '\n\n' >>/tmp/dhcp.log killvrrp () { echo "killvrrp $PIDFILE" >>/tmp/dhcp.log [ -n $old_ip_address ] || return if [ -e "$PIDFILE" ]; then kill "$(cat $PIDFILE)" || true fi } startvrrp () { echo "startvrrp $new_ip_address\n" >>/tmp/dhcp.log [ -n $new_ip_address ] || return echo vrrpd -D -n -i "$interface" -v "$VRID" -p "$PRIO" "$new_ip_address" vrrpd -D -n -i "$interface" -v "$VRID" -p "$PRIO" "$new_ip_address" || exit_status=1 } # dhclient-script insists on fucking with the IP address, we don't want any # of that since vrrpd will do it. neutralize_ip_change () { ip () neutered_ip "$@" } neutered_ip () { [ $# -ge 3 ] || { ip "$@"; return; } if [ "$1" = "-4" -a "$2" = addr -a "$3" = flush ] || [ "$1" = "-4" -a "$2" = addr -a "$3" = add ] || [ "$1" = "-4" -a "$2" = addr -a "$3" = change ] then echo ignored "$@" return else echo =ip "$@" command ip "$@" fi } do_vrrp () { if [ "$HOOKDIR" = "dhclient-enter-hooks.d" ]; then case $reason in BOUND | REBOOT | TIMEOUT) ;; # Just started shouldn't be running RENEW | REBIND) [ $new_ip_address = $old_ip_address] || killvrrp ;; EXPIRE | FAIL | RELEASE | STOP | TIMEOUT) killvrrp ;; esac neutralize_ip_change elif [ "$HOOKDIR" = "dhclient-exit-hooks.d" ]; then case $reason in # Note: TIMEOUT is when we can't reach a dhcp server during # startup but have a valid lease still that we're going to # use. BOUND | REBOOT | TIMEOUT) startvrrp ;; RENEW | REBIND) [ $new_ip_address = $old_ip_address] || startvrrp ;; EXPIRE | FAIL | RELEASE) ;; esac fi } if [ "$interface" = $IFACE ]; then do_vrrp fi
[PATCH 1/2] Babel: Add option to support ecmp routes
--- The FIB walks in babel_reconfigure_iface are a bit ugly, I tried to loop over ifa->neigh_list instead but couldn't get it to work. I'm open to suggestions :) doc/bird.sgml| 8 + proto/babel/babel.c | 70 +--- proto/babel/babel.h | 3 ++ proto/babel/config.Y | 3 ++ 4 files changed, 73 insertions(+), 11 deletions(-) diff --git a/doc/bird.sgml b/doc/bird.sgml index 1580facd..5e85d8ec 100644 --- a/doc/bird.sgml +++ b/doc/bird.sgml @@ -1879,6 +1879,7 @@ protocol babel [] { check link ; next hop ipv4 ; next hop ipv6 ; + ecmp [limit ]; authentication none|mac [permissive]; password ""; password " " { @@ -1983,6 +1984,13 @@ protocol babel [] { source for Babel packets will be used. In normal operation, it should not be necessary to set this option. + ecmp switch [limit number] + + Determines whether babel will emit ECMP (equal-cost multipath) + routes, allowing to load-balancing traffic across multiple paths. If + enabled the maximum number of next-hops to allow can be specified, + defaulting to 16. + authentication none|mac [permissive] Selects authentication method to be used. gw = r->next_hop; + nh->iface = r->neigh->ifa->iface; + + /* + * If we cannot find a reachable neighbour, set the entry to be onlink. This + * makes it possible to, e.g., assign /32 addresses on a mesh interface and + * have routing work. + */ + if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0)) +nh->flags = RNF_ONLINK; + + nexthop_insert(nhs, nh); +} + /** * babel_announce_rte - announce selected route to the core * @p: Babel protocol instance @@ -635,6 +663,7 @@ done: static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e) { + struct babel_config *cf = (void *) p->p.cf; struct babel_route *r = e->selected; struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : p->ip6_channel; @@ -645,18 +674,24 @@ babel_announce_rte(struct babel_proto *p, struct babel_entry *e) .source = RTS_BABEL, .scope = SCOPE_UNIVERSE, .dest = RTD_UNICAST, - .from = r->neigh->addr, - .nh.gw = r->next_hop, - .nh.iface = r->neigh->ifa->iface, }; -/* - * If we cannot find a reachable neighbour, set the entry to be onlink. This - * makes it possible to, e.g., assign /32 addresses on a mesh interface and - * have routing work. - */ -if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0)) - a0.nh.flags = RNF_ONLINK; +struct nexthop *nhs = NULL; +babel_nexthop_insert(p, r, &nhs, allocz(sizeof(struct nexthop))); +int num_nexthops = 1; + +struct babel_route *cr; +WALK_LIST(cr, e->routes) { + if (num_nexthops++ >= cf->ecmp) + break; + + if (cr == r || !cr->feasible || cr->metric != r->metric) + continue; + + babel_nexthop_insert(p, cr, &nhs, allocz(sizeof(struct nexthop))); +} + +a0.nh = *nhs; rta *a = rta_lookup(&a0); rte *rte = rte_get_temp(a); @@ -736,6 +771,7 @@ babel_announce_retraction(struct babel_proto *p, struct babel_entry *e) static void babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_route *mod) { + struct babel_config *cf = (void *) p->p.cf; struct babel_route *r, *best = e->selected; /* Shortcut if only non-best was modified */ @@ -744,8 +780,10 @@ babel_select_route(struct babel_proto *p, struct babel_entry *e, struct babel_ro /* Either select modified route, or keep old best route */ if ((mod->metric < (best ? best->metric : BABEL_INFINITY)) && mod->feasible) best = mod; -else +else if (cf->ecmp==1) return; +/* With ecmp one of the non-selected but equal metric routes might have + * changed so contine on with the announcement in that case. */ } else { @@ -1879,6 +1917,16 @@ babel_reconfigure_iface(struct babel_proto *p, struct babel_iface *ifa, struct b if (ifa->up) babel_iface_kick_timer(ifa); + /* Update all routes to refresh ecmp settings. */ + FIB_WALK(&p->ip6_rtable, struct babel_entry, e) { +babel_select_route(p, e, NULL); + } + FIB_WALK_END; + FIB_WALK(&p->ip4_rtable, struct babel_entry, e) { +babel_select_route(p, e, NULL); + } + FIB_WALK_END; + return 1; } diff --git a/proto/babel/babel.h b/proto/babel/babel.h index 84feb085..3868dcb2 100644 --- a/proto/babel/babel.h +++ b/proto/babel/babel.h @@ -62,6 +62,8 @@ #define BABEL_OVERHEAD (IP6_HEADER_LENGTH+UDP_HEADER_LENGTH) #define BABEL_MIN_MTU (512 + BABEL_OVERHEAD) +#define BABEL_DEFAULT_ECMP_LIMIT 16 + #define BABEL_AUTH_NONE0 #define BABEL_AUTH_MAC 1 @@ -120,6 +122,7 @@ struct babel_config { list iface_list; /* List of ifac
[PATCH 2/2] Babel: Add option to control ecmp nexthop weight
--- doc/bird.sgml| 21 ++--- proto/babel/babel.c | 1 + proto/babel/babel.h | 2 ++ proto/babel/config.Y | 2 ++ 4 files changed, 19 insertions(+), 7 deletions(-) diff --git a/doc/bird.sgml b/doc/bird.sgml index 5e85d8ec..1fdda7bc 100644 --- a/doc/bird.sgml +++ b/doc/bird.sgml @@ -1865,6 +1865,7 @@ protocol babel [] { ipv4 { }; ipv6 [sadr] { }; randomize router id ; + ecmp [limit ]; interface { type ; rxcost ; @@ -1879,7 +1880,7 @@ protocol babel [] { check link ; next hop ipv4 ; next hop ipv6 ; - ecmp [limit ]; + ecmp weight ; authentication none|mac [permissive]; password ""; password " " { @@ -1910,6 +1911,13 @@ protocol babel [] { router ID every time it starts up, which avoids this problem at the cost of not having stable router IDs in the network. Default: no. + ecmp switch [limit number] + + Determines whether babel will emit ECMP (equal-cost multipath) + routes, allowing to load-balancing traffic across multiple paths. If + enabled the maximum number of next-hops to allow can be specified, + defaulting to 16. + type wired|wireless This option specifies the interface type: Wired or wireless. On wired interfaces a neighbor is considered unreachable after a small number of @@ -1984,12 +1992,11 @@ protocol babel [] { source for Babel packets will be used. In normal operation, it should not be necessary to set this option. - ecmp switch [limit number] - - Determines whether babel will emit ECMP (equal-cost multipath) - routes, allowing to load-balancing traffic across multiple paths. If - enabled the maximum number of next-hops to allow can be specified, - defaulting to 16. + ecmp weight number + This specifies the relative weight used for nexthops going through + the iface when ECMP is enabled. Larger weight values relative to other + nexthops attract more traffic. Valid values are 1-256. Default value + is 1. authentication none|mac [permissive] Selects authentication method to be used. gw = r->next_hop; nh->iface = r->neigh->ifa->iface; + nh->weight = r->neigh->ifa->cf->ecmp_weight; /* * If we cannot find a reachable neighbour, set the entry to be onlink. This diff --git a/proto/babel/babel.h b/proto/babel/babel.h index 3868dcb2..2ad1ded0 100644 --- a/proto/babel/babel.h +++ b/proto/babel/babel.h @@ -145,6 +145,8 @@ struct babel_iface_config { int tx_tos; int tx_priority; + u8 ecmp_weight; + ip_addr next_hop_ip4; ip_addr next_hop_ip6; diff --git a/proto/babel/config.Y b/proto/babel/config.Y index 8f18c790..6f03482a 100644 --- a/proto/babel/config.Y +++ b/proto/babel/config.Y @@ -70,6 +70,7 @@ babel_iface_start: BABEL_IFACE->tx_tos = IP_PREC_INTERNET_CONTROL; BABEL_IFACE->tx_priority = sk_priority_control; BABEL_IFACE->check_link = 1; + BABEL_IFACE->ecmp_weight = 0; }; @@ -149,6 +150,7 @@ babel_iface_item: | AUTHENTICATION NONE { BABEL_IFACE->auth_type = BABEL_AUTH_NONE; } | AUTHENTICATION MAC { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; BABEL_IFACE->auth_permissive = 0; } | AUTHENTICATION MAC PERMISSIVE { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; BABEL_IFACE->auth_permissive = 1; } + | ECMP WEIGHT expr { BABEL_IFACE->ecmp_weight = $3 - 1; if (($3<1) || ($3>256)) cf_error("ECMP weight must be in range 1-256"); } | password_list ; -- 2.30.2
[PATCH] krt: Dump routing tables separetely on linux to avoid congestion
When dumping the routing table bird currently doesn't set the rtm_table netlink field to select any particular one but rather wants to get all at once. This can be problematic when multiple routing daemons are running on a system as the kernel's route modification performance goes down drasticly (by a factor of 20-200ish) when the table is being modified while it's being dumped. To avoid this situation we make bird do dumps on a per-kernel-table basis. This then allows the administrator to have multiple routing daemons use different kernel tables which sidesteps the problem. See also this discussion on the babel-users mailing list: https://alioth-lists.debian.net/pipermail/babel-users/2022-April/003902.html --- sysdep/cf/linux.h | 2 +- sysdep/linux/netlink.c | 7 --- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/sysdep/cf/linux.h b/sysdep/cf/linux.h index 047d3764..d6211d16 100644 --- a/sysdep/cf/linux.h +++ b/sysdep/cf/linux.h @@ -9,7 +9,7 @@ #define CONFIG_AUTO_ROUTES #define CONFIG_SELF_CONSCIOUS #define CONFIG_MULTIPLE_TABLES -#define CONFIG_ALL_TABLES_AT_ONCE +#undef CONFIG_ALL_TABLES_AT_ONCE #define CONFIG_IP6_SADR_KERNEL #define CONFIG_MC_PROPER_SRC diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c index 29b744cb..003b7939 100644 --- a/sysdep/linux/netlink.c +++ b/sysdep/linux/netlink.c @@ -256,7 +256,7 @@ nl_request_dump_addr(int af) } static void -nl_request_dump_route(int af) +nl_request_dump_route(int af, int table_id) { struct { struct nlmsghdr nh; @@ -267,6 +267,7 @@ nl_request_dump_route(int af) .nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP, .nh.nlmsg_seq = ++(nl_scan.seq), .rtm.rtm_family = af, +.rtm.rtm_table = table_id, }; send(nl_scan.fd, &req, sizeof(req), 0); @@ -1976,13 +1977,13 @@ nl_parse_route(struct nl_parse_state *s, struct nlmsghdr *h) } void -krt_do_scan(struct krt_proto *p UNUSED)/* CONFIG_ALL_TABLES_AT_ONCE => p is NULL */ +krt_do_scan(struct krt_proto *p) { struct nlmsghdr *h; struct nl_parse_state s; nl_parse_begin(&s, 1); - nl_request_dump_route(AF_UNSPEC); + nl_request_dump_route(AF_UNSPEC, KRT_CF->sys.table_id); while (h = nl_get_scan()) if (h->nlmsg_type == RTM_NEWROUTE || h->nlmsg_type == RTM_DELROUTE) nl_parse_route(&s, h); -- 2.30.2