Re: [Babel-users] Babel RTT settings for low and high latencies

2024-07-18 Thread Daniel Gröber
On Thu, Jul 18, 2024 at 12:38:20PM +0200, Juliusz Chroboczek wrote:
> > A year ago you said [1] that bird had not yet implemented "hysteresis on
> > metrics" [...]  Has bird implemented this in the meantime?
> 
> I haven't checked the BIRD implementation recently (I know, I should be
> doing more work on Babel, but I'm currently spending all my mana on
> fighting the French administration), you'll need to wait until Toke comes
> back from his Viking raid.

We do not have oscillation prevention measures in BIRD yet.

I was doing work on proto/babel most recently and what's complicating that
right now is BIRD 3 (thread-next branch) development disincentivising the
BIRD team from applying potentially conflicting changes as both branches
will need to be maintained going forward.

Perhaps it's time for me to start releasing a babel-bird fork while that's
going on so we can carry on development in the meantime? It seems I lost
the motivation to work on it since nothing is getting merged/released which
makes testing by the community significantly less likeley and doing it in a
way the team *might* pick up patches (I discussed this with them privately)
is too much work for me to overcome my activation energy requirements ;)

--Daniel


signature.asc
Description: PGP signature


Re: Current bird via bookworm-backports?

2024-04-24 Thread Daniel Gröber
Hi Jakub and Ondřej,

I'm also interested in using more recent bird versions in my deployment,
but I don't use third-party repos on principle.

Since I'm a DD I could just start uploading and maintaining a backports for
bird2, is there I reason to not do that? My guess is you just want to avoid
doing the work twice since you're already maintaining that repo outside of
Debian, right?

--Daniel


Re: BGP on /32 (/128) interfaces

2024-04-15 Thread Daniel Gröber
Hi Arzhel,

On Fri, Apr 12, 2024 at 11:57:38AM +0200, Arzhel Younsi wrote:
> But for IPv6, it's cleaner to only require the router's link local address:
> testvm2006:~$ ip -6 addr
> inet6 2620:0:860:140:10:192:24:4/128 scope global
> testvm2006:~$ ip -6 route
> default via fe80::2022:22ff:fe22:2201 dev ens13 metric 1024 pref medium
> 
> In Bird:
> neighbor fe80::2022:22ff:fe22:2201%ens13 external;
> 
> But then the link local address doesn't work with multihop (for obvious
> reason).
> bird: /etc/bird/bird.conf:22:1 Multihop BGP cannot be used with link-local
> addresses

I use lladdrs for BGP endpoints in my network and that works fine. I think
using `direct` instead of `multihop` in the v6-lladdr case would make it
work for you.

One word of advice: don't use the %scope syntax, use the `interface`
directive instead. I don't recall exactly why but I had some subtle problem
with that.

As for your v4/32 problem, give `multihop 1` a try. That enforces no
routers on the path to the peer like direct but allows off-subnet
endpoints. Do keep in mind the docs recommend setting the source address
explicitly when enabling multihop.

multihop [number]

Configure multihop BGP session to a neighbor that isn't directly
connected. Accurately, this option should be used if the configured
neighbor IP address does not match with any local network subnets. Such
IP address have to be reachable through system routing table. The
alternative is the direct option. For multihop BGP it is recommended to
explicitly configure the source address to have it stable. Optional
number argument can be used to specify the number of hops (used for
TTL). Note that the number of networks (edges) in a path is counted;
i.e., if two BGP speakers are separated by one router, the number of
hops is 2. Default: enabled for iBGP.

The reason why direct isn't working is also clear from the docs:

direct

Specify that the neighbor is directly connected. The IP address of
the neighbor must be from a directly reachable IP range
(i.e. associated with one of your router's interfaces), >>>otherwise
the BGP session wouldn't start but it would wait for
such interface to appear<<<. The alternative is the multihop
option. Default: enabled for eBGP.

Hope that helps,
--Daniel


Re: [Babel-users] Coexistence of multiple babel daemons on the same host

2024-03-31 Thread Daniel Gröber
Hi Juliusz,

On Sun, Mar 31, 2024 at 04:35:28PM +0200, Juliusz Chroboczek wrote:
> > I've just come across a reason I'd want to run babel with both bird and
> > babeld on the same node and have them become neighbours.
> 
> I hope you know what you're doing.

I sure hope so too, haha. No it's really quite benign. Just working around
BIRD still not being quite as featureful as I need to run my network but
babeld's non-atomic route replacement interacting badly with BIRD's
proto/radv. See below.

> The simplest solution would be to run each daemon in a different container.

Yeah, I am always trying to avoid that to keep things simple and manageable
for me.

Keep in mind my reason for even raising this on the MLs and not just
shooting patches is that I think the current UX is broken. I don't want
"accidentally running two babel daemons" to cause a network outage like it
did for me.

> If you don't want to use containers, you'll want to use separate routing
> tables for the two daemons, and then set up the correct routing rules.
> First configure the routing rule:

I just want to import routes into bird for use with radv "trigger". These
aren't going to get exported anywhere at all so no need for multiple
routing tables in the kernel and all that.

I do appreciate you taking the time to write down how to do it however :)

> > From what I gather by a quick skimming of the relevant details in
> > RFC8966 this could work by assigning each daemon a unique LL address so
> > each gets it's own entry in the neighbour table, right?
> 
> No, there's no relation between the link-local address and the routing
> table.

*neighbour table*. I think the crux of the issue I was seeing is that both
daemons will use eg. fe80::1 to source their babel hellos and since the
neighbour table is indexed by this address whoever sees both of these
announcements will experience "fighting" over who gets to own that entry or
something? I'm not actually sure, should have probably kept some pcaps from
when this was going on.

Let me know if this makes no sense to you and I'll try to recreate the
circumstances.
 
After some careful reading of the BIRD code I've actually found I can set
the LL saddr it uses using `protocol device { interface "eth0" { preferred
fe80::1234; }; }` so I think I can actually get this working without code
changes. I'm labbing that now.

Thanks,
--Daniel


Coexistence of multiple babel daemons on the same host

2024-03-31 Thread Daniel Gröber
Hi Babelers,

I've just come across a reason I'd want to run babel with both bird and
babeld on the same node and have them become neighbours. The details are
tedious -- my usual disclamer applies ;)

This got me to think about something I'd observed in the past and been
meaning to write about though: when I accidentally had both babeld and bird
running on a node it would casuse severe route flapping and consequently
packet loss.

I wonder if we can do better here? Maybe by warning about this situation in
the logs, shutting down both sides or just turning this into a supported
use-case? :)

>From what I gather by a quick skimming of the relevant details in RFC8966
this could work by assigning each daemon a unique LL address so each gets
it's own entry in the neighbour table, right?

Thanks,
--Daniel


Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-11-21 Thread Daniel Gröber
Hi Alexander,

On Wed, Nov 22, 2023 at 12:17:49AM +0100, Alexander Zubkov wrote:
> > Can you think of a use-case where fpRPF isn't enough?
> 
> Yes. IMHO, the problem with RPF is that routing table doesn't reflect the
> network topology, but only a subset of it.

Right that is the fundamental problem, so my solution to that is: routing
should "just represent the full network topology" :)  As the routing
protocol sees it anyway, since the whole point of RFP is to only allow
paths that the routing system chooses.

Do note that while I implement the topology information using ECMP routes
there's no reason you actually have to use ECMP. You could still have
regular routes in your (main) routing table and use a separate table with
ECMP routes for RPF and this is very much something I want us to support.

> I mean in topologies where multiple pathes are possible, you can choose
> to use or even learn only a subset of those pathes.

If I undestand correctly you're talking about (local) routing daemon policy
here. Yes this is something you can do and my current approach of (abusing)
ECMP only works when your routing policy satisfies some symmetry criteria.

However as Juliusz pointed out integrating this idea into the routing
protocol proper could allow using arbitrary policy without ever breaking
RPF, but figuring out the details is (exciting) future work.

> In that sense might be yes, the original sin is that the routing table
> doesn't reflect all the topology, not only the pathes we choose for egress.
> Not sure though if it is a sin, in that case routing table would be too
> overcomplicated.

Right routing table (modification) performance and clutter is certainly a
reason to forgoe this approach but I find that for the kind of (small)
networks I want to run and that many people might run using wireguard this
is perfectly fine.

> If I understand correctly, such fpRPF approach works only if you both learn
> all possible pathes and use all of them in a multi-nexthop route. But for
> example in the Internet with its advanced BGP announcement policies it is
> not true at all.

Right to deply fpRPF on a large scale you really need some kind of support
from the routing protocol. AFAIK there's nothing like that for BGP yet?

I don't think it's completely inapplicable either though, might still work
for iBGP with appropriately designed routing policy. My interest lies
mostly with doing this using babel though.

> So from my point of view it is good to split the topology definition
> (ingress decapsulation) and the chosen pathes (egress routing). Because it
> is related, but still different processes. So the system can be more
> flexible. Although we need to repeat common things and keep ingress and
> egress consistent/synced.

To me flexibility is only desirable insofar as it doesn't conflict with
system security. Source address authenticity is an important property I
wouldn't want to give up here. If it's easier to ignore source address
filtering than it is to implement it nobody is going to do it (cf. the
internet) and I think that's the crux of the problem with "encap".

Wireguard gifted us this amazing state of source filtering being the easy
default and I want to keep it that way.

> my point is that RPF (with its variations too) has its bounds and cannot
> be a universal solution, there is no silver bullet here.

No, ofc. nothing we do can possibly "fix everything for everyone" but
that's no reason not to try a new approach for a particular problem in a
particular use-case :)

--Daniel


signature.asc
Description: PGP signature


Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-11-19 Thread Daniel Gröber
Hi Erin, Juliusz,

On Sat, Nov 18, 2023 at 11:21:57AM +0100, Erin Shepherd wrote:
> On Sat, 18 Nov 2023, at 03:19, Daniel Gröber wrote:
> > That would be a problem as I specifically want to tie the source address
> > filtering to this too. I'll have a look at the internals (if and) when I
> > get around to starting work on this.
> 
> Is tying source address filtering to the routing table the right thing to
> do here? It seems to me that it would cause issues similar to those we
> see more generally with Unicast Reverse Path Filtering

IMO not providing a way to do source address filtering at the routing level
was the original sin :)

There is certianly the multihoming challange to be overcome as traditional
BCP38 style filtering doesn't cut it in the general case. I have some ideas
on how to deal with this.

I've done some experiments and found that in Linux multi-nexthop routes
actually match reverse path lookups (using nftables "rt") for _any_ of the
source interfaces involved. I think this can be used to build RFC 3704
style Feasible Path Reverse Path Forwarding when the routing daemon
involved supports ECMP.

This experiment is what got me interested in having via-wgpeer in the
routing table in the first place, once we have that we can apply the above
idea not just at the interface level but at the wg peer level. Neat.

Can you think of a use-case where fpRPF isn't enough?

It's also noteworthy that once we have this support for via-wgpeer it'd be
possible to apply ip-rule policy to the filtering decision. Perhaps that
gives some additional power for more fun use-cases :)

On Sat, Nov 18, 2023 at 01:22:03PM +0100, Juliusz Chroboczek wrote:
> Issues are caused by the kernel performing filtering that the routing
> protocol is not aware of: it causes the routing daemon's routing table to
> no longer match the effective forwarding table (the kernel's routing
> table).  That's the reason why uRPF breaks most routing protocols, that's
> the reason why we have trouble making Wireguard work with Babel, and also
> the reason behind https://github.com/jech/babeld/issues/111.

Right on the money as always. This idea has been on my mind too.

> Contrariwise, we can teach Babel to explicitly take into account the
> kernel features that we're interested in using.  Thus, Babel could be
> aware of the restrictions placed on a wireguard interface, and collaborate
> with Wireguard so that the routing table and the forwarding table remain
> congruent.  I haven't looked at the issue in detail, but I believe that
> would be an interesting (short-term) research project, one that I would be
> glad to collaborate with (but not necessarily lead, at least not right now).

Sounds interesting do you have a funding source in mind?

> For the specific case of source address filtering, Babel already has an
> (implemented) extension to deal with source addresses, and I encourage you
> to consider whether it can be used to deal with the issue at hand.  Please
> see https://arxiv.org/pdf/1403.0445.pdf and RFC 9079.

I don't think I mentioned this to you yet, but I have another one of my
crazy ideas of doing something vaguely similar to BGP flowspec with
babel. Restricted to IP source/destination address, so no L4 stuff. I just
want to represent firewall policy using ipv6 subtrees and distribute it in
realtime using babel :)

Unfortunately this is currently stalled due to an apparent nft rt match
kernel bug preventing me from representing multiple possible outcomes since
I want to support dropping, accepting but also stateful firewalling of
matching flows.

--Daniel



Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-11-17 Thread Daniel Gröber
Hi Alexander,

On Thu, Nov 09, 2023 at 12:57:26PM +0100, Alexander Zubkov wrote:
> I heard recently about the lightweight tunnel infrastructure in Linux
> kernel (ip route ... encap ...). And I think this might be helpful in
> the context of this thread.

I hadn't seen that yet, thanks for pointing it out.

> Linux kernel allows already to add encapsulation parameters to the route
> entry in its table. So you do not need to create tunnel devices for
> that. And wireguard encapsulation and destination might be added there
> too.

Right, I think ultimately it's going to come down to either technical
constraints or in the absence of that, maintainer preference whether
via-wgpeer or "encap wg" is the way. The idea is very similar anyway.

> But as I understood the technology, it works only in one way (for
> outgoing packets) and the decapsulation should be processed separately,
> for example in case of VXLAN and MPLS they have their own tables.

That would be a problem as I specifically want to tie the source address
filtering to this too. I'll have a look at the internals (if and) when I
get around to starting work on this.

Thanks,
--Daniel


signature.asc
Description: PGP signature


Re: babel rtt in bird: How to set the RTT equal to the latency?

2023-10-18 Thread Daniel Gröber
Hi Marek,

On Wed, Oct 18, 2023 at 10:49:34PM +0200, Marek Küthe wrote:
> I have seen that there is both rtt-min and rtt-max setting in bird, but
> no max-rtt-penalty. So my question is how would I represent the above
> expression in bird?

The `rtt cost` option is the equivalent in bird.

--Daniel


Re: Doing something wrong with VRF's

2023-09-28 Thread Daniel Gröber
On Thu, Sep 28, 2023 at 04:03:33PM +, Nigel Kukard via Bird-users wrote:
> Well .. I feel embarrassed now. The issue was a missing "kernel table ;"
> in the kernel protocol.

Don't feel bad. I was going to look into using VRFs with bird and now
there's a nice worked example ready to play with at in my inbox :)

--Daniel


Re: [RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-08-28 Thread Daniel Gröber
Hi Juliusz,

On Mon, Aug 28, 2023 at 07:40:51PM +0200, Juliusz Chroboczek wrote:
> I've read the whole discussion, and I'm still not clear what advantages
> the proposed route attribute has over having one interface per peer.  Is
> it because interfaces are expensive in the Linux kernel?  Or is there some
> other reason why it is better to run all WG tunnels over a single interface?

Off the top of my head UDP port exhaustion is a scalability concern here,
just as an example, not that I'd actually ever need that many peers in my
network :)

One wg-device per-peer means we need one UDP port per-peer and since
currently binding to a specific IP is also not supported by wg (I have a
patch pending for this though) there's no good way to work around this.

Frankly having tons of interfaces is just an operational PITA in all sorts
of ways. Apart from the port exhaustion having more than one wg device also
means I have to _allocate_ a new port for each node in my managment system
somehow instead of just using a static port for the entire network. This
gets dicy fast as I want to move in the direction of dynamic peering as in
tinc.

Other than that my `ip -br a` output is getting unmanagably long and having
more than one device means I have to keep ACL lists in sync all over the
system. This is a problem for daemons that don't support automatic reload
(babeld for example :P). I also have to sync the set of interface to
nftables which is easy to get wrong as it's still manual in my setup.

All of that could be solved, but I would also like to get my wg+babel VPN
setup deployed more widely at some point and all that friction isn't going
to help with that so I'd rather have this supported properly.

--Daniel


signature.asc
Description: PGP signature


Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-08-28 Thread Daniel Gröber
Hi Kyle,

On Mon, Aug 28, 2023 at 11:40:48AM -0400, Kyle Rose wrote:
> On Sat, Aug 19, 2023 at 5:25 PM Daniel Gröber  wrote:
> > Having read Kyle's use-case I'm thinking my original plan to extend the wg
> > internal source-address filtering to use a rt lookup with our new attribute
> > would not be maximally useful so now my thinking is we should just have a
> > boolean toggle to disable it explicitly per device.
> 
> If there is interest among the maintainers in eventually merging a
> change with a per-interface knob to turn off the source IP check, I
> will go through the trouble of putting together an initial pass at
> this. I don't want to spend the time if there is firm opposition to
> the idea.

I think just a patch to turn off the wg source IP check is not very useful
at the moment. It would encourage bad source IP filtering practices when
multiple peers are involved as no mechanism for identifying the sending
peer is available at the policy routing or netfilter level currently.

I think such a patch would have to get merged after some kind of mechanism
to identify and filter based on the sending wg peer is available.

So if you want to move this along I would suggest working on this
first. Since I'm also interested in having this feature I'm happy
collaborate.

It's just hard to find the motivation for writing more wg patches when my
pending ones have (mostly) been lying around for a year without a response,
but if you're also keen on this feature I'm sure it's easier to stay
motivated together :)

If my kernel patches go ignored for too long too I'll probably just resort
to getting a forked DKMS wireguard module into Debian with this
work. Perhaps that approach (or a package in a different distro) would work
for your use-case too?

--Daniel


Re: Pending bird patches and discussions for proto/babel

2023-08-25 Thread Daniel Gröber
Hi Maria,

On Tue, Jul 18, 2023 at 12:28:35AM +0200, Daniel Gröber wrote:
> On Thu, Jun 01, 2023 at 05:33:42PM +0200, Maria Matejka via Bird-users wrote:
> > > Lastly my v5 route selection patch should also be ready, all threads of
> > > discussion should be resolved and there are no known bugs. However I've 
> > > not
> > > received any testing reports, hint, hint peeps °>°
> > 
> > OK, will look at it as well. Please send me an owl, a pigeon or another
> > avian carrier compatible with RFC 1149 if we don't react in a reasonable
> > time (est. a fortnight).
> 
> Just a reminder on this :)

I humbly beseech thee to consider deinself reminded again :3

--Daniel


Re: [Babel-users] [RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-08-19 Thread Daniel Gröber
Hi Steffen,

from the questions you ask I fear you've misunderstood my intention to
"replace" AllowedIPs. I'm strictly talking of the _implementation_
(allowedips.c) in Linux particular. The netlink interface would naturally
stay unchanged.

On Sat, Aug 19, 2023 at 10:00:17PM +0200, Steffen Vogel wrote:
> Interesting ideas! I am wondering if this complexity is really necessary?

I think so, right now wg's behaviour just prevents a number of advanced
use-cases which is a pitty.

> My biggest concern about the introduction of a route attribute is that
> this adds complexity for users.  WireGuard's simplicity (and portability)
> have been important factors for its success.

Nothing would change for users that don't use this feature.

> A route attribute would introduce another source for the crypto-routing
> peer selection process.  What happens if the two mechanisms select
> different peers? Which one would have precedence?

If you read my previous mail carefully you'll find I specified how this
would shake out. The rt approach will necessarily have to override
AllowedIPs to be useful.

> Similarly also for incoming packets. WireGuard's current principle is
> really easy to understand. If the source address in in the peers
> AllowedIP list, we will accept the packet. If not its discarded. This is
> a central part of WireGuard's crypto-key routing feature which would
> become more complex.

Having read Kyle's use-case I'm thinking my original plan to extend the wg
internal source-address filtering to use a rt lookup with our new attribute
would not be maximally useful so now my thinking is we should just have a
boolean toggle to disable it explicitly per device.

Then users can do whatever fancy rt (reverse-path) lookups they want with
nft. I'm particularly happy that nft rt lookups will actually do the right
thing with multipath/ECMP routes (any of the involved devices satisfies the
lookup) so this should actually work out fine in my case at least. Mmore
prototyping is required here though.

> A second concern I have with the use of route attributes is limited
> portability. Not all platforms support them.  How do we handle WireGuard
> userspace implementations?

No need. The Linux's rt table is Linux specific I have no plans to
introduce this on other platforms, that's for other intersted souls to
tackle. Again "replace" was referring to implementation concerns.

> - Networks which are found in a Peers AllowedIP list will be installed as a 
> kernel route

That's configurable. I always turn this off when dealing with rt daemons.

> - Kernel routes with the peers unique IP address as next-hop will be added to 
> the Peers AllowedIPs list.
> 
> This rather simple feature allows user to pair cunicu with a software
> routing daemon like Bird2 while using a single WireGuard interface with
> multiple peer-to-peer links.

Sounds like you do what I want to do at kernel level in userspace, then at
least we can agree this is a useful thing :)

> In my setup a periodic synchronization worked fine. But I agree that it
> would be nice if we could have a Netlink multicast group for subscribing
> to changes like we also have for other parts of the Linux network stack
> like routing tables, or link states. This feature was already discussed
> on the WireGuard mailing list [7]. But unfortunately the patch was never
> accepted. Maybe we can revisit this patch?

Sounds like a plan, I'll have a look at it.

> [1] Others planned features are:
> - IP-autoconfiguration by deriving link-local addresses from peers public keys

That's been discussed so many times before on the ML and someone always
realises Jason is right and there's no point to this in the end. Key
distribution is the crux of the problem.

--Daniel


Re: [RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-08-19 Thread Daniel Gröber
Hi Bernd,

On Sat, Aug 19, 2023 at 07:50:38PM +0200, Bernd Naumann wrote:
> Chances are high I do miss something, but I've just set AllowedIPs to
> 0.0.0.0/0 and ::/0 and just used the routing protocol of my choice and
> filters to select which routes got exported and imported... :shrug:

Right, let me expand a bit. You are absolutely right, right now if you want
to use wg with dynamic routing daemons you essentially have to have one wg
tunnel per remote node with AllowedIPs=::/0 and that works just fine at
small scales.

The idea here is that we would like to go back to having just one tunnel
for all nodes involved in this particular network instead, due to general
operations scalability, mine is a mesh network so the number of tunnels
gets rather large quickly :)

Lots of tunnels suck for various reasons, monitoring if they're all up and
configured properly is one example but my understanding from previous
discussions is the performance is probably not ideal either.

--Daniel



[RFC] Replace WireGuard AllowedIPs with IP route attribute

2023-08-19 Thread Daniel Gröber
Hi wireguard, birds, and babelers,

tl;dr I want to add a new Linux route attribute (think "via $wgpeer") to
supplement wireguard's internal AllowedIPs logic for both routing and
source address filtering.

I've been pondering how to better integrate wireguard into dynamic routing
daemons, particularly BIRD and babeld. Essentially we want to be able to
dynamically add/remove AllowedIPs depending on current reachability and/or
link quality stats.

Looking at the wg netlink API I see two major efficiency/scalability
problems: 1) there is no way to be notified of changes in AllowedIPs made
by other processes meaning we have to do periodic scans and 2) a peer's
AllowedIPs set can only be replaced wholesale, not modified
incrementally. This is problematic as "someone" might, in the worst case,
want to install an entire internet routing table's worth of AllowedIPs and
the set will likely change frequently. FYI: The IPv4 table has ~1M entries
at present, yikes.

Assuming external AllowedIPs changes are infrequent occationally dumping
them all to keep a consistent view of the state shouldn't be too much of an
issue as long as the netlink interface is performant enoug, so I'm going to
concentrate on the add/remove API for now.

Instead of doing the obvious thing and adding a more efficient incremental
AllowedIPs netlink interface I figure why not just add a route attribute to
select a target wg peer on a device. That way we could not only save memory
(no separate AllowedIPs trie) but also simplify routing daemon
implementation considerably.

This would mirror how on ethernet you can have `dev eth0 via $router_ip`.
I'm still reviewing the net/ code to find the best way to do this, but I'm
thinking either a new RTA_WGPEER, like: `default dev wg0 via-wgpeer
$peer_pubkey` or perhaps re-using RTA_VIA and keying off a statically
configured AllowedIP addresses.

To start I'd make this an opt-in replacement for our usual AllowedIPs
logic, making sure to only activate it if any via* RTAs are active on a
particular device, but if it proves to work well I don't see why we
couldn't adapt the netlink code to maintain AllowedIPs using this RTA (but
invisible to userspace) to re-use the same code and get rid of allowedips.c
altogether. That's assuming this ends up being less code overall or perhaps
more performant.

Happy to hear your thoughts,
--Daniel


Re: Pending bird patches and discussions for proto/babel

2023-07-17 Thread Daniel Gröber
Hi Maria,

On Thu, Jun 01, 2023 at 05:33:42PM +0200, Maria Matejka via Bird-users wrote:
> > Lastly my v5 route selection patch should also be ready, all threads of
> > discussion should be resolved and there are no known bugs. However I've not
> > received any testing reports, hint, hint peeps °>°
> 
> OK, will look at it as well. Please send me an owl, a pigeon or another
> avian carrier compatible with RFC 1149 if we don't react in a reasonable
> time (est. a fortnight).

Just a reminder on this :)

--Daniel


Re: Recursive nexthop via kernel route in proto static not working

2023-07-04 Thread Daniel Gröber
Hi Maria,

So I just did a bird restart on the router I was testing on after poking
around some more and now the recursive route is suddenly working without a
change in config.

On Tue, Jun 27, 2023 at 09:50:14PM +0200, Maria Matejka wrote:
> this looks fishy. I tried a trivial config like this on my laptop as follows
> and it works. Could you please disclose the full table? 

My master6 just has a full-table plus kernel/device routes. For some reason
before the restart all of them were "unreachable"[1] but now they seem to
resolve via the default route now. I didn't save the dump I did before the
restart unfortunately.

[1]: Not sure why, but it doesn't matter since I just use the BGP feed on
this router as a reflector and BGP routes are not exported to the kernel.

None of the BGP routes were covering :: (I checked with `route show for
::`) so I don't really see why `recursive ::` suddenly works now. I tried
this on another one of my routers that has a similar setup and it's also
working there.

I guess I'll just keep an eye on this for now.

> There are some details in the recursive nexthop resolution algorithm
> (preventing infinite resolution loops) which may apply to your case.

Can you point me to the right bits in the code so I can have a look?

> BTW, is there really no message in the log?

I don't see any log messages, but maybe my verbosity itn's high
enough. What options do I have to twiddle to make sure?

Thanks,
--Daniel


Re: Recursive nexthop via kernel route in proto static not working

2023-06-27 Thread Daniel Gröber
Hi Alexander,

On Tue, Jun 27, 2023 at 04:48:46PM +0200, Alexander Zubkov wrote:
> Not sure, but I would guess it can be related to the local address. It
> might try to pick the first interface with such network.

> Could you try your setup with some route that has the nexthop from a
> unique subnet configured on the interface? At least to check if it will
> become reachable or not.

Ok, if I add a route like `2000::/64 dev enp1s0 via
fe80::fc00:3ff:fec7:cd05` to the kernel and recurse through that it does
work.

protocol static static_export_kernel2 {
ipv6;
igp table master6;
route 2001:db8::/64 recursive 2000::1234;
}

$ ip -6 route add 2000::/64 dev enp1s0 via fe80::fc00:3ff:fec7:cd05
$ birdc show route protocol static_export_kernel2 all
2001:db8::/64unicast [static_export_kernel2 16:01:20.131] * (200)
via fe80::fc00:3ff:fec7:cd05 on enp1s0
Type: static univ

> Or it might be that routes imported from the kernel are marked as
> recursive, so it does not resolve because Bird does not allow double
> recursion.

If I remove the 2000::/64 route the recursive route goes back to
unreachable. So something seems to be preventing recursion through the
default route specifically, but it evidently doesn't have anything to do
with the route coming from the kernel.

$ ip -6 route del 2000::/64 dev enp1s0 via fe80::fc00:3ff:fec7:cd05
$ birdc show route protocol static_export_kernel2 all
2001:db8::/64unreachable [static_export_kernel2 16:01:20.131] * 
(200)
Type: static univ

Weird.

Thanks for your suggestions,
--Daniel



Recursive nexthop via kernel route in proto static not working

2023-06-27 Thread Daniel Gröber
Hi,

I'm trying to configure a static route that uses the system's default
router. The default router is out of my control and is announced via IPv6
RA. Since the MAC might not be stable I'd prefer not to hardcode the
router's link-local address. So I tried:

protocol static static_export_kernel2 {
ipv6;
igp table master6;
route 2001:db8::/48 recursive ::;
}

The thinking being that :: ought to be resolvable through the default route
imported from proto kernel:

birdc show route for :: all
Table master6:
::/0 unicast [kernel1 2023-03-22] * (10)
via fe80::fc00:3ff:fec7:cd05 on enp1s0
Type: inherit univ
Kernel.source: 9
Kernel.metric: 1024
Kernel.hoplimit: 64

However the static route ends up being "unreachable":

2a0d:f302:114::/48   unreachable [static_export_kernel2 13:59:32.819] * 
(200)
Type: static univ

Any ideas why this isn't working?

Thanks,
--Daniel

PS: The reason I have to create this static route when it's already covered
by the default route has to do with an esoteric policy routing setup and
multiple routing daemons ask for details at your own peril :)


Re: Graceful shutdown request signal

2023-06-20 Thread Daniel Gröber
Hi Erin,

On Tue, Jun 20, 2023 at 08:20:50PM +0200, Erin Shepherd wrote:
> I run bird on a system which uses systemd as a service supervisor, and
> would like to implement graceful restart in a way which works well with
> it.

I'm also interested in getting this working. I'm wondering how graceful
restart is supposed to behave to begin with though. Last time I just tried
`birdc graceful restart` I was surprised that this actually makes bird exit
with rv=0 instead of ... well actually restarting.

Is that normal or a bug in the Debian packaging/systemd service?

I suppose if you're supposed to use graceful restart just before rebooting
the system it makes sense but at the time I just wanted to do a full
restart to update the running bird executable without causing traffic
disruption and was rather surprised when it didn't come back up.

> In particular, what I'd like to do is:
>  • If I restart the bird service (e.g. for an upgrade), Bird performs a 
> graceful restart
>  • If I manually stop bird (systemctl stop bird2), shutdown the system, or 
> any other action which is liable to cause a longer period of downtime, 
> perform a graceful restart (for BGP at least)

When you want to tickle special behaviour out of systemd I found it best to
not try and do everything in one unit. Using the dependency system you can
stop another unit before bird.service gets stopped, this can then call
birdc graceful restart, like this:

bird-graceful-restart.service:

[Unit]
After=bird.service
Requires=bird.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=birdc graceful restart
[Install]
WantedBy=bird.service

That way you can enable/disable system wide graceful bird restart by
`systemctl enable bird-graceful-restart.service`.

The BindsTo+After means stopping bird.service (which is part of restart)
will first stop bird-graceful-restart.service, which can then casually
signal bird to perform graceful restart without a race (I
think). WanteBy=bird.service is needed as ExecStop will only run when a
service has actually been started.

Also some special handling for the case where bird is already gone (crashed
or exited) and birdc would fail might be needed. Excercise for the reader :)

Doing some quick testing this actually seems to do what I want. Somewhat
unexpectedly `systemctl restart bird.servce` with the above enabled does
then actually do what you'd expect since restart is just stop+start. Bird
will exit on getting the graceful restart command so systemd can't try to
stop it some more but the subsequent start will make it come back up. Neat.

On system shutdown only the stop part will happen so bird exiting is
actully useful now and that too should work as expected.

> Ideally systemd would have an ExecStopRestart option which could be used
> to distinguish between stops and restarts, but unfortunately it doesn't
> (I might submit a feature request for this regardless). However, it does
> support `KillSignal and RestartKillSignal options.`

I've been perplexed by this in the past also. Don't even get me started on
the fact that ExecStop= runs even when a service already terminated on it's
own, to like stop it some more... :)

The semantics are really bonkers sometimes but it's actually rather
flexible if you look at the nuaunces of the different dependency types a
bit.

--Daniel


Re: [PATCH] [RFC] Babel: Implement route daming with fixed delay

2023-06-02 Thread Daniel Gröber
Hi Juliusz,

On Tue, Mar 07, 2023 at 01:20:28PM +0100, Juliusz Chroboczek wrote:
> To be honest, we hacked things until we had acceptable worst-case behaviour.
> 
> We had two networks to experiment with: Nexedi's production network
> (hundreds of tunnels over the public IPv6 Internet) and a simulated
> network we built ourselves which we believed represented the worst case (a
> bufferbloated diamond network).  We built a first prototype, which we
> instrumented to log RTT samples and route flaps, and noticed three things:
> 
> 1. in the production network, the RTT signal is noisy (see figures 4 and 6
>of [1]);
> 
> 2. in the bufferbloated diamond network, when we switch away from
>a congested route, we switch back too early, before the buffers have
>had time to drain;
> 
> 3. in the diamond network, we tend to switch routes as we oscillate around
>a common value.

I'm working on comparing the exponential filter behaviour to my damping
approach, I was wondering if you still have any scripts/notes on how the
simulations for Fig 10 and 11 were setup?

> References:
> 
> [1] https://arxiv.org/pdf/1403.3488.pdf


> Hence, the three mechanisms:
> 
> 1. smoothing of the RTT, to makes the signal less noisy; the smoothing is
>exponential just because it's easy to implement;
> 
> 2. saturating map from RTT to metric, so that congested routes all appear
>equally bad, and we don't switch back before the buffers drain; 

Re-reading this, was the time until a route is reconsidered made dependent
on the metric/RTT difference on purpose to get this draining behaviour?

While my timer approach doesn't mirror this currently I'm not opposed to
it, I just wonder if metric difference is the right variable here.

Thanks,
--Daniel


Re: Pending bird patches and discussions for proto/babel

2023-06-01 Thread Daniel Gröber
Hi Maria,

On Thu, Jun 01, 2023 at 05:33:42PM +0200, Maria Matejka via Bird-users wrote:
> > Toke's first two patches are ready for merging, but the third (metric
> > smoothing) is contested since it conflicts with my route selection patch in
> > its current form and if adapted it conflicts with my scalability goals for
> > babel instead. Possibly premature optimization admittedly ;)
> 
> We generally thought that the patches were still kinda work in progress,
> sorry if our non-responsivity could be interpreted as negligence.

No worries, I figured that's what was going on.

> > That being said since RTT metric is a new feature anyway I would prefer we
> > just go with my route damping approach, mark RTT as experimental and get
> > some more wider testing through a released version.
>
> That looks like a feasible way to go. Shall we then just try to merge these
> patches as they have been sent to the list?

I have some more changes to make to the dampening patch, but if we're in
agreement that this direction is the way to go I'll get a v1 out ASAP.

You can merge Toke's patches independently of that though. RTT without
smoothing is still useful :)

> > Lastly my v5 route selection patch should also be ready, all threads of
> > discussion should be resolved and there are no known bugs. However I've not
> > received any testing reports, hint, hint peeps °>°
> 
> OK, will look at it as well. Please send me an owl, a pigeon or another
> avian carrier compatible with RFC 1149 if we don't react in a reasonable
> time (est. a fortnight).

I'm afraid quality-of-life concerns for the carriers involved prevent this
as a method of delivery. Perhaps once RFC 7511 is more widely deployed?

I'd be open to sending a reminder mail over conflcit-free silica
transmission line though :D

Chirp-chirp,
--Daniel


Pending bird patches and discussions for proto/babel

2023-06-01 Thread Daniel Gröber
Hello birds,

I just thought I'd put together an overview of as yet unresolved
proto/babel patches and discussions to stimulate some activity :)

Toke's RTT series:

- lib/timer: Add current_time_now() function for immediate timestamp
- babel: Add support for the RTT extension
- babel: Add route metric smoothing

My route selection revamp:

- [v5] Babel: Replace internal route selection by bird's nest
- [DRAFT] Babel: Implement route daming with fixed delay(s/daming/damping/)

Toke's first two patches are ready for merging, but the third (metric
smoothing) is contested since it conflicts with my route selection patch in
its current form and if adapted it conflicts with my scalability goals for
babel instead. Possibly premature optimization admittedly ;)

Initially Toke was playing with the idea of exposing the smoothed metric to
route filters in a followup patch. Me and Juliusz oppose this, for
different reasons and the consensus is to not do that.

My stance is that we could choose to merge smoothing as long as we don't
expose it's implementation internals (the smoothed metric) to users. This
way we can replace it with a more performant approach later.

That being said since RTT metric is a new feature anyway I would prefer we
just go with my route damping approach, mark RTT as experimental and get
some more wider testing through a released version.

Lastly my v5 route selection patch should also be ready, all threads of
discussion should be resolved and there are no known bugs. However I've not
received any testing reports, hint, hint peeps °>°

--Daniel


Re: Radv proto sending adverts on wrong interface

2023-03-13 Thread Daniel Gröber
Hi Kees,

On Mon, Mar 13, 2023 at 07:23:17AM +0100, Kees Meijs | Nefos wrote:
> About VLAN configuration: I guess one should never ever use VLAN_DEFAULT
> c.q. VLAN 1 at all. Vendors often think differently about this case,
> sometimes allowing to have a .1Q tag, sometimes not. Or sort of "both" as
> well. In combination with protocols such as LACP or (variations on) STP
> resulting in great fun, which of course is very sarcastic.

I changed the default using `default-vlan-id 4095` for that very
reason. Perhaps this is still a default vid related bug. I'll see if I can
reproduce this with some other vid.

> My two cents after opening up a box with a new switch, starting your
> configuration: explicitly disable VLAN1 and continue using others. Same goes
> for VLAN > 4090 by the way. These numbers are sometimes "reserved".

The brocade docs list the reserved vlans (4091, 4092) explicitly and allow
to change them like the default vid. The 4095 default vid I'm using is the
example they use in the docs so I'd be surprised if that's broken.

Using such low vlan ids is a holdover from using some ath9k wifi APs that
only supported 4-bit vids so we had to make every number count ;) I'll
definetly reconsider that in the future haha.

Thanks,
--Daniel


[PATCH] Fix sk sending link-local scope multicast on wrong interface

2023-03-10 Thread Daniel Gröber
When radv sends to the ipv6 all-nodes multicast address (ff02::1) we don't
set sin6_scope_id because ipa_is_link_local isn't satisfied. This will
cause packets to be sent via the wrong interface.

To fix this we use ip6_classify, which returns SCOPE_LINK for unicast
fe80::/64 as well as multicast ff02::/8 destinations.
---
Tested with bird 2.0.10 on Debian 11.

 sysdep/unix/io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdep/unix/io.c b/sysdep/unix/io.c
index e131ca41..f77b4a51 100644
--- a/sysdep/unix/io.c
+++ b/sysdep/unix/io.c
@@ -229,7 +229,7 @@ sockaddr_fill6(struct sockaddr_in6 *sa, ip_addr a, struct 
iface *ifa, uint port)
   sa->sin6_flowinfo = 0;
   sa->sin6_addr = ipa_to_in6(a);
 
-  if (ifa && ipa_is_link_local(a))
+  if (ifa && ip6_classify(&a) & SCOPE_LINK)
 sa->sin6_scope_id = ifa->index;
 }
 
-- 
2.30.2



Radv proto sending adverts on wrong interface

2023-03-10 Thread Daniel Gröber
Hi,

I'm using bird as a replacement for radvd since the latter has a
longstanding issue with sending adverts on unconfigured interfaces under
complex conditions.

Turns out bird has a similar issue :)

Looking at the code, when opening the socket for an interface in
radv_sk_open we set sk->iface as you'd expect, which should cause packets
to be sent directly via this interface.

However radv sends packets to the all-nodes multicasts address for periodic
adverts, see radv_send_ra. This then calls sk_send_to which (eventually)
calls sockaddr_fill6. Here rther we find this code:

if (ifa && ipa_is_link_local(a))
  sa->sin6_scope_id = ifa->index;

This would seem to be the problem to me, since a=ff02::1 doesn't pass this
check so the sendmsg call goes out without the interface-index being
communicated to the kernel. It looks like that causes it to just pick a
random interface though I would expect it to just multicast this. Not sure
why that is but it's broken on our side either way.

I'd add a check here for whether the saddr is link-local too. That should
cover this case.

Any comments/objections?

Thanks,
--Daniel


[PATCH] [RFC] Babel: Implement route daming with fixed delay

2023-03-02 Thread Daniel Gröber
In order to prevent RTT based routing from causing persistent traffic
oscillations we delay core rte announcement of each prefix by a
configurable but metric invariant amount of time.

Initial announcements and withdrawals will go through undelayed but a
chnage in route for a prefix will cancel the previous announcement and
reset the delay.
---

This is just quickly thrown together to demonstrate the idea of how to
replace metric smoothing with something more simplistic. Obviously the
performance of using a linked list for keeping the scheduled
announcements is suboptimal, our binary heap could be used instead.

I'm not quite sure all the logic works out but let's discuss the
concept before spending more time on it :)

This could be extended to make the delay depend on metric difference
if it turns out that's necessary I just skipped it for simplicity.

 proto/babel/babel.c | 79 +++--
 proto/babel/babel.h | 11 +--
 2 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index 2243fc89..d6d99a37 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -62,6 +62,7 @@ static inline int gt_mod64k(uint a, uint b)
 
 static void babel_expire_requests(struct babel_proto *p, struct babel_entry 
*e);
 static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *r);
+static void babel_sched_announce_rte(struct babel_proto *p, struct babel_entry 
*e, struct babel_route *r);
 static void babel_rte_update_unreachable(struct babel_proto *p, struct 
babel_entry *e, u8 announce);
 static void babel_send_route_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_neighbor *n);
 static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_seqno_request *sr, struct babel_neighbor *n);
@@ -653,7 +654,7 @@ done:
 WALK_LIST2(r, n, nbr->routes, neigh_route)
 {
   r->metric = babel_compute_metric(nbr, r->advert_metric);
-  babel_announce_rte(p, r->e, r);
+  babel_sched_announce_rte(p, r->e, r);
 }
   }
 }
@@ -772,6 +773,50 @@ babel_announce_rte(struct babel_proto *p, struct 
babel_entry *e, struct babel_ro
   }
 }
 
+/*
+ * When route damping is enabled we have to delay updates to the kernel FIB. We
+ * do this by queueing some updates before actually announcing them to core.
+ */
+static void
+babel_sched_announce_rte(struct babel_proto *p, struct babel_entry *e, struct 
babel_route *r)
+{
+  struct babel_config *cf = (void *) p->p.cf;
+
+  if (!cf->route_damping ||
+  !e->selected ||
+  e->selected->metric < r->metric ||
+  /*^ When the best route's metric is worse than the one we're about to
+   * announce it's safe to do so immediately even if route damping is on as
+   * it won't get exported to most other protocols. Note add-path BGP is
+   * still notified but we're not concerned about micromanaging the
+   * stability of exotic setups, only our local kernel FIB's impact. */
+  r->metric == BABEL_INFINITY || !r->feasible)
+  /*^ This checks if announce_rte will do a withdrawal. */
+  {
+if (e->scheduled) {
+  rem_node(&e->sched);
+  e->scheduled = NULL;
+}
+
+babel_announce_rte(p, r->e, r);
+  }
+  else
+  {
+struct babel_iface *ifa = r->neigh->ifa;
+
+if (!e->scheduled)
+  add_tail(&ifa->sched_rtes, &e->sched);
+e->scheduled = r;
+
+r->sched_time = current_time() + cf->route_damping;
+
+if (!ifa->want_sched) {
+  ifa->next_sched = r->sched_time;
+  ifa->want_sched = 1;
+}
+  }
+}
+
 /*
  * Functions to send replies
  */
@@ -1371,7 +1416,7 @@ babel_handle_update(union babel_msg *m, struct 
babel_iface *ifa)
 e->updated = current_time();
   }
 
-  babel_announce_rte(p, e, r);
+  babel_sched_announce_rte(p, e, r);
 }
 
 void
@@ -1683,6 +1728,32 @@ babel_iface_timer(timer *t)
 p->triggered = 0;
   }
 
+  if (ifa->want_sched && now_ >= ifa->next_sched) {
+TRACE(D_EVENTS, "Announcing scheduled rtes on %s", ifa->ifname);
+
+struct babel_entry *e, *ex;
+
+ifa->next_sched = TIME_INFINITY;
+
+u8 all_announced = 1;
+WALK_LIST_DELSAFE(e, ex, ifa->sched_rtes) {
+  struct babel_route *r = e->scheduled;
+  ASSERT(r);
+  if (now_ >= r->sched_time) {
+   rem_node(&e->sched);
+   babel_announce_rte(p, e, r);
+  } else {
+   all_announced = 0;
+   if (r->sched_time < ifa->next_sched)
+ ifa->next_sched = r->sched_time;
+  }
+}
+
+if (all_announced) {
+  ifa->want_sched = 0;
+}
+  }
+
   btime next_event = MIN(ifa->next_hello, ifa->next_regular);
   if (ifa->want_triggered) next_event = MIN(next_event, ifa->next_triggered);
   tm_set(ifa->timer, next_event);
@@ -1862,6 +1933,8 @@ babel_add_iface(struct babel_proto *p, struct iface *new, 
struct babel_iface_con
   init_list(&ifa->msg_queue);
   ifa->send_event = ev_new_init(ifa->pool, babel_send_queue, ifa)

Re: [PATCH 0/3] babel: Add support for the RTT extension

2023-03-02 Thread Daniel Gröber
Hi Maria,
Hi Toke,

On Tue, Feb 28, 2023 at 02:07:06PM +0100, Maria Matejka via Bird-users wrote:
> > > I think it's probably simpler to just re-announce any route that's still
> > > converging every time we go through the routing table.
> > 
> > Simpler, yes, but I do want to be able to maintain internet scale routing
> > tables through babel eventually so slashing every little bit helps :)
> 
> In version 2, update of non-best route is propagated only to some protocols
> like pipes, add-path BGPs and alike.

Ok, that's good for either approach.

> In version 3, this is even more smoothed as all updates of one prefix are
> exported asynchronously to each protocol, being notified after your Babel
> ends the task (event, socket, timer), dampening best route oscillation or
> other flaps.

I don't quite understand why this would damp oscillations? Do you mean
there's explicit route flap damping support in v3 or just that this is a
side-effect of the new async world? I'd like to know more about either :)

If we do have actual BGP style damping in nest in v3 I'm not sure there's
much point in doing essentially the same thing in our proto. At the very
least that would be a good reason to keep the babel specific daming easy to
remove if it's about to become superceeded by direct nest support anyway.

> This way, I'm not so scared about Babel periodically updating many routes.
> BIRD has to withstand it.

I still think doing uneccessary work/computation is just dumb if we can
avoid it :)


On Tue, Feb 28, 2023 at 04:45:35PM +0100, Toke Høiland-Jørgensen wrote:
> Right, sure, that's a nice property, but I'm not actually sure how what
> you sketched out above accomplishes that?

Don't worry I'll send an RFC patch soon if I can make it work out, just got
tied up by some (mild) covid.

> > Simpler, yes, but I do want to be able to maintain internet scale routing
> > tables through babel eventually so slashing every little bit helps :)
> 
> Heh, yeah, I would like to eventually be able to do that as well, but I
> think there are other optimisations we need to do first. For instance,
> walking the entire routing table every second is not going to work in
> the first place in this case :)

True, but might as well throw myself at this RTT stuff while I have the
time and energy. Large scale route table performance testing will have to
wait for another day since there's not much point making it performant if
the features I want/need aren't supported (and performant! haha).

> >> Bear in mind that the currently selected route can also be converging, so
> >> predicting when two routes "cross" gets complicated quickly. Simpler to
> >> just do a periodic update and redo the comparison every time this update
> >> happens.
> >
> > I feel like that's an artifact specific to the "metric smoothing" approach
> > to route dampening not a feature though. The way I see it the behaviour we
> > really want is to delay any change in selected route for a time related to
> > the metric difference.
> >
> > Think back to what the purpose of the metric smoothing is in the first
> > place: to limit oscillations of the selected route, which this will do just
> > as well.
> 
> I'm OK with finding another solution, but I think you're going to have
> to explain in more detail how what you propose actually represents such
> a solution, then :)

Will do, I've been looking throug some network stability under dynamic
routing literature to see if there's any well founded science we can apply
here. Haven't really found anything good yet.

The RTT paper does admit that "we lack an in-depth theoretical
understanding of the performance of our algorithm, in particular of its
stability." ;)

There is one thing I'm unsure about: does the delay before propagating a
route change to the kernel FIB actually have to depend on the metric
difference to provide the network stability properties we're looking for?

I think just strictly for stability a fixed delay should be fine, despite
not being optimal in terms of convergence time.

> > I don't agree with that. It's not as if I want per-hop information. Just a
> > sum of RTTs along the path and a sum of administrative metric along the
> > path rather than have those jumbled together into one number.
> >
> > Since babel is quite flexible in the actual metric math that would allow
> > interesting ways of weighing each metric component rather than just having
> > everything be linear.
> 
> It also introduces dependencies, though. I.e., with the current approach
> you can have a subset of the routers speak the RTT extension, and other
> parts of the network will just have that incorporated into the metric.
> Whereas if it is carried as a separate metric your entire network has to
> know about the extension for it to be useful.

I don't see why you couldn't do both. Incorporate the rtt (or other
measures) into the metric for oblivious nodes and expose optional TLVs for
ones that care about the different components.

> > Fo

DHCPv6 PD server as client route originator (Was: iBGP RR IPv6 link-local next-hop not kept)

2023-02-28 Thread Daniel Gröber
Hi Mirai,

On Tue, Feb 28, 2023 at 02:57:54PM -0500, Mirai Azayaka wrote:
> I am trying to send routes from my DHCPv6 prefix delegation server to
> my router using iBGP. Those delegated prefix routes on the DHCPv6
> server are installed in its kernel table, such as 2001:db8:db8::/56
> via . 

Wow, that's actually a really interesting way to deploy DHCPv6 PD! I've
been researching how to deploy PD outside of the proprietary vendor
sandboxes recently (mainly still focusing on using DHCPv6 relays with route
snooping), but haven't found any good free-software options so far. Perhaps
your approach could be more viable for a small network like mine.

Would you mind elaborating on your setup a bit more?

Thanks,
--Daniel (AS212704)


Re: [PATCH 0/3] babel: Add support for the RTT extension

2023-02-27 Thread Daniel Gröber
Hi Toke,

On Mon, Feb 27, 2023 at 12:16:01PM +0100, Toke Høiland-Jørgensen wrote:
> >> - Add the smoothed metric as a new route attribute (so it's also
> >>   available to filters)
> >
> > I think doing that is a bad idea. If we keep filters from changing this we
> > might just be able to optimize by only announcing smooted metric updates
> > when the resulting route would actually be better than the currently
> > selected one.
> >
> > If we let filters meddle with this however that becomes impossible. Unless
> > you were thinking of a r/o attribute?
> 
> I meant read-only. I agree that the filters shouldn't be able to modify
> the smoothed metric. 

I've thought about this some more, I think we absolutely shouldn't expose
the smooted metric to filters. It's an implementation detail. There's a
bunch of other valid ways to implement this sort of dampening/hysteresis no
reason to expose it to filter users really.

How would you even use this for any practical filtering anyway? Reject
routes based on `smoothed_metric - metric > threshold` or even better,
increase metric based on smoothed_metric difference (oscillations galore
haha). Hardly seems useful to me :)

AFAICT it's perfectly fine to have EAs that filters can't access though,
that's the situation we're currently in for the router-id after all so
that's not really a blocker.

I've been playing with rebasing your smoothing patch on mine now and I see
a couple of options. My favorite so far is strategically replacing relevant
babel_announce_rte calls with this:

static void
babel_sched_announce_rte(struct babel_proto *p, struct babel_route *r)
{
  struct babel_config *cf = (void *) p->p.cf;

  if (!cf->metric_decay || !e->selected ||
  e->selected->metric < r->metric)
  /* When best route's metric is less than the one we're about to 
announce
   * it's safe to do so immediately even if metric smoothing is on. Note
   * e->selected takes export filters into account. */
babel_announce_rte(p, r->e, r);
  else
e->scheduled = r;
}

The clever bit is that announcing a route that's no better than the
selected one will have no impact on route selection. Same applies if
another protocol's route is active (i.e. when e->selected is NULL).

Only think I'm still missing is a way to reframe the metric smoothing math
such that I can install a timer for when the next scheduled route update
should go through, though perhaps polling in the global timer hook is fine?

> Another question is if the smoothing should incorporate any changes to
> the metric that the filters do? It probably should, right?

I don't think that'll give us very useful behaviour TBH. Perhaps the way to
look at it is that we're not trying to smooth the (administrative) metric
but rather only the link cost component which is the bit that is actually
fluctuating? Problem is we only have these as seperate numbers for our
local metrics/costs not the ones from other nodes.

Maybe we need a protocol extension to seperate these two concepts, hmm. I
have been thinking it would be nice to have seperate rtt and administrative
metric numbers just for network debugging, perhaps it would make sense for
general operation too.

> That may be OK, but I'm not actually sure how the nest comparison works
> (when we announce a new route, does the rte_better function get called
> pairwise against every other candidate with the same prefix, or just
> against the current best?).

>From adding some quick tracing code it certainly looks like the comparison
is only against the current best. There's only ever a couple of rte_better
calls per prefix once things are initialized.

Also see above for how "lovely" all the necessary lookups are going to be
if we go down this path, haha.

--Daniel


Re: [PATCH 3/3] babel: Add route metric smoothing

2023-02-27 Thread Daniel Gröber
Hi Toke,

See below.

On Sun, Feb 26, 2023 at 11:10:06PM +0100, Toke Høiland-Jørgensen via Bird-users 
wrote:
> The Babel RTT extension employs metric smoothing to dampen route
> oscillations in the face of varying RTT values between two peers[0].
> 
> This patch implements such dampening in Bird, roughly following the
> implementation in babeld (i.e., using the same exponential function
> definition). The main difference is that we calculate everything in the
> native Bird microsecond time unit (and increase constants accordingly), and
> that we split out the smoothed metric calculation in two function variants,
> one that has no side effects and one that does.
> 
>   [0] https://arxiv.org/pdf/1403.3488.pdf
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  doc/bird.sgml|  12 +
>  proto/babel/babel.c  | 121 ---
>  proto/babel/babel.h  |  16 ++
>  proto/babel/config.Y |  10 +++-
>  4 files changed, 150 insertions(+), 9 deletions(-)
> 
> diff --git a/doc/bird.sgml b/doc/bird.sgml
> index 451dff4031d5..82f0ded13133 100644
> --- a/doc/bird.sgml
> +++ b/doc/bird.sgml
> @@ -1915,6 +1915,7 @@ protocol babel [] {
>   ipv4 {  };
>   ipv6 [sadr] {  };
>  randomize router id ;
> +metric decay ;
>   interface  {
>   type ;
>   rxcost ;
> @@ -1965,6 +1966,17 @@ protocol babel [] {
>router ID every time it starts up, which avoids this problem at the 
> cost
>of not having stable router IDs in the network. Default: no.
>  
> +  metric decay 
> +  The metric smoothing decay time. When route metrics vary (because of
> +  varying quality of a wireless link, or varying RTT when timestamps are
> +  enabled), Babel applies an exponential smoothing procedure to the 
> metric
> +  to dampen route oscillations. This parameter specifies the half-life of
> +  the convergence of the smoothed metric to the actual metric of the 
> route.
> +  I.e., the distance between the smoothed and the actual metric will be
> +  halfed for each time period specified here (until they converge). Set 
> to 0
> +  to disable metric smoothing; if set, the value must be in the interval
> +  1-180 s. Default: 4 s
> +
>type wired|wireless|tunnel 
>This option specifies the interface type: Wired, wireless or tunnel. On
>wired interfaces a neighbor is considered unreachable after a small 
> number
> diff --git a/proto/babel/babel.c b/proto/babel/babel.c
> index 04613788303c..a95e37f5c413 100644
> --- a/proto/babel/babel.c
> +++ b/proto/babel/babel.c
> @@ -144,6 +144,103 @@ babel_expire_sources(struct babel_proto *p UNUSED, 
> struct babel_entry *e)
>}
>  }
>  
> +static u16
> +babel_calc_smoothed_metric(struct babel_proto *p, struct babel_route *r, u8 
> update)
> +{
> +  struct babel_config *cf = (void *) p->p.cf;
> +  uint metric = r->metric, smoothed_metric = r->smoothed_metric;
> +  btime smoothed_time = r->smoothed_time, now = current_time();
> +
> +  if (!cf->metric_decay || metric == BABEL_INFINITY ||
> +  metric == smoothed_metric || !smoothed_time)
> +  {
> +smoothed_metric = metric;
> +smoothed_time = now;
> +goto out;
> +  }
> +
> +  int diff = metric - smoothed_metric;
> +
> +  /*
> +   * The decay defines the half-life of the metric convergence, so first 
> iterate
> +   * in halving steps
> +   */
> +  while (smoothed_time < now - cf->metric_decay && diff) {
> +smoothed_metric += diff/2;
> +smoothed_time += cf->metric_decay;
> +diff = metric - smoothed_metric;
> +  }
> +
> +  /*
> +   * Then, update remainder in BABEL_SMOOTHING_STEP intervals using the
> +   * exponential function (approximated via the pre-computed reciprocal).
> +   */
> +  while (smoothed_time < now - BABEL_SMOOTHING_STEP && diff) {
> +smoothed_metric += (BABEL_SMOOTHING_STEP * diff *
> +(cf->smooth_recp - BABEL_SMOOTHING_UNIT) / 
> BABEL_SMOOTHING_UNIT);
> +smoothed_time += BABEL_SMOOTHING_STEP;
> +diff = metric - smoothed_metric;
> +  }
> +
> +  /* Consider the metric converged once we're close enough */
> +  if (ABS(diff) < BABEL_SMOOTHING_MIN_DIFF)
> +smoothed_metric = metric;
> +
> +out:
> +  if (update) {
> +r->smoothed_metric = smoothed_metric;
> +r->smoothed_time = smoothed_time;
> +  }
> +
> +  return smoothed_metric;
> +}
> +
> +static u16
> +babel_update_smoothed_metric(struct babel_proto *p, struct babel_route *r)
> +{
> +  if (!r->metric)
> +return 0;
> +
> +  if (!r->smoothed_metric) {
> +r->smoothed_metric = r->metric;
> +r->smoothed_time = current_time();
> +return r->smoothed_metric;
> +  }
> +
> +  u16 smoothed = babel_calc_smoothed_metric(p, r, 1);
> +  DBG("Updated smoothed metric for prefix %N: router-id %lR metric %d/%d\n",
> +  r->e->n.addr, r->router_id, r->metric, smoothed);
> +
> +  return smoothed;
> +}
> +
> +static u16
> +babel_smoothed_metric(struct babel_prot

Re: Babel: Clarifications on seqno request handling in bird

2023-02-27 Thread Daniel Gröber
On Mon, Feb 27, 2023 at 07:25:13PM +0100, Juliusz Chroboczek wrote:
> > That's a bug in the new RFC text then ;)
> 
> Agreed.  https://www.rfc-editor.org/how-to-report/

Done :)

--Daniel


[PATCH v5] Babel: Replace internal route selection by bird's nest

2023-02-27 Thread Daniel Gröber
This introduces the ability to filter routes from specific interfaces and
neighbours. With the current internal route selection proto babel exports
only up to one route and an admin cannot do fine-grained filtering.

To fix this we rip out the internal route selection entirely and export
evey babel route as a distinct route into the bird's nest instead.

In the presence of esoteric route filters this change may be observable by
users however we feel the additional filtering power is more important. A
release note entry for this change is recommended.
---
Changes in v5:
  - Fix crash due missing rt_lock_source from rte_src pruning
  - Fix broken rt_notify retraction logic (!internal) which would just crash

 proto/babel/babel.c | 285 
 proto/babel/babel.h |   3 +
 2 files changed, 135 insertions(+), 153 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index 04613788..c030784a 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -28,10 +28,12 @@
  * possible routes for the prefix are tracked as &babel_route entries and the
  * feasibility distance is maintained through &babel_source structures.
  *
- * The main route selection is done in babel_select_route(). This is called 
when
- * an entry is updated by receiving updates from the network or when modified 
by
- * internal timers. The function selects from feasible and reachable routes the
- * one with the lowest metric to be announced to the core.
+ * The main route selection is done by bird's nest (heh). For each prefix we
+ * export all feasible routes to the core with a distinct source (rte_src) per
+ * neihbour so bird handles them as distinct routes. Nest will then notify us 
of
+ * one best route for each prefix, either our own (internal) or from another
+ * protocol, by calling babel_rt_notify. For internal best routes we remember
+ * which babel_route it selected in babel_entry.selected.
  *
  * Supported standards:
  * RFC 8966 - The Babel Routing Protocol
@@ -59,8 +61,8 @@ static inline int gt_mod64k(uint a, uint b)
 { return ge_mod64k(a, b) && a != b; }
 
 static void babel_expire_requests(struct babel_proto *p, struct babel_entry 
*e);
-static void babel_select_route(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *mod);
-static inline void babel_announce_retraction(struct babel_proto *p, struct 
babel_entry *e);
+static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *r);
+static void babel_rte_update_unreachable(struct babel_proto *p, struct 
babel_entry *e, u8 announce);
 static void babel_send_route_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_neighbor *n);
 static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_seqno_request *sr, struct babel_neighbor *n);
 static void babel_update_cost(struct babel_neighbor *n);
@@ -178,9 +180,7 @@ static inline void
 babel_retract_route(struct babel_proto *p, struct babel_route *r)
 {
   r->metric = r->advert_metric = BABEL_INFINITY;
-
-  if (r == r->e->selected)
-babel_select_route(p, r->e, r);
+  babel_announce_rte(p, r->e, r);
 }
 
 static void
@@ -192,9 +192,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct 
babel_route *r)
   rem_node(NODE r);
   rem_node(&r->neigh_route);
 
-  if (r->e->selected == r)
-r->e->selected = NULL;
-
   sl_free(r);
 }
 
@@ -210,6 +207,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
   {
 r->metric = r->advert_metric = BABEL_INFINITY;
 r->expires = current_time() + cf->hold_time;
+babel_announce_rte(p, r->e, r);
   }
   else
   {
@@ -239,8 +237,6 @@ babel_expire_routes_(struct babel_proto *p, struct fib 
*rtable)
 loop:
   FIB_ITERATE_START(rtable, &fit, struct babel_entry, e)
   {
-int changed = 0;
-
 WALK_LIST_DELSAFE(r, rx, e->routes)
 {
   if (r->refresh_time && r->refresh_time <= now_)
@@ -248,23 +244,12 @@ loop:
 
   if (r->expires && r->expires <= now_)
   {
-   changed = changed || (r == e->selected);
+   FIB_ITERATE_PUT(&fit);
babel_expire_route(p, r);
+   goto loop;
   }
 }
 
-if (changed)
-{
-  /*
-   * We have to restart the iteration because there may be a cascade of
-   * synchronous events babel_select_route() -> nest table change ->
-   * babel_rt_notify() -> rtable change, invalidating hidden variables.
-   */
-  FIB_ITERATE_PUT(&fit);
-  babel_select_route(p, e, NULL);
-  goto loop;
-}
-
 /* Clean up stale entries */
 if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= 
now_))
   e->valid = BABEL_ENTRY_DUMMY;
@@ -273,7 +258,7 @@ loop:
 if (e->unreachable && (!e->valid || (e->router_id == p->router_id)))
 {
   FIB_ITERATE_PUT(&fit);
-  babel_announce_retraction(p, e);
+  babel_rte_update_unreachable(p, e, 0);
   goto loop;
 }
 
@@ -457,6 +442,8 @@ babel_get

Re: Babel: Clarifications on seqno request handling in bird

2023-02-27 Thread Daniel Gröber
Hi Juliusz,

On Mon, Feb 27, 2023 at 01:58:21PM +0100, Juliusz Chroboczek wrote:
> > I don't think RFC8966 is really framed in bird's "multi protocol" mindset
> 
> See the beginning of Section 3.7, which describes how a route
> redistributed from another protocol has router-id set to the local
> router's id.  Babel updates for the same prefix are processed as usual,
> with the routes inserted and maintained in Babel's route table but not
> necessarily installed into the kernel's forwarding table.

Right, what was missing from my thinking was the realization that sending
seqno requests for routes originated by us in this way doesn't really make
sense :)

> > so it's unclear to me whether this is something we have to fix or
> > not.
> 
> I'm not familiar with BIRD's internals, so I don't know.  There are two
> possible approaches:
> 
>   - drop all Babel routes for the static prefix ; or
>   - maintain routes as usual, but don't select them.
> 
> The latter is what babeld does (see xroute.c).

I belive my current rt_notify approach also has this behaviour, so that
should be fine then.

> >> A node that has lost all feasible routes to a given destination but still
> >> has unexpired unfeasible routes to that destination MUST send a seqno
> >> request;
> > 
> > I could for example read this as the above mentioned static route
> > constituting a feasible route received from a bird "internal" babel
> > neighbour which would make the behaviour described above perfectly fine,
> > no?
> 
> It is always safe to send a seqno request.  However, unless it is known
> that there is a Babel speaker at that address, it's pointless: a non-Babel
> node will not know how to react to a Babel message.
> 
> However, if there's a static route that's selected, there's not much point
> maintaining the feasibility of the Babel route; just let it become
> unfeasible, and optionally send a request when the static route is
> retracted.

When the static route is retracted we'll only send a seqno request when we
receive an unfeasable update. The problem in the implementation is we don't
know the router-id/seqno for the request anymore at this point.

I think this is fine since we're still sending seqno requests for any
unfeasable updates received while the static route is installed, so sending
another one when that is retracted shouldn't make a difference.

> > Further, looking at the code implementing RFC section 3.8.2.2 in
> > babel_handle_update:
> > 
> > if (!feasible && (metric != BABEL_INFINITY) &&
> > (!best || (r == best) || (metric < best->metric)))
> > /*^ */
> >   babel_generate_seqno_request(p, e, s->router_id, s->seqno + 1, nbr);
> >
> > It's unclear to me what the (!best) condition corresponds to from the RFC
> > text.
> 
> That used to be clear in RFC 6126 (Section 3.8.2.2):
> 
> Additionally, a node SHOULD send a unicast seqno request whenever it
> receives an unfeasible update from a currently unselected neighbour
> that is "good enough", i.e., that would lead to the received route
> becoming selected were it feasible.
> 
> It became less clear in RFC 8966:
> 
> Additionally, since metric computation does not necessarily coincide
> with the delay in propagating updates, a node might receive an
> unfeasible update from a currently unselected neighbour that is
> preferable to the currently selected route (e.g., because it has
> a much smaller metric); in that case, the node SHOULD send a unicast
> seqno request to the neighbour that advertised the preferable update.
>
> It is definitely a good idea to send a request in this case, especially if
> you ignore unfeasible updates for unknown routes (first clause of Section
> 3.5.3).

That's a bug in the new RFC text then ;)

Actually I also missed another reason for the !best condition: to prevent
(metric < best->metric) from segfaulting.

Thanks,
--Daniel


Re: [PATCH 0/3] babel: Add support for the RTT extension

2023-02-26 Thread Daniel Gröber
Hi Toke,

On Mon, Feb 27, 2023 at 12:14:23AM +0100, Toke Høiland-Jørgensen wrote:
> > To clarify: it's really only the metric smoothing patch that's in conflict
> > with my patch. I would advocate for merging only the other two patches for
> > now while we figure out how to rework the smoothing on top of my patch. I'm
> > happy to do the rework we just need to come up with a plan for that :)
> 
> Hmm, I think the way to handle this is basically:
> 
> - Add the smoothed metric as a new route attribute (so it's also
>   available to filters)

I think doing that is a bad idea. If we keep filters from changing this we
might just be able to optimize by only announcing smooted metric updates
when the resulting route would actually be better than the currently
selected one.

If we let filters meddle with this however that becomes impossible. Unless
you were thinking of a r/o attribute?

> - Change babel_rte_better() to incorporate the smoothed metric (from the
>   attribute) in its comparison
> - Change the decay logic to be timer-based instead of calculating the
>   smoothed metric on demand
> 
> That last bit is probably the biggest change. We can't really do the
> cached on-demand calculation of the smoothed metric if we're sticking it
> in an attribute.

Can you elaborate on why you think that's not possible? The way I see it
there isn't much harm in the smooted metric attribute being outdated so
long as we ensure we announce an update if an actually better route comes
along.

Suggesting this kind of seems like we're brining back the select_route
function, but I think we can do it in a way that only looks at the route
that changed without iterating over all of them, the nest will do that bit
for us now.

--Daniel


Re: [PATCH v4] Babel: Replace internal route selection by bird's nest

2023-02-26 Thread Daniel Gröber
Hi,

I've observed the following crash in DEBUGGING mode due to src->proto being
poisoned, so I think the neighbour must have been flushed too soon somehow.

#0  rte_recalculate (c=0x556e1970, net=0x77fc57e0, 
new=0x7703cce0, src=0x77fc8370)
at nest/rt-table.c:1392
#1  0x555dd72f in rte_update2 (c=0x556e1970, n=0x77fc41f0, 
new=0x7703cce0, 
src=0x77fc8370) at nest/rt-table.c:1636
#2  0x555e8e69 in babel_announce_rte (p=0x556e1700, 
e=0x77fc4168, r=0x7736d980)
at proto/babel/babel.c:726
#3  0x555e8a2e in babel_update_cost (nbr=0x556ddea0) at 
proto/babel/babel.c:625
#4  0x555e9f49 in babel_handle_ihu (m=0x759e10a8, 
ifa=0x55705390)
at proto/babel/babel.c:1132
#5  0x555f11b2 in babel_process_packet (ifa=0x55705390, 
pkt=0x5570f700, len=34, 
saddr=..., sport=6696, daddr=..., dport=6696) at 
proto/babel/packets.c:1525
#6  0x555f13fa in babel_rx_hook (sk=0x5570f5b0, len=34) at 
proto/babel/packets.c:1584
#7  0x55647707 in sk_read_noflush (s=0x5570f5b0, revents=1) at 
sysdep/unix/io.c:1914
#8  0x55647732 in sk_read (s=0x5570f5b0, revents=1) at 
sysdep/unix/io.c:1923
#9  0x556485e2 in io_loop () at sysdep/unix/io.c:2368
#10 0x5564dae0 in main (argc=8, argv=0x7fffe568) at 
sysdep/unix/main.c:959

I haven't been able to find a reproducer yet, I think it happened after a
system suspend thogh.

More notes below:

On Sun, Feb 26, 2023 at 09:05:24PM +0100, Daniel Gröber wrote:
> This introduces the ability to filter routes from specific interfaces and
> neighbours. With the current internal route selection proto babel exports
> only up to one route and an admin cannot do fine-grained filtering.
> 
> To fix this we rip out the internal route selection entirely and export
> evey babel route as a distinct route into the bird's nest instead.
> 
> In the presence of esoteric route filters this change may be observable by
> users however we feel the additional filtering power is more important. A
> release note entry for this change is recommended.
> ---
> Changes in v4:
>  - We've seen the light, use FIB_ITERATE as intended, thanks Ondrej.
>  - Reinstate seqno request logic on loss of all feasible routes, thanks
>Toke. Discussion of behaviour still ongoing, see[1]
>  - Try to prepare for async rt_notify in BIRD 3, thanks Maria.
> 
> [1]: http://trubka.network.cz/pipermail/bird-users/2023-February/016689.html
> 
> 
>  proto/babel/babel.c | 278 
>  proto/babel/babel.h |   3 +
>  2 files changed, 128 insertions(+), 153 deletions(-)
> 
> diff --git a/proto/babel/babel.c b/proto/babel/babel.c
> index ff8b6b52..eb347a85 100644
> --- a/proto/babel/babel.c
> +++ b/proto/babel/babel.c
> @@ -29,10 +29,12 @@
>   * possible routes for the prefix are tracked as &babel_route entries and the
>   * feasibility distance is maintained through &babel_source structures.
>   *
> - * The main route selection is done in babel_select_route(). This is called 
> when
> - * an entry is updated by receiving updates from the network or when 
> modified by
> - * internal timers. The function selects from feasible and reachable routes 
> the
> - * one with the lowest metric to be announced to the core.
> + * The main route selection is done by bird's nest (heh). For each prefix we
> + * export all feasible routes to the core with a distinct source (rte_src) 
> per
> + * neihbour so bird handles them as distinct routes. Nest will then notify 
> us of
> + * one best route for each prefix, either our own (internal) or from another
> + * protocol, by calling babel_rt_notify. For internal best routes we remember
> + * which babel_route it selected in babel_entry.selected.
>   */
>  
>  #include 
> @@ -50,8 +52,8 @@ static inline int ge_mod64k(uint a, uint b)
>  { return (u16)(a - b) < 0x8000; }
>  
>  static void babel_expire_requests(struct babel_proto *p, struct babel_entry 
> *e);
> -static void babel_select_route(struct babel_proto *p, struct babel_entry *e, 
> struct babel_route *mod);
> -static inline void babel_announce_retraction(struct babel_proto *p, struct 
> babel_entry *e);
> +static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, 
> struct babel_route *r);
> +static void babel_rte_update_unreachable(struct babel_proto *p, struct 
> babel_entry *e, u8 announce);
>  static void babel_send_route_request(struct babel_proto *p, struct 
> babel_entry *e, struct babel_neighbor *n);
>  static void babel_send_seqno_request(struct babel_proto *p, struct 
> babel_entry *e, struct babel_seqno_request 

Re: [PATCH 0/3] babel: Add support for the RTT extension

2023-02-26 Thread Daniel Gröber
Hi,

On Sun, Feb 26, 2023 at 11:10:03PM +0100, Toke Høiland-Jørgensen via Bird-users 
wrote:
> Note that this series conflicts with Daniel's patches for moving the
> route selection into the Bird nest. Figured I'd send them now so this can
> be part of the discussion of that other patch (and also Daniel is one of
> the current users of this extension so I expect he'd be interested in
> having the two be compatible).

Indeed I've been using these patches for a while, but I've had to switch to
babeld due to lack of proper route filtering ;) Still consider this

Tested-By: Daniel Gröber 

To clarify: it's really only the metric smoothing patch that's in conflict
with my patch. I would advocate for merging only the other two patches for
now while we figure out how to rework the smoothing on top of my patch. I'm
happy to do the rework we just need to come up with a plan for that :)

--Daniel


[PATCH v4] Babel: Replace internal route selection by bird's nest

2023-02-26 Thread Daniel Gröber
This introduces the ability to filter routes from specific interfaces and
neighbours. With the current internal route selection proto babel exports
only up to one route and an admin cannot do fine-grained filtering.

To fix this we rip out the internal route selection entirely and export
evey babel route as a distinct route into the bird's nest instead.

In the presence of esoteric route filters this change may be observable by
users however we feel the additional filtering power is more important. A
release note entry for this change is recommended.
---
Changes in v4:
 - We've seen the light, use FIB_ITERATE as intended, thanks Ondrej.
 - Reinstate seqno request logic on loss of all feasible routes, thanks
   Toke. Discussion of behaviour still ongoing, see[1]
 - Try to prepare for async rt_notify in BIRD 3, thanks Maria.

[1]: http://trubka.network.cz/pipermail/bird-users/2023-February/016689.html


 proto/babel/babel.c | 278 
 proto/babel/babel.h |   3 +
 2 files changed, 128 insertions(+), 153 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index ff8b6b52..eb347a85 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -29,10 +29,12 @@
  * possible routes for the prefix are tracked as &babel_route entries and the
  * feasibility distance is maintained through &babel_source structures.
  *
- * The main route selection is done in babel_select_route(). This is called 
when
- * an entry is updated by receiving updates from the network or when modified 
by
- * internal timers. The function selects from feasible and reachable routes the
- * one with the lowest metric to be announced to the core.
+ * The main route selection is done by bird's nest (heh). For each prefix we
+ * export all feasible routes to the core with a distinct source (rte_src) per
+ * neihbour so bird handles them as distinct routes. Nest will then notify us 
of
+ * one best route for each prefix, either our own (internal) or from another
+ * protocol, by calling babel_rt_notify. For internal best routes we remember
+ * which babel_route it selected in babel_entry.selected.
  */
 
 #include 
@@ -50,8 +52,8 @@ static inline int ge_mod64k(uint a, uint b)
 { return (u16)(a - b) < 0x8000; }
 
 static void babel_expire_requests(struct babel_proto *p, struct babel_entry 
*e);
-static void babel_select_route(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *mod);
-static inline void babel_announce_retraction(struct babel_proto *p, struct 
babel_entry *e);
+static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *r);
+static void babel_rte_update_unreachable(struct babel_proto *p, struct 
babel_entry *e, u8 announce);
 static void babel_send_route_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_neighbor *n);
 static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_seqno_request *sr, struct babel_neighbor *n);
 static void babel_update_cost(struct babel_neighbor *n);
@@ -168,9 +170,7 @@ static inline void
 babel_retract_route(struct babel_proto *p, struct babel_route *r)
 {
   r->metric = r->advert_metric = BABEL_INFINITY;
-
-  if (r == r->e->selected)
-babel_select_route(p, r->e, r);
+  babel_announce_rte(p, r->e, r);
 }
 
 static void
@@ -182,9 +182,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct 
babel_route *r)
   rem_node(NODE r);
   rem_node(&r->neigh_route);
 
-  if (r->e->selected == r)
-r->e->selected = NULL;
-
   sl_free(r);
 }
 
@@ -200,6 +197,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
   {
 r->metric = r->advert_metric = BABEL_INFINITY;
 r->expires = current_time() + cf->hold_time;
+babel_announce_rte(p, r->e, r);
   }
   else
   {
@@ -229,8 +227,6 @@ babel_expire_routes_(struct babel_proto *p, struct fib 
*rtable)
 loop:
   FIB_ITERATE_START(rtable, &fit, struct babel_entry, e)
   {
-int changed = 0;
-
 WALK_LIST_DELSAFE(r, rx, e->routes)
 {
   if (r->refresh_time && r->refresh_time <= now_)
@@ -238,23 +234,12 @@ loop:
 
   if (r->expires && r->expires <= now_)
   {
-   changed = changed || (r == e->selected);
+   FIB_ITERATE_PUT(&fit);
babel_expire_route(p, r);
+   goto loop;
   }
 }
 
-if (changed)
-{
-  /*
-   * We have to restart the iteration because there may be a cascade of
-   * synchronous events babel_select_route() -> nest table change ->
-   * babel_rt_notify() -> rtable change, invalidating hidden variables.
-   */
-  FIB_ITERATE_PUT(&fit);
-  babel_select_route(p, e, NULL);
-  goto loop;
-}
-
 /* Clean up stale entries */
 if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= 
now_))
   e->valid = BABEL_ENTRY_DUMMY;
@@ -263,7 +248,7 @@ loop:
 if (e->unreachable && (!e->valid || (e->router_id == p->router_id)))
 {
   FIB_ITERATE_PUT(&fit);

Re: [Babel-users] Babel: Clarifications on seqno request handling in bird

2023-02-26 Thread Daniel Gröber
Hi Maria,

On Sun, Feb 26, 2023 at 07:34:06PM +0100, Maria Matejka wrote:
> > I don't think RFC8966 is really framed in bird's "multi protocol" mindset
> > so it's unclear to me whether this is something we have to fix or
> > not. Section 3.8.2.1. says:
> > 
> > > A node that has lost all feasible routes to a given destination but still
> > > has unexpired unfeasible routes to that destination MUST send a seqno
> > > request;
> > 
> > I could for example read this as the above mentioned static route
> > constituting a feasible route received from a bird "internal" babel
> > neighbour which would make the behaviour described above perfectly fine,
> > no?
> 
> From my point of view, this is perfectly fine.

I did just do another reivew on the knock on effects of interpreting it
like that and I do tend to agree it all looks fine. The (!best) thing I
mentioned actually works in our favor here since getting rt_notifi'ed of
another protocols route will then still trigger the 3.8.2.2, as it should,
if we receive an unfeasible update.

> > @Bird folks: can anyone think of a way to be notified of any and all
> > changes in rt_notify? I assume it's not possible from some light reading of
> > the nest code but figured I might as well ask.
> 
> You may set channel ra_mode to RA_ANY, getting called rt_notify() for all
> route updates.
> 
> Or in BIRD 3, you can override the channel logic at all and if you don't
> mind calling your filters yourself, you can just request "on any change,
> give me all the routes for one prefix at once" and do whatever you want with
> the routes. Yet these changes most probably won't get backported to BIRD 2.

Hmm RA_ANY might actually be just fine if it comes to that, I'd still
prefer not having to do it tho :)

Thanks,
--Daniel


Re: [PATCH v3] Babel: Replace internal route selection by bird's nest

2023-01-31 Thread Daniel Gröber
Hi Toke,

Thanks for the comprehensive review! See below.

On Tue, Jan 31, 2023 at 12:38:25PM +0100, Toke Høiland-Jørgensen wrote:
> Daniel Gröber  writes:
> > This appears to not actually be a breaking change as route announcement was
> > already based on which routes bird would send back to babel_rt_notify
> > filters shouldn't be able to tell the difference between a single and
> > multiple routes AFAIK.
> 
> Can we be a bit more sure than "appears"? :)

Testers welcome :P

> Just to make sure I'm understanding correctly: this relies on the nest
> using babel_rte_better() to select the route with the lowest metric from
> all the ones we announce, right? And if there's no filter, functionality
> will be equivalent to before this patch?

Right, I have yet to actually compare the before/after behaviour more
comprehensively. Hence the "appears" :)

Also note the feasibility condition is checked before announcing to nest so
rte_better will never see ones that don't satisfy it. Furthermore metric
smoothing is going to be even fun to implement with this ;)

If there is no filter everything works as ever, what we still have to worry
about is the with filter case. I'm not sure if there are any differences
observable from the filter POV.

In particular import/export filters that fiddle with the metric need to be
tested still.

> > +static struct babel_route *
> > +babel_get_selected_route(struct babel_proto *p, struct babel_entry *e)
> > +{
> > +  if (e->selected_nbr)
> > +return babel_get_route(p, e, e->selected_nbr);
> > +  return NULL;
> > +}
> 
> Why the switch to storing the neighbour and resolving that to a route
> here? Couldn't we just keep the existing e->selected route and populate
> that where e->selected_nbr is currently set?

Just seemed more natural to me at the time, but looking at it now I wonder
if it isn't better for performance to cache this after all, since
get_selected_route does get called in handle_update. I think I'll roll that
back.

> > +  for (struct babel_route *r_next, *r = expire_head; r ; r = r_next) {
> > +r_next = r->expire_next;
> > +babel_expire_route(p, r);
> > +  }
> > +
> > +  for (struct babel_entry *e = retract_head; e ; e = e->retract_next) {
> > +babel_announce_retraction(p, e);
> > +  }
> > +
> > +  for (struct babel_entry *e_next, *e = delete_head; e ; e = e_next) {
> > +e_next = e->delete_next;
> > +fib_delete(rtable, e);
> > +  }
> 
> These also be clearing the ->{expire,delete,rectract}_next pointers? I
> know they're not used anywhere else, but they'll be left pointing to
> random memory in some cases here...

Yeah, I had that for all the loops then removed it because it complicated
the expire case as that one might deallocate the route. I'll give it
another go.

> > +static void
> > +babel_announce_rte_unreachable(struct babel_proto *p, struct babel_entry 
> > *e, u8 unreachable)
> > +{
> > +  struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : 
> > p->ip6_channel;
> > +  rte *rte = NULL;
> > +
> > +  if (unreachable) {
> > +/* Unreachable */
> > +rta a0 = {
> > +  .source = RTS_BABEL,
> > +  .scope = SCOPE_UNIVERSE,
> > +  .dest = RTD_UNREACHABLE,
> > +  .pref = 1,
> > +};
> > +
> > +rta *a = rta_lookup(&a0);
> > +rte = rte_get_temp(a, p->p.main_source);
> > +  }
> > +
> > +  e->unreachable = unreachable;
> > +
> > +  /* Unlike the many neighbour routes we only want one unreachable route, 
> > for
> > +   * one it's ugly to have lots of unreachable routes (if we just did this 
> > right
> > +   * in babel_announce_rte) but we also loose access to the neibour 
> > rte_src id
> > +   * when the neighbour is deallocated. This happens on a different 
> > schedule
> > +   * than unreachable route removal. */
> > +  rte_update2(c, e->n.addr, rte, p->p.main_source);
> > +}
> 
> OK, so this function took me a while to understand: There's a special
> "unreachable" route we announce whenever we lose a route, and this
> function announces or retract that particular route based on the
> "unreachable" argument. Could you please add a comment to the top of the
> function explaining all this (the comment at the bottom is only a
> partial explanation).
> 
> Also s/loose/lose/

ACK.

> > -TRACE(D_EVENTS, "Lost feasible route for prefix %N", e->n.addr);
> > -if (e->valid && (e->selected->

[PATCH v3] Babel: Replace internal route selection by bird's nest

2023-01-30 Thread Daniel Gröber
This allows for filtering routes from specific interfaces and
neighbours. With the current internal route selection proto babel exports
only up to one route and an admin cannot do fine-grained filtering.

To fix this we rip out the internal route selection entirely and put them
all into the bird's nest instead.

This appears to not actually be a breaking change as route announcement was
already based on which routes bird would send back to babel_rt_notify
filters shouldn't be able to tell the difference between a single and
multiple routes AFAIK.
---
Changes in v3:
 - Subsume FIB_RESTART v2 patch: instead of restarting FIB iteration
   we keep lists of actions to perform after FIB iteration is finished.

 proto/babel/babel.c | 269 +++-
 proto/babel/babel.h |   8 +-
 2 files changed, 120 insertions(+), 157 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index ff8b6b52..087fd95b 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -29,10 +29,11 @@
  * possible routes for the prefix are tracked as &babel_route entries and the
  * feasibility distance is maintained through &babel_source structures.
  *
- * The main route selection is done in babel_select_route(). This is called 
when
- * an entry is updated by receiving updates from the network or when modified 
by
- * internal timers. The function selects from feasible and reachable routes the
- * one with the lowest metric to be announced to the core.
+ * The main route selection is done by the bird nest (heh). For each prefix we
+ * export all feasible routes to the core with a distinct source (rte_src) per
+ * neihbour. Bird will then notify us of the optimal one by calling
+ * babel_rt_notify and we remember which neighbour it selected in
+ * babel_entry.selected_nbr.
  */
 
 #include 
@@ -50,7 +51,7 @@ static inline int ge_mod64k(uint a, uint b)
 { return (u16)(a - b) < 0x8000; }
 
 static void babel_expire_requests(struct babel_proto *p, struct babel_entry 
*e);
-static void babel_select_route(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *mod);
+static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *r);
 static inline void babel_announce_retraction(struct babel_proto *p, struct 
babel_entry *e);
 static void babel_send_route_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_neighbor *n);
 static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_seqno_request *sr, struct babel_neighbor *n);
@@ -164,13 +165,19 @@ babel_get_route(struct babel_proto *p, struct babel_entry 
*e, struct babel_neigh
   return r;
 }
 
+static struct babel_route *
+babel_get_selected_route(struct babel_proto *p, struct babel_entry *e)
+{
+  if (e->selected_nbr)
+return babel_get_route(p, e, e->selected_nbr);
+  return NULL;
+}
+
 static inline void
 babel_retract_route(struct babel_proto *p, struct babel_route *r)
 {
   r->metric = r->advert_metric = BABEL_INFINITY;
-
-  if (r == r->e->selected)
-babel_select_route(p, r->e, r);
+  babel_announce_rte(p, r->e, r);
 }
 
 static void
@@ -182,9 +189,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct 
babel_route *r)
   rem_node(NODE r);
   rem_node(&r->neigh_route);
 
-  if (r->e->selected == r)
-r->e->selected = NULL;
-
   sl_free(r);
 }
 
@@ -200,6 +204,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
   {
 r->metric = r->advert_metric = BABEL_INFINITY;
 r->expires = current_time() + cf->hold_time;
+babel_announce_rte(p, r->e, r);
   }
   else
   {
@@ -210,7 +215,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
 static void
 babel_refresh_route(struct babel_proto *p, struct babel_route *r)
 {
-  if (r == r->e->selected)
+  if (r == babel_get_selected_route(p, r->e))
 babel_send_route_request(p, r->e, r->neigh);
 
   r->refresh_time = 0;
@@ -221,16 +226,15 @@ babel_expire_routes_(struct babel_proto *p, struct fib 
*rtable)
 {
   struct babel_config *cf = (void *) p->p.cf;
   struct babel_route *r, *rx;
+  struct babel_entry *retract_head = NULL, *delete_head = NULL;
+  struct babel_route *expire_head = NULL;
   struct fib_iterator fit;
   btime now_ = current_time();
 
-  FIB_ITERATE_INIT(&fit, rtable);
 
-loop:
+  FIB_ITERATE_INIT(&fit, rtable);
   FIB_ITERATE_START(rtable, &fit, struct babel_entry, e)
   {
-int changed = 0;
-
 WALK_LIST_DELSAFE(r, rx, e->routes)
 {
   if (r->refresh_time && r->refresh_time <= now_)
@@ -238,23 +242,11 @@ loop:
 
   if (r->expires && r->expires <= now_)
   {
-   changed = changed || (r == e->selected);
-   babel_expire_route(p, r);
+   r->expire_next = expire_head;
+   expire_head = r;
   }
 }
 
-if (changed)
-{
-  /*
-   * We have to restart the iteration because there may be a cascade of
-   * synchronous events babel_select_route() -> nest table change ->
-  

[PATCH v2 2/2] [RFC] Babel: Replace internal route selection by bird's nest

2023-01-30 Thread Daniel Gröber
This allows for filtering routes from specific interfaces and
neighbours. With the current internal route selection proto babel exports
only up to one route and an admin cannot do fine-grained filtering.

To fix this we rip out the internal route selection entirely and put them
all into the bird's nest instead.

This appears to not actually be a breaking change as route announcement was
already based on which routes bird would send back to babel_rt_notify
filters shouldn't be able to tell the difference between a single and
multiple routes AFAIK.
---
Changes in v2:
 - Only announce a single unreachable route from the main_source and ensure
   proper hold timing. Previously the differing lifetime of neighbours vs
   unreachable hold time were a problem.

 proto/babel/babel.c | 228 +---
 proto/babel/babel.h |   5 +-
 2 files changed, 94 insertions(+), 139 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index a52125ff..4bc38540 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -29,10 +29,11 @@
  * possible routes for the prefix are tracked as &babel_route entries and the
  * feasibility distance is maintained through &babel_source structures.
  *
- * The main route selection is done in babel_select_route(). This is called 
when
- * an entry is updated by receiving updates from the network or when modified 
by
- * internal timers. The function selects from feasible and reachable routes the
- * one with the lowest metric to be announced to the core.
+ * The main route selection is done by the bird nest (heh). For each prefix we
+ * export all feasible routes to the core with a distinct source (rte_src) per
+ * neihbour. Bird will then notify us of the optimal one by calling
+ * babel_rt_notify and we remember which neighbour it selected in
+ * babel_entry.selected_nbr.
  */
 
 #include 
@@ -50,7 +51,7 @@ static inline int ge_mod64k(uint a, uint b)
 { return (u16)(a - b) < 0x8000; }
 
 static void babel_expire_requests(struct babel_proto *p, struct babel_entry 
*e);
-static void babel_select_route(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *mod);
+static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *r);
 static inline void babel_announce_retraction(struct babel_proto *p, struct 
babel_entry *e);
 static void babel_send_route_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_neighbor *n);
 static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_seqno_request *sr, struct babel_neighbor *n);
@@ -168,13 +169,19 @@ babel_get_route(struct babel_proto *p, struct babel_entry 
*e, struct babel_neigh
   return r;
 }
 
+static struct babel_route *
+babel_get_selected_route(struct babel_proto *p, struct babel_entry *e)
+{
+  if (e->selected_nbr)
+return babel_get_route(p, e, e->selected_nbr);
+  return NULL;
+}
+
 static inline void
 babel_retract_route(struct babel_proto *p, struct babel_route *r)
 {
   r->metric = r->advert_metric = BABEL_INFINITY;
-
-  if (r == r->e->selected)
-babel_select_route(p, r->e, r);
+  babel_announce_rte(p, r->e, r);
 }
 
 static void
@@ -186,9 +193,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct 
babel_route *r)
   rem_node(NODE r);
   rem_node(&r->neigh_route);
 
-  if (r->e->selected == r)
-r->e->selected = NULL;
-
   sl_free(r);
 }
 
@@ -204,6 +208,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
   {
 r->metric = r->advert_metric = BABEL_INFINITY;
 r->expires = current_time() + cf->hold_time;
+babel_announce_rte(p, r->e, r);
   }
   else
   {
@@ -214,7 +219,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
 static void
 babel_refresh_route(struct babel_proto *p, struct babel_route *r)
 {
-  if (r == r->e->selected)
+  if (r == babel_get_selected_route(p, r->e))
 babel_send_route_request(p, r->e, r->neigh);
 
   r->refresh_time = 0;
@@ -234,8 +239,6 @@ loop:
   rtable_lock(rtable);
   FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e)
   {
-int changed = 0;
-
 WALK_LIST_DELSAFE(r, rx, e->routes)
 {
   if (r->refresh_time && r->refresh_time <= now_)
@@ -243,14 +246,10 @@ loop:
 
   if (r->expires && r->expires <= now_)
   {
-   changed = changed || (r == e->selected);
babel_expire_route(p, r);
   }
 }
 
-if (changed)
-  babel_select_route(p, e, NULL);
-
 /* Clean up stale entries */
 if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= 
now_))
   e->valid = BABEL_ENTRY_DUMMY;
@@ -446,6 +445,7 @@ babel_get_neighbor(struct babel_iface *ifa, ip_addr addr)
 
   nbr = mb_allocz(ifa->pool, sizeof(struct babel_neighbor));
   nbr->ifa = ifa;
+  nbr->src = rt_get_source(&p->p, idm_alloc(&p->src_ids));
   nbr->addr = addr;
   nbr->rxcost = BABEL_INFINITY;
   nbr->txcost = BABEL_INFINITY;
@@ -473,6 +473,8 @@ babel_flush_neighbor(stru

[PATCH v2 1/2] Babel: Remove unecessary FIB_ITERATE restart

2023-01-30 Thread Daniel Gröber
The route expiration code appears to have been stolen from rip.c, in that
code the rt_notify function actually does modify the rtable fib by calling
fib_get. The babel code however does no such thing, so this inefficient
restart is just entirely uneccesarry.

To prove this is true I add a bunch of checking code that can be removed
later.
---
Changes in v2:
 - Fix assert due to misplaced lock/unlock

 proto/babel/babel.c | 64 ++---
 proto/babel/babel.h |  8 +++---
 2 files changed, 42 insertions(+), 30 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index ff8b6b52..a52125ff 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -58,6 +58,9 @@ static void babel_update_cost(struct babel_neighbor *n);
 static inline void babel_kick_timer(struct babel_proto *p);
 static inline void babel_iface_kick_timer(struct babel_iface *ifa);
 
+#define rtable_lock(rtable) ASSERT(!(rtable)->lock++)
+#define rtable_unlock(rtable) ASSERT(!--(rtable)->lock)
+
 /*
  * Functions to maintain data structures
  */
@@ -76,15 +79,16 @@ babel_init_entry(struct fib *f UNUSED, void *E)
 static inline struct babel_entry *
 babel_find_entry(struct babel_proto *p, const net_addr *n)
 {
-  struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable;
+  struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable.fib : 
&p->ip6_rtable.fib;
   return fib_find(rtable, n);
 }
 
 static struct babel_entry *
 babel_get_entry(struct babel_proto *p, const net_addr *n)
 {
-  struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable;
-  struct babel_entry *e = fib_get(rtable, n);
+  struct babel_rtable *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : 
&p->ip6_rtable;
+  ASSERT(!rtable->lock);
+  struct babel_entry *e = fib_get(&rtable->fib, n);
   return e;
 }
 
@@ -217,17 +221,18 @@ babel_refresh_route(struct babel_proto *p, struct 
babel_route *r)
 }
 
 static void
-babel_expire_routes_(struct babel_proto *p, struct fib *rtable)
+babel_expire_routes_(struct babel_proto *p, struct babel_rtable *rtable)
 {
   struct babel_config *cf = (void *) p->p.cf;
   struct babel_route *r, *rx;
   struct fib_iterator fit;
   btime now_ = current_time();
 
-  FIB_ITERATE_INIT(&fit, rtable);
+  FIB_ITERATE_INIT(&fit, &rtable->fib);
 
 loop:
-  FIB_ITERATE_START(rtable, &fit, struct babel_entry, e)
+  rtable_lock(rtable);
+  FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e)
   {
 int changed = 0;
 
@@ -244,16 +249,7 @@ loop:
 }
 
 if (changed)
-{
-  /*
-   * We have to restart the iteration because there may be a cascade of
-   * synchronous events babel_select_route() -> nest table change ->
-   * babel_rt_notify() -> rtable change, invalidating hidden variables.
-   */
-  FIB_ITERATE_PUT(&fit);
   babel_select_route(p, e, NULL);
-  goto loop;
-}
 
 /* Clean up stale entries */
 if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= 
now_))
@@ -263,6 +259,7 @@ loop:
 if (e->unreachable && (!e->valid || (e->router_id == p->router_id)))
 {
   FIB_ITERATE_PUT(&fit);
+  rtable_unlock(rtable);
   babel_announce_retraction(p, e);
   goto loop;
 }
@@ -274,11 +271,13 @@ loop:
 if (!e->valid && EMPTY_LIST(e->routes) && EMPTY_LIST(e->sources) && 
EMPTY_LIST(e->requests))
 {
   FIB_ITERATE_PUT(&fit);
-  fib_delete(rtable, e);
+  rtable_unlock(rtable);
+  fib_delete(&rtable->fib, e);
   goto loop;
 }
   }
   FIB_ITERATE_END;
+  rtable_unlock(rtable);
 }
 
 static void
@@ -959,7 +958,7 @@ babel_send_seqno_request(struct babel_proto *p, struct 
babel_entry *e,
  * transmitted entry is updated.
  */
 static void
-babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable)
+babel_send_update_(struct babel_iface *ifa, btime changed, struct babel_rtable 
*rtable)
 {
   struct babel_proto *p = ifa->proto;
 
@@ -970,7 +969,8 @@ babel_send_update_(struct babel_iface *ifa, btime changed, 
struct fib *rtable)
 p->update_seqno_inc = 0;
   }
 
-  FIB_WALK(rtable, struct babel_entry, e)
+  rtable_lock(rtable);
+  FIB_WALK(&rtable->fib, struct babel_entry, e)
   {
 if (!e->valid)
   continue;
@@ -1022,6 +1022,7 @@ babel_send_update_(struct babel_iface *ifa, btime 
changed, struct fib *rtable)
 }
   }
   FIB_WALK_END;
+  rtable_unlock(rtable);
 }
 
 static void
@@ -2055,16 +2056,21 @@ babel_dump(struct proto *P)
   WALK_LIST(ifa, p->interfaces)
 babel_dump_iface(ifa);
 
-  FIB_WALK(&p->ip4_rtable, struct babel_entry, e)
+  rtable_lock(&p->ip4_rtable);
+  FIB_WALK(&p->ip4_rtable.fib, struct babel_entry, e)
   {
 babel_dump_entry(e);
   }
   FIB_WALK_END;
-  FIB_WALK(&p->ip6_rtable, struct babel_entry, e)
+  rtable_unlock(&p->ip4_rtable);
+
+  rtable_lock(&p->ip6_rtable);
+  FIB_WALK(&p->ip6_rtable.fib, struct babel_entry, e)
   {
 babel_dump_entry(e);
   }
   FIB_WALK_END;
+  rtable_unlock(&p->ip6_

[PATCH] [RFC] Babel: Replace internal route selection by bird nest

2023-01-29 Thread Daniel Gröber
The main motivation for this change is to allow for ingress route
filtering. With the current internal route selection proto/babel exports
only the one route it selected and an admin cannot decide which
neighbours/interfaces to import certain routes from using filters.

To fix this we rip out the internal route selection entirely and let bird
do it's thing instead.

Note that this does very much represent a breaking change as now export
filters have to explicitly re-export babel routes which was implicit
before, using say `source = RTS_BABEL`.

It's unclear to me if we should support the old behaviour and how I would
circumvent the export filter for babel routes if so. Comments and
suggestions welcome.

TODO:
 - Unreachable routes are still kind of funky since rebasing on master from
   2.0.10. I see both a unicast routes and an unreachable route sometimes.
---
Note: this depends on "Babel: Remove unecessary FIB_ITERATE restart",
I thought it wouldn't or I would have sent it as a series.

 proto/babel/babel.c | 177 +++-
 proto/babel/babel.h |   5 +-
 2 files changed, 62 insertions(+), 120 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index bed2e1e4..1fe73cfa 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -29,10 +29,11 @@
  * possible routes for the prefix are tracked as &babel_route entries and the
  * feasibility distance is maintained through &babel_source structures.
  *
- * The main route selection is done in babel_select_route(). This is called 
when
- * an entry is updated by receiving updates from the network or when modified 
by
- * internal timers. The function selects from feasible and reachable routes the
- * one with the lowest metric to be announced to the core.
+ * The main route selection is done by the bird nest (heh). For each prefix we
+ * export all feasible routes to the core with a distinct source (rte_src) per
+ * neihbour. Bird will then notify us of the optimal one by calling
+ * babel_rt_notify and we remember which neighbour it selected in
+ * babel_entry.selected_nbr.
  */
 
 #include 
@@ -50,7 +51,7 @@ static inline int ge_mod64k(uint a, uint b)
 { return (u16)(a - b) < 0x8000; }
 
 static void babel_expire_requests(struct babel_proto *p, struct babel_entry 
*e);
-static void babel_select_route(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *mod);
+static void babel_announce_rte(struct babel_proto *p, struct babel_entry *e, 
struct babel_route *r);
 static inline void babel_announce_retraction(struct babel_proto *p, struct 
babel_entry *e);
 static void babel_send_route_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_neighbor *n);
 static void babel_send_seqno_request(struct babel_proto *p, struct babel_entry 
*e, struct babel_seqno_request *sr, struct babel_neighbor *n);
@@ -168,13 +169,19 @@ babel_get_route(struct babel_proto *p, struct babel_entry 
*e, struct babel_neigh
   return r;
 }
 
+static struct babel_route *
+babel_get_selected_route(struct babel_proto *p, struct babel_entry *e)
+{
+  if (e->selected_nbr)
+return babel_get_route(p, e, e->selected_nbr);
+  return NULL;
+}
+
 static inline void
 babel_retract_route(struct babel_proto *p, struct babel_route *r)
 {
   r->metric = r->advert_metric = BABEL_INFINITY;
-
-  if (r == r->e->selected)
-babel_select_route(p, r->e, r);
+  babel_announce_rte(p, r->e, r);
 }
 
 static void
@@ -186,9 +193,6 @@ babel_flush_route(struct babel_proto *p UNUSED, struct 
babel_route *r)
   rem_node(NODE r);
   rem_node(&r->neigh_route);
 
-  if (r->e->selected == r)
-r->e->selected = NULL;
-
   sl_free(r);
 }
 
@@ -204,9 +208,11 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
   {
 r->metric = r->advert_metric = BABEL_INFINITY;
 r->expires = current_time() + cf->hold_time;
+babel_announce_rte(p, r->e, r);
   }
   else
   {
+babel_announce_rte(p, r->e, r);
 babel_flush_route(p, r);
   }
 }
@@ -214,7 +220,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
 static void
 babel_refresh_route(struct babel_proto *p, struct babel_route *r)
 {
-  if (r == r->e->selected)
+  if (r == babel_get_selected_route(p, r->e))
 babel_send_route_request(p, r->e, r->neigh);
 
   r->refresh_time = 0;
@@ -234,8 +240,6 @@ babel_expire_routes_(struct babel_proto *p, struct 
babel_rtable *rtable)
 loop:
   FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e)
   {
-int changed = 0;
-
 WALK_LIST_DELSAFE(r, rx, e->routes)
 {
   if (r->refresh_time && r->refresh_time <= now_)
@@ -243,14 +247,10 @@ loop:
 
   if (r->expires && r->expires <= now_)
   {
-   changed = changed || (r == e->selected);
babel_expire_route(p, r);
   }
 }
 
-if (changed)
-  babel_select_route(p, e, NULL);
-
 /* Clean up stale entries */
 if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= 
now_))
   e->valid = BABEL

[PATCH] Babel: Remove unecessary FIB_ITERATE restart

2023-01-29 Thread Daniel Gröber
The route expiration code appears to have been stolen from rip.c, in that
code the rt_notify function actually does modify the rtable fib by calling
fib_get. The babel code however does no such thing, so this inefficient
restart is just entirely uneccesarry.

To prove this is true I add a bunch of checking code that can be removed
later.
---
 proto/babel/babel.c | 64 ++---
 proto/babel/babel.h |  8 +++---
 2 files changed, 42 insertions(+), 30 deletions(-)

diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index ff8b6b52..bed2e1e4 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -58,6 +58,9 @@ static void babel_update_cost(struct babel_neighbor *n);
 static inline void babel_kick_timer(struct babel_proto *p);
 static inline void babel_iface_kick_timer(struct babel_iface *ifa);
 
+#define rtable_lock(rtable) ASSERT(!(rtable)->lock++)
+#define rtable_unlock(rtable) ASSERT(!--(rtable)->lock)
+
 /*
  * Functions to maintain data structures
  */
@@ -76,15 +79,16 @@ babel_init_entry(struct fib *f UNUSED, void *E)
 static inline struct babel_entry *
 babel_find_entry(struct babel_proto *p, const net_addr *n)
 {
-  struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable;
+  struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable.fib : 
&p->ip6_rtable.fib;
   return fib_find(rtable, n);
 }
 
 static struct babel_entry *
 babel_get_entry(struct babel_proto *p, const net_addr *n)
 {
-  struct fib *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : &p->ip6_rtable;
-  struct babel_entry *e = fib_get(rtable, n);
+  struct babel_rtable *rtable = (n->type == NET_IP4) ? &p->ip4_rtable : 
&p->ip6_rtable;
+  ASSERT(!rtable->lock);
+  struct babel_entry *e = fib_get(&rtable->fib, n);
   return e;
 }
 
@@ -217,17 +221,18 @@ babel_refresh_route(struct babel_proto *p, struct 
babel_route *r)
 }
 
 static void
-babel_expire_routes_(struct babel_proto *p, struct fib *rtable)
+babel_expire_routes_(struct babel_proto *p, struct babel_rtable *rtable)
 {
   struct babel_config *cf = (void *) p->p.cf;
   struct babel_route *r, *rx;
   struct fib_iterator fit;
   btime now_ = current_time();
 
-  FIB_ITERATE_INIT(&fit, rtable);
+  rtable_lock(rtable);
+  FIB_ITERATE_INIT(&fit, &rtable->fib);
 
 loop:
-  FIB_ITERATE_START(rtable, &fit, struct babel_entry, e)
+  FIB_ITERATE_START(&rtable->fib, &fit, struct babel_entry, e)
   {
 int changed = 0;
 
@@ -244,16 +249,7 @@ loop:
 }
 
 if (changed)
-{
-  /*
-   * We have to restart the iteration because there may be a cascade of
-   * synchronous events babel_select_route() -> nest table change ->
-   * babel_rt_notify() -> rtable change, invalidating hidden variables.
-   */
-  FIB_ITERATE_PUT(&fit);
   babel_select_route(p, e, NULL);
-  goto loop;
-}
 
 /* Clean up stale entries */
 if ((e->valid == BABEL_ENTRY_STALE) && ((e->updated + cf->hold_time) <= 
now_))
@@ -263,6 +259,7 @@ loop:
 if (e->unreachable && (!e->valid || (e->router_id == p->router_id)))
 {
   FIB_ITERATE_PUT(&fit);
+  rtable_unlock(rtable);
   babel_announce_retraction(p, e);
   goto loop;
 }
@@ -274,11 +271,13 @@ loop:
 if (!e->valid && EMPTY_LIST(e->routes) && EMPTY_LIST(e->sources) && 
EMPTY_LIST(e->requests))
 {
   FIB_ITERATE_PUT(&fit);
-  fib_delete(rtable, e);
+  rtable_unlock(rtable);
+  fib_delete(&rtable->fib, e);
   goto loop;
 }
   }
   FIB_ITERATE_END;
+  rtable_unlock(rtable);
 }
 
 static void
@@ -959,7 +958,7 @@ babel_send_seqno_request(struct babel_proto *p, struct 
babel_entry *e,
  * transmitted entry is updated.
  */
 static void
-babel_send_update_(struct babel_iface *ifa, btime changed, struct fib *rtable)
+babel_send_update_(struct babel_iface *ifa, btime changed, struct babel_rtable 
*rtable)
 {
   struct babel_proto *p = ifa->proto;
 
@@ -970,7 +969,8 @@ babel_send_update_(struct babel_iface *ifa, btime changed, 
struct fib *rtable)
 p->update_seqno_inc = 0;
   }
 
-  FIB_WALK(rtable, struct babel_entry, e)
+  rtable_lock(rtable);
+  FIB_WALK(&rtable->fib, struct babel_entry, e)
   {
 if (!e->valid)
   continue;
@@ -1022,6 +1022,7 @@ babel_send_update_(struct babel_iface *ifa, btime 
changed, struct fib *rtable)
 }
   }
   FIB_WALK_END;
+  rtable_unlock(rtable);
 }
 
 static void
@@ -2055,16 +2056,21 @@ babel_dump(struct proto *P)
   WALK_LIST(ifa, p->interfaces)
 babel_dump_iface(ifa);
 
-  FIB_WALK(&p->ip4_rtable, struct babel_entry, e)
+  rtable_lock(&p->ip4_rtable);
+  FIB_WALK(&p->ip4_rtable.fib, struct babel_entry, e)
   {
 babel_dump_entry(e);
   }
   FIB_WALK_END;
-  FIB_WALK(&p->ip6_rtable, struct babel_entry, e)
+  rtable_unlock(&p->ip4_rtable);
+
+  rtable_lock(&p->ip6_rtable);
+  FIB_WALK(&p->ip6_rtable.fib, struct babel_entry, e)
   {
 babel_dump_entry(e);
   }
   FIB_WALK_END;
+  rtable_unlock(&p->ip6_rtable);
 }
 
 static void
@@ -2180,11 +2186,12 @@ babel_sh

Babel: Possible segfault in bird unfeasible update handling code

2023-01-29 Thread Daniel Gröber
Hi Babelers,

I've been working on the babel proto in bird and found some code where I
can't convince myself it won't segfault. The problematic bit is, I think,
's' in babel_handle_update can be NULL because nothing ensures the
babel_source for a particular neighbour actually exists here:

  /* Regular update */
  [...]
  s = babel_find_source(e, msg->router_id); /* for feasibility */
  [...]

  /* RFC section 3.8.2.2 - Dealing with unfeasible updates */
  if (!feasible && (metric != BABEL_INFINITY) &&
  (!best || (r == best) || (metric < best->metric)))
babel_add_seqno_request(p, e, s->router_id, s->seqno + 1, 0, nbr);
//^ BUG: Can 's' be NULL here?

The only place that allocates sources is babel_send_update_ which just
happens at it's own pace and has nothing to do with incoming update
handling AFAICT.

Am I missing something or is this a bug? Perhaps find should just be
replaced by babel_get_source here?

--Daniel


[PATCH v3] Babel: Add option to support ecmp routes

2022-05-26 Thread Daniel Gröber
We introduce ecmp support for the babel protocol by extending it's
definition of a route being selected to mean the route being in the ECMP
set.

In order to keep code changes minimal we keep the pointer to an arbitrary
member of the ECMP set in the FIB entry and add a new flag to babel_route
which indicates which routes were actually announced to the core.

Since keeping this flag update at all times is a hassle we take a lazy
approach and simply check the metric of the selected route against the
route in question whenever we want to know for sure if a route is in the
ECMP set.
---
 doc/bird.sgml|  32 ++
 proto/babel/babel.c  | 103 +++
 proto/babel/babel.h  |  16 +++
 proto/babel/config.Y |   5 +++
 4 files changed, 138 insertions(+), 18 deletions(-)

Changes in v3:
 - Squash with ecmp weigth patch
 - Add babel_route_is_selected() as replacement for e->selected
 I'm not totally sure the lazy approach is safe yet. We might need
 additional bookeeping to reset r->active_nexthop on route
 retraction/flush instead.

diff --git a/doc/bird.sgml b/doc/bird.sgml
index 1580facd..8d159b22 100644
--- a/doc/bird.sgml
+++ b/doc/bird.sgml
@@ -1865,6 +1865,7 @@ protocol babel [] {
ipv4 {  };
ipv6 [sadr] {  };
 randomize router id ;
+   ecmp  [limit ];
interface  {
type ;
rxcost ;
@@ -1879,6 +1880,7 @@ protocol babel [] {
check link ;
next hop ipv4 ;
next hop ipv6 ;
+   ecmp weight ;
authentication none|mac [permissive];
password "";
password "" {
@@ -1909,6 +1911,18 @@ protocol babel [] {
   router ID every time it starts up, which avoids this problem at the cost
   of not having stable router IDs in the network. Default: no.
 
+  ecmp switch [limit 
number]
+
+  Determines whether babel will emit ECMP (equal-cost multipath)
+  routes, allowing to load-balancing traffic across multiple paths. If
+  enabled the maximum number of next-hops to allow can be specified,
+  defaulting to 16.
+
+  When neibours are using a dynamic link-quality metric this is
+  unlikely to be useful. For best results  should be used throughout the network to get what
+  amounts to a hop count metric.
+
   type wired|wireless 
   This option specifies the interface type: Wired or wireless. On wired
   interfaces a neighbor is considered unreachable after a small number of
@@ -1928,6 +1942,18 @@ protocol babel [] {
   selection and not local route selection. Default: 96 for wired 
interfaces,
   256 for wireless.
 
+  ecmp switch [limit 
number]
+
+  Determines whether babel will emit ECMP (equal-cost multipath)
+  routes, allowing to load-balancing traffic across multiple paths. If
+  enabled the maximum number of next-hops to allow can be specified,
+  defaulting to 16.
+
+  When neibours are using a dynamic link-quality metric this is
+  unlikely to be useful. For best results  should be used throughout the network to get what
+  amounts to a hop count metric.
+
   limit 
   BIRD keeps track of received Hello messages from each neighbor to
   establish neighbor reachability. For wired type interfaces, this option
@@ -1983,6 +2009,12 @@ protocol babel [] {
   source for Babel packets will be used. In normal operation, it should not
   be necessary to set this option.
 
+  ecmp weight number
+  This specifies the relative weight used for nexthops going through
+  the iface when ECMP is enabled. Larger weight values relative to other
+  nexthops attract more traffic. Valid values are 1-256. Default value
+  is 1.
+
   authentication none|mac 
[permissive]
   Selects authentication method to be used. e->selected || r->e->selected->metric != r->metric)
+r->active_nexthop = 0;
+
+  return r->active_nexthop;
+}
+
 static inline void
 babel_retract_route(struct babel_proto *p, struct babel_route *r)
 {
   r->metric = r->advert_metric = BABEL_INFINITY;
 
-  if (r == r->e->selected)
+  if (babel_route_is_selected(r))
 babel_select_route(p, r->e, r);
 }
 
@@ -210,7 +221,7 @@ babel_expire_route(struct babel_proto *p, struct 
babel_route *r)
 static void
 babel_refresh_route(struct babel_proto *p, struct babel_route *r)
 {
-  if (r == r->e->selected)
+  if (babel_route_is_selected(r))
 babel_send_route_request(p, r->e, r->neigh);
 
   r->refresh_time = 0;
@@ -238,7 +249,7 @@ loop:
 
   if (r->expires && r->expires <= now_)
   {
-   changed = changed || (r == e->selected);
+   changed = changed || babel_route_is_selected(r);
babel_expire_route(p, r);
   }
 }
@@ -624,7 +635,38 @@ done:
 }
 
 /**
- * babel_announce_rte - announce selected route to the core
+ * babel_nexthop_insert - add next_hop of route to nexthop list

[PATCH v2 2/2] Netlink: Propagate ECMP nexthop weight to kernel for inet6 routes

2022-05-26 Thread Daniel Gröber
Previously nl_send_route would use plain nl_add_nexthop for ecmp ipv6
routes instead of adding RTA_MULTIPATH objects via nl_add_multipath. The
former lacks support for the rtnh_hops field needed for setting the nexthop
weight though.

On the kernel side support for nexthop weights was introduced by commit
398958ae48 ("ipv6: Add support for non-equal-cost multipath") which landed
in 4.16. Before this the weight will simply be ignored. Support for the
ipv6 RTA_MULTIPATH attribute was added in 51ebd31815 ("ipv6: add support of
equal cost multipath (ECMP)") which landed in 3.10. If this version bound
isn't met installing the route will fail.
---
 sysdep/linux/netlink.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c
index eca681f9..cf1e18d2 100644
--- a/sysdep/linux/netlink.c
+++ b/sysdep/linux/netlink.c
@@ -74,7 +74,6 @@
 #endif
 
 #define krt_ipv4(p) ((p)->af == AF_INET)
-#define krt_ecmp6(p) ((p)->af == AF_INET6)
 
 const int rt_default_ecmp = 16;
 
@@ -1403,7 +1402,7 @@ dest:
 {
 case RTD_UNICAST:
   r->r.rtm_type = RTN_UNICAST;
-  if (nh->next && !krt_ecmp6(p))
+  if (nh->next)
nl_add_multipath(&r->h, rsize, nh, p->af, eattrs);
   else
   {
-- 
2.30.2



[PATCH v2 1/2] Netlink: Drop ECMP route splitting hacks

2022-05-26 Thread Daniel Gröber
This removes the hacky route merging/splitting code needed to support older
kernel versions. Consequently the required Linux version is raised to 4.11
for reliable operation.
---
 sysdep/linux/netlink.c | 218 +++--
 1 file changed, 33 insertions(+), 185 deletions(-)

diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c
index 29b744cb..eca681f9 100644
--- a/sysdep/linux/netlink.c
+++ b/sysdep/linux/netlink.c
@@ -78,49 +78,6 @@
 
 const int rt_default_ecmp = 16;
 
-/*
- * Structure nl_parse_state keeps state of received route processing. Ideally,
- * we could just independently parse received Netlink messages and immediately
- * propagate received routes to the rest of BIRD, but older Linux kernel 
(before
- * version 4.11) represents and announces IPv6 ECMP routes not as one route 
with
- * multiple next hops (like RTA_MULTIPATH in IPv4 ECMP), but as a sequence of
- * routes with the same prefix. More recent kernels work as with IPv4.
- *
- * Therefore, BIRD keeps currently processed route in nl_parse_state structure
- * and postpones its propagation until we expect it to be final; i.e., when
- * non-matching route is received or when the scan ends. When another matching
- * route is received, it is merged with the already processed route to form an
- * ECMP route. Note that merging is done only for IPv6 (merge == 1), but the
- * postponing is done in both cases (for simplicity). All IPv4 routes or IPv6
- * routes with RTA_MULTIPATH set are just considered non-matching.
- *
- * This is ignored for asynchronous notifications (every notification is 
handled
- * as a separate route). It is not an issue for our routes, as we ignore such
- * notifications anyways. But importing alien IPv6 ECMP routes does not work
- * properly with older kernels.
- *
- * Whatever the kernel version is, IPv6 ECMP routes are sent as multiple routes
- * for the same prefix.
- */
-
-struct nl_parse_state
-{
-  struct linpool *pool;
-  int scan;
-  int merge;
-
-  net *net;
-  rta *attrs;
-  struct krt_proto *proto;
-  s8 new;
-  s8 krt_src;
-  u8 krt_type;
-  u8 krt_proto;
-  u32 krt_metric;
-
-  u32 rta_flow;/* Used during parsing */
-};
-
 /*
  * Synchronous Netlink interface
  */
@@ -761,7 +718,7 @@ nl_add_multipath(struct nlmsghdr *h, uint bufsize, struct 
nexthop *nh, int af, e
 }
 
 static struct nexthop *
-nl_parse_multipath(struct nl_parse_state *s, struct krt_proto *p, const 
net_addr *n, struct rtattr *ra, int af, int krt_src)
+nl_parse_multipath(struct krt_proto *p, const net_addr *n, struct rtattr *ra, 
int af, int krt_src, u32 *rta_flow)
 {
   struct rtattr *a[BIRD_RTA_MAX];
   struct rtnexthop *nh = RTA_DATA(ra);
@@ -780,7 +737,7 @@ nl_parse_multipath(struct nl_parse_state *s, struct 
krt_proto *p, const net_addr
   if ((nh->rtnh_flags & RTNH_F_DEAD) && (krt_src != KRT_SRC_BIRD))
goto next;
 
-  *last = rv = lp_allocz(s->pool, NEXTHOP_MAX_SIZE);
+  *last = rv = lp_allocz(nl_linpool, NEXTHOP_MAX_SIZE);
   last = &(rv->next);
 
   rv->weight = nh->rtnh_hops;
@@ -824,7 +781,7 @@ nl_parse_multipath(struct nl_parse_state *s, struct 
krt_proto *p, const net_addr
rv->gw = rta_get_ipa(a[RTA_GATEWAY]);
 
   if (a[RTA_FLOW])
-   s->rta_flow = rta_get_u32(a[RTA_FLOW]);
+   *rta_flow = rta_get_u32(a[RTA_FLOW]);
 
 #ifdef HAVE_MPLS_KERNEL
   if (a[RTA_VIA])
@@ -1480,36 +1437,13 @@ static inline int
 nl_add_rte(struct krt_proto *p, rte *e)
 {
   rta *a = e->attrs;
-  int err = 0;
-
-  if (krt_ecmp6(p) && a->nh.next)
-  {
-struct nexthop *nh = &(a->nh);
-
-err = nl_send_route(p, e, NL_OP_ADD, RTD_UNICAST, nh);
-if (err < 0)
-  return err;
-
-for (nh = nh->next; nh; nh = nh->next)
-  err += nl_send_route(p, e, NL_OP_APPEND, RTD_UNICAST, nh);
-
-return err;
-  }
-
   return nl_send_route(p, e, NL_OP_ADD, a->dest, &(a->nh));
 }
 
 static inline int
 nl_delete_rte(struct krt_proto *p, rte *e)
 {
-  int err = 0;
-
-  /* For IPv6, we just repeatedly request DELETE until we get error */
-  do
-err = nl_send_route(p, e, NL_OP_DELETE, RTD_NONE, NULL);
-  while (krt_ecmp6(p) && !err);
-
-  return err;
+  return nl_send_route(p, e, NL_OP_DELETE, RTD_NONE, NULL);
 }
 
 static inline int
@@ -1559,67 +1493,11 @@ krt_replace_rte(struct krt_proto *p, net *n UNUSED, rte 
*new, rte *old)
   }
 }
 
-static int
-nl_mergable_route(struct nl_parse_state *s, net *net, struct krt_proto *p, 
uint priority, uint krt_type, uint rtm_family)
-{
-  /* Route merging is used for IPv6 scans */
-  if (!s->scan || (rtm_family != AF_INET6))
-return 0;
-
-  /* Saved and new route must have same network, proto/table, and priority */
-  if ((s->net != net) || (s->proto != p) || (s->krt_metric != priority))
-return 0;
-
-  /* Both must be regular unicast routes */
-  if ((s->krt_type != RTN_UNICAST) || (krt_type != RTN_UNICAST))
-return 0;
-
-  return 1;
-}
-
-static void
-nl_announce_route(struct nl_parse_state *s)

[PATCH] Netlink: Propagate ecmp nexthop weight to kernel for inet6 routes

2022-05-22 Thread Daniel Gröber
Previously nl_send_route would use plain nl_add_nexthop for ecmp ipv6
routes instead of adding RTA_MULTIPATH objects via nl_add_multipath. The
former lacks support for the rtnh_hops field needed for setting the nexthop
weight though.

On the kernel side support for nexthop weights was introduced by commit
398958ae48 ("ipv6: Add support for non-equal-cost multipath") which landed
in 4.16. Before this the weight will simply be ignored. Support for the
ipv6 RTA_MULTIPATH attribute was added in 51ebd31815 ("ipv6: add support of
equal cost multipath (ECMP)") which landed in 3.10. If this version bound
isn't met installing the route will fail.
---
 sysdep/linux/netlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c
index 29b744cb..5e0b036d 100644
--- a/sysdep/linux/netlink.c
+++ b/sysdep/linux/netlink.c
@@ -1446,7 +1446,7 @@ dest:
 {
 case RTD_UNICAST:
   r->r.rtm_type = RTN_UNICAST;
-  if (nh->next && !krt_ecmp6(p))
+  if (nh->next)
nl_add_multipath(&r->h, rsize, nh, p->af, eattrs);
   else
   {
-- 
2.30.2



[PATCH] Netlink: Propagate ecmp nexthop weight to kernel for inet6 routes

2022-05-22 Thread Daniel Gröber
Previously nl_send_route would use plain nl_add_nexthop for ecmp ipv6
routes instead of adding RTA_MULTIPATH objects via nl_add_multipath. The
former lacks support for the rtnexthop.rtnh_hops field need for setting the
nexthop weight though.

On the kernel side support for nexthop weights was introduced by commit
398958ae48 ("ipv6: Add support for non-equal-cost multipath") which landed
in 4.16. Before this the weight will simply be ignored. Support for the
ipv6 RTA_MULTIPATH attribute was added in 51ebd31815 ("ipv6: add support of
equal cost multipath (ECMP)") which landed in 3.10. If this version bound
isn't met installing the route will fail.
---
 sysdep/linux/netlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c
index 29b744cb..5e0b036d 100644
--- a/sysdep/linux/netlink.c
+++ b/sysdep/linux/netlink.c
@@ -1446,7 +1446,7 @@ dest:
 {
 case RTD_UNICAST:
   r->r.rtm_type = RTN_UNICAST;
-  if (nh->next && !krt_ecmp6(p))
+  if (nh->next)
nl_add_multipath(&r->h, rsize, nh, p->af, eattrs);
   else
   {
-- 
2.30.2



[PATCH v2 2/2] Babel: Add option to control ecmp nexthop weight

2022-05-09 Thread Daniel Gröber
---
 doc/bird.sgml| 7 +++
 proto/babel/babel.c  | 1 +
 proto/babel/babel.h  | 2 ++
 proto/babel/config.Y | 2 ++
 4 files changed, 12 insertions(+)

diff --git a/doc/bird.sgml b/doc/bird.sgml
index d1e6376b..8d159b22 100644
--- a/doc/bird.sgml
+++ b/doc/bird.sgml
@@ -1880,6 +1880,7 @@ protocol babel [] {
check link ;
next hop ipv4 ;
next hop ipv6 ;
+   ecmp weight ;
authentication none|mac [permissive];
password "";
password "" {
@@ -2008,6 +2009,12 @@ protocol babel [] {
   source for Babel packets will be used. In normal operation, it should not
   be necessary to set this option.
 
+  ecmp weight number
+  This specifies the relative weight used for nexthops going through
+  the iface when ECMP is enabled. Larger weight values relative to other
+  nexthops attract more traffic. Valid values are 1-256. Default value
+  is 1.
+
   authentication none|mac 
[permissive]
   Selects authentication method to be used. gw = r->next_hop;
   nh->iface = r->neigh->ifa->iface;
+  nh->weight = r->neigh->ifa->cf->ecmp_weight;
 
   /*
* If we cannot find a reachable neighbour, set the entry to be onlink. This
diff --git a/proto/babel/babel.h b/proto/babel/babel.h
index 46e268fd..06b82799 100644
--- a/proto/babel/babel.h
+++ b/proto/babel/babel.h
@@ -147,6 +147,8 @@ struct babel_iface_config {
   int tx_tos;
   int tx_priority;
 
+  u8 ecmp_weight;
+
   ip_addr next_hop_ip4;
   ip_addr next_hop_ip6;
 
diff --git a/proto/babel/config.Y b/proto/babel/config.Y
index 92e754cd..0da5025d 100644
--- a/proto/babel/config.Y
+++ b/proto/babel/config.Y
@@ -70,6 +70,7 @@ babel_iface_start:
   BABEL_IFACE->tx_tos = IP_PREC_INTERNET_CONTROL;
   BABEL_IFACE->tx_priority = sk_priority_control;
   BABEL_IFACE->check_link = 1;
+  BABEL_IFACE->ecmp_weight = 0;
 };
 
 
@@ -149,6 +150,7 @@ babel_iface_item:
  | AUTHENTICATION NONE { BABEL_IFACE->auth_type = BABEL_AUTH_NONE; }
  | AUTHENTICATION MAC { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; 
BABEL_IFACE->auth_permissive = 0; }
  | AUTHENTICATION MAC PERMISSIVE { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; 
BABEL_IFACE->auth_permissive = 1; }
+ | ECMP WEIGHT expr { BABEL_IFACE->ecmp_weight = $3 - 1; if (($3<1) || 
($3>256)) cf_error("ECMP weight must be in range 1-256"); }
  | password_list
  ;
 
-- 
2.30.2



[PATCH v2 1/2] Babel: Add option to support ecmp routes

2022-05-09 Thread Daniel Gröber
---
 doc/bird.sgml| 25 
 proto/babel/babel.c  | 70 +---
 proto/babel/babel.h  |  5 
 proto/babel/config.Y |  3 ++
 4 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/doc/bird.sgml b/doc/bird.sgml
index 1580facd..d1e6376b 100644
--- a/doc/bird.sgml
+++ b/doc/bird.sgml
@@ -1865,6 +1865,7 @@ protocol babel [] {
ipv4 {  };
ipv6 [sadr] {  };
 randomize router id ;
+   ecmp  [limit ];
interface  {
type ;
rxcost ;
@@ -1909,6 +1910,18 @@ protocol babel [] {
   router ID every time it starts up, which avoids this problem at the cost
   of not having stable router IDs in the network. Default: no.
 
+  ecmp switch [limit 
number]
+
+  Determines whether babel will emit ECMP (equal-cost multipath)
+  routes, allowing to load-balancing traffic across multiple paths. If
+  enabled the maximum number of next-hops to allow can be specified,
+  defaulting to 16.
+
+  When neibours are using a dynamic link-quality metric this is
+  unlikely to be useful. For best results  should be used throughout the network to get what
+  amounts to a hop count metric.
+
   type wired|wireless 
   This option specifies the interface type: Wired or wireless. On wired
   interfaces a neighbor is considered unreachable after a small number of
@@ -1928,6 +1941,18 @@ protocol babel [] {
   selection and not local route selection. Default: 96 for wired 
interfaces,
   256 for wireless.
 
+  ecmp switch [limit 
number]
+
+  Determines whether babel will emit ECMP (equal-cost multipath)
+  routes, allowing to load-balancing traffic across multiple paths. If
+  enabled the maximum number of next-hops to allow can be specified,
+  defaulting to 16.
+
+  When neibours are using a dynamic link-quality metric this is
+  unlikely to be useful. For best results  should be used throughout the network to get what
+  amounts to a hop count metric.
+
   limit 
   BIRD keeps track of received Hello messages from each neighbor to
   establish neighbor reachability. For wired type interfaces, this option
diff --git a/proto/babel/babel.c b/proto/babel/babel.c
index 4a7d550f..01ea3a44 100644
--- a/proto/babel/babel.c
+++ b/proto/babel/babel.c
@@ -623,6 +623,34 @@ done:
   }
 }
 
+/**
+ * babel_nexthop_insert - add next_hop of route to nexthop list
+ * @p: Babel protocol instance
+ * @r: Babel route
+ * @nhs: nexthop list head to append onto
+ * @nh: freshly allocated buffer to fill
+ */
+static void
+babel_nexthop_insert(
+  struct babel_proto *p,
+  struct babel_route *r,
+  struct nexthop **nhs,
+  struct nexthop *nh)
+{
+  nh->gw = r->next_hop;
+  nh->iface = r->neigh->ifa->iface;
+
+  /*
+   * If we cannot find a reachable neighbour, set the entry to be onlink. This
+   * makes it possible to, e.g., assign /32 addresses on a mesh interface and
+   * have routing work.
+   */
+  if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0))
+nh->flags = RNF_ONLINK;
+
+  nexthop_insert(nhs, nh);
+}
+
 /**
  * babel_announce_rte - announce selected route to the core
  * @p: Babel protocol instance
@@ -635,6 +663,7 @@ done:
 static void
 babel_announce_rte(struct babel_proto *p, struct babel_entry *e)
 {
+  struct babel_config *cf = (void *) p->p.cf;
   struct babel_route *r = e->selected;
   struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : 
p->ip6_channel;
 
@@ -645,18 +674,24 @@ babel_announce_rte(struct babel_proto *p, struct 
babel_entry *e)
   .source = RTS_BABEL,
   .scope = SCOPE_UNIVERSE,
   .dest = RTD_UNICAST,
-  .from = r->neigh->addr,
-  .nh.gw = r->next_hop,
-  .nh.iface = r->neigh->ifa->iface,
 };
 
-/*
- * If we cannot find a reachable neighbour, set the entry to be onlink. 
This
- * makes it possible to, e.g., assign /32 addresses on a mesh interface and
- * have routing work.
- */
-if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0))
-  a0.nh.flags = RNF_ONLINK;
+struct nexthop *nhs = NULL;
+babel_nexthop_insert(p, r, &nhs, allocz(sizeof(struct nexthop)));
+int num_nexthops = 1;
+
+struct babel_route *cr;
+WALK_LIST(cr, e->routes) {
+  if (cr == r || !cr->feasible || cr->metric != r->metric)
+   continue;
+
+  if (num_nexthops++ >= cf->max_nexthops)
+   break;
+
+  babel_nexthop_insert(p, cr, &nhs, allocz(sizeof(struct nexthop)));
+}
+
+a0.nh = *nhs;
 
 rta *a = rta_lookup(&a0);
 rte *rte = rte_get_temp(a);
@@ -736,6 +771,7 @@ babel_announce_retraction(struct babel_proto *p, struct 
babel_entry *e)
 static void
 babel_select_route(struct babel_proto *p, struct babel_entry *e, struct 
babel_route *mod)
 {
+  struct babel_config *cf = (void *) p->p.cf;
   struct babel_route *r, *best = e->selected;
 
   /* Shortcut if only non-best was modified */
@@ -744,8 

Re: Merging multiple routes into a single multipath route

2022-05-08 Thread Daniel Gröber
Hi Johannes,

On Tue, May 03, 2022 at 04:07:08PM +0200, Johannes Erwerle wrote:
> I would like to use both uplinks via the linux multipath routing and I am
> searching for a way to "merge" both default routes that I get into one route
> with multiple next hops.
> 
> Is there a way to accomplish that with bird? (or any other tool?)

Instead of letting the dhcp client install the routes and then trying to
massage them into shape you could do this directly by configuring the dhcp
client appropriately. The isc-dhcp client actually and very good support
for scripting in dhclient-*-hooks.d that lets you override what the
dhclient-script (which handles the OS specific stuff) does by default.

I have been working on a dhcp script myself to allow for VRRP redundancy on
a DHCP managed interface. You should be able to use the
neutralize_ip_change hack I use to override iproute2 commands called by
dhclient-script for your purposes too:

neutralize_ip_change () {
ip () neutered_ip "$@"
}
neutered_ip () { : do stuff and call real "ip" command; }

This works because dhclient-script (a bash script) sources the hooks so we
can mess with the shell environment to override the "ip" command with a
shell function that matches on the args and does whatever we want.

Script attached since I haven't published it anywhere yet. Note this is
only lightly tested so far but I'd be happy to help you make this work for
your use-case.

--Daniel
#!/bin/sh
# Install into /etc/dhcp and symlink to /etc/dhcp/dhclient-{enter,exit}-hooks.d

IFACE=enp9s0f1.4
VRID=4
PRIO=100
DUMMYIP_OLD=192.168.254.254
DUMMYIP_NEW=192.168.254.253

PIDFILE=/var/run/"vrrpd_${IFACE}_${VRID}.pid"
HOOKDIR="$(basename "$dir")"

## Debugging
#set -x
#exec >> /tmp/dhcp.log 2>&1
echo $HOOKDIR $reason $interface >>/tmp/dhcp.log
set | grep 'ip_address\|route' | sort >>/tmp/dhcp.log
printf '\n\n' >>/tmp/dhcp.log

killvrrp () {
echo "killvrrp $PIDFILE" >>/tmp/dhcp.log
[ -n $old_ip_address ] || return
if [ -e "$PIDFILE" ]; then
kill "$(cat $PIDFILE)" || true
fi
}

startvrrp () {
echo "startvrrp $new_ip_address\n" >>/tmp/dhcp.log
[ -n $new_ip_address ] || return
echo vrrpd -D -n -i "$interface" -v "$VRID" -p "$PRIO" "$new_ip_address"
vrrpd -D -n -i "$interface" -v "$VRID" -p "$PRIO" "$new_ip_address" ||
exit_status=1
}

# dhclient-script insists on fucking with the IP address, we don't want any
# of that since vrrpd will do it.
neutralize_ip_change () {
ip () neutered_ip "$@"
}

neutered_ip () {
[ $# -ge 3 ] || { ip "$@"; return; }
if
[ "$1" = "-4" -a "$2" = addr -a "$3" = flush ] ||
[ "$1" = "-4" -a "$2" = addr -a "$3" = add ] ||
[ "$1" = "-4" -a "$2" = addr -a "$3" = change ]
then
echo ignored "$@"
return
else
echo =ip "$@"
command ip "$@"
fi
}

do_vrrp () {

if [ "$HOOKDIR" = "dhclient-enter-hooks.d" ]; then
case $reason in
BOUND | REBOOT | TIMEOUT) ;; # Just started shouldn't be running
RENEW | REBIND)
[ $new_ip_address = $old_ip_address] || killvrrp ;;
EXPIRE | FAIL | RELEASE | STOP | TIMEOUT)
killvrrp ;;
esac

neutralize_ip_change
elif [ "$HOOKDIR" = "dhclient-exit-hooks.d" ]; then
case $reason in
# Note: TIMEOUT is when we can't reach a dhcp server during
# startup but have a valid lease still that we're going to
# use.
BOUND | REBOOT | TIMEOUT)
startvrrp ;;
RENEW | REBIND)
[ $new_ip_address = $old_ip_address] || startvrrp ;;
EXPIRE | FAIL | RELEASE) ;;
esac
fi
}

if [ "$interface" = $IFACE ]; then
do_vrrp
fi


[PATCH 1/2] Babel: Add option to support ecmp routes

2022-05-08 Thread Daniel Gröber
---
The FIB walks in babel_reconfigure_iface are a bit ugly, I tried to
loop over ifa->neigh_list instead but couldn't get it to work. I'm
open to suggestions :)

 doc/bird.sgml|  8 +
 proto/babel/babel.c  | 70 +---
 proto/babel/babel.h  |  3 ++
 proto/babel/config.Y |  3 ++
 4 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/doc/bird.sgml b/doc/bird.sgml
index 1580facd..5e85d8ec 100644
--- a/doc/bird.sgml
+++ b/doc/bird.sgml
@@ -1879,6 +1879,7 @@ protocol babel [] {
check link ;
next hop ipv4 ;
next hop ipv6 ;
+   ecmp  [limit ];
authentication none|mac [permissive];
password "";
password "" {
@@ -1983,6 +1984,13 @@ protocol babel [] {
   source for Babel packets will be used. In normal operation, it should not
   be necessary to set this option.
 
+  ecmp switch [limit 
number]
+
+  Determines whether babel will emit ECMP (equal-cost multipath)
+  routes, allowing to load-balancing traffic across multiple paths. If
+  enabled the maximum number of next-hops to allow can be specified,
+  defaulting to 16.
+
   authentication none|mac 
[permissive]
   Selects authentication method to be used. gw = r->next_hop;
+  nh->iface = r->neigh->ifa->iface;
+
+  /*
+   * If we cannot find a reachable neighbour, set the entry to be onlink. This
+   * makes it possible to, e.g., assign /32 addresses on a mesh interface and
+   * have routing work.
+   */
+  if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0))
+nh->flags = RNF_ONLINK;
+
+  nexthop_insert(nhs, nh);
+}
+
 /**
  * babel_announce_rte - announce selected route to the core
  * @p: Babel protocol instance
@@ -635,6 +663,7 @@ done:
 static void
 babel_announce_rte(struct babel_proto *p, struct babel_entry *e)
 {
+  struct babel_config *cf = (void *) p->p.cf;
   struct babel_route *r = e->selected;
   struct channel *c = (e->n.addr->type == NET_IP4) ? p->ip4_channel : 
p->ip6_channel;
 
@@ -645,18 +674,24 @@ babel_announce_rte(struct babel_proto *p, struct 
babel_entry *e)
   .source = RTS_BABEL,
   .scope = SCOPE_UNIVERSE,
   .dest = RTD_UNICAST,
-  .from = r->neigh->addr,
-  .nh.gw = r->next_hop,
-  .nh.iface = r->neigh->ifa->iface,
 };
 
-/*
- * If we cannot find a reachable neighbour, set the entry to be onlink. 
This
- * makes it possible to, e.g., assign /32 addresses on a mesh interface and
- * have routing work.
- */
-if (!neigh_find(&p->p, r->next_hop, r->neigh->ifa->iface, 0))
-  a0.nh.flags = RNF_ONLINK;
+struct nexthop *nhs = NULL;
+babel_nexthop_insert(p, r, &nhs, allocz(sizeof(struct nexthop)));
+int num_nexthops = 1;
+
+struct babel_route *cr;
+WALK_LIST(cr, e->routes) {
+  if (num_nexthops++ >= cf->ecmp)
+   break;
+
+  if (cr == r || !cr->feasible || cr->metric != r->metric)
+   continue;
+
+  babel_nexthop_insert(p, cr, &nhs, allocz(sizeof(struct nexthop)));
+}
+
+a0.nh = *nhs;
 
 rta *a = rta_lookup(&a0);
 rte *rte = rte_get_temp(a);
@@ -736,6 +771,7 @@ babel_announce_retraction(struct babel_proto *p, struct 
babel_entry *e)
 static void
 babel_select_route(struct babel_proto *p, struct babel_entry *e, struct 
babel_route *mod)
 {
+  struct babel_config *cf = (void *) p->p.cf;
   struct babel_route *r, *best = e->selected;
 
   /* Shortcut if only non-best was modified */
@@ -744,8 +780,10 @@ babel_select_route(struct babel_proto *p, struct 
babel_entry *e, struct babel_ro
 /* Either select modified route, or keep old best route */
 if ((mod->metric < (best ? best->metric : BABEL_INFINITY)) && 
mod->feasible)
   best = mod;
-else
+else if (cf->ecmp==1)
   return;
+/* With ecmp one of the non-selected but equal metric routes might have
+ * changed so contine on with the announcement in that case. */
   }
   else
   {
@@ -1879,6 +1917,16 @@ babel_reconfigure_iface(struct babel_proto *p, struct 
babel_iface *ifa, struct b
   if (ifa->up)
 babel_iface_kick_timer(ifa);
 
+  /* Update all routes to refresh ecmp settings. */
+  FIB_WALK(&p->ip6_rtable, struct babel_entry, e) {
+babel_select_route(p, e, NULL);
+  }
+  FIB_WALK_END;
+  FIB_WALK(&p->ip4_rtable, struct babel_entry, e) {
+babel_select_route(p, e, NULL);
+  }
+  FIB_WALK_END;
+
   return 1;
 }
 
diff --git a/proto/babel/babel.h b/proto/babel/babel.h
index 84feb085..3868dcb2 100644
--- a/proto/babel/babel.h
+++ b/proto/babel/babel.h
@@ -62,6 +62,8 @@
 #define BABEL_OVERHEAD (IP6_HEADER_LENGTH+UDP_HEADER_LENGTH)
 #define BABEL_MIN_MTU  (512 + BABEL_OVERHEAD)
 
+#define BABEL_DEFAULT_ECMP_LIMIT   16
+
 #define BABEL_AUTH_NONE0
 #define BABEL_AUTH_MAC 1
 
@@ -120,6 +122,7 @@ struct babel_config {
   list iface_list; /* List of ifac

[PATCH 2/2] Babel: Add option to control ecmp nexthop weight

2022-05-08 Thread Daniel Gröber
---
 doc/bird.sgml| 21 ++---
 proto/babel/babel.c  |  1 +
 proto/babel/babel.h  |  2 ++
 proto/babel/config.Y |  2 ++
 4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/doc/bird.sgml b/doc/bird.sgml
index 5e85d8ec..1fdda7bc 100644
--- a/doc/bird.sgml
+++ b/doc/bird.sgml
@@ -1865,6 +1865,7 @@ protocol babel [] {
ipv4 {  };
ipv6 [sadr] {  };
 randomize router id ;
+   ecmp  [limit ];
interface  {
type ;
rxcost ;
@@ -1879,7 +1880,7 @@ protocol babel [] {
check link ;
next hop ipv4 ;
next hop ipv6 ;
-   ecmp  [limit ];
+   ecmp weight ;
authentication none|mac [permissive];
password "";
password "" {
@@ -1910,6 +1911,13 @@ protocol babel [] {
   router ID every time it starts up, which avoids this problem at the cost
   of not having stable router IDs in the network. Default: no.
 
+  ecmp switch [limit 
number]
+
+  Determines whether babel will emit ECMP (equal-cost multipath)
+  routes, allowing to load-balancing traffic across multiple paths. If
+  enabled the maximum number of next-hops to allow can be specified,
+  defaulting to 16.
+
   type wired|wireless 
   This option specifies the interface type: Wired or wireless. On wired
   interfaces a neighbor is considered unreachable after a small number of
@@ -1984,12 +1992,11 @@ protocol babel [] {
   source for Babel packets will be used. In normal operation, it should not
   be necessary to set this option.
 
-  ecmp switch [limit 
number]
-
-  Determines whether babel will emit ECMP (equal-cost multipath)
-  routes, allowing to load-balancing traffic across multiple paths. If
-  enabled the maximum number of next-hops to allow can be specified,
-  defaulting to 16.
+  ecmp weight number
+  This specifies the relative weight used for nexthops going through
+  the iface when ECMP is enabled. Larger weight values relative to other
+  nexthops attract more traffic. Valid values are 1-256. Default value
+  is 1.
 
   authentication none|mac 
[permissive]
   Selects authentication method to be used. gw = r->next_hop;
   nh->iface = r->neigh->ifa->iface;
+  nh->weight = r->neigh->ifa->cf->ecmp_weight;
 
   /*
* If we cannot find a reachable neighbour, set the entry to be onlink. This
diff --git a/proto/babel/babel.h b/proto/babel/babel.h
index 3868dcb2..2ad1ded0 100644
--- a/proto/babel/babel.h
+++ b/proto/babel/babel.h
@@ -145,6 +145,8 @@ struct babel_iface_config {
   int tx_tos;
   int tx_priority;
 
+  u8 ecmp_weight;
+
   ip_addr next_hop_ip4;
   ip_addr next_hop_ip6;
 
diff --git a/proto/babel/config.Y b/proto/babel/config.Y
index 8f18c790..6f03482a 100644
--- a/proto/babel/config.Y
+++ b/proto/babel/config.Y
@@ -70,6 +70,7 @@ babel_iface_start:
   BABEL_IFACE->tx_tos = IP_PREC_INTERNET_CONTROL;
   BABEL_IFACE->tx_priority = sk_priority_control;
   BABEL_IFACE->check_link = 1;
+  BABEL_IFACE->ecmp_weight = 0;
 };
 
 
@@ -149,6 +150,7 @@ babel_iface_item:
  | AUTHENTICATION NONE { BABEL_IFACE->auth_type = BABEL_AUTH_NONE; }
  | AUTHENTICATION MAC { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; 
BABEL_IFACE->auth_permissive = 0; }
  | AUTHENTICATION MAC PERMISSIVE { BABEL_IFACE->auth_type = BABEL_AUTH_MAC; 
BABEL_IFACE->auth_permissive = 1; }
+ | ECMP WEIGHT expr { BABEL_IFACE->ecmp_weight = $3 - 1; if (($3<1) || 
($3>256)) cf_error("ECMP weight must be in range 1-256"); }
  | password_list
  ;
 
-- 
2.30.2



[PATCH] krt: Dump routing tables separetely on linux to avoid congestion

2022-04-16 Thread Daniel Gröber
When dumping the routing table bird currently doesn't set the rtm_table
netlink field to select any particular one but rather wants to get all at
once.

This can be problematic when multiple routing daemons are running on a
system as the kernel's route modification performance goes down
drasticly (by a factor of 20-200ish) when the table is being modified while
it's being dumped.

To avoid this situation we make bird do dumps on a per-kernel-table
basis. This then allows the administrator to have multiple routing daemons
use different kernel tables which sidesteps the problem.

See also this discussion on the babel-users mailing list:
  https://alioth-lists.debian.net/pipermail/babel-users/2022-April/003902.html
---
 sysdep/cf/linux.h  | 2 +-
 sysdep/linux/netlink.c | 7 ---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/sysdep/cf/linux.h b/sysdep/cf/linux.h
index 047d3764..d6211d16 100644
--- a/sysdep/cf/linux.h
+++ b/sysdep/cf/linux.h
@@ -9,7 +9,7 @@
 #define CONFIG_AUTO_ROUTES
 #define CONFIG_SELF_CONSCIOUS
 #define CONFIG_MULTIPLE_TABLES
-#define CONFIG_ALL_TABLES_AT_ONCE
+#undef  CONFIG_ALL_TABLES_AT_ONCE
 #define CONFIG_IP6_SADR_KERNEL
 
 #define CONFIG_MC_PROPER_SRC
diff --git a/sysdep/linux/netlink.c b/sysdep/linux/netlink.c
index 29b744cb..003b7939 100644
--- a/sysdep/linux/netlink.c
+++ b/sysdep/linux/netlink.c
@@ -256,7 +256,7 @@ nl_request_dump_addr(int af)
 }
 
 static void
-nl_request_dump_route(int af)
+nl_request_dump_route(int af, int table_id)
 {
   struct {
 struct nlmsghdr nh;
@@ -267,6 +267,7 @@ nl_request_dump_route(int af)
 .nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP,
 .nh.nlmsg_seq = ++(nl_scan.seq),
 .rtm.rtm_family = af,
+.rtm.rtm_table = table_id,
   };
 
   send(nl_scan.fd, &req, sizeof(req), 0);
@@ -1976,13 +1977,13 @@ nl_parse_route(struct nl_parse_state *s, struct 
nlmsghdr *h)
 }
 
 void
-krt_do_scan(struct krt_proto *p UNUSED)/* CONFIG_ALL_TABLES_AT_ONCE => 
p is NULL */
+krt_do_scan(struct krt_proto *p)
 {
   struct nlmsghdr *h;
   struct nl_parse_state s;
 
   nl_parse_begin(&s, 1);
-  nl_request_dump_route(AF_UNSPEC);
+  nl_request_dump_route(AF_UNSPEC, KRT_CF->sys.table_id);
   while (h = nl_get_scan())
 if (h->nlmsg_type == RTM_NEWROUTE || h->nlmsg_type == RTM_DELROUTE)
   nl_parse_route(&s, h);
-- 
2.30.2