Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-05-04 Thread Daniel Gröber
On Thu, Apr 14, 2022 at 08:58:53PM +0200, Toke Høiland-Jørgensen wrote:
> Side note: why is bird replacing all the routes in the first place? :)

FYI: I figured this out in the end. Turns out I had the BGP sessions in
bird configured as `multihop 1`/direct with global scope addressess that
babel would (sometimes) re-route indirectly. Every time that happens the
BGP session would then break :)

I use link-locals now, much better. Doesn't really make sense to re-route
the BGP session endoints haha.

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-16 Thread Daniel Gröber
Hi Toke,

so I was debugging not seeing a performance improvement by your patch see
below

On Fri, Apr 15, 2022 at 12:48:05AM +0200, Toke Høiland-Jørgensen wrote:
> diff --git a/kernel_netlink.c b/kernel_netlink.c
> index efe1243c3b07..36aae29124a5 100644
> --- a/kernel_netlink.c
> +++ b/kernel_netlink.c
> @@ -236,7 +236,7 @@ static int nl_setup = 0;
>  static int
>  netlink_socket(struct netlink *nl, uint32_t groups)
>  {
> -int rc;
> +int rc, strict;
>  int rcvsize = 512 * 1024;
>  
>  nl->sock = socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
> @@ -271,6 +271,9 @@ netlink_socket(struct netlink *nl, uint32_t groups)
>  perror("setsockopt(SO_RCVBUF)");
>  }
>  }
> +rc = setsockopt(nl->sock, SOL_NETLINK, NETLINK_GET_STRICT_CHK, &strict, 
> sizeof(strict));

You're using `strict` uninitialized here. If I set strict = 1 it works though :D

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-16 Thread Daniel Gröber
Hi Toke,

On Fri, Apr 15, 2022 at 12:48:05AM +0200, Toke Høiland-Jørgensen wrote:
> Poked a bit more into the kernel fib code; the more I'm looking at it,
> the more I'm convinced that the contention between add and dump is a
> fundamental feature of the way the routing table is implemented, so I'm
> not so sure it's simply a "bug" that can be "fixed" :(

Hmm, so do you think we should still send a report to netdev then?

> > That is a good question, bird should really be able to see that the route
> > is already installed and just don't bother. I see this del/add behaviour
> > even when bgp is otherwise nice and converged though so I assumed bird is
> > just like this.
> 
> Hmm, that's odd. What's your "background radiation" (i.e. route updates
> per second when Bird is running normally and babeld isn't started? I
> just checked my own router (which also imports a full v6 table), and
> that churns less than one route per second. So if you're seeing a lot of
> churn, maybe it's something in your config that could be fixed?

No, no, it's also just a couple routes per second here too.

> Alternatively, an option could be to improve Bird's performance when
> replacing routes; for one thing, there's this comment in bird's
> netlink.c:

Right if they don't quite trust the kernel that could explain the add/del
behaviour then.

> I've been meaning to look into adding nexthop support to Bird anyway, so
> this could be a nice occasion to bump that up my list. Don't take that
> as a promise, though... :P

I'm not sure how nexthop objects would help with this problem specifically?
If bird doesn't trust the kernel even if it can update the nexthop directly
it can't necessarily trust the other route attributes are right either and
so would still have to replace the FIB entry.

> > As I said before it always triggers when I (re)start babeld but I can't see
> > anything obvious in the log even with debug on as to why. Particularily I
> > don't see any bgp state events so the sessions should be fine but for some
> > reason it decides to churn everything anyway.
> 
> Well, the trigger when starting babeld would be the initial route dump,
> I suppose: If you have lots of route churn happening in the background,
> the drop in insert performance caused by the dump would be your trigger,
> no?

To clarify: when I start babeld bird is not yet churning just doing
background level updates but the act of starting babeld seems to somehow
make bird start churning routes soon after. I don't think the route dump
alone should/could make bird do anything to it's route otherwise a iproute2
dump would likely also do it.

I can tell bird is starting to churn because its CPU usage goes up to 100%
(most in the kernel). It's pretty mystifying how this could be connected to
be sure, I'll have to do more testing.

> One thing I noticed when playing around with your reproducer example,
> which may be something we could apply to the babeld case: If I run 'ip
> -6 route show table 1337' I get the slowdown, but if I just run a
> regular 'ip -6 route show', I do not. This seems to be because iproute2
> is adding the table to the route dump request, which will make the
> kernel dump only the requested table. And since the lock that's being
> contended is per table, that should nicely get rid of the contention. A
> patch to do this is included below (only compile-tested, so no idea if
> it'll actually work :)).

Ah this is excellent, thanks! I was wondering if the kernel keeps the
tables in separate datastructures or not.

Your patch seems to be against a different babeld branch than what I have
(can't see any CHANGE_RULE stuff here) but removing that bit it applies
fine.

I just tested it and it does indeed seem to work, however I think we also
need to make bird use table specific dumps since I'm still seeing the
slowdown and it doesn't seem to set rtm_table in nl_request_dump_route
either. I'll get on that.

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-14 Thread Daniel Gröber
On Thu, Apr 14, 2022 at 08:58:53PM +0200, Toke Høiland-Jørgensen wrote:
> Yeah, I do. He's also one of the maintainers of the routing code, so
> definitely the right person to Cc on this (explicitly Cc'ing maintainers
> makes sure they see your email as not everyone follows netdev
> rigorously).

Kk :)

> Ah, okay, that's interesting. Playing around with your examples, on my
> laptop the performance goes from ~90k/s to ~1k/s when doing just a
> single 'ip -6 route show table 1337'. The dump itself takes between 5-10
> seconds, so with the 30-sec interval in babeld I guess the periodic dump
> can coincide with the update at random.
> 
> Side note: why is bird replacing all the routes in the first place? :)

That is a good question, bird should really be able to see that the route
is already installed and just don't bother. I see this del/add behaviour
even when bgp is otherwise nice and converged though so I assumed bird is
just like this.

As I said before it always triggers when I (re)start babeld but I can't see
anything obvious in the log even with debug on as to why. Particularily I
don't see any bgp state events so the sessions should be fine but for some
reason it decides to churn everything anyway.

> > I'm currently working on babel ECMP support in bird though maybe I'll
> > have a stab at RTT after that.
> 
> On the subject of ECMP and Babel, you may want to read this thread:
> https://mailarchive.ietf.org/arch/msg/babel/i4tqsRIL3DS9e22GJ0QuoMef-P0/ 
> 
> I.e., it's not just a matter of writing the code we'll also need to
> define the semantics in the spec. Just so you know what you're getting
> yourself into ;)

Interesting thread, thanks.

I think for my use-case the loop avoidance point is moot though since I'm
mainly interested in using this on endpoints, not routers. So perhaps
calling this ECMP is not the right nomenclature?

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-14 Thread Daniel Gröber
Hi Toke,

On Thu, Apr 14, 2022 at 12:12:36AM +0200, Toke Høiland-Jørgensen wrote:
> How about submitting this report to netdev and asking for advice there?
> From a quick glance at the kernel fib code, this does not look like it's
> an easy fix (if it can be fixed at all), but we should really get
> someone who is an expert in the kernel routing code (which I'm not,
> sadly) to weight in. You could add an explicit Cc to David Ahern
>  when doing submitting, and please keep me in Cc as
> well. Or if you'd prefer, I can submit the report on your behalf?

I'll try to get around to that but no promises :)

Do you know David? I don't like just CCing people I don't know at random.

> As for why you're seeing this in particular when Babel is running, now
> that we know the route dump is the culprit, it's quite obvious: While
> Babel listens for new route notifications from the kernel, it doesn't
> actually use those notifications directly; instead, it just sets a flag
> (see kernel_route_notify() in babeld.c), and does a full dump whenever
> it gets a notification. Which obviously interacts really badly with lots
> of routes being inserted at the same time, as that will basically send
> Babel into a loop of doing nothing but route dumps.

I saw that too and I was poking at the babeld code for a while before
settling on the iproute2 reproducer, also compared it quite closely with
bird and I can't say I really see a difference in what they do other than
netlink buffer sizing.

Both will periodically dump the whole table so if I had two instances of
bird running concurrently I could experience the same problem as it seems
to be the recvmsg call that's blocking forever in the kernel while the
table churn is going on so it's not even related to babeld doing a
quadratic number of dumps or anything.

What is also interesting is that babeld already seems to correctly filter
the notifications by table id so all my route churn never actually sets the
kernel_routes_changed flag (see parse_kernel_route_rta import_tables check
at the bottom).

> Bird does things a bit differently: it will directly update its internal
> routing table from the netlink notification messages, and only does a
> full dump at intervals (by default once every minute, but it can be
> configured to run entirely without dumps).

Right but the important part is that it does very much still do the dumps
:)

Also I wonder how netlink buffer overruns are dealt with when there isn't a
periodic dump? Wouldn't it still have to do a full dump to resync if that
happens?

> AFAICT the babeld code will require quite a bit of surgery to change
> this behaviour; to the point where I think it may be simpler to
> implement the RTT extension in Bird (but I'm obviously biased here)... :)

In order to scale the number of native babel routes further you're probably
right but that's not necessary for my use-case anyway. If this kernel bug
goes away babeld would still work fine IMO.

I'm currently working on babel ECMP support in bird though maybe I'll have
a stab at RTT after that.

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-13 Thread Juliusz Chroboczek
> As for why you're seeing this in particular when Babel is running, now
> that we know the route dump is the culprit, it's quite obvious: While
> Babel listens for new route notifications from the kernel, it doesn't
> actually use those notifications directly; instead, it just sets a flag
> (see kernel_route_notify() in babeld.c), and does a full dump whenever
> it gets a notification.

You're right, as usual.

> Bird does things a bit differently: it will directly update its internal
> routing table from the netlink notification messages, and only does a
> full dump at intervals (by default once every minute, but it can be
> configured to run entirely without dumps).

Yeah, that's the right way.  Could you please point me at the place in
BIRD where you parse a netlink notification?

-- Juliusz



___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-13 Thread dxld
Hi Toke,

So this is definetly a kernel bug. I've managed to reproduce it with only
iproute2 commands. The problem seems to be dumping the whole FIB while lots
of individual route modifications are taking place.

First we have to generate some ip-route(1) -batch commands to use. You can
use a bgp route dump I've uploaded or create some synthetic prefixes if you
like:

$ get_prefixes () { curl https://dxld.net/bgp.prefixes; }
$ get_prefixes | awk '{ print "route add table 1337 unreachable " $1 }' > 
add-routes
$ get_prefixes | awk '{ print "route del table 1337 unreachable " $1; print 
"route add table 1337 unreachable " $1 }' > change-routes

To reproduce this I first insert a bunch of routes from that route dump
using ip -batch:

# ip -batch ./add-routes

Then to simulate what bird is doing I use a version of this dump where
every route is removed and re-added in a loop:

# while sleep 0.1; do ip -batch ./change-routes; done

While this is going on monitor route insertion performance using

# while sleep 0.1; do { timeout 1 ip -6 monitor; } | wc -l; done

On my system this shows ~10k routes/s. If we now dump the table while
change-routes is running the performance drops to ~500 routes/s on my
system:

# while sleep 0.1; do ip -6 route show table 1337 >/dev/null; done

FYI: peeking at `perf top` shows fib6_walk_continue and mutex_spin_on_owner
as the main offenders and almost all of the CPU time during this test is
spent in the kernel.

--Daniel

PS: To clean up use `ip -6 route flush table 1337`.

On Fri, Apr 08, 2022 at 02:38:43PM +0200, d...@darkboxed.org wrote:
> On Fri, Apr 08, 2022 at 01:57:01PM +0200, Toke Høiland-Jørgensen wrote:
> > Daniel Gröber  writes:
> > > I'll probably try that tomorrow then.
> > 
> > Alright, let met know how it goes; I can go poking at the kernel, but
> > having a reproducer makes that a lot easier :)
> 
> So i tried ip -batch but it seems it's, um, batching the sendmsg calls too
> much :)
> 
> Bird does a separate sendto call for each route but iproute2 batches them
> into only 1k ish calls for 100k routes so I can't reproduce the problem
> with that unfortunately.
> 
> I did do some stracing against babeld with `strace -e raw=all | ts -i '%.s`
> just to see what the timing of recvmsg calls is and how they vary. It seems
> to me the problem only happens when babeld is exclusively calling recvmsg
> (I assume during kernel_dump()), when it's in a steady state and starts
> calling select() between the recvmsg() calls performance is fine.
> 
> From skimming the code it seems babeld occationally schedules a full dump
> though so that might be why the reproducibility is so sporadic.
> 
> During babled startup seems to be the best chance for repro. For some
> reason bird pretty reliably also starts churning pretty soon after I
> restart babled not sure why but it makes testing easier so I'll debug that
> later :)
> 
> I also tried tweaking the iov_len size for recvmsg() in babled to match
> that of bird which is quite large without much change. Lowering the size
> just gave me message truncated errors not sure what's up with that.
> 
> If you want to play along, `while sleep 0.1; do { timeout 1 ip -6 monitor
> route; } | wc -l; done` is what I'm using to monitor the route insertion
> performance now. The {} is load bearing (for some reason) and it does error
> with "No buffer space available" when lots of churn is going on but it
> works anyway.
> 
> --Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-07 Thread Daniel Gröber
On Thu, Apr 07, 2022 at 11:02:15PM +0200, Daniel Gröber wrote:
> Hi Toke,
> 
> On Thu, Apr 07, 2022 at 12:19:46AM +0200, Toke Høiland-Jørgensen wrote:
> > I doubt that can have fixed this, though? But if it's gone, well, good
> > news? :P
> 
> I just managed to trigger it again so probably no :) This is on
> 5.10.0-13-amd64/5.10.106-1 with babeld 1.9.1 and bird version 2.0.9.

Just got another trigger, this time with babeld 1.11 + bird 2.0.9.

Interesting tidbit: if I SIGSTOP babeld insertion performance goes right
back up. When I SIGCONT it it goes back to snailtown.

Note it's probably not babeld starving bird of CPU as this is on a quadcore
AMD GX-412TC and babeld/bird are using at most one CPU each.

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-07 Thread Daniel Gröber
Hi Toke,

On Thu, Apr 07, 2022 at 12:19:46AM +0200, Toke Høiland-Jørgensen wrote:
> I doubt that can have fixed this, though? But if it's gone, well, good
> news? :P

I just managed to trigger it again so probably no :) This is on
5.10.0-13-amd64/5.10.106-1 with babeld 1.9.1 and bird version 2.0.9.

While it's happening bird prints something like the following every once in
a while:

   I/O loop cycle took 6793 ms for 6 events

I get 500-2000ish events from `ip -6 monitor` over a 10sec interval as
opposed to the 80-150k I usually see.

> Hmm, I've definitely had issues with dnsmasq not handling lots of route
> updates well. I got rid of it now, but when I was running a full BGP
> table on an oldish openwrt, I basically had to kill dnsmasq every time
> the BGP session went up or down, otherwise it would take several minutes
> to recover :/

I did kill dnsmasq while it already started happening but that didn't help.
On the other hand if I kill babeld the insertion speed seems to go back up
again. So that would seem to suggest dnsmasq isn't the problem.

Interestingly I can't seem to trigger it by just (non-gracefully)
restarting bird so there must be a trigger other than just lots of route
insertion activity going on.

For completeness' sake: I added a babel protocol to my bird config before
this triggered again but it's not involved in the main bgp routing table
only a tiny isolated one nor is it handling any of the interfaces babled is
on. For now I'm assuming this is not causing to the problem.

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-04-06 Thread Daniel Gröber
Hi Toke,

Looks like I never responded to this :O.

On Thu, Feb 24, 2022 at 11:36:06PM +0100, Toke Høiland-Jørgensen wrote:
> Yeah, I find this a bit surprising as well. What kernel version are you
> seeing this on, and what does the CPU usage show while it's ongoing
> (just starting 'top' and sorting by CPU usage should show you which
> process(es) are using the most CPU time).

The CPU usage is pretty much what you'd expect babel and bird are the top
offenders but dnsmasq is also spinning quite heavily. During the original
tests I was monitoring CPU usage with htop to make sure I'm not measuring
idle route insertion activity FYI.

I just tried to reproduce the problem to make sure dnsmasq isn't
interfering also, but I can't seem to reproduce it now. Perhaps this was
actully a kernel bug that since got fixed by a kernel upgrade.

This is on a Debian 11 (bullseye) system, according to my dpkg.log I likely
had 5.10.0-11-amd64/5.10.92-1 at the time of the original tests whereas I
have 5.10.0-12-amd64/5.10.103-1 now. I tried with the old version too but I
can't seem to get the problem to trigger anymore now. Good I guess but
unsatisfying :/

> > I am aware of the babel support in bird, but in my setup the whole
> > point of using babel is for the RTT metric support which bird doesn't
> > seem to support yet.
> 
> Ah, right, yeah, it doesn't. But good to know there's demand for this,
> that's a motivation for implementing it :)

Definitely :)

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-02-24 Thread Toke Høiland-Jørgensen
Juliusz Chroboczek  writes:

>> Probably because babeld subscribes to netlink notifications for all new
>> routes, and only filters them on the table name fairly late,
>> specifically here:
>> 
>> https://github.com/jech/babeld/blob/master/kernel_netlink.c#L1175
>
> Do you see how it can be done better?

Hmm, no, not really :(

Looked at the Bird code, and it seems like it's doing both the subscribe
and parsing quite similar to the way babeld is. So it's actually a bit
puzzling why it's hurting performance that much. the only obvious
difference that I can see from my admittedly cursory glance is that
babeld makes heavy use of indirect calls; but we're not talking millions
of operations per second here, so it really shouldn't be taking such a
heavy toll...

-Toke

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-02-23 Thread Dave Taht
Ages ago I attempted a string of optimizations for babeld which sped
it up by a lot, but introduced new bugs along the way, and juliusz
preferred the relative cleanliness of the babeld code compared to:

using ebpf to filter routes in the kernel (big speedup, but it was buggy)
inlining qsort (2x speedup), and leveraging sse/neon for comparisons
using uthashes to manage scheduling updates (or one day a timerwheel)
switching to a struct for the common params (someone elses branch)
threads for managing I/O and bailing out

Some of that work made it back into the mainline. Some things (like
unicast updates) sort of fell out of that.

In actuality I was also experimenting with a custom processor and
trying to find optimizations that could make it into hw of some form
or another, and it was a mess. I am not proud of those few weeks of
hacking/flailing.

I had a goal of 64k routes, and ultimately wanted to be working on
optimizing updates (the protocol supports longer lasting
announcements), in response to not being able to meet compute
deadlines and fell short (with still buggy code) at about 30k routes
on the very limited hw I was using.

Anyway, that tree https://github.com/dtaht/rabeld had the semi-broken
ebpf code for filtering kernel updates better, I've also long held out
hope that some new kernel support for switching routes faster could be
leveraged, and I've kind of longed that someone else would stress out
the bird version, not just of babel, but of multiple routing
protocols, using tools like rtod, here:

https://github.com/dtaht/rtod

I like bird's codebase a lot.

On Wed, Feb 23, 2022 at 7:13 PM Daniel Gröber  wrote:
>
> Hi Toke and Juliusz,
>
> On Wed, Feb 23, 2022 at 09:43:29PM +0100, Toke Høiland-Jørgensen wrote:
> > Probably because babeld subscribes to netlink notifications for all new
> > routes, and only filters them on the table name fairly late,
> > specifically here:
> >
> > https://github.com/jech/babeld/blob/master/kernel_netlink.c#L1175
>
> Thanks for the pointer, I figured it would be something like that but I'm
> still surprised babled should be able to (seemingly) block the kernel from
> processing other netlink messages but I haven't had the time to really
> review the code yet properly yet.
>
> I would have expected the kernel to just drop events when babled falls
> behind with processing.
>
> > So babeld will process and parse all route entries even if it won't
> > export them.
>
> Right, so I wonder if there is a way to let the kernel do the filtering
> before passing events to babeld. Perhaps just making babled faster at
> processing route updates would be a better solution though. Maybe I'll try
> my hand at some profiling when I get a chance.
>
> > implementation in Bird as well; that has no issues with running
> > concurrently with a full BGP table. It is even possible to run babel and
> > BGP in the same Bird instance, but I split mine out to two instances
> > (one for BGP, one for Babel) because I had issues with the
> > single-threaded nature of Bird causing Babel to miss hello updates while
> > processing a large BGP update.
>
> I am aware of the babel support in bird, but in my setup the whole point of
> using babel is for the RTT metric support which bird doesn't seem to
> support yet.
>
> I had a look at FRR too since it supposedly does support RTT but according
> to the babel homepage using it is discouraged. I was wondering if that is
> still correct actually?
>
> On Thu, Feb 24, 2022 at 12:55:01AM +0100, Juliusz Chroboczek wrote:
> > > I run Bird in a similar setup as yours, BTW, but using the Babel
> > > implementation in Bird
> >
> > Just to clarify: there are two major implementations of Babel:
> >
> >   - babeld, which is a research project, and was written over the years by
> > myself and a number of students, most of whom only stayed during an
> > internship before moving on;
>
> > While I find babeld more convenient than BIRD, since it requires little
> > configuration in many common cases, I recommend that people use BIRD in
> > preference to babeld in production deployments.
>
> Yeah like I said above I am using babeld because of the RTT metric support
> otherwise I would have preferred bird :)
>
> --Daniel
> ___
> Babel-users mailing list
> Babel-users@alioth-lists.debian.net
> https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users



-- 
I tried to build a better future, a few times:
https://wayforward.archive.org/?site=https%3A%2F%2Fwww.icei.org

Dave Täht CEO, TekLibre, LLC

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-02-23 Thread Daniel Gröber
Hi Toke and Juliusz,

On Wed, Feb 23, 2022 at 09:43:29PM +0100, Toke Høiland-Jørgensen wrote:
> Probably because babeld subscribes to netlink notifications for all new
> routes, and only filters them on the table name fairly late,
> specifically here:
> 
> https://github.com/jech/babeld/blob/master/kernel_netlink.c#L1175

Thanks for the pointer, I figured it would be something like that but I'm
still surprised babled should be able to (seemingly) block the kernel from
processing other netlink messages but I haven't had the time to really
review the code yet properly yet.

I would have expected the kernel to just drop events when babled falls
behind with processing.

> So babeld will process and parse all route entries even if it won't
> export them.

Right, so I wonder if there is a way to let the kernel do the filtering
before passing events to babeld. Perhaps just making babled faster at
processing route updates would be a better solution though. Maybe I'll try
my hand at some profiling when I get a chance.

> implementation in Bird as well; that has no issues with running
> concurrently with a full BGP table. It is even possible to run babel and
> BGP in the same Bird instance, but I split mine out to two instances
> (one for BGP, one for Babel) because I had issues with the
> single-threaded nature of Bird causing Babel to miss hello updates while
> processing a large BGP update.

I am aware of the babel support in bird, but in my setup the whole point of
using babel is for the RTT metric support which bird doesn't seem to
support yet.

I had a look at FRR too since it supposedly does support RTT but according
to the babel homepage using it is discouraged. I was wondering if that is
still correct actually?

On Thu, Feb 24, 2022 at 12:55:01AM +0100, Juliusz Chroboczek wrote:
> > I run Bird in a similar setup as yours, BTW, but using the Babel
> > implementation in Bird
> 
> Just to clarify: there are two major implementations of Babel:
> 
>   - babeld, which is a research project, and was written over the years by
> myself and a number of students, most of whom only stayed during an
> internship before moving on;

> While I find babeld more convenient than BIRD, since it requires little
> configuration in many common cases, I recommend that people use BIRD in
> preference to babeld in production deployments.

Yeah like I said above I am using babeld because of the RTT metric support
otherwise I would have preferred bird :)

--Daniel


signature.asc
Description: PGP signature
___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-02-23 Thread Juliusz Chroboczek
> Probably because babeld subscribes to netlink notifications for all new
> routes, and only filters them on the table name fairly late,
> specifically here:
> 
> https://github.com/jech/babeld/blob/master/kernel_netlink.c#L1175

Do you see how it can be done better?

> I run Bird in a similar setup as yours, BTW, but using the Babel
> implementation in Bird

Just to clarify: there are two major implementations of Babel:

  - babeld, which is a research project, and was written over the years by
myself and a number of students, most of whom only stayed during an
internship before moving on;
  - the one that's integrated in BIRD, which was written by Toke, one of
the most competent programmers I have had the pleasure to meet.

While I find babeld more convenient than BIRD, since it requires little
configuration in many common cases, I recommend that people use BIRD in
preference to babeld in production deployments.

(Both BIRD and babeld aim to implement Babel and its extensions as
standardised and documented at the IETF, so any failure of the two
implementations to interoperate is considered as a bug.  In other words,
you should be able to mix and match BIRD and babeld in a single network.)

-- Juliusz

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


[Babel-users] babeld slashes kernel route manipulation performance by 17000%

2022-02-23 Thread Daniel Gröber
Hi,

I'm seeing a rather odd issue in my babeld deployment. I'm using babeld on
one of my linux routers which also has bird running with a v6 BGP full
table session thats being inserted into the kernel FIB.

Whenever bird tries to clear/reinsert all routes in the kernel table I'm
seeing a 17000% reduction[1] in route insertion performance if babled is
running simultaniously :)

Since I've anticipated babeld not being quite ready to even see/filter a
kernel table with 100k+ routes I've set things up such that bird inserts
its routes into a separate table so babeld has a chance to just completely
ignore them and only see a nice and short main routing table.

Any ideas/pointers how babled could be slowing this down so much?

--Daniel

[1]: Measured by counting how many lines ip monitor spits out over a 10
second period while bird is having at it with or without babeld running,
thusly: `timeout 10 ip -6 monitor | wc -l`.

With babeld:

root@Debby:~# { timeout 10 ip -6 monitor; } | wc -l
296

Without babled:

root@Debby:~# { timeout 10 ip -6 monitor; } | wc -l
104809


signature.asc
Description: PGP signature
___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users