Re: [Babel-users] Anybody else seeing disruption when restarting babeld?y

2023-02-25 Thread Dave Taht
I have a fantasy concerning all routing daemons that I hope is the
case in bird, but isn't, in babeld.

It's this pattern:

while(true) {
do_somework();
}

where if do_somework() exceeds a deadline (about 4s in the babel
case), bad things start to happen, and cascade.

Modern linux and windows at least, have the concept of an interval
timer (fdtimers), which can easily let you see when you are exceeding
deadlines,
and find a way to shed_somework(). Babel has within the protocol the
ability to start announcing routes on a larger interval which would be
a way to shed some work. Always ensuring that at least a default route
made it out, or scrambling the order of the announcements somewhat so
other limits are not hit, might also help.

Some realworld examples of applications doing this right are in the
top utility, and in how netflix probes for more bandwidth (if a given
10sec segment doesn't load in under 4 seconds, they scale back the
resolution).

Recently we hit this problem hard whilst trying to scale libreqos down
to sub 10ms sampling intervals.

The second thing that might help some, is good ole-fashioned random
exponential backoff
(https://www3.cs.stonybrook.edu/~bender/newpub/2016-BenderFiGi-SODA-energy-backoff.pdf),
not just in packet access, but in kernel access, where I see on some
loads, netlink returning ENOBUFS.

I am very happy to see crates for these concepts appearing in rust,
and do wish more folk fiddling with routing daemons fiddled with my
RTOD tool. It would be a more robust, more smoothly degrading, world.

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] Anybody else seeing disruption when restarting babeld?y

2023-02-24 Thread Daniel Gröber
Hi,

On Sat, Feb 25, 2023 at 01:01:20AM +0100, Daniel Gröber wrote:
> > If you're really keen on avoiding disruptions, you should first increase
> > the metric to something very lare (say, 2^15), then wait a couple of
> > seconds, then send a retraction, then wait 200ms.
> 
> Could you go into a bit more detail as to why that would be better? I think
> I get the jist, we want to avoid other nodes installing an unreachable
> route in response to our retraction while they do the seqno request dance,
> right? I just don't see why the high (but non-infinite) metric would
> prevent this?

I think I figured it out. babeld sends seqno requests when a just received
unfeasible update has a much larger metric than the current route. In
send_unfeasible_request:

if(force || !route || route_metric(route) >= metric + 512)
send_unicast_multihop_request(neigh, src->prefix, src->plen, ...);

I guess that's another thing for my bird TODO list :)

Any reason this (and the route_old() logic) isn't mentioned in the RFC or
did I miss it?

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] Anybody else seeing disruption when restarting babeld?y

2023-02-24 Thread Daniel Gröber
Hi Juliusz,

On Sat, Feb 25, 2023 at 12:10:26AM +0100, Juliusz Chroboczek wrote:
> > I can't say I agree with the "their problem" mentality. The way I see it
> > during graceful shutdown we're still responsible for in-flight traffic
> > anyway.
> 
> What I mean is that after our neighbours receive our retraction, they'll no
> longer be sending traffic to us, whether they have a feasible route or not.
> 
> If you're really keen on avoiding disruptions, you should first increase
> the metric to something very lare (say, 2^15), then wait a couple of
> seconds, then send a retraction, then wait 200ms.

Could you go into a bit more detail as to why that would be better? I think
I get the jist, we want to avoid other nodes installing an unreachable
route in response to our retraction while they do the seqno request dance,
right? I just don't see why the high (but non-infinite) metric would
prevent this?

AFAIK the RFC only requires nodes start sending seqno requests once the
last feasible route is already gone. Which is pretty bad from my "no
disruptions" viewpoint now that I think of it.

This also makes me wonder, looking at
https://www.rfc-editor.org/rfc/rfc8966.html#section-3.8.2.1, would it be
permissible for a node to always send seqno requests when any route is
unfeasible, in order to have as many feasible routes as possible?

> But I think that's too much hassle, I like your current approach better.

I hadn't considered this problem. So my current fix really only provides a
fix for inflight packets but not the blackhole that could be created as
soon as we send the retraction to a neighbour without any other feasible
routes.

I would prefer investing more time to fix that too if that's even possible
without a protocol extension?

> > In my mind it doesn't matter if babeld takes 500ms or 15sec to shutdown if
> > that buys me a rock solid network.
> 
> I think the default should be 300ms or so.

Works for me.

Thanks,
--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] Anybody else seeing disruption when restarting babeld?y

2023-02-24 Thread Juliusz Chroboczek
>> Of course, if there are no feasible routes to a given destination, then
>> the neighbours will perform an end-to-end search for a loop-free route,
>> but that's the neghbours' problem, not ours.

> I can't say I agree with the "their problem" mentality. The way I see it
> during graceful shutdown we're still responsible for in-flight traffic
> anyway.

What I mean is that after our neighbours receive our retraction, they'll no
longer be sending traffic to us, whether they have a feasible route or not.

If you're really keen on avoiding disruptions, you should first increase
the metric to something very lare (say, 2^15), then wait a couple of
seconds, then send a retraction, then wait 200ms.  But I think that's too
much hassle, I like your current approach better.

> In my mind it doesn't matter if babeld takes 500ms or 15sec to shutdown if
> that buys me a rock solid network.

I think the default should be 300ms or so.

> The note about the ACKs was simply supposed to be reasoning for why an
> ad-hoc delay rather than having neighbours ACK the retractions.
> 
>>   - should the granularity be lower?  A second for local signalling is
>> a lot, I'd expect 300ms to be enough in most cases;

> I have no problem changing it to millisecond granularity if that suits you?

Please.

-- Juliusz

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users


Re: [Babel-users] Anybody else seeing disruption when restarting babeld?y

2023-02-24 Thread Daniel Gröber
On Fri, Feb 24, 2023 at 07:41:03PM +0100, Juliusz Chroboczek wrote:
> > I think I figured out whats going on: babeld immediately flushes the kernel
> > routes it installed when shutting down, without waiting for neighbours to
> > switch to a different path.
> 
> Right.  How long is the disruption?

It's not so much how long it is as it is that it's there at all. I don't
want my network to drop packets on the floor unecessarily.

> > I figure this has to be configurable option since full propagation of the
> > retractions depends on the network diameter and there's no way in the
> > protocol we can get acknowledgments from the entire network (AFAIK?), not
> > just our immediate neighbours.
> 
> We only signal the neighbours: this is distance-vector, so the neighbours
> will start searching for a different route without any need for end-to-end
> signalling. 

You're right ofc. as soon as we signal a neighbour they will divert traffic
somewhere else, but iff. they have a feasible route at hand. I'm also
concerned about blackholing in the "no feasible route" case.

> Of course, if there are no feasible routes to a given destination, then
> the neighbours will perform an end-to-end search for a loop-free route,
> but that's the neghbours' problem, not ours.

I can't say I agree with the "their problem" mentality. The way I see it
during graceful shutdown we're still responsible for in-flight traffic
anyway. We're also in a reasonable position to avoid dropping any traffic
still about to be routed based on the assumption we're still alive and
routing until our retraction propagates, why shouldn't we take advantage of
that?

In my mind it doesn't matter if babeld takes 500ms or 15sec to shutdown if
that buys me a rock solid network. So my thinking is I'd like to know when
everything has converged, since that isn't really a thing in DV as you note
an ad-hoc delay is the next best thing I could think of.

The note about the ACKs was simply supposed to be reasoning for why an
ad-hoc delay rather than having neighbours ACK the retractions.

> > https://github.com/jech/babeld/pull/102
> 
> Looks good to me.  Just two comments:
> 
>   - should the granularity be lower?  A second for local signalling is
> a lot, I'd expect 300ms to be enough in most cases;

I have no problem changing it to millisecond granularity if that suits you?

>   - why a goto rather than a loop?

Oh you know how it goes: you make a decision then another, change that one
and then unbeknownst to you the first one doesn't really make sense
anymore ;) Will fix.

--Daniel

___
Babel-users mailing list
Babel-users@alioth-lists.debian.net
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users