Re: [j-nsp] BGP output queue priorities between RIBs/NLRIs

Jeffrey Haas via juniper-nsp Tue, 10 Nov 2020 05:34:59 -0800

--- Begin Message ---

Rob,

> On Nov 9, 2020, at 9:53 PM, Rob Foehl <r...@loonybin.net> wrote:
> 
>> An immense amount of work in the BGP code is built around the need to not 
>> have to keep full state on EVERYTHING.  We're already one of the most 
>> stateful BGP implementations on the planet.  Many times that helps us, 
>> sometimes it doesn't.
>> 
>> But as a result of such designs, for certain kinds of large work it is 
>> necessary to have a consistent work list and build a simple iterator on 
>> that.  One of the more common patterns that is impacted by this is the walk 
>> of the various routing tables.  As noted, we start roughly at inet.0 and go 
>> forward based on internal table order.
> 
> Makes sense, but also erases the utility of output queue priorities when
> multiple tables are involved.  Is there any feasibility of moving the RIB
> walking in the direction of more parallelism, or at least something like
> round robin between tables, without incurring too much overhead / bug
> surface / et cetera?

Recent RPD work has been done toward introducing multiple threads of execution 
on the BGP pipeline.  (Sharding.)  The output queue work is still applicable 
here.

The thing to remember is that even though you're not getting a given afi/safi 
as front-loaded as you want (absolute front of queue), as soon as we have 
routes for that priority they're dispatched accordingly.

Full table walks to populate the queues take some seconds to several minutes 
depending on the scale of the router.  In the absence of prioritization, 
something like the evpn routes might not go out for most of a minute rather 
than getting delayed some number of seconds until the rib walker has reached 
that table.

Perfect?  No. Better, yes.  It was a tradeoff of complexity vs. perfect 
queueing behavior.

> 
>> The primary challenge for populating the route queues in user desired orders 
>> is to move that code out of the pattern that is used for quite a few other 
>> things.  While you may want your evpn routes to go first, you likely don't 
>> want route resolution which is using earlier tables to be negatively 
>> impacted.  Decoupling the iterators for the overlapping table impacts is 
>> challenging, at best.  Once we're able to achieve that, the user 
>> configuration becomes a small thing.
> 
> I'm actually worried that if the open ER goes anywhere, it'll result in
> the ability to specify a table order only, and that's an awfully big
> hammer when what's really needed is the equivalent of the output queue
> priorities covering the entire process.  Some of these animals are more
> equal than others.

Which is why there's 16 queues to work with.  In my usual presentation on this 
feature, "why 16?".  The answer is "we needed at least the usual 3 
(low/medium/high), but then users would fight over arbitrary arrangement of 
those among different tables, and you can't do absolute prioritization of those 
per table because VPN may be least priority for some people and Internet 
highest or vice versa... and also, it should be less than 32 based on available 
bits in the data structure".

Yes, some more equal than others.  Hence the flexibility.

Once there's an ability to adjust the walker, it wouldn't impact the use of the 
queues.  It simply would make sure that things were prioritized earlier.

And, as noted above, we're continuing threading work.  At some point the tables 
may gain additional levels of independence which would obviate an explicit 
feature... maybe.  The core observation I make for a lot of this stuff is 
"There's always too much work to do".

> 
>> I don't recall seeing the question about the route refreshes, but I can 
>> offer a small bit of commentary: The CLI for our route refresh isn't as 
>> fine-grained as it could be.  The BGP extension for route refresh permits 
>> per afi/safi refreshing and honestly, we should expose that to the user.  I 
>> know I flagged this for PLM at one point in the past.
> 
> The route refresh issue mostly causes trouble when bringing new PEs into
> existing instances, and is presumably a consequence of the same behavior:
> the refresh message includes the correct AFI/SAFI, but the remote winds up
> walking every RIB before it starts emitting routes for the requested
> family (and no others).  The open case for the output queue issue has a
> note from 9/2 wherein TAC was able to reproduce this behavior and collect
> packet captures of both the specific refresh message and the long period
> of silence before any routes were sent.

The intent of the code is to issue the minimum set of refreshes for new 
configuration.  If it's provably not minimum for a given config, there should 
be a PR on that.

The cost of the refresh in getting routes sent to you is another artifact of 
"we don't keep that state" - at least in that configuration.  This is a 
circumstance where family route-target (RT-Constrain) may help.  You should 
find when using that feature that adding a new VRF with support for that 
feature results in the missing routes arriving quite fast - we keep the state.

-- Jeff

> 
> -Rob
>

--- End Message ---

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] BGP output queue priorities between RIBs/NLRIs

Reply via email to