Re: [j-nsp] BGP output queue priorities between RIBs/NLRIs

Rob Foehl Tue, 10 Nov 2020 10:27:15 -0800

On Tue, 10 Nov 2020, Jeffrey Haas wrote:

The thing to remember is that even though you're not getting a given afi/safi 
as front-loaded as you want (absolute front of queue), as soon as we have 
routes for that priority they're dispatched accordingly.

Right, that turns out to be the essential issue -- the output queuesactually are working as configured, but the AFI/SAFI routes relevant to ahigher priority queue arrive so late in the process that it's basicallyirrelevant whether they get to cut in line at that point. Certainlywasn't observable to human eyes, had to capture the traffic to verify.

Full table walks to populate the queues take some seconds to several minutes 
depending on the scale of the router.  In the absence of prioritization, 
something like the evpn routes might not go out for most of a minute rather 
than getting delayed some number of seconds until the rib walker has reached 
that table.

Ah, maybe this is the sticking point: on a route reflector with anRE-S-X6-64 carrying ~10M inet routes and ~10K evpn routes, a new sessiontoward an RR client PE needing to be sent ~1.6M inet routes (full table,add-path 2) and maybe ~3K evpn routes takes between 11-17 minutes to getthrough the initial batch. The evpn routes only arrive at the tail end ofthat, and may only preempt around 1000 inet routes in the output queues,as confirmed by TAC.

I have some RRs that tend toward the low end of that range and some thattend toward the high end -- and not entirely sure why in either case --but that timing is pretty consistent overall, and pretty horrifying. Icould almost live with "most of a minute", but this is not that.

This has problems with blackholing traffic for long periods in severalcases, but the consequences for DF elections are particularly disastrous,given that they make up their own minds based on received state withoutany affirmative handshake: the only possible behaviors are discarding orlooping traffic for every ethernet segment involved until the routessettle, depending on whether the PE involved believes it's going to winthe election and how soon. Setting extremely long 20 minute DF electionhold timers is currently the least worst "solution", as losing traffic forup to 20 minutes is preferable to flooding a segment into oblivion -- butonly just.

I wouldn't be nearly as concerned with this if we weren't taking 15-20minute outages every time anything changes on one of the PEs involved...



[on the topic of route refreshes]

The intent of the code is to issue the minimum set of refreshes for new 
configuration.  If it's provably not minimum for a given config, there should 
be a PR on that.

I'm pretty sure that much is working as intended, given what is actuallysent -- this issue is the time spent walking other RIBs that have nobearing on what's being refreshed.

The cost of the refresh in getting routes sent to you is another artifact of "we 
don't keep that state" - at least in that configuration.  This is a circumstance 
where family route-target (RT-Constrain) may help.  You should find when using that 
feature that adding a new VRF with support for that feature results in the missing routes 
arriving quite fast - we keep the state.

I'd briefly looked at RT-Constrain, but wasn't convinced it'd be usefulhere since disinterested PEs only have to discard at most ~10K EVPN routesat present. Worth revisiting that assessment?


-Rob


_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] BGP output queue priorities between RIBs/NLRIs

Reply via email to