On Tue, 10 Nov 2020, Jeffrey Haas wrote:

The thing to remember is that even though you're not getting a given afi/safi 
as front-loaded as you want (absolute front of queue), as soon as we have 
routes for that priority they're dispatched accordingly.

Right, that turns out to be the essential issue -- the output queues actually are working as configured, but the AFI/SAFI routes relevant to a higher priority queue arrive so late in the process that it's basically irrelevant whether they get to cut in line at that point. Certainly wasn't observable to human eyes, had to capture the traffic to verify.

Full table walks to populate the queues take some seconds to several minutes 
depending on the scale of the router.  In the absence of prioritization, 
something like the evpn routes might not go out for most of a minute rather 
than getting delayed some number of seconds until the rib walker has reached 
that table.

Ah, maybe this is the sticking point: on a route reflector with an RE-S-X6-64 carrying ~10M inet routes and ~10K evpn routes, a new session toward an RR client PE needing to be sent ~1.6M inet routes (full table, add-path 2) and maybe ~3K evpn routes takes between 11-17 minutes to get through the initial batch. The evpn routes only arrive at the tail end of that, and may only preempt around 1000 inet routes in the output queues, as confirmed by TAC.

I have some RRs that tend toward the low end of that range and some that tend toward the high end -- and not entirely sure why in either case -- but that timing is pretty consistent overall, and pretty horrifying. I could almost live with "most of a minute", but this is not that.

This has problems with blackholing traffic for long periods in several cases, but the consequences for DF elections are particularly disastrous, given that they make up their own minds based on received state without any affirmative handshake: the only possible behaviors are discarding or looping traffic for every ethernet segment involved until the routes settle, depending on whether the PE involved believes it's going to win the election and how soon. Setting extremely long 20 minute DF election hold timers is currently the least worst "solution", as losing traffic for up to 20 minutes is preferable to flooding a segment into oblivion -- but only just.

I wouldn't be nearly as concerned with this if we weren't taking 15-20 minute outages every time anything changes on one of the PEs involved...


[on the topic of route refreshes]

The intent of the code is to issue the minimum set of refreshes for new 
configuration.  If it's provably not minimum for a given config, there should 
be a PR on that.

I'm pretty sure that much is working as intended, given what is actually sent -- this issue is the time spent walking other RIBs that have no bearing on what's being refreshed.

The cost of the refresh in getting routes sent to you is another artifact of "we 
don't keep that state" - at least in that configuration.  This is a circumstance 
where family route-target (RT-Constrain) may help.  You should find when using that 
feature that adding a new VRF with support for that feature results in the missing routes 
arriving quite fast - we keep the state.

I'd briefly looked at RT-Constrain, but wasn't convinced it'd be useful here since disinterested PEs only have to discard at most ~10K EVPN routes at present. Worth revisiting that assessment?

-Rob


_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Reply via email to