RE: On minimizing SPF backoff induced blackouts

Antoni Przygienda Tue, 29 Jul 2014 08:05:20 -0700

Well informed discussion by now ;-) just one inline comment 

> as tony (and others) have pointed out ... it will only work during 
> nice-weather
> conditions ... as soon as you have the perfect-storm, you will have micro
> loops (and all the pain you try to protect yourself) - therefore i'd be much
> more in favour of 'something else' which gets us to zero uloops, and for that 
> i
> am afraid we need to have the notions of independent forwarding planes
> (ala make before break) -
> 
> trying to synchronize RIB/FIB update across different CPUs/routing-
> stacks/vendors/load conditions is a battle that can never be won.
> 
> [SLI] This is not a definitive solution for microloops, just a simple quick 
> win to
> remove somes.
> I'm still dreaming on a simple definitive solution ... but I don't see any 
> ... oFIB
> or synchronized FIBs sounds to not really have a good of support for
> implementations ...
> In the meantime, I do think there are small and fast areas of improvments
> even if not definitive solutions.


[Tony said]  The battle of synchronizing the nodes (and the network in a sense) 
cannot be won since beside dealing with asynchronous networks (ah, the halcyon 
days of TDM ;-) we are dealing here with hugely asynchronous systems within 
networks (by now). The days of same single-core processor doing everything on 
the side while running fast-path written in assembly by a pizza-fed wizard are 
long, long past (but then, 200 msec before flooding something & other guy being 
able to look @ it was considered quite normal ;-).  Looking @ e.g. our 
architecture I am having fun discussions along the lines "I gave you ACK but 
you know, it only means I processed some of the stuff you gave me & then I have 
to do all those other things until it's in FIB & fwd'ing but you surely don't 
want me to wait until it's done before I ACK you. So, what does an 'ACK' really 
mean here ?" ;-)   We can get 'some' amount of better synchronization (and 
again, we better not get too good @ it, a perfectly sync'ed netwo
 rk on LSA refreshes or HELLOs is _not_ a fun thing to debug ;-) and that's 
what this work will need to settle for.  The important things in the work are 
IMO

        * make sure you don't aim for _perfect_ synchronization but some small 
jitter to avoid network-wide Dirac pulses on your control planes. Even if you 
don't, I'm pretty sure the asynchronous nature of today's architectures will 
confound you. All kind of hysteresis like Hannes said is built into all the 
large systems today (packing, pipelining of async comms, state batching, 
reordering of state updates to preserve FIB integrity, chip restrictions & so 
on). 
        * allow for 'how many new LSAs and/or a timebound' is my 'normal 
failure'  knob for the operator.  Those numbers may shift dramatically. If I'm 
running large numbers of IGP shortcuts over a link, I may end up with tons of 
stuff on my plate on a single phy link failure before I want my computation to 
jump-start.  Yes, normal will be probably, one link - jump @ it if you're @ 
cool temperature but everybody is doing that today already pretty much so it's 
more 'what's bad enough to start to back-off'. 

Looking fwd' to what will emerge as practical proposed backoff algorithm here

--- tony

_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

RE: On minimizing SPF backoff induced blackouts

Reply via email to