Hi, Thanks for the feedback.
On 13 Apr 2012, at 21:24, Tony Li wrote: > This may be a nit, but I think it's important to recall that UPDATE messages > include withdrawn routes. These MUST NOT be ignored by the receiver. Doing > so will simply result in forwarding loops or black holes. > > Perhaps this should at least be wordsmithed into something like 'ignore any > reachability information in an UPDATE message, while processing the withdrawn > routes in that same UPDATE message'. Agreed - I think that this is the only place in the draft where there is any reference to ignoring a message, but let me re-review this and take your suggestion into account. > Independently of that, I think that trying to maintain a session in the face > of multiple errors is a clear waste of time and effort on all parties. At > some point, there is more effort and complexity spent on error recovery than > on correct transmission, and that's just backwards. I support the suggestion > of binary exponential back off on session restarts. This was the original intent of the section, to ensure that at some point there is a cap on the amount of resource that goes into handling these errors. The problem with defining an arbitrary cap of "X errors in N time" (as Robert suggested) is that this trips up on some particular scenarios: - A remote AS in the Internet DFZ re-announces prefixes due to upgrading border routers, with some changed attribute - it's likely that all their prefixes are learnt within a short time period, and if all of them include some semantic error, then transitioning to a down state will impact the rest of the prefixes in the DFZ unnecessarily. - If a PE in an L3VPN environment experiences some event (be it configuration change, software upgrade, etc) and as a result propagates prefixes which are made invalid by an element of the RR infrastructure, then a all prefixes will be readvertised within a relatively short time period. Again if this results in transitioning the session to a down state on the receiving PE (receiving from the RRs), then this impacts all prefixes from other PEs unnecessarily. In both cases these are lower-impact to the receiving PE if they are semantic errors, since the action required is likely to be the generation of an UPDATE withdrawing these prefixes. Where automatic recovery is put in place by an implementor, really the requirement is to cap how much of the receiving speaker's resource is used - hence the suggestion in the original mail in this thread. As I see it, for critical errors, where the hitless restart is the way of handling the error, then this is somewhat different due to the somewhat heavier impact of the receiving PE, plus the fact that the error is much more likely to be localised to the directly peered-with speaker, and hence the cap of resource by shutting the session down at some point, and then exponentially backing off the re-opening process. Please let me know if this doesn't sound reasonable, or whether there's any assumption I made here that's invalid. Cheers, r. _______________________________________________ GROW mailing list GROW@ietf.org https://www.ietf.org/mailman/listinfo/grow