Hi,

Thanks for the feedback.

On 13 Apr 2012, at 21:24, Tony Li wrote:

> This may be a nit, but I think it's important to recall that UPDATE messages 
> include withdrawn routes.  These MUST NOT be ignored by the receiver.  Doing 
> so will simply result in forwarding loops or black holes.
> 
> Perhaps this should at least be wordsmithed into something like 'ignore any 
> reachability information in an UPDATE message, while processing the withdrawn 
> routes in that same UPDATE message'.

Agreed - I think that this is the only place in the draft where there is any 
reference to ignoring a message, but let me re-review this and take your 
suggestion into account.

> Independently of that, I think that trying to maintain a session in the face 
> of multiple errors is a clear waste of time and effort on all parties.  At 
> some point, there is more effort and complexity spent on error recovery than 
> on correct transmission, and that's just backwards.  I support the suggestion 
> of binary exponential back off on session restarts.

This was the original intent of the section, to ensure that at some point there 
is a cap on the amount of resource that goes into handling these errors. 

The problem with defining an arbitrary cap of "X errors in N time" (as Robert 
suggested) is that this trips up on some particular scenarios:

- A remote AS in the Internet DFZ re-announces prefixes due to upgrading border 
routers, with some changed attribute - it's likely that all their prefixes are 
learnt within a short time period, and if all of them include some semantic 
error, then transitioning to a down state will impact the rest of the prefixes 
in the DFZ unnecessarily.

- If a PE in an L3VPN environment experiences some event (be it configuration 
change, software upgrade, etc) and as a result propagates prefixes which are 
made invalid by an element of the RR infrastructure, then a all prefixes will 
be readvertised within a relatively short time period. Again if this results in 
transitioning the session to a down state on the receiving PE (receiving from 
the RRs), then this impacts all prefixes from other PEs unnecessarily.

In both cases these are lower-impact to the receiving PE if they are semantic 
errors, since the action required is likely to be the generation of an UPDATE 
withdrawing these prefixes. Where automatic recovery is put in place by an 
implementor, really the requirement is to cap how much of the receiving 
speaker's resource is used - hence the suggestion in the original mail in this 
thread.

As I see it, for critical errors, where the hitless restart is the way of 
handling the error, then this is somewhat different due to the somewhat heavier 
impact of the receiving PE, plus the fact that the error is much more likely to 
be localised to the directly peered-with speaker, and hence the cap of resource 
by shutting the session down at some point, and then exponentially backing off 
the re-opening process.

Please let me know if this doesn't sound reasonable, or whether there's any 
assumption I made here that's invalid.

Cheers,
r.
_______________________________________________
GROW mailing list
GROW@ietf.org
https://www.ietf.org/mailman/listinfo/grow

Reply via email to