On Apr 16, 2012, at 7:50 AM, Jakob Heitz wrote:

> This should not overwhelm a router.
> However, a more serious consequence is a lot of flapping routes.
> Serious enough to consider. We could dust off some dampening code for it.
> 
> Note the previous discussion was about repeated "treat as withdraw"
> errors. I support the use of an operational message for these errors,
> but nothing for repetitions of them. Repeated operational messages
> may get annoying, but they are not going to disrupt service.


The problem we have today is two fold:

1) Vendors have done a bad job of reporting the routes related to a BGP session 
reset
  a - multiple vendor environments can have somewhat catastrophic failure modes

2) Stability of the internet (as defined by "keeping the BGP session up for the 
non-broken routes") is important; overall reachability is as well.  (to Tony 
Li's comment about how an update has both feasible and infeasible nlri in it).

A broken implementation is just that, broken.  About half the problems in the 
past 24 months are due to improper parsing of a well formed update.  With the 
balance the improper propagation of a invalid update.  (I'm willing to be 
corrected on the exact ratio, but I believe these are the two cases seen unless 
my memory fails me).

I'm hard pressed to imagine a case where the code doesn't quickly get 
unwieldily from this situation.

Part of the requirements of solving this problem will be:

a) Sending an UPDATE message with only new/updated NLRI and zero items in the 
withdrawl.
b) Sending a KEEPALIVE (or other well known message that will be parsed 
properly).
c) Sending an UPDATE message with no new NLRI and only those to be withdrawn.

Perhaps this mode is only triggered after a previous session drop within 3600 
seconds.   This means you need to track that state, track a timer, and utilize 
this new (capability?) mode.

We also have a more systematic issue here.  Those participating in the global 
BGP ecosystem need to play an active role in the maintenance of these devices.  
Failure to do so can and will harm the rest of the ecosystem.  I'm not sure the 
best way to enforce this operationally but disabling sessions must be an 
operational response.  The reason for the reset is to *cause pain* for the 
poorly behaving devices in question and draw attention to them.  The same 
reason a developer would call abort().

The case of dropping the session is the developer equivalent of abort(), meant 
to alert the users to the problem.  Since not everyone running BGP is a 
protocol expert these days, I almost feel we need strict guidance and 
requirements around the logging we need to report these problems to the vendors.

- Jared
_______________________________________________
GROW mailing list
GROW@ietf.org
https://www.ietf.org/mailman/listinfo/grow

Reply via email to