On Apr 16, 2012, at 7:50 AM, Jakob Heitz wrote: > This should not overwhelm a router. > However, a more serious consequence is a lot of flapping routes. > Serious enough to consider. We could dust off some dampening code for it. > > Note the previous discussion was about repeated "treat as withdraw" > errors. I support the use of an operational message for these errors, > but nothing for repetitions of them. Repeated operational messages > may get annoying, but they are not going to disrupt service.
The problem we have today is two fold: 1) Vendors have done a bad job of reporting the routes related to a BGP session reset a - multiple vendor environments can have somewhat catastrophic failure modes 2) Stability of the internet (as defined by "keeping the BGP session up for the non-broken routes") is important; overall reachability is as well. (to Tony Li's comment about how an update has both feasible and infeasible nlri in it). A broken implementation is just that, broken. About half the problems in the past 24 months are due to improper parsing of a well formed update. With the balance the improper propagation of a invalid update. (I'm willing to be corrected on the exact ratio, but I believe these are the two cases seen unless my memory fails me). I'm hard pressed to imagine a case where the code doesn't quickly get unwieldily from this situation. Part of the requirements of solving this problem will be: a) Sending an UPDATE message with only new/updated NLRI and zero items in the withdrawl. b) Sending a KEEPALIVE (or other well known message that will be parsed properly). c) Sending an UPDATE message with no new NLRI and only those to be withdrawn. Perhaps this mode is only triggered after a previous session drop within 3600 seconds. This means you need to track that state, track a timer, and utilize this new (capability?) mode. We also have a more systematic issue here. Those participating in the global BGP ecosystem need to play an active role in the maintenance of these devices. Failure to do so can and will harm the rest of the ecosystem. I'm not sure the best way to enforce this operationally but disabling sessions must be an operational response. The reason for the reset is to *cause pain* for the poorly behaving devices in question and draw attention to them. The same reason a developer would call abort(). The case of dropping the session is the developer equivalent of abort(), meant to alert the users to the problem. Since not everyone running BGP is a protocol expert these days, I almost feel we need strict guidance and requirements around the logging we need to report these problems to the vendors. - Jared _______________________________________________ GROW mailing list GROW@ietf.org https://www.ietf.org/mailman/listinfo/grow