Hi Rob, all,
Thanks for the updated document. New version is definitely an improvement. Thanks for the work. Please find below some comments. 1) Critical error (§3) IMHO, the term "critical error" is mixing both technical/protocol considerations (e.g. can't read the update) and requirements considerations (BGP sessions state is too degraded and I prefer shutting it down rather than running on a degraded mode) which IMHO is unfortunate and does not help the discussion. I'd much prefer that we distinguish both by defining technical levels of errors and then defining the requirements for each plus the consequences/drawbacks of the decision (whether to keep or shut the session). For the protocol standpoint, I would propose the following level of errors, based on the protocol encoding layers: session, update, attribute. - attribute level error: semantic or syntax error in the attribute value or attribute flags - session level error: error in the update length / marker. i.e. if skipping the update length I can't find the marker of the next bgp message. - update level error: any other error in the update message We can further distinguish if the NLRIs can be parsed or not. 2) Business Requirements In the current text, I found the requirements a bit too technically oriented. I'd rather add business requirements independent of the current solutions. I would propose: In VPN networks, VPN are supposed to be isolated from each others and from the others services (most notably the Internet). Hence, an error on routes/BGP messages related to a VPN SHOULD NOT negatively impact others VPN. Similarly, an error on routes/BGP messages related to a non VPN service SHOULD not negatively impact the VPN service. In Internet networks, ASes are supposed to be Autonomous. Hence an error on routes/BGP messages originated by an AS SHOULD NOT negatively impact destinations originated from others ASes. By "negatively impact", we mean losing reachability for a destination (NLRI), typically by losing all the paths in the Loc-RIB to that destination (NLRI). Note that those paths may be learnt through multiple BGP sessions and hence the requirement span multiple BGP sessions. The consequence is that if the BGP error is believed to be limited to a single BGP session (e.g. a session level error), then in a network with redundancy, the destination is believed to be still known through another session and hence the session MAY be chosen to be shutdown and all path learned from that session removed. On the contrary, if the BGP error has a chance to be also met on the redundant paths/sessions, then the BGP session and the routes learned from that session SHOULD be preserved, until the negatives consequences are considered too important. When evaluating those consequences, the fact that all redundant paths/sessions may suffer from the same error and hence will inherit the same decision MUST be considered. As an illustration, we typically seek to avoid that because of a single BGP error a PE lose both its redundant iBGP session with its BGP RR. And by "a PE" I really mean all PE experiencing this condition. Could easily be 10s of PE, even 100s. 3) Technical requirements For session level error, the BGP session is dead so need to be shutdown/graceful shutdown/graceful restart. If the update length is set to the number of octets sent to the peer (or vice versa) rather than computed based on the content of the update, there is a chance to 1) limit the number of such session level errors and 2) increase the probability that this error is local to that session and not likely to happen on a redundant/backup session. There is probably a limited part of the BGP code which needs to be hardened to reduce such unrecoverable errors. And if those errors are still frequent, we may further propose technical solutions (e.g. replacing TCP by SCTP which can provides message boundaries, among others things (e.g. some benefits of multi-sessions)) For attribute & update level error when the NLRI can be parsed, cf draft-error-handling (treat as withdraw). Now let the discussion begin :). For attribute & update level error when the NLRI cannot be extracted IMHO there is room for discussion and analysis of the consequences. "since the NLRI cannot be extracted, error handling mechanisms must be applied at the per-session level" (§5) Well, IMO, this is a choice to be made rather than a "must". If we were to skip a BGP update: For Internet, probably the worst case would be to miss a BGP update with a loop in the AS path and hence create a loop for me and my upstream ASes for the NLRI in the missed updated. How much probable is this? 0 for iBGP sessions. TBE for eBGP. Then what would be the consequences? loss of connectivity for the NLRI until the problem is manually solved by an AS between the origin and me, possible forwarding congestions for others. I'm not sure I care too much about loosing reachability to NLRI in faulty BGP update as most likely, if only one BGP update (out of millions) is faulty, the reason may come from the origin AS playing with a specific bit or attribute and if they chose to play with their update, they should bear the responsibility. To be compared by the probability of losing all redundant paths (if the error is seen on redundant path) and the consequences (PE -possibly all PEs- down). For VPN, probably the worst case would be to keep a VPN label previously allocated to VPN 1 and re-allocated to another VPN (VPN breach Cf http://tools.ietf.org/html/draft-uttaro-idr-bgp-persistence-01#section-8) Again, the pro and con could be discussed (e.g. possibly one way partial VPN breach for some time (that basically no one can exploit) vs all VPN/PE being down. IMHO, if we believe such issue could be corrected in 30-60 minutes, I would probably favor keeping the session up. >From the lively discussions, looks like the opinions may vary depending on the >AS, people and circumstances. E.g. how much my redundant BGP paths are failure >independent? (e.g. use different BGP implementations) As such, what about defining severity levels for BGP error handling? As one may wish to accept only low severity errors while others may be willing to accept high severity errors (including when the NLRI cannot be found) e.g. the network has been down for 30 minutes, while waiting for the patch, one may want to be able to restore some service at all costs (can't possibly be worst). Again, IMHO it would be good to discuss the drawbacks depending on the situation (iBGP, eBGP; hop by hop routed, tunneled ...) in this requirement document to make sure we are all on the same page, we have constructive discussions and SP enabling revised error handling are fully aware of the consequences. 4) Security consideration In §7 "security considerations" I would discuss the fact that current BGP error handling (or a (too) strict one) could be exploited by attackers to create a remote DOS attack. Should we also ask a review of the SIDR WG since "The purpose of the SIDR working group is to reduce vulnerabilities in the inter-domain routing system." ? ... Best regards, Bruno >-----Original Message----- >From: idr-boun...@ietf.org [mailto:idr-boun...@ietf.org] On Behalf Of Rob >Shakir >Sent: Thursday, December 27, 2012 7:44 PM >To: i...@ietf.org >Subject: [Idr] Fwd: [GROW] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error- >handling-06.txt > >Hi IDR! > >FYI -- please find an updated relating to a new version of draft-ietf-grow-ops- >reqs-for-bgp-error-handling. > >Any comments very welcome (to me or grow@). > >Seasons greetings! >r. > >Begin forwarded message: > >> From: <rob.sha...@bt.com<mailto:rob.sha...@bt.com>> >> Subject: Re: [GROW] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error- >handling-06.txt >> Date: 27 December 2012 18:41:50 GMT >> To: <internet-dra...@ietf.org<mailto:internet-dra...@ietf.org>>, >> <i-d-annou...@ietf.org<mailto:i-d-annou...@ietf.org>> >> Cc: grow@ietf.org<mailto:grow@ietf.org> >> >> On 27/12/2012 18:35, >> "internet-dra...@ietf.org<mailto:internet-dra...@ietf.org>" >> <internet-dra...@ietf.org<mailto:internet-dra...@ietf.org>> >> wrote: >> >>> >>> A New Internet-Draft is available from the on-line Internet-Drafts >>> directories. >>> This draft is a work item of the Global Routing Operations Working Group >>> of the IETF. >>> >>> Title : Operational Requirements for Enhanced Error Handling >>> Behaviour in BGP-4 >>> Author(s) : Rob Shakir >>> Filename : draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt >>> Pages : 19 >>> Date : 2012-12-27 >> >> Hi GROW! >> >> This update is a fairly major re-spin of the BGP Error Handling >> requirements draft. The technical content should be as per the previous >> revisions however, following the ietf/RtgDir last call comments, I have >> made the following changes: >> >> * Made the amendments that were discussed and there was no disagreement >> with from our meeting in Atlanta -- this is essentially renaming the >> Critical/Semantic error types to Critical/Non-Critical. >> >> * Significant de-duplication within the text including merging the >> operational monitoring/toolset discussions into the error handling >> sections. >> >> * Adoption of rfc2119 language throughout to clarify the requirements. >> >> * Removal of some of the discussion around more detailed justifications >> for why particular decisions were made. I think this was useful through >> the discussion phase of this draft, but it seems like GROW/IDR have >> converged on a relatively stable set of requirements, so I have trimmed >> back some of this discussion. >> >> I'd really welcome any further comments on this before we re-submit for >> publication. To eke these out - Peter/Chris - can you kick off a WGLC for >> this draft please? :-) >> >> Seasons greetings! >> r. >> >> _______________________________________________ >> GROW mailing list >> GROW@ietf.org<mailto:GROW@ietf.org> >> https://www.ietf.org/mailman/listinfo/grow > >_______________________________________________ >Idr mailing list >i...@ietf.org<mailto:i...@ietf.org> >https://www.ietf.org/mailman/listinfo/idr _________________________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
_______________________________________________ GROW mailing list GROW@ietf.org https://www.ietf.org/mailman/listinfo/grow