Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

bruno.decraene Tue, 29 Jan 2013 03:11:53 -0800

Hi Rob,

Thanks for your reply. More inline.


>From: Rob Shakir [mailto:r...@rob.sh] >Sent: Monday, January 28, 2013 7:49 PM
>
>Hi Bruno,
>
>Thanks for the review of this version of the draft. I've added some feedback
>in-line as [rjs]. My apologies for the delay in responding.
>
>On 8 Jan 2013, at 10:57, bruno.decra...@orange.com wrote:
>
>> 1)  Critical error (§3)
>>
>> IMHO, the term "critical error" is mixing both technical/protocol
>considerations (e.g. can't read the update) and requirements considerations
>(BGP sessions state is too degraded and I prefer shutting it down rather than
>running on a degraded mode) which IMHO is unfortunate and does not help the
>discussion. I'd much prefer that we distinguish both by defining technical
>levels of errors and then defining the requirements for each plus  the
>consequences/drawbacks of the decision (whether to keep or shut the session).
>> For the protocol standpoint, I would propose the following level of errors,
>based on the protocol encoding layers: session, update, attribute.
>> - attribute level error: semantic or syntax error in the attribute value or
>attribute flags
>> - session level error: error in the update length / marker. i.e. if skipping
>the update length I can't find the marker of the next bgp message.
>> - update level error: any other error in the update message
>>
>> We can further distinguish if the NLRIs can be parsed or not.
>
>[rjs]: I would observe that there are two ways that we can consider how to
>classify errors here -- one based on the definition of the impact on the UPDATE
>of errors, and then one based on the reaction to those errors. 

[Bruno]: Indeed, but the point of the draft is discussing the requirements for 
those reactions, since we would like them to change.
If you define, at the very beginning of the draft, which errors are "critical" 
and hence require a bgp session shut, then IMHO the discussion is closed and 
the draft done after those lines.

>The current
>draft (clearly) takes the latter approach for classification and reaction,
>however, as you say, it could be advantageous to classify the significance of
>the error to determine how "broken" the UPDATE is, and then map this to the
>possible approaches for handling the error.
>
>[rjs]: If we were to take this approach -- then we could end up with a mapping
>of errors of:
>       - For attribute-level errors, if it is not the NLRI-carrying attribute
>affected, then this is NLRI-level error handling, otherwise use session level
>error handling.
>       - For session-level errors, only use session-level error handling.
>       - For UPDATE-level errors, if the NLRI attribute can be parsed, then use
>error handling targeted to the NLRI, else handle it at a session level.
>I am not clear where we would have UPDATE errors that do not fall within either
>the attribute, or session categories - do you have any example to help me
>understand? 

[Bruno]: 
- attribute-level error: the error is limited to an attribute (within a single 
attribute). Example: "attribute low bits flags" set
- update-level error: I can't parse the next attributes in that update. 
Example: RIP labs incident. The error is more than a single attribute, however 
the session could still survive as session level length was ok.

>Also, do you envisage cases where there are session-level errors
>that we would map to any NLRI-level error handling?

[Bruno]: No. For session-level erros, I can't parse the current update nor any 
subsequent data, including next update.

>[rjs]: It does sound advantageous to note the caveats of holding the session up
>for each type of error -- I will work to add a paragraph to § 3 that describes
>the motivation for wanting to hold the session up in some cases, and the
>drawbacks of doing so.

[Bruno]: Thank you.

>>  2)  Business Requirements
>> In the current text, I found the requirements a bit too technically oriented.
>I'd rather add business requirements independent of the current solutions. I
>would propose:
>>
>> In VPN networks, VPN are supposed to be isolated from each others and from
>the others services (most notably the Internet). Hence, an error on routes/BGP
>messages related to a VPN SHOULD NOT negatively impact others VPN. Similarly,
>an error on routes/BGP messages related to a non VPN service SHOULD not
>negatively impact the VPN service.
>> In Internet networks, ASes are supposed to be Autonomous. Hence an error on
>routes/BGP messages originated by an AS SHOULD NOT negatively impact
>destinations originated from others ASes.
>>
>> By "negatively impact", we mean losing reachability for a destination (NLRI),
>typically by losing all the paths in the Loc-RIB to that destination (NLRI).
>Note that those paths may be learnt through multiple BGP sessions and hence the
>requirement span multiple BGP sessions. The consequence is that if the BGP
>error is believed to be limited to a single BGP session (e.g. a session level
>error), then in a network with redundancy, the destination is believed to be
>still known through another session and hence the session MAY be chosen to be
>shutdown and all path learned from that session removed. On the contrary, if
>the BGP error has a chance to be also met on the redundant paths/sessions, then
>the BGP session and the routes learned from that session SHOULD be preserved,
>until the negatives consequences are considered too important. When evaluating
>those consequences, the fact that all redundant paths/sessions may suffer from
>the same error and hence will inherit the same decision MUST be considered.
>
>[rjs]: I will go through and review this section to try and align it more with
>the service/business requirements for BGP deployments. It strikes me that the
>suggestion above is more related to an additional point that is not clearly
>included in this section around the different requirements for differing
>networks in which BGP is deployed. I would suggest that this is something that
>is added to the latter part of §2, and the existing text remains. I'm keen that
>we provide some background as to *why* there is motivation for change in terms
>of deployment characteristics, as well as covering the business requirements
>you mention above.

[Bruno]: "Why" is good. But I wish we could also clarify/discuss what our 
(ultimate) goals/requirements are. Indeed, my proposition is service dependent. 
But the reality is that BGP is used for different services/business and hence 
the business requirements are per service.
The current draft is good, but other the time, most of its requirements are now 
already covered with some proposed solution. Ideally, I'd like the document to 
also pave the road for the future and future solutions.
IMHO, my above business requirements are reasonable, at least the VPN one. But 
they are still not covered by existing proposed solutions, nor the requirement 
draft.

Regarding the organization of the draft, I would leave it to you.
I fine if this is added in the latter part of §2.
2 comments:
- IMHO, I would rename "2. Problem Statement" into "2. Introduction" and "2.1. 
Role of BGP-4 in Service Provider Networks" into "3 Problem Statement". And I 
would add the business requirements in last §.
- at the end of current §2.1: "This document defines a set of requirements for 
protocol developments"  I would propose :s/requirements/technical requirements  
(to make the distinction between business requirements which 
are in this § and the technical requirements in the next.

>>
>> As an illustration, we typically seek to avoid that because of a single BGP
>error a PE lose both its redundant iBGP session with its BGP RR. And by "a PE"
>I really mean all PE experiencing this condition. Could easily be 10s of PE,
>even 100s.
>>
>> 3)  Technical requirements
>> For session level error, the BGP session is dead so need to be
>shutdown/graceful shutdown/graceful restart. If the update length is set to the
>number of octets sent to the peer (or vice versa) rather than computed based on
>the content of the update, there is a chance to 1) limit the number of such
>session level errors and 2) increase the probability that this error is local
>to that session and not likely to happen on a redundant/backup session. There
>is probably a limited part of the BGP code which needs to be hardened to reduce
>such unrecoverable errors. And if those errors are still frequent, we may
>further propose technical solutions (e.g. replacing TCP by SCTP which can
>provides message boundaries, among others things (e.g. some benefits of multi-
>sessions))
>>
>> For attribute & update level error when the NLRI can be parsed, cf draft-
>error-handling (treat as withdraw).
>
>[rjs]: AIUI, if we added this requirement, then we could say that the total
>UPDATE length should be trusted as the "real" length of the transmitted UPDATE
>(which would be further validated by the subsequent presence of the marker). 

[Bruno]: correct. With possibly :s/"real"/specified

> In
>this case, (and I expect we are getting towards draft-ietf-idr-error-handling
>here), then do you think that there is a capability required to indicate that
>an implementation has used this method of calculation? Without one, then we
>have the ambiguity of whether an implementation used this "trick" and hence are
>not clear whether we should trust it.

[Bruno]: So far, IMO, I don't see a need for a capability.
For the receiver: Update message length indicates where is located the next 
update. We could see the marker as a check. Then either I can read the next 
update or not. Seems within the existing BGP spec. No trick.
For the sender: IMO clearly specifying that the receiver will use the update 
message length as the message delimiter, will put emphasize on making this 
field right (which seems doable to me as it's independent of the attributes 
within the message. That's just the number of byte that I'm sending) otherwise 
the session is dead.

>> Now let the discussion begin J. For attribute & update level error when the
>NLRI cannot be extracted IMHO there is room for discussion and analysis of the
>consequences.
>>
>> "since the NLRI cannot be extracted, error handling mechanisms must be
>applied at the per-session level" (§5)
>> Well, IMO, this is a choice to be made rather than a "must".
>
>[rjs]: Do you envisage that this is a requirement in all scenarios, or a
>special case to be able to hold the session up following repeated errors? If
>during normal operation one tries to apply treat-as-withdraw, then this cannot
>be done (safely) unless we can determine to which NLRI this should be applied
>to. 

[Bruno]:
IMO if in some scenario, this tool is required to fulfill business 
requirements, we should not forbid it. Possibly combined with other tools to 
make it safer.
That being said, I don't believe we will do the full work (analysis & 
solutions) to address those business requirements. So let's start with a first 
phase, and then we'll see if this is enough or not. So I'm fine with keeping 
the session up even if some NLRI cannot be parsed, only for special cases 
manually enabled.

> I'm unclear whether this not being a MUST (although at the moment it's a
>lower-case 'must') really implies that we have a requirement for a solution
>akin to the persistence draft as a "last resort" mechanism?

To clarify, here I was referring to update level errors or MP attribute level 
errors where NLRI cannot be extracted.
The persistence draft would be for session level errors where the session 
cannot be re-established. 

>[rjs]: I think that this in-line with your later discussion -- essentially, the
>different levels as to how conservative one might want to be are very black and
>white at the current time (within the draft), as it's really whether you have
>these mechanisms "on", or "off". Is your suggestion that we evaluate more
>levels of error handling (i.e., include the "ignore all errors and continue
>operating") within this document, 

[Bruno]: yes. My suggestion was that we evaluate, or at least not forbid and 
explicitly leave if for the future, other response to error condition. 
Specifically session "hold up"

>or is it an evaluation between the current
>on/off levels? Extending the draft to cover the "hold up" use case potentially
>expands it outside of BGP error handling that is applicable to most deployments
>of BGP into more special cases in my view. 

[Bruno]: rather than "more special cases", I'd say "type of BGP errors not yet 
widely faced"

>I'd like to understand whether the
>working group feels that this problem space falls within the scope of this
>draft.

[Bruno]: I believe that the error handling requirement should cover all cases, 
otherwise this means that we don't have requirements for other cases (ok, more 
realistically, we don't want to bother thinking on hypothetical cases). But I'm 
fine with leaving this as a next phase. In which case, IMO the draft should not 
close the door but rather hint for a possible next phase.

Many thanks Rob. You picked a difficult work on a difficult subject.

Bruno

>Thanks,
>r.
>
>> If we were to skip a BGP update:
>> For Internet, probably the worst case would be to miss a BGP update with a
>loop in the AS path and hence create a loop for me and my upstream ASes for the
>NLRI in the missed updated. How much probable is this? 0 for iBGP sessions. TBE
>for eBGP. Then what would be the consequences? loss of connectivity for the
>NLRI until the problem is manually solved by an AS between the origin and me,
>possible forwarding congestions for others. I'm not sure I care too much about
>loosing reachability to NLRI in faulty BGP update as most likely, if only one
>BGP update (out of millions) is faulty, the reason may come from the origin AS
>playing with a specific bit or attribute and if they chose to play with their
>update, they should bear the responsibility. To be compared by the probability
>of losing all redundant paths (if the error is seen on redundant path) and the
>consequences (PE -possibly all PEs- down).
>>
>> For VPN, probably the worst case would be to keep a VPN label previously
>allocated to VPN 1 and re-allocated to another VPN (VPN breach Cf
>http://tools.ietf.org/html/draft-uttaro-idr-bgp-persistence-01#section-8)
>> Again, the pro and con could be discussed (e.g. possibly one way partial VPN
>breach for some time (that basically no one can exploit) vs all VPN/PE being
>down. IMHO, if we believe such issue could be corrected in 30-60 minutes, I
>would probably favor keeping the session up.
>>
>> From the lively discussions, looks like the opinions may vary depending on
>the AS, people and circumstances. E.g. how much my redundant BGP paths are
>failure independent? (e.g. use different BGP implementations)
>> As such, what about defining severity levels for BGP error handling? As one
>may wish to accept only low severity errors while others may be willing to
>accept high severity errors (including when the NLRI cannot be found) e.g. the
>network has been down for 30 minutes, while waiting for the patch, one may want
>to be able to restore some service at all costs (can't possibly be worst).
>>
>> Again, IMHO it would be good to discuss the drawbacks depending on the
>situation (iBGP, eBGP; hop by hop routed, tunneled .) in this requirement
>document to make sure we are all on the same page, we have constructive
>discussions and SP enabling revised error handling are fully aware of the
>consequences.
>>
>> 4)  Security consideration
>> In §7 "security considerations" I would discuss the fact that current BGP
>error handling (or a (too) strict one) could be exploited by attackers to
>create a remote DOS attack.
>> Should we also ask a review of the SIDR WG since "The purpose of the SIDR
>working group is to reduce vulnerabilities in the inter-domain routing system."
>? ...
>>
>> Best regards,
>> Bruno
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> >-----Original Message-----
>> >From: idr-boun...@ietf.org [mailto:idr-boun...@ietf.org] On Behalf Of Rob
>> >Shakir
>> >Sent: Thursday, December 27, 2012 7:44 PM
>> >To: i...@ietf.org
>> >Subject: [Idr] Fwd: [GROW] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-
>error-
>> >handling-06.txt
>> >
>> >Hi IDR!
>> >
>> >FYI -- please find an updated relating to a new version of draft-ietf-grow-
>ops-
>> >reqs-for-bgp-error-handling.
>> >
>> >Any comments very welcome (to me or grow@).
>> >
>> >Seasons greetings!
>> >r.
>> >
>> >Begin forwarded message:
>> >
>> >> From: <rob.sha...@bt.com>
>> >> Subject: Re: [GROW] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-
>> >handling-06.txt
>> >> Date: 27 December 2012 18:41:50 GMT
>> >> To: <internet-dra...@ietf.org>, <i-d-annou...@ietf.org>
>> >> Cc: grow@ietf.org
>> >>
>> >> On 27/12/2012 18:35, "internet-dra...@ietf.org" <internet-dra...@ietf.org>
>> >> wrote:
>> >>
>> >>>
>> >>> A New Internet-Draft is available from the on-line Internet-Drafts
>> >>> directories.
>> >>> This draft is a work item of the Global Routing Operations Working Group
>> >>> of the IETF.
>> >>>
>> >>>   Title           : Operational Requirements for Enhanced Error Handling
>> >>> Behaviour in BGP-4
>> >>>   Author(s)       : Rob Shakir
>> >>>   Filename        : draft-ietf-grow-ops-reqs-for-bgp-error-handling-
>06.txt
>> >>>   Pages           : 19
>> >>>   Date            : 2012-12-27
>> >>
>> >> Hi GROW!
>> >>
>> >> This update is a fairly major re-spin of the BGP Error Handling
>> >> requirements draft. The technical content should be as per the previous
>> >> revisions however, following the ietf/RtgDir last call comments, I have
>> >> made the following changes:
>> >>
>> >> * Made the amendments that were discussed and there was no disagreement
>> >> with from our meeting in Atlanta -- this is essentially renaming the
>> >> Critical/Semantic error types to Critical/Non-Critical.
>> >>
>> >> * Significant de-duplication within the text including merging the
>> >> operational monitoring/toolset discussions into the error handling
>> >> sections.
>> >>
>> >> * Adoption of rfc2119 language throughout to clarify the requirements.
>> >>
>> >> * Removal of some of the discussion around more detailed justifications
>> >> for why particular decisions were made. I think this was useful through
>> >> the discussion phase of this draft, but it seems like GROW/IDR have
>> >> converged on a relatively stable set of requirements, so I have trimmed
>> >> back some of this discussion.
>> >>
>> >> I'd really welcome any further comments on this before we re-submit for
>> >> publication. To eke these out - Peter/Chris - can you kick off a WGLC for
>> >> this draft please? :-)
>> >>
>> >> Seasons greetings!
>> >> r.
>> >>
>> >> _______________________________________________
>> >> GROW mailing list
>> >> GROW@ietf.org
>> >> https://www.ietf.org/mailman/listinfo/grow
>> >
>> >_______________________________________________
>> >Idr mailing list
>> >i...@ietf.org
>> >https://www.ietf.org/mailman/listinfo/idr
>>
>_______________________________________________________________________________
>__________________________________________
>>
>> Ce message et ses pieces jointes peuvent contenir des informations
>confidentielles ou privilegiees et ne doivent donc
>> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
>ce message par erreur, veuillez le signaler
>> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
>electroniques etant susceptibles d'alteration,
>> France Telecom - Orange decline toute responsabilite si ce message a ete
>altere, deforme ou falsifie. Merci.
>>
>> This message and its attachments may contain confidential or privileged
>information that may be protected by law;
>> they should not be distributed, used or copied without authorisation.
>> If you have received this email in error, please notify the sender and delete
>this message and its attachments.
>> As emails may be altered, France Telecom - Orange is not liable for messages
>that have been modified, changed or falsified.
>> Thank you.
>>


_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.

_______________________________________________
GROW mailing list
GROW@ietf.org
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to