Re: [rrg] routing security and scale impacts (was RRG to hibernation)

Russ White Wed, 21 Nov 2012 04:36:28 -0800

+1

I might even have some ideas on where to get some ideas, if we can convince the 
researchers in question to come forward. We could start with a requirements 
doc, which I'd be willing to co with someone once I replace the hard drive in 
my computer.


Russ

 Sent from my iPad

On Nov 13, 2012, at 9:09 AM, "George, Wes" <[email protected]> wrote:

> Changing subject line to reflect topic
> 
> Shane has articulated a number of concerns that I think would be useful for 
> RRG to spend some time working on, and I tend to agree with Danny that the 
> current BGPSec solution seems to be more about "hacking at the edges" to get 
> something that is marginally better in some ways than the [lack of] security 
> that we have now, potentially ignoring the known scaling problems this group 
> has discussed at length all the while doing several things likely to 
> exacerbate them. It gives me concern about whether it will see significant 
> deployment due to the large amount of required investment vs the potential 
> benefit. I know I have asked more than once about the scaling implications of 
> BGPSec since it potentially makes a large impact in the footprint of the 
> routing data that must be stored and managed, and haven't exactly been 
> pleased with the answers even though some analysis has been done to show that 
> it's not a bad thing.
> 
> If I were to distill things down, today we have a growth curve for both the 
> routing table (both RIB and FIB) and for cost-effective hardware with the 
> horsepower necessary to manage it (CPU, ASIC, memory, etc). SIDR is likely 
> not the one thing that will break the routing system by causing those curves 
> to cross, but it certainly changes the curves' pitch such that it's more 
> likely that the cost of keeping up with the demands of the system starts 
> becoming unmanageable, even if it doesn't actually reach the limits of the 
> technology. The investment in a network for scale and growth is incremental, 
> and SIDR's full justification is that those incremental upgrades will bring 
> hardware that can support its needs organically. However, things like BGPSec 
> or other disturbances that increase the growth curve of the routing table and 
> related scaling vectors mean that as an operator, I have to shorten my 
> upgrade cycle, spend capital earlier than originally projected, possibly even 
> to the p
 oi
> nt where I can't manage an entire depreciation cycle (5-7 years) before 
> needing to spend additional money on upgrades. In a network that is driven by 
> commoditization of prices, that's not a good position to be in.
> 
> Additionally, as Shane alluded to in another message, this isn't simply about 
> DFZ scale, but also internal scale, where there are commonly a *LOT* more 
> routes being carried by your average router inside an ISP's network. There 
> are also other considerations like the rate of updates due to background 
> churn vs during an event, other things that the control plane must manage 
> simultaneously, etc. Taking a step even further away from where RRG has been 
> previously focused, there is a similar sort of scaling problem within the 
> L3VPN space that is typically self-contained within the SP's network. While I 
> think there are some engineering solutions that may help with the short-term 
> scaling issues, there may also be some meat for research in the area of 
> modeling and instrumentation of the routing system to give SP's better tools 
> to use their available capacity efficiently, and possibly even changes to 
> help the routing control plane degrade more gracefully and deterministically. 
> The L3V
 PN
>  discussion is detailed in draft-gs-vpn-scaling-01 (an -02 rev is due soon, 
> waiting on co-author review and a few more updates), specifically in section 
> 6 and 6.5 for the modeling/instrumentation, and in sections 4 and 5 for ways 
> that the control plane tends to break down at scale limits.
> 
> Thanks,
> 
> Wes George
> 
> 
> 
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of
>> Shane Amante
>> Sent: Saturday, November 10, 2012 8:39 PM
>> To: [email protected]
>> Subject: Re: [rrg] RRG to hibernation
>> 
>> 
>> On Nov 10, 2012, at 10:35 AM, Danny McPherson <[email protected]> wrote:
>>> On Nov 10, 2012, at 12:24 PM, Tony Li wrote:
>> [--snip--]
>>>> I agree that some security needs to be deployed.  I'm not convinced
>> that it needs to be BGPSEC.  We've muddled along for many years and
>> never found the gumption to actually deploy anything.  Must not be
>> important to people.  I don't get it, but that's the observable
>> behavior.
>>>> 
>>>> In any case, this doesn't seem like a research topic.  This is pretty
>> clearly an engineering issue.
>>> 
>>> I don't agree.  The engineering solution that SIDR is actively working
>> (RPKI-enabled BGPSEC) is pumping out standards track RFCs like there's
>> no tomorrow.  The USG has stated intentions of "expediting secure
>> routing work through the Internet standard process" and "fostering
>> adoption through government procurement vehicles".
>>> 
>>> As an operator this scares the hell out of me, especially considering
>> what they've designed is largely a system to control "what's routed on
>> the Internet and by whom".  They can't seem to do anything in BGP(SEC)
>> without introducing the equivalent of "periodic updates", and undoing
>> all the goodness of things like update packing completely.
>>> 
>>> Some serious thinkers working on this problem would be goodness...
>> 
>> Let me add that I share Danny's concerns ...
>> 
>> However, let me try to take a step back and share with everyone a much
>> broader set of, potentially, architectural concerns that I'm not sure
>> this RG considered during the last round.
>> 
>> BGP was originally designed for flooding of reachability information.
>> But, reachability information is the end-result /after/ the application
>> of _routing_policy_, describing "intent", by operators of individual
>> networks based on various contractual agreements they have with parties
>> whom they directly interconnect.  Assuming you agree with this premise,
>> this presents a paradox from a security PoV.  Specifically, if a
>> downstream network does not have visibility into its upstream network's
>> routing policy is it practical/feasible for the downstream network to
>> understand the _intended_ propagation of reachability information and,
>> ultimately, connectivity?  Furthermore, is it feasible to carry such
>> information within the control plane itself?  Or, should the control
>> plane be relegated to carrying [strictly] reachability information in
>> real-time, while offboard systems carry accompanying routing policy and
>> security information in order to assist in making "optimal" Inter-Domain
>> rou  ting/forwarding decisions?
>> 
>> A second concern is also related to the original design of BGP and what
>> it has organically involved into, today.  Specifically, BGP is /also/
>> now being tasked as a generic "message bus" and service discovery
>> mechanism.  Not to pick on anyone, in particular, but the following are
>> recent examples that come to my mind wrt this trend:
>> http://tools.ietf.org/html/draft-ietf-idr-ls-distribution-01
>> http://tools.ietf.org/html/draft-ietf-idr-operational-message-00
>> ... and, there may be others.  Although, contrast those proposals with
>> what should be most concerning to people in this RG, and in the IETF:
>> http://tools.ietf.org/html/draft-ietf-grow-ops-reqs-for-bgp-error-
>> handling-05
>> In short, operators (such as myself) are _extremely_ concerned that a
>> single erroneous update results in a complete reset of BGP sessions.
>> Due to the overwhelming success of BGP, it's now (and, has been for a
>> while) a mission-critical protocol, thus such catastrophic session
>> resets -- caused by a single malformed UPDATE -- are widely
>> visible/impactful.  This impact is compounded by the 'cost to recover'.
>> Namely, due to the large and growing amount of information in the RIB
>> (again, not just reachability, but also service-discovery and completely
>> orthogonal information), it takes longer to exchange RIB information
>> and, ultimately, restore services.  Is this really the best we, as an
>> industry, can do?
>> 
>> While the IETF IDR WG has been looking at mechanisms for how BGP may
>> defend against certain types of erroneous BGP UPDATE's for external BGP
>> sessions:
>> http://tools.ietf.org/html/draft-ietf-idr-error-handling-02
>> ... there does not appear to be any [straightforward] answer with
>> respect to internal BGP sessions, given the requirement that BGP
>> speakers internal to an AS must have a globally consistent RIB and FIB,
>> otherwise packet forwarding loops will result.  And, in my personal
>> operational experience it's _rarely_ the case that malformed UPDATE's
>> are detected at the first ASBR (attached to an eBGP neighbor) in my AS,
>> thus it concerns me that mechanisms such as draft-ietf-idr-error-
>> handling-02 are an adequate solution to the problems we experience.
>> IOW, as an operator I desire "defense in depth" where a heterogeneous
>> mix of vendor equipment (HW + SW), participating as interior BGP
>> speakers, have mechanisms to detect *and* automatically recover from
>> malformed UDPATE's received over iBGP sessions.  This is another area
>> that I would point research colleagues toward.
>> 
>> So, this raises the classic conundrum of: increasing complexity,
>> increasing RIB (and FIB) size information coupled with a contrasting
>> need from operators who are concerned about the robustness of the
>> protocol and the requirement to NOT sustain any failures[1].
>> Something's got to give.
>> 
>> Ultimately, this makes me question whether it's no longer _just_ growth
>> of RIB (and, FIB) size that this RG should be (primarily?) focused on.
>> Rather, will the requirements for:
>> a) operational robustness, in the face of critical messaging errors in
>> an Inter-Domain Routing Protocol, which the IETF may be unable to
>> address on its own;
>> b) designing security as a first-class principle of an Inter-Domain
>> Routing Protocol -- either carried within or outside of control-plane
>> reachability information
>> c) increased scalability of RIB (and, other?) information ... lead us
>> down a path of considering we may be approaching the end-of-the-road for
>> BGPv4 and we need something new?
>> 
>> Does anyone on this list share similar concerns wrt operational
>> robustness, time to recovery and (then) scalability of BGPv4?
>> 
>> -shane
>> 
>> [1] It is not cool to suggest that operators should just stop asking for
>> new features and we wouldn't have this problem.  :)
>> _______________________________________________
>> rrg mailing list
>> [email protected]
>> http://www.irtf.org/mailman/listinfo/rrg
> 
> This E-mail and any of its attachments may contain Time Warner Cable 
> proprietary information, which is privileged, confidential, or subject to 
> copyright belonging to Time Warner Cable. This E-mail is intended solely for 
> the use of the individual or entity to which it is addressed. If you are not 
> the intended recipient of this E-mail, you are hereby notified that any 
> dissemination, distribution, copying, or action taken in relation to the 
> contents of and attachments to this E-mail is strictly prohibited and may be 
> unlawful. If you have received this E-mail in error, please notify the sender 
> immediately and permanently delete the original and any copy of this E-mail 
> and any printout.
> _______________________________________________
> rrg mailing list
> [email protected]
> http://www.irtf.org/mailman/listinfo/rrg
_______________________________________________
rrg mailing list
[email protected]
http://www.irtf.org/mailman/listinfo/rrg

Re: [rrg] routing security and scale impacts (was RRG to hibernation)

Reply via email to