Re: [ncc-services-wg] Draft Cloud Strategy Framework

Hank Nussbacher Sat, 28 Aug 2021 23:18:39 -0700

On 26/08/2021 15:47, Felipe Victolla Silveira wrote:

Felipe,

Thanks for the extremely well thought out and detailed response. Ican't argue with most of what you stated.

So I will go back to my original statement that cloud computing is notsecure for critical infrastructure. Cloud vendors roll out dozens ofnew features per year and each vendor probably has tens of millions oflines of code running and controlling their platforms.

Microsoft's IE and Google's Chrome are used by a billion users and havehad dozens of security holes found and fixed over the past decade. Thecloud platforms are used by perhaps only a million different end usersand every week another security hole is found:

https://www.wiz.io/blog/chaosdb-how-we-hacked-thousands-of-azure-customers-databases

I assume there are dozens of 0-days in every cloud platform.

Regards,
Hank

Dear Gert, Hank,

First, our apologies again for the delay in our response. A few of us were 
taking our summer break and our colleagues didn't want to respond without 
checking with us first.

To recap, we’ve outlined our core goals - improve the resilience of our 
services, become more agile and flexible as an organisation, and focus 
engineering expertise on our core business. You correctly point out that we 
haven't really talked about the problems we’re trying to solve.

Fair point - we're not used to talking about the firefighting that's needed 
behind the scenes. We can go over some of this now. We can start by noting that 
if you take the inverse of the benefits we've listed so far, you find most the 
problems we're trying to solve.

1. Improve resilience and availability

We currently host our infrastructure in two data centres in Amsterdam. While 
they have provided excellent availability so far, users further afield (South 
America, Oceania, Asia) experience high latency when accessing our services. 
Importantly, an outage affecting both of these data centres would render all of 
our services offline.

Public cloud providers have many global regions available, allowing us to 
choose the level of resilience that best fits a particular service - protecting 
us against multiple hardware failures or natural disasters (remember that we 
are below sea level here).

2. Become more agile and flexible

We're proud of the stable and highly-available services we provide. Here we can 
credit the expertise and hard work of our engineering staff, but also a 
continuous investment in our infrastructure over time. This has a big footprint 
- we are currently using almost 50 racks across our two data centres.

Each hardware element has its own lifecycle: procurement, shipping, 
installation, configuration, patching, upgrading and retiring. With hundreds of 
servers, network and storage equipment, this is a continuous operation that 
takes a lot of time and effort. Hardware maintenance is not even the biggest 
challenge here: our infrastructure doesn't offer much in the way of flexibility 
and making changes is complex and expensive.

Our infrastructure also lacks elasticity, meaning that we have to estimate 
demand and over-provision our services to cover any peaks. This makes us less 
agile, by forcing us into long-term commitments and requiring us to pay for a 
lot of unused or idle resources.

3. Focus engineering expertise on our core business

For each new application or change to our infrastructure, there are a lot of 
manual steps that require tickets back and forth between separate engineering 
teams. Getting from idea to reality can take many months, and we can see this 
impacting our ability to innovate. This is inevitable when attention turns from 
service excellence to fixing problems and time-consuming, mundane maintenance 
tasks. We especially don't like this because we often need to react quickly as 
an organisation, while also being able to experiment with new services in an 
efficient way.

By moving to the cloud, we can build pipelines to deploy code faster, with 
fewer errors and manual steps, and provide sandbox accounts for engineers to 
quickly and safely test new technologies. We can also automate security 
auditing and reporting as much as possible, at all application and 
infrastructure layers.

There were two good comments on the article recently, from Niall Murphy and Bert Hubert. 
We will respond to these soon, but I would like to reference one point Bert makes there, 
which is essentially "Don't outsource your key capabilities." We completely 
agree with this (many of us have been reading Bert's article on this topic recently). 
This is precisely what we are *not* doing.

While it is important to have in-house expertise on all technical layers, some 
are more important than others. For example, at the physical layer we are 
already using data centre remote hands to replace failed disks, and we 
generally want to eliminate as much of the repetitive work to unpack, rack, and 
cable equipment in the data centre as we can. The resources we save here can be 
used to double down on the capabilities we want to develop further. We will 
continue to write our own software and control our deployment pipelines, and 
configure routers, firewalls, load balancers, and storage devices - whether 
they are physical or virtual, on-premise or in the cloud.

I see Hank's suggestion that we compile a list of outages. I'm reluctant to ask 
our engineers to spend time on this when I think they'll find we have very 
resilient services. But past results are not always the best indicator of 
future performance. And with RPKI especially, I also expect that what we 
consider acceptable resilience might increase as more and more networks come to 
rely on it.

(Also I find "evade the discussion on the list by posting a new lengthy article on 
labs every few months" not really helpful)


I do want to respond to this point. We sometimes miss a comment or take longer 
to respond than is acceptable, and this is not something that we take lightly 
as a company. But I would be disappointed if the community thought we were 
trying to evade discussion. We are here, we are listening, and we will respond.

With that, it's over to you again - let me know if you feel I’ve missed 
anything here.

Regards

Felipe Victolla Silveira
Chief Operations Officer
RIPE NCC

Re: [ncc-services-wg] Draft Cloud Strategy Framework

Reply via email to