On 26/08/2021 15:47, Felipe Victolla Silveira wrote:
Felipe,
Thanks for the extremely well thought out and detailed response. I
can't argue with most of what you stated.
So I will go back to my original statement that cloud computing is not
secure for critical infrastructure. Cloud vendors roll out dozens of
new features per year and each vendor probably has tens of millions of
lines of code running and controlling their platforms.
Microsoft's IE and Google's Chrome are used by a billion users and have
had dozens of security holes found and fixed over the past decade. The
cloud platforms are used by perhaps only a million different end users
and every week another security hole is found:
https://www.wiz.io/blog/chaosdb-how-we-hacked-thousands-of-azure-customers-databases
I assume there are dozens of 0-days in every cloud platform.
Regards,
Hank
Dear Gert, Hank,
First, our apologies again for the delay in our response. A few of us were
taking our summer break and our colleagues didn't want to respond without
checking with us first.
To recap, we’ve outlined our core goals - improve the resilience of our
services, become more agile and flexible as an organisation, and focus
engineering expertise on our core business. You correctly point out that we
haven't really talked about the problems we’re trying to solve.
Fair point - we're not used to talking about the firefighting that's needed
behind the scenes. We can go over some of this now. We can start by noting that
if you take the inverse of the benefits we've listed so far, you find most the
problems we're trying to solve.
1. Improve resilience and availability
We currently host our infrastructure in two data centres in Amsterdam. While
they have provided excellent availability so far, users further afield (South
America, Oceania, Asia) experience high latency when accessing our services.
Importantly, an outage affecting both of these data centres would render all of
our services offline.
Public cloud providers have many global regions available, allowing us to
choose the level of resilience that best fits a particular service - protecting
us against multiple hardware failures or natural disasters (remember that we
are below sea level here).
2. Become more agile and flexible
We're proud of the stable and highly-available services we provide. Here we can
credit the expertise and hard work of our engineering staff, but also a
continuous investment in our infrastructure over time. This has a big footprint
- we are currently using almost 50 racks across our two data centres.
Each hardware element has its own lifecycle: procurement, shipping,
installation, configuration, patching, upgrading and retiring. With hundreds of
servers, network and storage equipment, this is a continuous operation that
takes a lot of time and effort. Hardware maintenance is not even the biggest
challenge here: our infrastructure doesn't offer much in the way of flexibility
and making changes is complex and expensive.
Our infrastructure also lacks elasticity, meaning that we have to estimate
demand and over-provision our services to cover any peaks. This makes us less
agile, by forcing us into long-term commitments and requiring us to pay for a
lot of unused or idle resources.
3. Focus engineering expertise on our core business
For each new application or change to our infrastructure, there are a lot of
manual steps that require tickets back and forth between separate engineering
teams. Getting from idea to reality can take many months, and we can see this
impacting our ability to innovate. This is inevitable when attention turns from
service excellence to fixing problems and time-consuming, mundane maintenance
tasks. We especially don't like this because we often need to react quickly as
an organisation, while also being able to experiment with new services in an
efficient way.
By moving to the cloud, we can build pipelines to deploy code faster, with
fewer errors and manual steps, and provide sandbox accounts for engineers to
quickly and safely test new technologies. We can also automate security
auditing and reporting as much as possible, at all application and
infrastructure layers.
There were two good comments on the article recently, from Niall Murphy and Bert Hubert.
We will respond to these soon, but I would like to reference one point Bert makes there,
which is essentially "Don't outsource your key capabilities." We completely
agree with this (many of us have been reading Bert's article on this topic recently).
This is precisely what we are *not* doing.
While it is important to have in-house expertise on all technical layers, some
are more important than others. For example, at the physical layer we are
already using data centre remote hands to replace failed disks, and we
generally want to eliminate as much of the repetitive work to unpack, rack, and
cable equipment in the data centre as we can. The resources we save here can be
used to double down on the capabilities we want to develop further. We will
continue to write our own software and control our deployment pipelines, and
configure routers, firewalls, load balancers, and storage devices - whether
they are physical or virtual, on-premise or in the cloud.
I see Hank's suggestion that we compile a list of outages. I'm reluctant to ask
our engineers to spend time on this when I think they'll find we have very
resilient services. But past results are not always the best indicator of
future performance. And with RPKI especially, I also expect that what we
consider acceptable resilience might increase as more and more networks come to
rely on it.
(Also I find "evade the discussion on the list by posting a new lengthy article on
labs every few months" not really helpful)
I do want to respond to this point. We sometimes miss a comment or take longer
to respond than is acceptable, and this is not something that we take lightly
as a company. But I would be disappointed if the community thought we were
trying to evade discussion. We are here, we are listening, and we will respond.
With that, it's over to you again - let me know if you feel I’ve missed
anything here.
Regards
Felipe Victolla Silveira
Chief Operations Officer
RIPE NCC