[google-appengine] Confusing message about GAE's datacenter relocation

Jeff Schnitzer Thu, 22 Aug 2013 21:16:05 -0700

I'm sure many of you saw this email, but I'm curious - what does it mean?

I thought GAE was located in some number of geographically distributed
datacenters (more than 3) and a failure of any single datacenter was
immaterial to HRD apps.  This email says "relocated our entire US serving
footprint to a different location within the US" as if there is only a
single datacenter involved.

Can somebody clarify this?

Jeff

---------- Forwarded message ----------
From: Michael Handler <hand...@google.com>
Date: Thu, Aug 22, 2013 at 1:31 PM
Subject: Information on Google App Engine's recent US datacenter relocations
To: google-appengine-downtime-not...@googlegroups.com

Google App Engine recently relocated our entire US serving footprint to a
different location within the US, without requiring any scheduled
maintenance period or reduction in functionality, and while serving
normally throughout the work period. This work required multiple
engineer-months of careful preparation, repeated refinement of our
datacenter setup and data copy automation, and a world-class network
backbone between our locations. Online migration of all stored data
required an initial transfer of multiple petabytes, and online replication
of all changes written to stored data after the initial transfer, until the
relocation process was complete.

This was the first time the US serving footprint for App Engine had been
relocated en masse since the launch of the High Replication Datastore (HRD)
in January of 2011. Uncommon processes of this complexity and scale are
rarely completed without some small number of bumps and missteps, and this
migration proved to be no exception to that rule. While we worked hard to
make this process as automatic and invisible to our customers as possible,
we fell short in a few places, and we’d like to take this opportunity to
explain what happened and what we’re doing to address the issues we
encountered.
Storage Layer Overload

As we completed the migration of one of our replicas into its new
datacenter, the storage infrastructure in that datacenter began consuming
all resources allocated to it. This was above all previously observed
demand and capacity planning, without any corresponding increase in traffic
that would logically have caused such a jump in demand. As with any piece
of infrastructure where demand exceeds supply, the performance of the
portions of App Engine that depend heavily on the storage layer degraded.

We simultaneously began investigating potential causes of the increase in
storage infrastructure resource demand, and allocating more resources to
the storage infrastructure. As the additional resources came online, demand
grew too quickly consume them as well, clarifying that this was not a
simple case of underprovisioning. We directed traffic away from the
affected datacenter, and immediately saw storage infrastructure resource
demand return to normal levels for a drained datacenter.

After further investigation, and detailed consultation with the engineers
from the storage infrastructure teams, we were unable to determine the
origin of the increased resource demand. We returned traffic to this
datacenter during the period of lowest global App Engine load and saw
resource demand increase to expected levels without incident. The
datacenter has performed as expected since that time, without any unusual
behavior or incidents similar to this one.

What you saw: Applications serving from this particular datacenter would
have experienced elevated Datastore latency and errors (for both reads and
writes), and irregular and delayed Task Queue performance. Additionally,
all US applications may have experienced a slightly elevated error rate for
datastore write operations, as a consequence of the localized overload.

When you saw it: Tuesday, 25 June 2013, from 7:30 AM to 3:10 PM US/Pacific

What we’re doing about it: We are chagrined to say that, since this event,
we have been unable to further diagnose the origin of this issue, nor
reproduce it in our testing infrastructure or labs. Without a clear root
cause, we cannot commit to specific fixes, other than continuing to monitor
and improve our diagnostic tools, and investigating ways to gradually
slowly ramp up traffic into a datacenter, to avoid any potential
complications when beginning to serve from a datacenter for the first time.
Application Server Hotspots

Before beginning this move process, App Engine had recently completed an
internal migration to the next generation of our scheduler system. The
scheduler is responsible for deploying your applications to our serving
environment, assigning more resources to your application as required by
your traffic and directed by your performance and scaling settings, and
rebalancing to ensure a uniform load across our infrastructure.

As the migration to the new scheduler had occurred exclusively within
already-serving datacenters, historical load data about each application
from the old scheduler was already on hand and available for use by the new
scheduler’s algorithm, which provided for a smooth transition. However,
when we tested bringing up a new datacenter for the first time under the
new scheduler, which lacked any historical load data for all applications
assigned to the datacenter, we discovered that the new scheduler’s
algorithm performed poorly in the absence of historical load data, did not
properly distribute load across application servers, and created hotspots.

For the en masse migration to our new datacenters, we prepared for this
situation by replicating the historical load data from each old datacenter
to the each new datacenter for use by the new scheduler during startup.
While this process generally functioned sufficiently well, the load
distribution in the new datacenter still resulted in a small percentage of
overloaded application servers, which were prone to serving mostly errors.
The new scheduler’s algorithm would have corrected these hotspots
eventually but at an unacceptably slow rate, and we manually intervened to
rebalance and eliminate the hotspots.

What you saw: A small number of application servers in one datacenter were
overloaded and unable to create new instances. Larger applications are
assigned to multiple application servers in their datacenter, and any
traffic arriving at the affected application server would be retried at a
different application server after a small delay, at a cost of some latency.

Small applications are assigned a small number of application servers and
do not automatically retry their traffic on other application servers
unless the first application server is completely down. If the application
server is unable to start instances at all, any request will return a 500
error. (This minimizes creating new instances on other application servers,
which would elevate the instance hour cost of your application, but can
result in substantial disruption in this particular case.)

When you saw it: Tuesday, 25 June 2013 through Wednesday, 26 June 2013

What we’re doing about it:

   -

   Improving the scheduler algorithm to make better load assignment choices
   when limited or no load data is available.
   -

   Installing safety limits in the scheduler to keep it from assigning too
   much load to a given application server, to guard against unexpected edge
   cases in the scheduler algorithm.
   -

   Instrumenting the application servers and the schedulers to
   automatically and aggressively detect and disable any application server
   that is consistently unable to start instances, or is serving a high
   percentage of internal errors in response to requests.

Delayed Datastore Eventual Consistency

The High Replication Datastore (HRD) has been designed and documented (
Python<https://developers.google.com/appengine/docs/python/datastore/queries#Python_Data_consistency>,
Java<https://developers.google.com/appengine/docs/java/datastore/queries#Java_Data_consistency>)
from its creation as being eventually consistent, i.e. a non-ancestor query
(a query that can return results from multiple entity groups) may not
immediately return results that reflect recent writes to those entity
groups.

We strive under normal circumstances to keep the eventual consistency delay
(the time before all current writes will be reflected in non-ancestor
queries) as small as possible. However, the process of moving App Engine en
masse to its new datacenters required us to take each replica completely
offline for some amount of time during the final phase of the move. Taking
the replica completely offline prevents it from accepting writes and
participating in
Paxos<http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf>majority
consensus, and also prevents the background replication system
from keeping the replica up to date. When re-enabled, the backlog of
updates causes the replica to be further behind than what is typically
observed during normal operation, and increases the eventual consistency
delay to levels not seen during routine functioning of App Engine.

During the design of this migration, alternative processes were
investigated which could have minimized the increase in eventual
consistency delay that was observed, but analysis showed they would have
required periods of elevated datastore and serving latency to complete, and
were judged too risky to pursue.

Given that HRD is designed and documented to be eventually consistent with
non-ancestor queries, we strongly encourage you to view the documentation
linked above about eventual consistency and expected behavior for Datastore
queries, and evaluate whether you may need to modify your application to
better handle any potential issues that would arise during periods of
elevated eventual consistency delay.

What you saw: Non-ancestor queries may have returned results that did not
reflect writes performed a notable amount of time in the past, i.e.
elevated eventual consistency delay

When you saw it: Monday, 24 June 2013 through Friday, 28 June 2013

What we’re doing about it: Improving our infrastructure to generate better
per-application views of eventual consistency delay, and using that to
drive improvements that will reduce it systemwide. We are also changing our
local development tools so that eventual consistency is enabled by default,
allowing developers to experience and build for this behavior earlier in
the development cycle.
Closing Thoughts

If your application experienced any disruption due to the above issues, we
would like to extend sincere apologies to you and your users. We hope that
this document makes it clear what happened, why, and what we’re doing to
guard against any of these issues from recurring, as well as preventing
future similar issues from occurring at all.

That said, despite the disruptions we outlined above, we consider this
migration a success for Google App Engine. Through careful planning and
technical innovation, we were able to relocate the entirety of the US
serving footprint to a new location in the US without requiring a scheduled
maintenance window, a read-only period of the Datastore (as was required of
Master/Slave applications), or causing substantial disruptions to our
serving traffic. The new datacenters allow for the Google Cloud Platform to
expand even more rapidly than before to handle increased demand for our
services, and additionally, colocates more Google Cloud Platform services
in close proximity, guaranteeing reduced latency for communication between
them, and allowing development of new innovative products and integrations.

This level of reliability and ongoing innovation and improvement are built
into the design of Google Cloud Platform products from the start, and we’re
glad to provide it to your applications standard and without additional
cost as part of the services we are selling to you. We recognize that you
have built your businesses on the Google Cloud Platform because you trust
Google to handle the difficult and complicated tasks of growing and
reliably maintaining computing resources as service, while letting you just
focus on growing your application and your business. We’re substantially
committed to continuous, thoughtful, strategic, and pre-emptive improvement
in the reliability of all parts of the Google Cloud Platform, no matter the
difficulty or the complexity.

As always, if you believe your paid application experienced an SLA
violation due to any of the issues that we describe above, please fill out
our refund request
form<http://support.google.com/code/bin/request.py?contact_type=cloud_platform_billing>.

Regards,

Michael Handler, on behalf of the Google App Engine Team

-- 
You received this message because you are subscribed to the Google Groups
"Google App Engine Downtime Notify" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to google-appengine-downtime-notify+unsubscr...@googlegroups.com.
To post to this group, send email to
google-appengine-downtime-not...@googlegroups.com.
Visit this group at
http://groups.google.com/group/google-appengine-downtime-notify.
For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

[google-appengine] Confusing message about GAE's datacenter relocation

Reply via email to