Re: [google-appengine] Confusing message about GAE's datacenter relocation

PK Thu, 22 Aug 2013 21:18:12 -0700

+1

PK
http://www.gae123.com


On Aug 22, 2013, at 9:15 PM, Jeff Schnitzer <j...@infohazard.org> wrote:

> I'm sure many of you saw this email, but I'm curious - what does it mean?
> 
> I thought GAE was located in some number of geographically distributed 
> datacenters (more than 3) and a failure of any single datacenter was 
> immaterial to HRD apps.  This email says "relocated our entire US serving 
> footprint to a different location within the US" as if there is only a single 
> datacenter involved.
> 
> Can somebody clarify this?
> 
> Jeff
> 
> ---------- Forwarded message ----------
> From: Michael Handler <hand...@google.com>
> Date: Thu, Aug 22, 2013 at 1:31 PM
> Subject: Information on Google App Engine's recent US datacenter relocations
> To: google-appengine-downtime-not...@googlegroups.com
> 
> 
> Google App Engine recently relocated our entire US serving footprint to a 
> different location within the US, without requiring any scheduled maintenance 
> period or reduction in functionality, and while serving normally throughout 
> the work period. This work required multiple engineer-months of careful 
> preparation, repeated refinement of our datacenter setup and data copy 
> automation, and a world-class network backbone between our locations. Online 
> migration of all stored data required an initial transfer of multiple 
> petabytes, and online replication of all changes written to stored data after 
> the initial transfer, until the relocation process was complete.
> 
> This was the first time the US serving footprint for App Engine had been 
> relocated en masse since the launch of the High Replication Datastore (HRD) 
> in January of 2011. Uncommon processes of this complexity and scale are 
> rarely completed without some small number of bumps and missteps, and this 
> migration proved to be no exception to that rule. While we worked hard to 
> make this process as automatic and invisible to our customers as possible, we 
> fell short in a few places, and we’d like to take this opportunity to explain 
> what happened and what we’re doing to address the issues we encountered.
> Storage Layer Overload
> As we completed the migration of one of our replicas into its new datacenter, 
> the storage infrastructure in that datacenter began consuming all resources 
> allocated to it. This was above all previously observed demand and capacity 
> planning, without any corresponding increase in traffic that would logically 
> have caused such a jump in demand. As with any piece of infrastructure where 
> demand exceeds supply, the performance of the portions of App Engine that 
> depend heavily on the storage layer degraded.
> 
> We simultaneously began investigating potential causes of the increase in 
> storage infrastructure resource demand, and allocating more resources to the 
> storage infrastructure. As the additional resources came online, demand grew 
> too quickly consume them as well, clarifying that this was not a simple case 
> of underprovisioning. We directed traffic away from the affected datacenter, 
> and immediately saw storage infrastructure resource demand return to normal 
> levels for a drained datacenter.
> 
> After further investigation, and detailed consultation with the engineers 
> from the storage infrastructure teams, we were unable to determine the origin 
> of the increased resource demand. We returned traffic to this datacenter 
> during the period of lowest global App Engine load and saw resource demand 
> increase to expected levels without incident. The datacenter has performed as 
> expected since that time, without any unusual behavior or incidents similar 
> to this one.
> 
> What you saw: Applications serving from this particular datacenter would have 
> experienced elevated Datastore latency and errors (for both reads and 
> writes), and irregular and delayed Task Queue performance. Additionally, all 
> US applications may have experienced a slightly elevated error rate for 
> datastore write operations, as a consequence of the localized overload.
> 
> When you saw it: Tuesday, 25 June 2013, from 7:30 AM to 3:10 PM US/Pacific
> 
> What we’re doing about it: We are chagrined to say that, since this event, we 
> have been unable to further diagnose the origin of this issue, nor reproduce 
> it in our testing infrastructure or labs. Without a clear root cause, we 
> cannot commit to specific fixes, other than continuing to monitor and improve 
> our diagnostic tools, and investigating ways to gradually slowly ramp up 
> traffic into a datacenter, to avoid any potential complications when 
> beginning to serve from a datacenter for the first time.
> Application Server Hotspots
> Before beginning this move process, App Engine had recently completed an 
> internal migration to the next generation of our scheduler system. The 
> scheduler is responsible for deploying your applications to our serving 
> environment, assigning more resources to your application as required by your 
> traffic and directed by your performance and scaling settings, and 
> rebalancing to ensure a uniform load across our infrastructure.
> 
> As the migration to the new scheduler had occurred exclusively within 
> already-serving datacenters, historical load data about each application from 
> the old scheduler was already on hand and available for use by the new 
> scheduler’s algorithm, which provided for a smooth transition. However, when 
> we tested bringing up a new datacenter for the first time under the new 
> scheduler, which lacked any historical load data for all applications 
> assigned to the datacenter, we discovered that the new scheduler’s algorithm 
> performed poorly in the absence of historical load data, did not properly 
> distribute load across application servers, and created hotspots.
> 
> For the en masse migration to our new datacenters, we prepared for this 
> situation by replicating the historical load data from each old datacenter to 
> the each new datacenter for use by the new scheduler during startup. While 
> this process generally functioned sufficiently well, the load distribution in 
> the new datacenter still resulted in a small percentage of overloaded 
> application servers, which were prone to serving mostly errors. The new 
> scheduler’s algorithm would have corrected these hotspots eventually but at 
> an unacceptably slow rate, and we manually intervened to rebalance and 
> eliminate the hotspots.
> 
> What you saw: A small number of application servers in one datacenter were 
> overloaded and unable to create new instances. Larger applications are 
> assigned to multiple application servers in their datacenter, and any traffic 
> arriving at the affected application server would be retried at a different 
> application server after a small delay, at a cost of some latency.
> 
> Small applications are assigned a small number of application servers and do 
> not automatically retry their traffic on other application servers unless the 
> first application server is completely down. If the application server is 
> unable to start instances at all, any request will return a 500 error. (This 
> minimizes creating new instances on other application servers, which would 
> elevate the instance hour cost of your application, but can result in 
> substantial disruption in this particular case.)
> 
> When you saw it: Tuesday, 25 June 2013 through Wednesday, 26 June 2013
> 
> What we’re doing about it:
> Improving the scheduler algorithm to make better load assignment choices when 
> limited or no load data is available.
> Installing safety limits in the scheduler to keep it from assigning too much 
> load to a given application server, to guard against unexpected edge cases in 
> the scheduler algorithm.
> Instrumenting the application servers and the schedulers to automatically and 
> aggressively detect and disable any application server that is consistently 
> unable to start instances, or is serving a high percentage of internal errors 
> in response to requests.
> Delayed Datastore Eventual Consistency
> The High Replication Datastore (HRD) has been designed and documented 
> (Python, Java) from its creation as being eventually consistent, i.e. a 
> non-ancestor query (a query that can return results from multiple entity 
> groups) may not immediately return results that reflect recent writes to 
> those entity groups.
> 
> We strive under normal circumstances to keep the eventual consistency delay 
> (the time before all current writes will be reflected in non-ancestor 
> queries) as small as possible. However, the process of moving App Engine en 
> masse to its new datacenters required us to take each replica completely 
> offline for some amount of time during the final phase of the move. Taking 
> the replica completely offline prevents it from accepting writes and 
> participating in Paxos majority consensus, and also prevents the background 
> replication system from keeping the replica up to date. When re-enabled, the 
> backlog of updates causes the replica to be further behind than what is 
> typically observed during normal operation, and increases the eventual 
> consistency delay to levels not seen during routine functioning of App Engine.
> 
> During the design of this migration, alternative processes were investigated 
> which could have minimized the increase in eventual consistency delay that 
> was observed, but analysis showed they would have required periods of 
> elevated datastore and serving latency to complete, and were judged too risky 
> to pursue.
> 
> Given that HRD is designed and documented to be eventually consistent with 
> non-ancestor queries, we strongly encourage you to view the documentation 
> linked above about eventual consistency and expected behavior for Datastore 
> queries, and evaluate whether you may need to modify your application to 
> better handle any potential issues that would arise during periods of 
> elevated eventual consistency delay.
> 
> What you saw: Non-ancestor queries may have returned results that did not 
> reflect writes performed a notable amount of time in the past, i.e. elevated 
> eventual consistency delay
> 
> When you saw it: Monday, 24 June 2013 through Friday, 28 June 2013
> 
> What we’re doing about it: Improving our infrastructure to generate better 
> per-application views of eventual consistency delay, and using that to drive 
> improvements that will reduce it systemwide. We are also changing our local 
> development tools so that eventual consistency is enabled by default, 
> allowing developers to experience and build for this behavior earlier in the 
> development cycle.
> Closing Thoughts
> If your application experienced any disruption due to the above issues, we 
> would like to extend sincere apologies to you and your users. We hope that 
> this document makes it clear what happened, why, and what we’re doing to 
> guard against any of these issues from recurring, as well as preventing 
> future similar issues from occurring at all.
> 
> That said, despite the disruptions we outlined above, we consider this 
> migration a success for Google App Engine. Through careful planning and 
> technical innovation, we were able to relocate the entirety of the US serving 
> footprint to a new location in the US without requiring a scheduled 
> maintenance window, a read-only period of the Datastore (as was required of 
> Master/Slave applications), or causing substantial disruptions to our serving 
> traffic. The new datacenters allow for the Google Cloud Platform to expand 
> even more rapidly than before to handle increased demand for our services, 
> and additionally, colocates more Google Cloud Platform services in close 
> proximity, guaranteeing reduced latency for communication between them, and 
> allowing development of new innovative products and integrations.
> 
> This level of reliability and ongoing innovation and improvement are built 
> into the design of Google Cloud Platform products from the start, and we’re 
> glad to provide it to your applications standard and without additional cost 
> as part of the services we are selling to you. We recognize that you have 
> built your businesses on the Google Cloud Platform because you trust Google 
> to handle the difficult and complicated tasks of growing and reliably 
> maintaining computing resources as service, while letting you just focus on 
> growing your application and your business. We’re substantially committed to 
> continuous, thoughtful, strategic, and pre-emptive improvement in the 
> reliability of all parts of the Google Cloud Platform, no matter the 
> difficulty or the complexity.
> 
> As always, if you believe your paid application experienced an SLA violation 
> due to any of the issues that we describe above, please fill out our refund 
> request form. 
> 
> Regards,
> 
> Michael Handler, on behalf of the Google App Engine Team
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine Downtime Notify" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to google-appengine-downtime-notify+unsubscr...@googlegroups.com.
> To post to this group, send email to 
> google-appengine-downtime-not...@googlegroups.com.
> Visit this group at 
> http://groups.google.com/group/google-appengine-downtime-notify.
> For more options, visit https://groups.google.com/groups/opt_out.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Google App Engine" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to google-appengine+unsubscr...@googlegroups.com.
> To post to this group, send email to google-appengine@googlegroups.com.
> Visit this group at http://groups.google.com/group/google-appengine.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to google-appengine+unsubscr...@googlegroups.com.
To post to this group, send email to google-appengine@googlegroups.com.
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [google-appengine] Confusing message about GAE's datacenter relocation

Reply via email to