App Engine Developers, We wanted to provide you with a fully detailed account of our recent outage. Every billed application will also receive a credit for all paid resource usage from the entire day of the outage; this will appear as a credit balance in your application account within the next week.
We apologize for the downtime, and the App Engine team is continuing our to work improve the availability and power of App Engine. Our full post-mortem analysis is below. Thanks for your patience. The Google App Engine Team ---------- Summary On July 2, from 6:45 AM PDT until 12:35 PM PDT, Google App Engine (App Engine) experienced an outage that ranged from partial to complete. Following is a timeline of events, an analysis of the technology and process failures, and a set of steps the team is committed to taking to prevent such an outage from happening again. The App Engine outage was due to complete unavailability of the datacenter's persistence layer, GFS, for approximately three hours. The GFS failure was abrupt for reasons described below, and as a consequence the data belonging to App Engine applications remained resident on GFS servers and was unreachable during this period. Since needed application data was completely unreachable for a longer than expected time period, we could not follow the usual procedure of serving of App Engine applications from an alternate datacenter, because doing so would have resulted in inconsistent or unavailable data for applications. The root cause of the outage was a bug in the GFS Master server caused by another client in the datacenter sending it an improperly formed filehandle which had not been safely sanitized on the server side, and thus caused a stack overflow on the Master when processed. ---------- Timeline (all times below Pacific Daylight Time, GMT -0700) 6:44 AM --- A GFS Site Reliability Engineer (SRE) reports that the GFS Master in App Engine's primary data center is failing and continuously restarting. Since it is failing repeatedly, dependent services cannot communicate reliably with GFS. 7:00 AM --- The monitoring system that monitors the health of the App Engine cluster notices that request latency has spiked across many applications and pages the primary on-call engineer for App Engine. Primary begins investigating but quickly receives another page reporting an increased error rate for Datastore RPCs, and latency for Datastore operations (reads and writes) has increased. Datastore reads are succeeding within normal tolerances. Between 5% and 20% of Datastore writes are failing. 8:00 AM --- The cause of the GFS Master failures has not yet been identified. However, a similar-looking issue that had been seen in a different data center the week prior had been resolved by an upgrade to a newer version of the GFS software. This upgrade was already planned for the App Engine primary data center later in the week, so the GFS SRE decides to commence the upgrade immediately in an attempt to alleviate the problem. 8:07 AM --- The App Engine primary on-call engineer attempts to update the System Status site with information describing elevated datastore latency and error rates. However, the Status Site is only intermittently available and returning errors on all updates. Investigating the problem, the primary engineer discovered that the isolated servers supporting the Status Site were running in the same data center as the primary App Engine serving cluster. Thus, the site ultimately depended upon the same GFS instance as App Engine itself. The cause for this error in the Status Site was determined to be a configuration error in the App Engine datacenter failover procedure. 8:35 AM --- Datastore write failure rate rises to 100%. Most of the App Engine engineering team is present and involved in resolving the problem at this point. Datastore replication delay between the App Engine primary data center and the App Engine alternative datacenter is measured at 30 minutes. In other words, the App Engine Datastore was determined to be 30 minutes behind in replicating application data to the alternate datacenter and would need another 30 minutes of sending data to "catch up." Usual replication delay values are around 1 to 5 minutes. Because not all data had been replicated out of the primary serving datacenter to the alternate datacenter, moving serving traffic to the alternate datacenter would have resulted in a random set of application data being unavailable to App Engine applications. If Datastore writes were enabled in the alternate datacenter, then writes would be based on stale or incomplete data. Thus, at this point we had to choose between failing over immediately, in which case there would have been inconsistent or unavailable data for applications, or waiting 30 minutes in read-only mode for replication to catch up. We decided that inconsistency was not acceptable, even given the serving problems for the past hour and a half, so App Engine primary began preparing for failover to the alternate datacenter with the understanding that failover could not occur until replication caught up. 9:00 AM --- The GFS upgrade to the new version of the GFS software finishes, but the Master is still failing. The GFS SRE escalates directly to the GFS engineering team, and the GFS engineering team immediately begins live debugging of the failing software to attempt to determine the cause and come up with a fix. 10:00 AM --- GFS SRE advises that the GFS engineering team has identified the cause of the crashes as a "query-of-death" against the GFS servers. Another user of GFS in the same primary datacenter as App Engine is issuing a request to the GFS servers that reliably causes a crash. The client was sending an improperly formed filehandle which was not safely checked and sanitized by the server, and which caused a stack overflow when processed. Now that it is known that the bug is a malformed query from a client, GFS SRE identifies a MapReduce process that is triggering the GFS bug, and the process is disabled. GFS Master is no longer failing and GFS Chunkservers, which hold the actual needed data, are starting to come back up by 10:30 AM. 10:40 AM --- Overall Datastore error rate remains at 30% and continues to rise, despite the fixed GFS Master. Replication delay is 2 hours and 45 minutes. (The delay estimates are based on how quickly data can be read and sent to the remote datacenter.) 11:47 AM --- Datastore servers start up properly again, and both reads and writes are succeeding. Datastore replication quickly catches up to the present, since all reads are now going through. Replication delay drops to zero, indicating that the alternate datacenter now has all application data from the primary datacenter. Initial prognosis is that the App Engine cluster is healthy again. 12:00 PM --- GFS SRE advises that the GFS Master in the primary App Engine data center needs to be restarted one more time later in the day to pick up some configuration changes missed during the emergency upgrade. Based on this fact and the fact that replication delay has again dropped to zero, App Engine primary on-call decides to fail over to the backup data center to avoid any instability introduced by the planned GFS Master restart later in the day. Failover to the alternate datacenter begins. 12:14 PM --- Writes are re-enabled in the backup data center. The failover is complete, and the alternate datacenter is serving normally. 12:35 PM --- Message posted to the google-appengine-downtime-notify group that all functionality has been restored. ---------- What did we do wrong? Production --- It is possible, although unlikely, that if we had disabled Datastore writes before 8 AM when the problem was initially detected, that Datastore replication might have caught up before GFS completely failed. If this happened, we would have been able to move traffic out of our primary data center before all reads and writes became unavailable, and downtime would have been reduced to a partial outage of around 30 minutes. However, between 7 AM and 8 AM there was not yet any evidence that GFS would fail for the entire cluster and that we were heading towards a major outage. Given the information that was available at the time, leaving Datastore writes enabled was a reasonable decision. Communication --- One area where it's clear we could have done better was communication with our customers during the outage. We have a well- defined process for keeping our customers informed, but the process assumes that updates can be posted to the System Status site. There are a number of concrete steps we will take to prevent this from happening again. First, we will devise a backup plan for communicating with customers when the System Status site is unavailable. Second, we will update our failover script to verify that the System Status site and App Engine are running in different data centers. We will also update our failover script to move the System Status site to a different data center when this is not the case. Third, we will add an automated alert to our monitoring system that will notify the App Engine engineer who is on-call whenever the System Status site and App Engine are running in the same data center. Architecture --- Ultimately, the cause of this outage was not the result of a single bad decision. GFS and App Engine are distributed systems that have been designed from the ground up with fault- tolerance in mind, and we designed our failover strategy with the understanding that it relied on stored data being available for reading during the time we wanted to fail over. The failover procedure was designed to cope with partial unavailability of GFS for extended periods, and full unavailability for short periods, but was not designed to handle failover during full unavailability for a long period (greater than three hours). We have had an engineering effort under way for approximately 10 months to make App Engine less dependent on any single instance of GFS, and more resilient in the face of outages in the primary datacenter. We expect to deploy this system to production within the next two months. This will significantly reduce the likelihood of a complete outage like the one we saw on July 2. Recovery --- However, even if we could roll these changes out today, it's still extremely important that we be able to get the entire system functioning more quickly when any of the GFS instances we depend on become unexpectedly unavailable for an extended period of time. We were surprised by the amount of time it took for us to begin serving normally once the GFS query-of-death was identified and disabled, and this delay is unacceptable to us. Frankly, we did not expect the whole persistence layer to be unavailable for nearly this long for any reason, and therefore had not planned properly for it. The engineering team is already discussing possible solutions, and we will update our roadmap as soon as we have something concrete we can work towards. ---------- What are we doing to fix it? 1. The underlying bug in GFS has already been addressed and the fix will be pushed to all datacenters as soon as possible. It has also been determined that the bug has been live for at least a year, so the risk of recurrence should be low. Site reliability engineers are aware of this issue and can quickly fix it if it should recur before then. 2. The App Engine team is accelerating its schedule to release the new clustering system that was already under development. When this system is in place, it will greatly reduce the likelihood of a complete outage like this one. 3. The App Engine team is actively investigating new solutions to cope with long-term unavailability of the primary persistence layer. These solutions will be designed to ensure that applications can cope reasonably with long-term catastrophic outages, no matter how rare. 4. Changes will be made to the Status Site configuration to ensure that the Status Site is properly available during outages. ---------- On Jul 2, 4:47 pm, "Chris Beckmann (App Engine PM)" <beckmann +...@google.com> wrote: > We wanted to provide you with some additional detail regarding our > recent outage. On July 2nd, between 6:20 AM PT and 12:30 PM PT, all > applications experienced increased error rate and latency with > Datastore and memcache operations, as well as some serving errors. > Datastore access and serving were fully restored as of 12:25 PM PT. > > Problem > > There was a serious issue in one of App Engine's datacenters with GFS, > Google's low level storage system. GFS underlies Bigtable, which in > turn underlies App Engine's Datastore. GFS also provides storage for > our application serving infrastructure, so GFS unavailability caused > problems for Datastore reads and writes, as well as application > serving. > > Resolution Efforts > > Availability and data integrity are both very important to the App > Engine team. Typically, we would have switched to an alternate > datacenter immediately. However, due to the specific nature of this > problem, switching datacenters immediately meant that the most recent > data written by applications would not have been available, leading to > consistency problems for many applications. > > The team decided to try to stabilize GFS first, then switch > datacenters. This was accomplished and we avoided any data consistency > issues. > > Prevention > > The team has been actively working on a solution in the medium-term > that would allow us to switchover datacenters immediately without > consistency problems. > > Communication and Status > > Many users noted that the System Status site was also down. The System > Status site is hosted separately from App Engine applications, and is > not typically affected by availability problems. However, due to the > low level problem with GFS in this case, the System Status site was > also affected. The team did post the downtime announcement and updates > on the Downtime Notification group, available > here:http://groups.google.com/group/google-appengine-downtime-notify > > The App Engine team is continuing to work to improve the availability > and power of App Engine. Thanks for your patience. > > Chris Beckmann > Product Manager, App Engine Team --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---