Wow, I second lennysan. Awesome postmortem! Thank you so much for sharing it with us.
Marc On Mar 5, 12:25 pm, lennysan <lenny...@gmail.com> wrote: > I've been working on a Guideline for Postmortem Communication, and ran > this post through the > guideline:http://www.transparentuptime.com/2010/03/google-app-engine-downtime-p... > > Overall, this may be the simple most impressive postmortem I've seen > yet. The amount of time and though put into this post is staggering, > and the takeaways are useful to every organization. I'm especially > impressed with the proposed new functionality that turns this event > into a long term positive, which is really all you can ask for after > an incident. > > On Mar 4, 3:22 pm, App Engine Team <appengine.nore...@gmail.com> > wrote: > > > Post-Mortem Summary > > > This document details the cause and events occurring immediately after > > App Engine's outage on February 24th, 2010, as well as the steps we > > are taking to mitigate the impact of future outages like this one in > > the future. > > > On February 24th, 2010, all Googe App Engine applications were in > > varying degraded states of operation for a period of two hours and > > twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT. The > > underlying cause of the outage was a power failure in our primary > > datacenter. While the Google App Engine infrastructure is designed to > > quickly recover from these sort of failures, this type of rare > > problem, combined with internal procedural issues extended the time > > required to restore the service. > > > <<Link to full timeline here, which is attached below.>> > > > What did we do wrong? > > > Though the team had planned for this sort of failure, our response had > > a few important issues: > > > - Although we had procedures ready for this sort of outage, the oncall > > staff was unfamiliar with them and had not trained sufficiently with > > the specific recovery procedure for this type of failure. > > > - Recent work to migrate the datastore for better multihoming changed > > and improved the procedure for handling these failures significantly. > > However, some documentation detailing the procedure to support the > > datastore during failover incorrectly referred to the old > > configuration. This led to confusion during the event. > > > - The production team had not agreed on a policy that clearly > > indicates when, and in what situations, our oncall staff should take > > aggressive user-facing actions, such as an unscheduled failover. This > > led to a bad call of returning to a partially working datacenter. > > > - We failed to plan for the case of a power outage that might affect > > some, but not all, of our machines in a datacenter (in this case, > > about 25%). In particular, this led to incorrect analysis of the > > serving state of the failed datacenter and when it might recover. > > > - Though we were able to eventually migrate traffic to the backup > > datacenter, a small number of Datastore entity groups, belonging to > > approximately 25 applications in total, became stuck in an > > inconsistent state as a result of the failover procedure. This > > represented considerably less than 0.00002% of data stored in the > > Datastore. > > > Ultimately, although significant work had been done over the past year > > to improve our handling of these types of outages, issues with > > procedures reduced their impact. > > > What are we doing to fix it? > > > As a result, we have instituted the following procedures going > > forward: > > > - Introduce regular drills by all oncall staff of all of our > > production procedures. This will include the rare and complicated > > procedures, and all members of the team will be required to complete > > the drills before joining the oncall rotation. > > > - Implement a regular bi-monthly audit of our operations docs to > > ensure that all needed procedures are properly findable, and all out- > > of-date docs are properly marked "Deprecated." > > > - Establish a clear policy framework to assist oncall staff to quickly > > and decisively make decisions about taking intrusive, user-facing > > actions during failures. This will allow them to act confidently and > > without delay in emergency situations. > > > We believe that with these new procedures in place, last week's outage > > would have been reduced in impact from about 2 hours of total > > unavailability to about 10 to 20 minutes of partial unavailability. > > > In response to this outage, we have also decided to make a major > > infrastructural change in App Engine. Currently, App Engine provides a > > one-size-fits-all Datastore, that provides low write latency combined > > with strong consistency, in exchange for lower availability in > > situations of unexpected failure in one of our serving datacenters. In > > response to this outage, and feedback from our users, we have begun > > work on providing two different Datastore configurations: > > > - The current option of low-latency, strong consistency, and lower > > availability during unexpected failures (like a power outage) > > > - A new option for higher availability using synchronous replication > > for reads and writes, at the cost of significantly higher latency > > > We believe that providing both of these options to you, our users, > > will allow you to make your own informed decisions about the tradeoffs > > you want to make in running your applications. > > > We sincerely apologize for the impact of Feb 24th's service disruption > > on your applications. We take great pride in the reliability that App > > Engine offers, but we also recognize that we can do more to improve > > it. You can be confident that we will continue to work diligently to > > improve the service and ensure the impact of low level outages like > > this have the least possible affect on our customers. > > > Timeline > > ----------- > > > 7:48 AM - Internal monitoring graphs first begin to show that traffic > > has problems in our primary datacenter and is returning an elevated > > number of errors. Around the same time, posts begin to show up in the > > google-appengine discussion group from users who are having trouble > > accessing App Engine. > > > 7:53 AM - Google Site Reliabilty Engineers send an email to a broad > > audience notifying oncall staff that there has been a power outage in > > our primary datacenter. Google's datacenters have backup power > > generators for these situations. But, in this case, around 25% of > > machines in the datacenter did not receive backup power in time and > > crashed. At this time, our oncall staff was paged. > > > 8:01 AM - By this time, our primary oncall engineer has determined the > > extent and the impact of the page, and has determined that App Engine > > is down. The oncall engineer, according to procedure, pages our > > product managers and engineering leads to handle communicating about > > the outage to out users. A few minutes later, the first post from the > > App Engine team about this outage is made on the external group ("We > > are investigating this issue."). > > > 8:22 AM - After further analysis, we determine that although power has > > returned to the datacenter, many machines in the datacenter are > > missing due to the power outage, and are not able to serve traffic. > > Particularly, it is determined that the GFS and Bigtable clusters are > > not in a functioning state due to having lost too many machines, and > > that thus the Datastore is not usable in the primary datacenter at > > that time. The oncall engineer discusses performing a failover to our > > alternate datacenter with the rest of the oncall team. Agreement is > > reached to pursue our unexpected failover procedure for an unplanned > > datacenter outages. > > > 8:36 AM - Following up on the post on the discussion group outage > > thread, the App Engine team makes a post about the outage to our > > appengine-downtime-notify group and to the App Engine Status site. > > > 8:40 AM - The primary oncall engineer discovers two conflicting sets > > of procedures. This was a result of the operations process changing > > after our recent migration of the Datastore. After discussion with > > other oncall engineers, consensus is not reached, and members of the > > engineering team attempt to contact the specific engineers responsible > > for procedure change to resolve the situation. > > > 8:44 AM - While others attempt to determine which is the correct > > unexpected failover procedure, the oncall engineer attempts to move > > all traffic into a read-only state in our alternate datacenter. > > Traffic is moved, but an unexpected configuration problem from this > > procedure prevents the read-only traffic from working properly. > > > 9:08 AM - Various engineers are diagnosing the problem with read-only > > traffic in our alternate datacenter. In the meantime, however, the > > primary oncall engineer sees data that leads them to believe that our > > primary datacenter has recovered and may be able to serve. Without a > > clear rubric with which to make this decision, however, the engineer > > was not aware that based on historical data the primary datacenter is > > unlikely to have recovered to a usable state by this point of time. > > Traffic is moved back to the original primary datacenter as an attempt > > to resume serving, while others debug the read-only issue in the > > alternate datacenter. > > > 9:18 AM - The primary oncall engineer determines that the primary > > datacenter has not recovered, and cannot serve traffic. It is now > > clear to oncall staff that the call was wrong, the primary will not > > recover, and we must focus on the alternate datacenter. Traffic is > > failed back over to the alternate datacenter, and the oncall makes the > > decision to follow the unplanned failover procedure and begins the > > process. > > > 9:35 AM - An engineer with familiarity with the unplanned failover > > procedure is reached, and begins providing guidance about the failover > > procedure. Traffic is moved to our alternate datacenter, initially in > > read-only mode. > > > 9:48 AM - Serving for App Engine begins externally in read-only mode, > > from our alternate datacenter. At this point, apps that properly > > handle read-only periods should be serving correctly, though in a > > reduced operational state. > > > 9:53 AM - After engineering team > > ... > > read more » -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appeng...@googlegroups.com. To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.