[google-appengine] Re: Post-mortem for February 24th, 2010 outage

Marc Provost Fri, 05 Mar 2010 11:09:37 -0800

Wow, I second lennysan. Awesome postmortem! Thank you so much for
sharing it with us.


Marc

On Mar 5, 12:25 pm, lennysan <lenny...@gmail.com> wrote:
> I've been working on a Guideline for Postmortem Communication, and ran
> this post through the 
> guideline:http://www.transparentuptime.com/2010/03/google-app-engine-downtime-p...
>
> Overall, this may be the simple most impressive postmortem I've seen
> yet. The amount of time and though put into this post is staggering,
> and the takeaways are useful to every organization. I'm especially
> impressed with the proposed new functionality that turns this event
> into a long term positive, which is really all you can ask for after
> an incident.
>
> On Mar 4, 3:22 pm, App Engine Team <appengine.nore...@gmail.com>
> wrote:
>
> > Post-Mortem Summary
>
> > This document details the cause and events occurring immediately after
> > App Engine's outage on February 24th, 2010, as well as the steps we
> > are taking to mitigate the impact of future outages like this one in
> > the future.
>
> > On February 24th, 2010, all Googe App Engine applications were in
> > varying degraded states of operation for a period of two hours and
> > twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT.  The
> > underlying cause of the outage was a power failure in our primary
> > datacenter. While the Google App Engine infrastructure is designed to
> > quickly recover from these sort of failures, this type of rare
> > problem, combined with internal procedural issues  extended the time
> > required to restore the service.
>
> > <<Link to full timeline here, which is attached below.>>
>
> > What did we do wrong?
>
> > Though the team had planned for this sort of failure, our response had
> > a few important issues:
>
> > - Although we had procedures ready for this sort of outage, the oncall
> > staff was unfamiliar with them and had not trained sufficiently with
> > the specific recovery procedure for this type of failure.
>
> > - Recent work to migrate the datastore for better multihoming changed
> > and improved the procedure for handling these failures significantly.
> > However, some documentation detailing the procedure to support the
> > datastore during failover incorrectly referred to the old
> > configuration. This led to confusion during the event.
>
> > - The production team had not agreed on a policy that clearly
> > indicates when, and in what situations, our oncall staff should take
> > aggressive user-facing actions, such as an unscheduled failover.  This
> > led to a bad call of returning to a partially working datacenter.
>
> > - We failed to plan for the case of a power outage that might affect
> > some, but not all, of our machines in a datacenter (in this case,
> > about 25%). In particular, this led to incorrect analysis of the
> > serving state of the failed datacenter and when it might recover.
>
> > - Though we were able to eventually migrate traffic to the backup
> > datacenter, a small number of Datastore entity groups, belonging to
> > approximately 25 applications in total,  became stuck in an
> > inconsistent state as a result of the failover procedure. This
> > represented considerably less than 0.00002% of data stored in the
> > Datastore.
>
> > Ultimately, although significant work had been done over the past year
> > to improve our handling of these types of outages, issues with
> > procedures reduced their impact.
>
> > What are we doing to fix it?
>
> > As a result, we have instituted the following procedures going
> > forward:
>
> > - Introduce regular drills by all oncall staff of all of our
> > production procedures. This will include the rare and complicated
> > procedures, and all members of the team will be required to complete
> > the drills before joining the oncall rotation.
>
> > - Implement a regular bi-monthly audit of our operations docs to
> > ensure that all needed procedures are properly findable, and all out-
> > of-date docs are properly marked "Deprecated."
>
> > - Establish a clear policy framework to assist oncall staff to quickly
> > and decisively make decisions about taking intrusive, user-facing
> > actions during failures. This will allow them to act confidently and
> > without delay in emergency situations.
>
> > We believe that with these new procedures in place, last week's outage
> > would have been reduced in impact from about 2 hours of total
> > unavailability to about 10 to 20 minutes of partial unavailability.
>
> > In response to this outage, we have also decided to make a major
> > infrastructural change in App Engine. Currently, App Engine provides a
> > one-size-fits-all Datastore, that provides low write latency combined
> > with strong consistency, in exchange for lower availability in
> > situations of unexpected failure in one of our serving datacenters. In
> > response to this outage, and feedback from our users, we have begun
> > work on providing two different Datastore configurations:
>
> > - The current option of low-latency, strong consistency, and lower
> > availability during unexpected failures (like a power outage)
>
> > - A new option for higher availability using synchronous replication
> > for reads and writes, at the cost of significantly higher latency
>
> > We believe that providing both of these options to you, our users,
> > will allow you to make your own informed decisions about the tradeoffs
> > you want to make in running your applications.
>
> > We sincerely apologize for the impact of Feb 24th's service disruption
> > on your applications. We take great pride in the reliability that App
> > Engine offers, but we also recognize that we can do more to improve
> > it. You can be confident that we will continue to work diligently to
> > improve the service and ensure the impact of low level outages like
> > this have the least possible affect on our customers.
>
> > Timeline
> > -----------
>
> > 7:48 AM - Internal monitoring graphs first begin to show that traffic
> > has problems in our primary datacenter and is returning an elevated
> > number of errors. Around the same time, posts begin to show up in the
> > google-appengine discussion group from users who are having trouble
> > accessing App Engine.
>
> > 7:53 AM - Google Site Reliabilty Engineers send an email to a broad
> > audience notifying oncall staff that there has been a power outage in
> > our primary datacenter. Google's datacenters have backup power
> > generators for these situations. But, in this case, around 25% of
> > machines in the datacenter did not receive backup power in time and
> > crashed. At this time, our oncall staff was paged.
>
> > 8:01 AM - By this time, our primary oncall engineer has determined the
> > extent and the impact of the page, and has determined that App Engine
> > is down. The oncall engineer, according to procedure, pages our
> > product managers and engineering leads to handle communicating about
> > the outage to out users. A few minutes later, the first post from the
> > App Engine team about this outage is made on the external group ("We
> > are investigating this issue.").
>
> > 8:22 AM - After further analysis, we determine that although power has
> > returned to the datacenter, many machines in the datacenter are
> > missing due to the power outage, and are not able to serve traffic.
> > Particularly, it is determined that the GFS and Bigtable clusters are
> > not in a functioning state due to having lost too many machines, and
> > that thus the Datastore is not usable in the primary datacenter at
> > that time. The oncall engineer discusses performing a failover to our
> > alternate datacenter with the rest of the oncall team. Agreement is
> > reached to pursue our unexpected failover procedure for an unplanned
> > datacenter outages.
>
> > 8:36 AM - Following up on the post on the discussion group outage
> > thread, the App Engine team makes a post about the outage to our
> > appengine-downtime-notify group and to the App Engine Status site.
>
> > 8:40 AM - The primary oncall engineer discovers two conflicting sets
> > of procedures. This was a result of the operations process changing
> > after our recent migration of the Datastore. After discussion with
> > other oncall engineers, consensus is not reached, and members of the
> > engineering team attempt to contact the specific engineers responsible
> > for procedure change to resolve the situation.
>
> > 8:44 AM - While others attempt to determine which is the correct
> > unexpected failover procedure, the oncall engineer attempts to move
> > all traffic into a read-only state in our alternate datacenter.
> > Traffic is moved, but an unexpected configuration problem from this
> > procedure prevents the read-only traffic from working properly.
>
> > 9:08 AM - Various engineers are diagnosing the problem with read-only
> > traffic in our alternate datacenter. In the meantime, however, the
> > primary oncall engineer sees data that leads them to believe that our
> > primary datacenter has recovered and may be able to serve. Without a
> > clear rubric with which to make this decision, however, the engineer
> > was not aware that based on historical data the primary datacenter is
> > unlikely to have recovered to a usable state by this point of time.
> > Traffic is moved back to the original primary datacenter as an attempt
> > to resume serving, while others debug the read-only issue in the
> > alternate datacenter.
>
> > 9:18 AM - The primary oncall engineer determines that the primary
> > datacenter has not recovered, and cannot serve traffic. It is now
> > clear to oncall staff that the call was wrong, the primary will not
> > recover, and we must focus on the alternate datacenter. Traffic is
> > failed back over to the alternate datacenter, and the oncall makes the
> > decision to follow the unplanned failover procedure and begins the
> > process.
>
> > 9:35 AM - An engineer with familiarity with the unplanned failover
> > procedure is reached, and begins providing guidance about the failover
> > procedure. Traffic is moved to our alternate datacenter, initially in
> > read-only mode.
>
> > 9:48 AM - Serving for App Engine begins externally in read-only mode,
> > from our alternate datacenter. At this point, apps that properly
> > handle read-only periods should be serving correctly, though in a
> > reduced operational state.
>
> > 9:53 AM - After engineering team
>
> ...
>
> read more »

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Post-mortem for February 24th, 2010 outage

Reply via email to