[google-appengine] Re: Post-mortem for February 24th, 2010 outage

Chris Fri, 05 Mar 2010 08:10:17 -0800

Thanks for sharing this information. It may very well be highly
irrational but it does feel good to get some insight into what
happened and how you guys respond to it!


One paragraph in particular caught my attention:

"- A new option for higher availability using synchronous replication
for reads and writes, at the cost of significantly higher latency"

I think I understand why you wish to expose this fundamental trade-off
to the app owners i.e. "Let us make our own decisons and force us to
acknowledge the fundamental forces at play". I'm a bit concerned about
the potential behavioral side-effect of such a "feature" though. Right
now you guys have to make GAE _both_ reliable and fast. That's what we
expect. It may very well be next to impossible to do both but you have
to keep trying...and having a bunch of hard-core GAE engineers
continously trying will probably land us all in a pretty happy place a
year or two down the line ;-)

On the other hand...making it a choice for app owners would
effectively be giving up on that fundamental challenge. "Oh..so you
can't live with 2-3 outages og several hours each every year...then
you should enable the super reliable but slooooow option". I'm sure,
you guys would still do your best to make the "slow option" super
fast...but at the end of the day resources need to be prioritized.
I could fear that optimizing reliability of the slow option will tend
to go at the bottom of the stack since there is a "workaround" for
customers who really need it.


/Chris


N.B: I'm very happy with the performance of GAE as it is today...or at
least what it was up until a week or so ago ;-)...but the reliability
is a major cause of concern for me.











On Mar 5, 12:22 am, App Engine Team <appengine.nore...@gmail.com>
wrote:
> Post-Mortem Summary
>
> This document details the cause and events occurring immediately after
> App Engine's outage on February 24th, 2010, as well as the steps we
> are taking to mitigate the impact of future outages like this one in
> the future.
>
> On February 24th, 2010, all Googe App Engine applications were in
> varying degraded states of operation for a period of two hours and
> twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT.  The
> underlying cause of the outage was a power failure in our primary
> datacenter. While the Google App Engine infrastructure is designed to
> quickly recover from these sort of failures, this type of rare
> problem, combined with internal procedural issues  extended the time
> required to restore the service.
>
> <<Link to full timeline here, which is attached below.>>
>
> What did we do wrong?
>
> Though the team had planned for this sort of failure, our response had
> a few important issues:
>
> - Although we had procedures ready for this sort of outage, the oncall
> staff was unfamiliar with them and had not trained sufficiently with
> the specific recovery procedure for this type of failure.
>
> - Recent work to migrate the datastore for better multihoming changed
> and improved the procedure for handling these failures significantly.
> However, some documentation detailing the procedure to support the
> datastore during failover incorrectly referred to the old
> configuration. This led to confusion during the event.
>
> - The production team had not agreed on a policy that clearly
> indicates when, and in what situations, our oncall staff should take
> aggressive user-facing actions, such as an unscheduled failover.  This
> led to a bad call of returning to a partially working datacenter.
>
> - We failed to plan for the case of a power outage that might affect
> some, but not all, of our machines in a datacenter (in this case,
> about 25%). In particular, this led to incorrect analysis of the
> serving state of the failed datacenter and when it might recover.
>
> - Though we were able to eventually migrate traffic to the backup
> datacenter, a small number of Datastore entity groups, belonging to
> approximately 25 applications in total,  became stuck in an
> inconsistent state as a result of the failover procedure. This
> represented considerably less than 0.00002% of data stored in the
> Datastore.
>
> Ultimately, although significant work had been done over the past year
> to improve our handling of these types of outages, issues with
> procedures reduced their impact.
>
> What are we doing to fix it?
>
> As a result, we have instituted the following procedures going
> forward:
>
> - Introduce regular drills by all oncall staff of all of our
> production procedures. This will include the rare and complicated
> procedures, and all members of the team will be required to complete
> the drills before joining the oncall rotation.
>
> - Implement a regular bi-monthly audit of our operations docs to
> ensure that all needed procedures are properly findable, and all out-
> of-date docs are properly marked "Deprecated."
>
> - Establish a clear policy framework to assist oncall staff to quickly
> and decisively make decisions about taking intrusive, user-facing
> actions during failures. This will allow them to act confidently and
> without delay in emergency situations.
>
> We believe that with these new procedures in place, last week's outage
> would have been reduced in impact from about 2 hours of total
> unavailability to about 10 to 20 minutes of partial unavailability.
>
> In response to this outage, we have also decided to make a major
> infrastructural change in App Engine. Currently, App Engine provides a
> one-size-fits-all Datastore, that provides low write latency combined
> with strong consistency, in exchange for lower availability in
> situations of unexpected failure in one of our serving datacenters. In
> response to this outage, and feedback from our users, we have begun
> work on providing two different Datastore configurations:
>
> - The current option of low-latency, strong consistency, and lower
> availability during unexpected failures (like a power outage)
>
> - A new option for higher availability using synchronous replication
> for reads and writes, at the cost of significantly higher latency
>
> We believe that providing both of these options to you, our users,
> will allow you to make your own informed decisions about the tradeoffs
> you want to make in running your applications.
>
> We sincerely apologize for the impact of Feb 24th's service disruption
> on your applications. We take great pride in the reliability that App
> Engine offers, but we also recognize that we can do more to improve
> it. You can be confident that we will continue to work diligently to
> improve the service and ensure the impact of low level outages like
> this have the least possible affect on our customers.
>
> Timeline
> -----------
>
> 7:48 AM - Internal monitoring graphs first begin to show that traffic
> has problems in our primary datacenter and is returning an elevated
> number of errors. Around the same time, posts begin to show up in the
> google-appengine discussion group from users who are having trouble
> accessing App Engine.
>
> 7:53 AM - Google Site Reliabilty Engineers send an email to a broad
> audience notifying oncall staff that there has been a power outage in
> our primary datacenter. Google's datacenters have backup power
> generators for these situations. But, in this case, around 25% of
> machines in the datacenter did not receive backup power in time and
> crashed. At this time, our oncall staff was paged.
>
> 8:01 AM - By this time, our primary oncall engineer has determined the
> extent and the impact of the page, and has determined that App Engine
> is down. The oncall engineer, according to procedure, pages our
> product managers and engineering leads to handle communicating about
> the outage to out users. A few minutes later, the first post from the
> App Engine team about this outage is made on the external group ("We
> are investigating this issue.").
>
> 8:22 AM - After further analysis, we determine that although power has
> returned to the datacenter, many machines in the datacenter are
> missing due to the power outage, and are not able to serve traffic.
> Particularly, it is determined that the GFS and Bigtable clusters are
> not in a functioning state due to having lost too many machines, and
> that thus the Datastore is not usable in the primary datacenter at
> that time. The oncall engineer discusses performing a failover to our
> alternate datacenter with the rest of the oncall team. Agreement is
> reached to pursue our unexpected failover procedure for an unplanned
> datacenter outages.
>
> 8:36 AM - Following up on the post on the discussion group outage
> thread, the App Engine team makes a post about the outage to our
> appengine-downtime-notify group and to the App Engine Status site.
>
> 8:40 AM - The primary oncall engineer discovers two conflicting sets
> of procedures. This was a result of the operations process changing
> after our recent migration of the Datastore. After discussion with
> other oncall engineers, consensus is not reached, and members of the
> engineering team attempt to contact the specific engineers responsible
> for procedure change to resolve the situation.
>
> 8:44 AM - While others attempt to determine which is the correct
> unexpected failover procedure, the oncall engineer attempts to move
> all traffic into a read-only state in our alternate datacenter.
> Traffic is moved, but an unexpected configuration problem from this
> procedure prevents the read-only traffic from working properly.
>
> 9:08 AM - Various engineers are diagnosing the problem with read-only
> traffic in our alternate datacenter. In the meantime, however, the
> primary oncall engineer sees data that leads them to believe that our
> primary datacenter has recovered and may be able to serve. Without a
> clear rubric with which to make this decision, however, the engineer
> was not aware that based on historical data the primary datacenter is
> unlikely to have recovered to a usable state by this point of time.
> Traffic is moved back to the original primary datacenter as an attempt
> to resume serving, while others debug the read-only issue in the
> alternate datacenter.
>
> 9:18 AM - The primary oncall engineer determines that the primary
> datacenter has not recovered, and cannot serve traffic. It is now
> clear to oncall staff that the call was wrong, the primary will not
> recover, and we must focus on the alternate datacenter. Traffic is
> failed back over to the alternate datacenter, and the oncall makes the
> decision to follow the unplanned failover procedure and begins the
> process.
>
> 9:35 AM - An engineer with familiarity with the unplanned failover
> procedure is reached, and begins providing guidance about the failover
> procedure. Traffic is moved to our alternate datacenter, initially in
> read-only mode.
>
> 9:48 AM - Serving for App Engine begins externally in read-only mode,
> from our alternate datacenter. At this point, apps that properly
> handle read-only periods should be serving correctly, though in a
> reduced operational state.
>
> 9:53 AM - After engineering team consultation with the relevant
> engineers, now online, the correct unplanned failover procedure
> operations document is confirmed, and is ready to be used by the
> oncall engineer. The actual unplanned failover procedure for reads and
> writes begins.
>
> 10:09 AM - The unplanned failover procedure completes, without any
> problems. Traffic resumes serving normally, read and write. App Engine
> is considered up at this time.
>
> 10:19 AM - A follow-up post is made to the appengine-downtime-notify
> group, letting people know that App Engine is now serving normally.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appeng...@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Post-mortem for February 24th, 2010 outage

Reply via email to