Send Outages-discussion mailing list submissions to
        outages-discussion@outages.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://puck.nether.net/mailman/listinfo/outages-discussion
or, via email, send a message with subject or body 'help' to
        outages-discussion-requ...@outages.org

You can reach the person managing the list at
        outages-discussion-ow...@outages.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Outages-discussion digest..."


Today's Topics:

   1. Re: S3 Outages Postmortem (Michael Christian)
   2. Re: S3 Outages Postmortem (Jim Popovitch)


----------------------------------------------------------------------

Message: 1
Date: Wed, 1 Mar 2017 23:44:37 -0800
From: Michael Christian <mfletcherchrist...@yahoo.com>
To: "Chapman, Brad (NBCUniversal)" <brad.chap...@nbcuni.com>
Cc: Kevin Blackham <black...@gmail.com>, Bob Strecansky
        <b...@mailchimp.com>, "outages-discussion@outages.org"
        <outages-discussion@outages.org>
Subject: Re: [Outages-discussion] S3 Outages Postmortem
Message-ID: <0b57d2b9-cfc3-4e5b-90be-05d56d657...@yahoo.com>
Content-Type: text/plain; charset="utf-8"

The outage was abrupt, but the recovery came in stages.  Read traffic first, 
followed by write traffic ~1.5 hours later.   That makes me think a power 
problem, or automation gone awry.  We always blame the network team, but that 
rings hollow to me here.

On strategy, I am fully behind prioritization of read traffic recovery over 
write traffic.  That's evolving over time, but is still true for most use cases.

For those saying "who cares," you may not understand the number of blended 
integrated systems out there in this age.  This took down a huge number of 
correlated services, and it shouldn't have.   We need looser coupling.

- Mike Christian


Sent from my iPad

> On Mar 1, 2017, at 11:25 AM, Chapman, Brad (NBCUniversal) 
> <brad.chap...@nbcuni.com> wrote:
> 
> ??lots of services affected??
>  
> Well, that was pretty obvious from the dashboard yesterday:
>  
> https://i.imgur.com/xTec0Bn.png
>  
> -Brad
>  
> From: Outages-discussion [mailto:outages-discussion-boun...@outages.org] On 
> Behalf Of Kevin Blackham
> Sent: Wednesday, March 1, 2017 11:17 AM
> To: Bob Strecansky <b...@mailchimp.com>
> Cc: outages-discussion@outages.org
> Subject: Re: [Outages-discussion] S3 Outages Postmortem
>  
> I have some insights, but I'm under NDA. This was big enough I expect some 
> public disclosure (my words).
>  
> I can tell you we observed lots of services affected, not just S3. EBS was 
> jacking up IO all over the place, and many machines didn't even ping. SES was 
> quite broken, as was autoscaling. One might conclude it was a network problem.
>  
> On Mar 1, 2017 12:09, "Bob Strecansky" <b...@mailchimp.com> wrote:
> Has anyone heard anything about why S3 was down for 5 hours yesterday?  
> Usually Amazon doesn't post postmortems, and i'm curious as to what happened.
>  
> Thanks,
>  
> Bob Strecansky
> --
> Thanks,
> 
> -B
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion@outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion@outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<https://puck.nether.net/pipermail/outages-discussion/attachments/20170301/0cdc9eb1/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 2 Mar 2017 09:45:20 -0500
From: Jim Popovitch <jim...@gmail.com>
To: "outages-discussion@outages.org" <outages-discussion@outages.org>
Subject: Re: [Outages-discussion] S3 Outages Postmortem
Message-ID:
        <CAGfsgR0nFc+=T6RDd7NBRLaYQU1DrBNMEVx+cEmX9v-e=nz...@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

On Thu, Mar 2, 2017 at 2:44 AM, Michael Christian
<mfletcherchrist...@yahoo.com> wrote:
> For those saying "who cares," you may not understand the number of blended
> integrated systems out there in this age.

I'm someone who says 'who cares', but not in the context you're
suggesting.   I say:


Who cares to see 30 outages posts for an outage in 1/20th of 1
providers datacenter services?


Who cares to see 30 outages posts about "important" websites that
don't follow decades of best practices on redundancy and resiliency?


Who cares to see 30 outages posts about "me too", "me too", "me too"?


-Jim P.


------------------------------

Subject: Digest Footer

_______________________________________________
Outages-discussion mailing list
Outages-discussion@outages.org
https://puck.nether.net/mailman/listinfo/outages-discussion


------------------------------

End of Outages-discussion Digest, Vol 93, Issue 3
*************************************************

Reply via email to