The incident report does not go far enough back into the history of the
incident.  It does not explain how this code managed to get into the
deployment chain with a fatal error in it. It does not identify ways to
prevent that from happening in the future.

Even the most conscientious and perfectionist developer will make the
occasional error - and the root problem here is not the error itself, but
the fact that anything that can take the entire Wikimedia cluster down for
9 minutes got deployed onto production wikis. Nine minutes of downtime on
one of the world's top-10 websites, caused by an *internal* error rather
than an external attack, is a very, very big deal, but I'm not getting that
impression from anything written here, on phabricator, or in the report
itself.  That disappoints me far more than that an error was made in the
first place.

Risker/Anne

On 26 October 2015 at 23:04, MZMcBride <z...@mzmcbride.com> wrote:

> Greg Grossmeier wrote:
> >All is better now. Outage lasted about 10 minutes.
> >
> >Full incident report will be written by Erik B today.
>
> https://wikitech.wikimedia.org/wiki/Special:Permalink/197206
>
> MZMcBride
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to