The incident report does not go far enough back into the history of the incident. It does not explain how this code managed to get into the deployment chain with a fatal error in it. It does not identify ways to prevent that from happening in the future.
Even the most conscientious and perfectionist developer will make the occasional error - and the root problem here is not the error itself, but the fact that anything that can take the entire Wikimedia cluster down for 9 minutes got deployed onto production wikis. Nine minutes of downtime on one of the world's top-10 websites, caused by an *internal* error rather than an external attack, is a very, very big deal, but I'm not getting that impression from anything written here, on phabricator, or in the report itself. That disappoints me far more than that an error was made in the first place. Risker/Anne On 26 October 2015 at 23:04, MZMcBride <z...@mzmcbride.com> wrote: > Greg Grossmeier wrote: > >All is better now. Outage lasted about 10 minutes. > > > >Full incident report will be written by Erik B today. > > https://wikitech.wikimedia.org/wiki/Special:Permalink/197206 > > MZMcBride > > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l