My phone sent the above writing as the train jostled. I'll re send after finishing at the office On Oct 27, 2015 8:04 AM, "Erik Bernhardson" <ebernhard...@wikimedia.org> wrote:
> > On Oct 27, 2015 7:29 AM, "Risker" <risker...@gmail.com> wrote: > > > > On 27 October 2015 at 09:57, Brad Jorsch (Anomie) <bjor...@wikimedia.org > > > > wrote: > > > > > On Tue, Oct 27, 2015 at 8:02 AM, Risker <risker...@gmail.com> wrote: > > > > > > > The incident report does not go far enough back into the history of > the > > > > incident. It does not explain how this code managed to get into the > > > > deployment chain with a fatal error in it. > > > > > > > > > Actually, it does. Erik writes "This occured because the patch for the > > > CirrusSearch repository that removed the schema should have been > deployed > > > before the change that adds it to the WikimediaEvents repository." > > > > > > In other words, there was nothing wrong with the code itself. The > problem > > > was that the multiple pieces of the change needed to be done in a > > > particular order during the manual backporting process, but they were > not > > > done in that order. > > > > > > If this had waited for the train deployment, both pieces would have > been > > > done simultaneously and it wouldn't have been an issue, just as it > wasn't > > > an issue when these changes were done in master and automatically > deployed > > > to Beta Labs. > > > > > > > > That's a start, Brad. But even as someone who has limited experience > with > > software deployment, I can think of at least half a dozen questions that > > I'd be asking here: > > > > - Why wasn't it part of the deployment train > This was a fix for something that broke during the previous deployment > train. Specifically a hook was changed in core and not noticed in the > extenaion until the events from javascript stopped coming into our logging > tables. > > - As a higher level question, what are the thresholds for using a SWAT > > deployment as opposed to the regular deployment train, are these > standards > > being followed, and are they the right standards. (Even I notice that > most > > of the big problems seem to come with deployments outside of the > deployment > > train.) > > This is documented at https://wikitech.wikimedia.org/wiki/SWAT_deploys. > I'm not sure about previous outages but in this case the patch matches the > documented limits. My intuition is a that a dep > > - How was the code reviewed and tested before deployment > Code was re > > - Why did it appear to work in some contexts (indicated in your > response > > as master and Beta Labs) but not in the production context > Because, as stated in the report and by brad, the code itself works. The > code was redeployed after the outage with no errors because the second time > it was deployed in the correct order. This is why code review didn't catch > the fatal and the error didn't show up in beta labs. This was an issue > primarily with deployment process. > > > - How are we ensuring that deployments that require multiple > sequential > > steps are (a) identified and (b) implemented in a way that those > steps are > > followed in the correct order > > > > > > Notice how none of the questions are "what was wrong with the code" or > "who > > screwed up". They're all systems questions. This is a systems problem. > > Even in situations where there *is* a problem with the code or someone > > *did* screw up, the root cause usually comes back to having single points > > of failure (e.g. one person having the ability to [unintentionally] get > > problem code deployed, or weaknesses in the code review and testing > > process). > > > > Risker/Anne > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > At a higher level, this was a 9 minute outage instead of a 2 or 3 minute > outage due to two mistakes I made while doing the revert. Both of these are > in the incident report. First the monitor I was watching from our logserver > to tell me it needs a rollback did not report this error adding a minute or > two before the rollback started. We have other monitors that have been > added in the past year that I should have been looking at as well. Second I > reverted multiple patches from within gerrit (our code review tool) which > takes too long when the site is down. I can only point to inexperience > here, others who have previously taken our sites down informed me that the > proper was is to revert is directly on the deployment server. Iv been > deploying patches and to wmf for a couple years and have always in the past > reverted through gerrit, but those didn't need the extra speedy recovery as > the site was not down, it was only logging errors or some specific piece of > functionality was not working. > > Going up another level comes to our deployment tooling specifically. > RelEng is working on a project called scap3 which brings our deployment > process closer to what you should expect from a top 10 website. It includes > canary deployments (eg 1% of servers) along with a single command that > undoes the entire deployment. Canary deployments allow to see an error > before it is deployed everywhere, and a one command rollback operation > would have likely brought the site back 3 to 4 minutes faster than how I > reverted the patches. > > I did not link the scap3 portions as an actionable because, in my mind, > that's not a single actionable thing. Scap3 is a major overhaul of our > deploy process. Additionally this is already a priority in RelEng. > _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l