My phone sent the above writing as the train jostled. I'll re send after
finishing at the office
On Oct 27, 2015 8:04 AM, "Erik Bernhardson" <ebernhard...@wikimedia.org>
wrote:

>
> On Oct 27, 2015 7:29 AM, "Risker" <risker...@gmail.com> wrote:
> >
> > On 27 October 2015 at 09:57, Brad Jorsch (Anomie) <bjor...@wikimedia.org
> >
> > wrote:
> >
> > > On Tue, Oct 27, 2015 at 8:02 AM, Risker <risker...@gmail.com> wrote:
> > >
> > > > The incident report does not go far enough back into the history of
> the
> > > > incident.  It does not explain how this code managed to get into the
> > > > deployment chain with a fatal error in it.
> > >
> > >
> > > Actually, it does. Erik writes "This occured because the patch for the
> > > CirrusSearch repository that removed the schema should have been
> deployed
> > > before the change that adds it to the WikimediaEvents repository."
> > >
> > > In other words, there was nothing wrong with the code itself. The
> problem
> > > was that the multiple pieces of the change needed to be done in a
> > > particular order during the manual backporting process, but they were
> not
> > > done in that order.
> > >
> > > If this had waited for the train deployment, both pieces would have
> been
> > > done simultaneously and it wouldn't have been an issue, just as it
> wasn't
> > > an issue when these changes were done in master and automatically
> deployed
> > > to Beta Labs.
> > >
> > >
> > That's a start, Brad.  But even as someone who has limited experience
> with
> >  software deployment, I can think of at least half a dozen questions that
> > I'd be asking here:
> >
> >    - Why wasn't it part of the deployment train
> This was a fix for something that broke during the previous deployment
> train. Specifically a hook was changed in core and not noticed in the
> extenaion until the events from javascript stopped coming into our logging
> tables.
> >    - As a higher level question, what are the thresholds for using a SWAT
> >    deployment as opposed to the regular deployment train, are these
> standards
> >    being followed, and are they the right standards. (Even I notice that
> most
> >    of the big problems seem to come with deployments outside of the
> deployment
> >    train.)
>
> This is documented at https://wikitech.wikimedia.org/wiki/SWAT_deploys.
> I'm not sure about previous outages but in this case the patch matches the
> documented limits. My intuition is a that a dep
> >    - How was the code reviewed and tested before deployment
> Code was re
> >    - Why did it appear to work in some contexts (indicated in your
> response
> >    as master and Beta Labs) but not in the production context
> Because, as stated in the report and by brad, the code itself works. The
> code was redeployed after the outage with no errors because the second time
> it was deployed in the correct order. This is why code review didn't catch
> the fatal and the error didn't show up in beta labs. This was an issue
> primarily with deployment process.
>
> >    - How are we ensuring that deployments that require multiple
> sequential
> >    steps are (a) identified and (b) implemented in a way that those
> steps are
> >    followed in the correct order
> >
> >
> > Notice how none of the questions are "what was wrong with the code" or
> "who
> > screwed up".  They're all systems questions. This is a systems problem.
> > Even in situations where there *is* a problem with the code or someone
> > *did* screw up, the root cause usually comes back to having single points
> > of failure (e.g. one person having the ability to [unintentionally] get
> > problem code deployed, or weaknesses in the code review and testing
> > process).
> >
> > Risker/Anne
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> At a higher level, this was a 9 minute outage instead of a 2 or 3 minute
> outage due to two mistakes I made while doing the revert. Both of these are
> in the incident report. First the monitor I was watching from our logserver
> to tell me it needs a rollback did not report this error adding a minute or
> two before the rollback started. We have other monitors that have been
> added in the past year that I should have been looking at as well. Second I
> reverted multiple patches from within gerrit (our code review tool) which
> takes too long when the site is down. I can only point to inexperience
> here, others who have previously taken our sites down informed me that the
> proper was is to revert is directly on the deployment server. Iv been
> deploying patches and to wmf for a couple years and have always in the past
> reverted through gerrit, but those didn't need the extra speedy recovery as
> the site was not down, it was only logging errors or some specific piece of
> functionality was not working.
>
> Going up another level comes to our deployment tooling specifically.
> RelEng is working on a project called scap3 which brings our deployment
> process closer to what you should expect from a top 10 website. It includes
> canary deployments (eg 1% of servers) along with a single command that
> undoes the entire deployment. Canary deployments allow to see an error
> before it is deployed everywhere, and a one command rollback operation
> would have likely brought the site back 3 to 4 minutes faster than how I
> reverted the patches.
>
> I did not link the scap3 portions as an actionable because, in my mind,
> that's not a single actionable thing. Scap3 is a major overhaul of our
> deploy process. Additionally this is already a priority in RelEng.
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to