Re: [Wikitech-l] Wikipedia is down

Oliver Keyes Tue, 27 Oct 2015 08:04:56 -0700

On 27 October 2015 at 10:56, Brad Jorsch (Anomie) <bjor...@wikimedia.org> wrote:
> On Tue, Oct 27, 2015 at 10:29 AM, Risker <risker...@gmail.com> wrote:
>
>>    - Why wasn't it part of the deployment train
>>
>
> Good question, and one that needs someone involved in this backport to
> answer.


I can sort of answer this question (although don't consider it the
canonical answer necessarily; this is just what I know from within the
team and the analytics side of things):

A few weeks ago changes to the infrastructure around EventLogging
meant that our event collection around Search, for desktop events,
unexpectedly broke. While it was non-functional, we were unable to
collect any data for some of our high-level metrics, including load
times and event counts around user searches. Not just "unable to
visualise"; the events weren't being collected, meaning that
backfilling was impossible - that data was simply gone and the longer
we went without a fix, the longer a gap we'd have.

Deploying the fix took a while for a couple of reasons - namely, the
deployment freeze while Operations were out of town and the
EventLogging rollback and freeze due to some issues with other changes
to that extension. This extended the period we were missing data for,
upping the criticality of fixing it. As of today we're at 1 month and
3 days of missing data. So, that's probably why it was SWATted; the
irrevocable impact of waiting, and how long we had /been/ waiting.

>
>
>>    - As a higher level question, what are the thresholds for using a SWAT
>>    deployment as opposed to the regular deployment train, are these
>> standards
>>    being followed, and are they the right standards. (Even I notice that
>> most
>>    of the big problems seem to come with deployments outside of the
>> deployment
>>    train.)
>>
>
> My understanding is that SWAT is supposed to be for WMF configuration
> changes (i.e. the operations/mediawiki-config repo, which this wasn't) and
> for urgent bug fixes that can't wait for the weekly train. But my
> understanding might be too strict, so I'd recommend waiting for a more
> official answer than mine.
>
>
>>    - How was the code reviewed and tested before deployment
>>
>
> First, it was reviewed before being merged into master. Then the SWAT
> deployer is supposed to review the backport for potential issues, although
> they may lack the domain-specific knowledge that the original reviewers
> have to spot issues like the one here.
>
>
>>    - Why did it appear to work in some contexts (indicated in your response
>>    as master and Beta Labs) but not in the production context
>>
>
> You're assuming this code wouldn't have worked in the production context if
> deployed correctly. It's like asking "Why does it work to change a
> lightbulb normally, but it doesn't work if the bulb-changer forgets to
> remove the burned-out bulb before trying to put the new one in?"
>
>
>>    - How are we ensuring that deployments that require multiple sequential
>>    steps are (a) identified and (b) implemented in a way that those steps
>> are
>>    followed in the correct order
>>
>
> It requires that the people proposing/implementing the change identify the
> prerequisites required. There's currently no automated way to do this, and
> even if some automated mechanism such as "Depends-On" tags on the git
> commits were implemented it would require that people correctly use the
> mechanism and that the mechanism can be automatically tracked during
> backports as well as normal development merges.
>
> There's also the possibility that unit testing could catch such issues when
> the changes are merged to the deployment branches before being deployed,
> and our Release Engineering team has been working on increasing the number
> of extension unit tests run. But that requires we have unit tests that
> cover everything, which we don't so things can still slip through. It also
> wouldn't handle the case where the individual files of the change are
> individually deployed out of order, although at a glance it doesn't seem
> like that was the issue here.
>
> Taking this further to discuss plans, implementation, and mitigation of the
> remaining process issues is a discussion for the Release Engineering team,
> and may already be happening somewhere. Once people in SF get into work
> they might have further comments along these lines.
>
>
> --
> Brad Jorsch (Anomie)
> Senior Software Engineer
> Wikimedia Foundation
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikipedia is down

Reply via email to