📘 Read this post on Phabricator at
https://phabricator.wikimedia.org/phame/post/view/130/
<https://phabricator.wikimedia.org/phame/post/view/130/production_excellence_december_2018/>
-------

How’d we do in our strive for operational excellence last month? Read on to
find out!

- Month in numbers.
- Lighting round.
- Current problems.

## 📊 *Month in numbers*

* 4 documented incidents. [1]
* 20 Wikimedia-prod-error tasks closed. [2]
* 18 Wikimedia-prod-error tasks created. [3]
* 172 currently open Wikimedia-prod-error tasks (as of 16 January 2019).

Terminology:

* An *Exception* (or fatal) prevents a user action. For example, a page
would display “Exception: Unable to render page”, instead the article
content.
* An *Error* (or non-fatal, warning) can produce pages that are technically
unaware of a problem, but may show corrupt, incorrect, or incomplete
information. For example — a user may receive a notification that says “You
have (null) new messages”.

For December, I haven’t prepared any stories or taken interviews. Instead,
I’ve got a lightning round of errors in various areas that were found and
fixed this past month.

## ⚡️ *Contributions view fixed*

MarcoAurelio reported that Special:Contributions failed to load for certain
user names on meta.wikimedia.org (PHP Fatal error, due to a faulty database
record). Brad Jorsch investigated and found a relation to database
maintenance from March 2018. He corrected the faulty records, which
resolved the problem. Thanks!  — https://phabricator.wikimedia.org/T210985

## ⚡️ *Undefined talk space now defined*

The newly created Cantonese Wiktionary (yue.wiktionary.org) was
encountering errors from the Siteinfo API. We found this was due to invalid
site configuration. Urbanecm patched the issue, and also created a new unit
test for wmf-config that will prevent this issue from happening on other
wikis in the future. Thanks!  — https://phabricator.wikimedia.org/T211529

## ⚡️ *The undefined error status... error*

After deploying the 1.33.0-wmf.8 train to all wikis, we found a regression
in the HTTP library for MediaWiki. When MediaWiki requested an HTTP
resource from another service, and this resource was unavailable, then
MediaWiki failed to correctly determine the HTTP status code of that error.
Which then caused another error! This happened, for example, when
Special:Collection was unable to reach the PediaPress.com backend in some
cases. Fixed by Bill Pirkle. Thanks!  —
https://phabricator.wikimedia.org/T212005

## ⚡️ *Fatal error: Call to undefined function in Kartographer API*

When the 1.33.0-wmf-9 train reached the canary phase on Tue 18 December
(aka, group0 [1]), Željko spotted a new fatal error in the logs. The fatal
originated in the Kartographer extension and would have affected various
users of the MediaWiki API. Patched the same day by Michael Holloway,
reviewed by James Forrester, and deployed by Željko. Thanks!  —
https://phabricator.wikimedia.org/T212218

## 📉 *Current problems*

Take a look at the workboard and look for tasks that might need your help.
The workboard lists known issues, grouped by the week in which they were
first observed.

→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

December’s theme will continue for now, as I imagine lots of you were on
vacation during that time! I’d like to draw attention to a subset of PHP
fatal errors. Specifically, those that are publicly exposed (e.g. don’t
need elevated user rights) and emit an HTTP 500 error code.

* Wikibase: Clicking “undo” for certain revisions fatals with a
PatcherException. — https://phabricator.wikimedia.org/T97146
* Flow: Unable to view certain talk pages due to workflow
InvalidDataException. — https://phabricator.wikimedia.org/T70526
* Translate: Certain Special:Translate urls fatal. —
https://phabricator.wikimedia.org/T204833
* MediaWiki (Special-pages): SpecialDoubleRedirects unavailable on
tt.wikipedia.org. — https://phabricator.wikimedia.org/T204800
* MediaWiki (Parser): Parse API exposes fatal content model error. —
https://phabricator.wikimedia.org/T206253
* CentralNotice: Certain SpecialCentralNoticeBanners urls fatal. —
https://phabricator.wikimedia.org/T149240
* PageViewInfo: Certain “mostviewed” API queries fail. —
https://phabricator.wikimedia.org/T208691

Public user requests resulting in fatals can (and have) caused alerts to
fire that notify SRE of wikis potentially being less available or down.


💡*ProTip*: Use “Report Error” on
https://phabricator.wikimedia.org/tag/wikimedia-production-error/ to create
a task with a helpful template. This template is also available as “Report
Application Error”, from the “Create Task” dropdown menu, on any task
creation form.


## 🎉 *Thanks!*

Thank you to everyone who has helped by reporting, investigating, or
resolving problems in Wikimedia production. Including MarcoAurelio, Anomie,
Urbanecm, BPirkle, zeljkofilipin, Mholloway, Esanders, Jdforrester-WMF, and
hashar.

Until next time,

— Timo Tijhof

-------

Footnotes:

[1] Incidents. —
https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20181200&to=Incident+documentation%2F20190100&namespace=0


[2] Tasks closed. —
https://phabricator.wikimedia.org/maniphest/query/Pe2KaRZhJJ.H/#R

[3] Tasks opened. —
https://phabricator.wikimedia.org/maniphest/query/aqbDey80TU02/#R

[4] What is group0? —
https://wikitech.wikimedia.org/wiki/Deployments/One_week#Three_groups
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to