I just want to say thank you so much for these emails, they're great on their own, but together they paint a clear picture at a level usually inaccessible for those of us outside everyday mw development. Thank you!
On Sat, Dec 11, 2021 at 20:39 Krinkle <krin...@fastmail.com> wrote: > How’d we do in our strive for operational excellence last month? Read on > to find out! > Incidents > > 6 documented incidents last month. That's above the two-year and five-year > median of 4 per month (per Incident graphs > <https://codepen.io/Krinkle/full/wbYMZK>). > > 2021-11-04 large file upload timeouts > <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-04_large_file_upload_timeouts>; > Impact: For 9 months, editors were unable to upload large files (e.g. to > Commons). Editors would receive generic error messages, typically after a > timeout. In retrospect, a dozen different distinct production errors had > been reported and regularly observed that were related and provided > different clues, however most of these remained untriaged and > uninvestigated for months. This may be related to the affected components > having no active code steward. > > 2021-11-05 TOC language converter > <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-05_TOC_language_converter>; > Impact: For 6 hours, wikis experienced a blank or missing table of contents > on many pages. For up to 3 days prior, wikis that have multiple language > variants (such as Chinese Wikipedia) displayed the table of contents in an > incorrect or inconsistent language variant (which are not understandable to > some readers). > > 2021-11-10 cirrussearch commonsfile outage > <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-10_cirrussearch_commonsfile_outage>; > Impact: For ~2.5 hours, the Search results page was unavailable on many > wikis (except English Wikipedia). On Wikimedia Commons the search > suggestions feature was unresponsive as well. > > 2021-11-18 codfw ipv6 network > <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-18_codfw_ipv6_network>; > Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6 > connectivity for upload.wikimedia.org. This did not affect availability > of the service because the "Happy Eyeballs > <https://en.wikipedia.org/wiki/Happy_Eyeballs>" algorithm ensures > browsers (and other clients) automatically fallback to IPv4. The Codfw > cluster generally serves Mexico and parts of the US and Canada. The > upload.wikimedia.org service serves photos and other media/document > files, such as displayed in Wikipedia articles. > > 2021-11-23 core network routing > <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-23_Core_Network_Routing>; > Impact: For about 12 minutes, Eqiad was unable to reach hosts in other data > centers via public IP addresses. This was due to a BGP routing error. There > was no impact on end-user traffic, and impact on internal traffic was > limited (only Icinga alerts themselves) because internal traffic generally > uses local IP subnets which we currently route with OSPF instead of BGP. > > 2021-11-25 eventgate-main outage > <https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage>; > Impact: For about 3 minutes, eventgate-main was down. This resulted in > 25,000 MediaWiki backend errors due to inability to queue new jobs. About > 1000 user-facing web requests failed (HTTP 500 Error). Event production > briefly dropped from ~3000 per second to 0 per second. > Incident follow-up > > Remember to review and schedule Incident Follow-up work > <https://phabricator.wikimedia.org/project/view/4758/> in Phabricator, > which are preventive measures and tech debt mitigations written down after > an incident is concluded. Read more about past incidents at Incident > status <https://wikitech.wikimedia.org/wiki/Incident_status> on Wikitech. > > Recently resolved incident follow-up: > > Disable DPL on wikis that aren't using it > <https://phabricator.wikimedia.org/T287916> > Filed after a July 2021 incident, done by Amir (Ladsgroup) and Kunal > (Legoktm). > > Create easy access to MySQL ports for faster incident response and > maintenance <https://phabricator.wikimedia.org/T291352> > Filed in Sep 2021, and carried out by Stevie (Kormat). > > Create paging alert for primary DB hosts > <https://phabricator.wikimedia.org/T233684> > Filed after a Sept 2019 incident, done by Stevie (Kormat). > > Trends > > November saw 27 new production error reports of which 14 were resolved, > and 13 remain open and carry over to the next month. > > Of the 301 errors still open from previous months, 16 were resolved. > Together with the 13 carried over from November that brings the workboard > to 298 unresolved tasks. > Figure 1: Unresolved error reports by month > <https://phabricator.wikimedia.org/phame/post/view/261/production_excellence_38_november_2021/#trends> > . > > > Outstanding errors > > Take a look at the workboard and look for tasks that could use your help. > → https://phabricator.wikimedia.org/tag/wikimedia-production-error/ > > 💡 Did you know: > *To find your team's error reports, use the appropriate **"Filter" link > in the sidebar of the workboard**.* > > Issues carried over from recent months: > > Apr 2021: > 9 of 42 issues left. > May 2021: > 16 of 54 issues left. > Jun 2021: > 9 of 26 issues left. > Jul 2021: > 11 of 31 issues left. > Aug 2021: > 10 of 46 issues left. > Sep 2021: > 10 of 24 issues left. > Oct 2021: > 20 of 49 issues left. > Nov 2021: > 13 of 27 new issues > <https://phabricator.wikimedia.org/maniphest/query/0W0Nuk9umBDc/#R> are > carried forward. > > Thanks! > > Thank you to everyone who helped by reporting, investigating, or resolving > problems in Wikimedia production. Thanks! > > Until next time, > > – Timo Tijhof > > > 🔗 Share or read later via > https://phabricator.wikimedia.org/phame/post/view/261/ > > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/