BBlack added a comment. |
Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but still pretty verbose):
# cache servers work ongoing here, ethtool changes that require short depooled downtimes around short ethernet port outages: 17:49 bblack: ulsfo cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 17:57 bblack@neodymium: conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet 17:59 bblack: codfw cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 18:00 <+jouncebot> Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. 18:27 bblack: esams cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 18:41 bblack: eqiad cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
# 5xx alerts start appearing. initial assumption is related to ethtool work above 18:44 <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] 18:46 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] 18:47 <+icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] 18:48 <+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] # ...but once the MW exceptions hit, seems less-likely to be related to the ethtool work # notices hit IRC for these wikidata sorting changes: 18:51 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 01m 04s) 18:52 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) 19:12 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 50s) 19:14 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 00m 49s) # Lots of discussion and digging ensues on all sides... # bblack figures out that while the logs implicate a single eqiad text backend cache, depooling said cache moves the problem to a different cache host (repeatedly), so it doesn't seem to be a faulty cache node. # One cache just happens to be the unlucky chash destination for more of the problematic traffic than the others at any given time. # The problematic traffic load/patterns consumes all of the 1K connection slots varnish allows to api.svc+appservers.svc, and then this causes many unrelated 503s for lack of available backend connection slots to service requests. # The Fatals logs seem to be related to ORES fetching from Wikidata # So, a timeout is increased there to cope with slow wikidata responses: 19:33 awight@tin: Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 19:33 awight@tin: Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 00m 10s) 19:34 awight@tin: Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 19:36 aaron@tin: Started restart [jobrunner/jobrunner@a20d043]: (no justification provided) 19:41 awight@tin: Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 07m 25s) # Still doesn't fix the problem, so the next attempted fix is to disable ores+wikidata entirely: 20:02 ladsgroup@tin: Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s) 20:00 ladsgroup@tin: Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s) # Things are still borked, try reverting some other recent Wikidata-related changes: 20:59 hoo@tin: Synchronized wmf-config/Wikibase.php: Revert "Add property for RDF mapping of external identifiers for Wikidata" (T178180) (duration: 00m 50s) 21:00 hoo: Fully revert all changes related to T178180 # Still borked. Tried reverting something else that looks dangerous in the logstash errors, but also wasn't the cause: 21:30 hoo@tin: Synchronized wmf-config/InitialiseSettings.php: Temporary disable remex html (T178632) (duration: 00m 50s) 21:32 hoo@tin: Synchronized wmf-config/InitialiseSettings.php: Temporary disable remex html (T178632) (duration: 00m 50s) # For lack of anything better to try, raise varnish->MW connection limits by an order of magnitude to see what happens: 21:40 bblack: raising backend max_connections for api.svc.eqiad.wmnet + appservers.svc.eqiad.wmnet from 1K to 10K on cp1053.eqiad.wmnet (current funnel for the bulk of the 503s) # At this point the large rate of unrelated 503s subsides. Socket counts eventually stabilize without reaching the new (10K) limits on this host. There's still clearly a problem, but this is effectively mitigating the worst of the damage for now. # Eventually puppetize this change out to the other cache_text in eqiad and restart cp1053 backend to clear mailbox lag from the heavy connection-cycling, which moves the traffic pattern over to cp1052, where the same bad connection pattern appears yet again (but now without unrelated 503s, of course): 22:34 <+wikibugs> (CR) BBlack: [C: 2] cache_text: raise MW connection limits to 10K [puppet] - https://gerrit.wikimedia.org/r/386756 (owner: BBlack) 22:39 bblack: restarting varnish-backend on cp1053 (mailbox lag from ongoing issues elsewhere?) 22:41 < bblack> so the target cache is now cp1052 it seems # ori chimes in out of nowhere that we should try reverting the earlier "Fix sorting of NullResults" stuff, which then leads to a decision to just roll Wikidatawiki back to wmf.4 (which undoes those changes and more): 22:47 demon@tin: rebuilt wikiversions.php and synchronized wikiversions files: wikidata to wmf.4 # In reference to how quickly this might work even if it's the right revert (finally?): 22:49 < ori> it'll probably take a few minutes # Keep debating and staring at things. We're still not having unrelated 503s since back at ~21:40, but the pattern (now on cp1052) of the underlying badness is still present (large counts ofsockets cycling to MW). Eventually: 23:22 < bblack> I'm going to restart cp1052 varnish backend again just to clear out other noise that makes it harder to see what's going on 23:24 < bblack> cp1067 is the new target [...] 23:29 < bblack> well, cp1067 isn't building up the connection stats I'd expect 23:29 < bblack> none of them seem to clearly be the target anymore... 23:30 < bblack> problem mysteriously poofed? probably not right when I moved the traffic above. probably earlier, but varnish was still suffering until I moved it. 23:33 < bblack> yeah, still none of the eqiad text caches are showing signs of connection problems. 23:33 < bblack> the last action I took was restarting cp1052, which would've (at least, the several previous times I did this) shifting the load to a random new cache 23:34 < bblack> it was ~22:40 (nearly an hour ago) when we switched from cp1053 to cp1052 23:35 < bblack> and it was during the earlier part of that nearly-an-hour window that the wmf.4 revert for wikidatawiki happened 23:35 < bblack> I tihnk somewhere in there the underlying problem went away, and cp1052 was just still suffering lingering effects. 23:35 < bblack> that's my best hypothesis right now [...] 23:36 < bblack> but yeah, I think the revert did it, eventually 23:37 < bblack> [plus kicking the tires of the varnish cluster to shake it out of its bad pattern afterwards]
TASK DETAIL
EMAIL PREFERENCES
To: BBlack
Cc: thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Zppix, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, Jay8g, fgiunchedi, mmodell
Cc: thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Zppix, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, Jay8g, fgiunchedi, mmodell
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs