BBlack added a comment. |
Updates from the Varnish side of things today (since I've been bad about getting commits/logs tagged onto this ticket):
- 18:15 - I took over looking at today's outburst on the Varnish side
- The current target at the time was cp1053 (after elukey's earlier restart of cp1055 varnish-be above)
- 18:21 - I manually reduced the backend timeouts for api+appservers to from defaults of connect/firstbyte/betweenbytes of 5/180/60 to 3/20/10
- cp1053 had already been under assault for quite a while though, and this didn't seem to help much.
- 18:39 - restarted varnish-be on cp1053 to clear it of issues and move to a new target
- 18:41 - identified cp1065 as the new target (more on identifying these below!)
- 18:42 - Merged->deployed https://gerrit.wikimedia.org/r/#/c/387024/ to apply the shorter-timeouts workaround above to all text caches
- at this point, cp1065 was showing various signs of the issue (rising connection counts + mailbox lag), but connection counts stabilized much lower than before, ~200-300 instead of rising towards ~3K, an apparent success of the timeout-reduction mitigation.
- 18:56 - Identified the first slow-running requests in cp1065 logs with the reduced timeouts:
18:56 < bblack> - BereqURL /w/api.php?action="" 18:56 < bblack> - BereqHeader Host: www.wikidata.org 18:56 < bblack> - Timestamp Bereq: 1509216884.761646 0.000042 0.000042 18:56 < bblack> - Timestamp Beresp: 1509216965.538549 80.776945 80.776903 18:56 < bblack> - Timestamp Error: 1509216965.538554 80.776950 0.000005 18:56 < bblack> - Timestamp Start: 1509216970.911803 0.000000 0.000000
- after this, identified several other slow requests. All were for the same basic flow-parsoid-utils API + www.wikidata.org
- 19:39 - hoo's parsoid timeout reduction for Flow (above) hits
- 19:39 - restarted varnish-backend on cp1065 due to rising mailbox lag
- 19:41 - new target seems to be cp1067, briefly, but within a minute or two it recovers to normal state and stops exhibiting the symptoms much? Apparently the problem-causing traffic may have temporarily died off on its own.
For future reference by another opsen who might be looking at this: one of the key metrics that identifies what we've been calling the "target cache" in eqiad, the one that will (eventually) have issues due to whatever bad traffic is currently mapped through it, is by looking at the connection counts to appservers.svc.eqiad.wmnet + api-appservers.svc.eqiad.wmnet on all the eqiad cache nodes. For this, I've been using:
bblack@neodymium:~$ sudo cumin A:cp-text_eqiad 'netstat -an|egrep "10\.2\.2\.(1|22)"|awk "{print \$5}"|sort|uniq -c|sort -n'
Which during the latter/worst part of cp1053's earlier target-period produced output like:
===== NODE GROUP ===== (1) cp1068.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 1 10.2.2.18:8080 15 10.2.2.17:7231 79 10.2.2.1:80 111 10.2.2.22:80 ===== NODE GROUP ===== (1) cp1066.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 1 10.2.2.18:8080 14 10.2.2.17:7231 92 10.2.2.1:80 111 10.2.2.22:80 ===== NODE GROUP ===== (1) cp1052.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 1 10.2.2.18:8080 14 10.2.2.17:7231 100 10.2.2.1:80 116 10.2.2.22:80 ===== NODE GROUP ===== (1) cp1054.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 1 10.2.2.18:8080 17 10.2.2.17:7231 125 10.2.2.1:80 126 10.2.2.22:80 ===== NODE GROUP ===== (1) cp1067.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 1 10.2.2.18:8080 16 10.2.2.17:7231 78 10.2.2.1:80 140 10.2.2.22:80 ===== NODE GROUP ===== (1) cp1065.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 1 10.2.2.18:8080 46 10.2.2.17:7231 85 10.2.2.1:80 113 10.2.2.22:80 ===== NODE GROUP ===== (1) cp1053.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 2 10.2.2.18:8080 1140 10.2.2.17:7231 3031 10.2.2.22:80 3345 10.2.2.1:80 ===== NODE GROUP ===== (1) cp1055.eqiad.wmnet ----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' ----- 3 10.2.2.18:8080 27 10.2.2.17:7231 98 10.2.2.1:80 123 10.2.2.22:80 ================ `
Above, you can see most of the nodes have ~100 sockets to .1 and .22 (appservers and api-appservers), but cp1053 has ~3K (most of which will be in CLOSE_WAIT). It takes a while to build up to ~3K under the pressure of the problem, but that's about the plateau we've seen. Once a node's mailbox lag gets bad and the connection counts are up in this range, usually the recourse is to restart that varnish backend (run-no-puppet varnish-backend-restart) to move the problem to a new node and restart the clock on the badness impacting a varnishd.
Cc: Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs