BBlack added a comment.

Updates from the Varnish side of things today (since I've been bad about getting commits/logs tagged onto this ticket):

  • 18:15 - I took over looking at today's outburst on the Varnish side
  • The current target at the time was cp1053 (after elukey's earlier restart of cp1055 varnish-be above)
  • 18:21 - I manually reduced the backend timeouts for api+appservers to from defaults of connect/firstbyte/betweenbytes of 5/180/60 to 3/20/10
  • cp1053 had already been under assault for quite a while though, and this didn't seem to help much.
  • 18:39 - restarted varnish-be on cp1053 to clear it of issues and move to a new target
  • 18:41 - identified cp1065 as the new target (more on identifying these below!)
  • 18:42 - Merged->deployed https://gerrit.wikimedia.org/r/#/c/387024/ to apply the shorter-timeouts workaround above to all text caches
  • at this point, cp1065 was showing various signs of the issue (rising connection counts + mailbox lag), but connection counts stabilized much lower than before, ~200-300 instead of rising towards ~3K, an apparent success of the timeout-reduction mitigation.
  • 18:56 - Identified the first slow-running requests in cp1065 logs with the reduced timeouts:
18:56 < bblack> -   BereqURL       
/w/api.php?action=""
18:56 < bblack> -   BereqHeader    Host: www.wikidata.org
18:56 < bblack> -   Timestamp      Bereq: 1509216884.761646 0.000042 0.000042
18:56 < bblack> -   Timestamp      Beresp: 1509216965.538549 80.776945 80.776903
18:56 < bblack> -   Timestamp      Error: 1509216965.538554 80.776950 0.000005
18:56 < bblack> -   Timestamp      Start: 1509216970.911803 0.000000 0.000000

  • after this, identified several other slow requests. All were for the same basic flow-parsoid-utils API + www.wikidata.org
  • 19:39 - hoo's parsoid timeout reduction for Flow (above) hits
  • 19:39 - restarted varnish-backend on cp1065 due to rising mailbox lag
  • 19:41 - new target seems to be cp1067, briefly, but within a minute or two it recovers to normal state and stops exhibiting the symptoms much? Apparently the problem-causing traffic may have temporarily died off on its own.

For future reference by another opsen who might be looking at this: one of the key metrics that identifies what we've been calling the "target cache" in eqiad, the one that will (eventually) have issues due to whatever bad traffic is currently mapped through it, is by looking at the connection counts to appservers.svc.eqiad.wmnet + api-appservers.svc.eqiad.wmnet on all the eqiad cache nodes. For this, I've been using:

bblack@neodymium:~$ sudo cumin A:cp-text_eqiad 'netstat -an|egrep "10\.2\.2\.(1|22)"|awk "{print \$5}"|sort|uniq -c|sort -n'

Which during the latter/worst part of cp1053's earlier target-period produced output like:

===== NODE GROUP =====                                                                                                                                                                         
(1) cp1068.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      1 10.2.2.18:8080                                                                                                                                                                         
     15 10.2.2.17:7231                                                                                                                                                                         
     79 10.2.2.1:80
    111 10.2.2.22:80
===== NODE GROUP =====                                                                                                                                                                         
(1) cp1066.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      1 10.2.2.18:8080                                                                                                                                                                         
     14 10.2.2.17:7231                                                                                                                                                                         
     92 10.2.2.1:80
    111 10.2.2.22:80
===== NODE GROUP =====                                                                                                                                                                         
(1) cp1052.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      1 10.2.2.18:8080                                                                                                                                                                         
     14 10.2.2.17:7231                                                                                                                                                                         
    100 10.2.2.1:80
    116 10.2.2.22:80
===== NODE GROUP =====                                                                                                                                                                         
(1) cp1054.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      1 10.2.2.18:8080                                                                                                                                                                         
     17 10.2.2.17:7231                                                                                                                                                                         
    125 10.2.2.1:80
    126 10.2.2.22:80
===== NODE GROUP =====                                                                                                                                                                         
(1) cp1067.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      1 10.2.2.18:8080                                                                                                                                                                         
     16 10.2.2.17:7231                                                                                                                                                                         
     78 10.2.2.1:80
    140 10.2.2.22:80
===== NODE GROUP =====                                                                                                                                                                         
(1) cp1065.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      1 10.2.2.18:8080                                                                                                                                                                         
     46 10.2.2.17:7231                                                                                                                                                                         
     85 10.2.2.1:80
    113 10.2.2.22:80
===== NODE GROUP =====                                                                                                                                                                         
(1) cp1053.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      2 10.2.2.18:8080                                                                                                                                                                         
   1140 10.2.2.17:7231                                                                                                                                                                         
   3031 10.2.2.22:80
   3345 10.2.2.1:80
===== NODE GROUP =====                                                                                                                                                                         
(1) cp1055.eqiad.wmnet                                                                                                                                                                         
----- OUTPUT of 'netstat -an|egre...|uniq -c|sort -n' -----                                                                                                                                    
      3 10.2.2.18:8080                                                                                                                                                                         
     27 10.2.2.17:7231                                                                                                                                                                         
     98 10.2.2.1:80
    123 10.2.2.22:80
================                                         `

Above, you can see most of the nodes have ~100 sockets to .1 and .22 (appservers and api-appservers), but cp1053 has ~3K (most of which will be in CLOSE_WAIT). It takes a while to build up to ~3K under the pressure of the problem, but that's about the plateau we've seen. Once a node's mailbox lag gets bad and the connection counts are up in this range, usually the recourse is to restart that varnish backend (run-no-puppet varnish-backend-restart) to move the problem to a new node and restart the clock on the badness impacting a varnishd.


TASK DETAIL
https://phabricator.wikimedia.org/T179156

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BBlack
Cc: Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to