BBlack added a comment.

  So, as it turns out, this is a general varnishd bug in our specific varnishd 
build.  For purposes of this bug, our varnishd code is essentially 3.0.7 plus a 
bunch of ancient forward-ported 'plus' patches related to streaming, and we're 
missing 
https://github.com/varnishcache/varnish-cache/commit/72981734a141a0a52172b85bae55f8877f69ff42
 (do_gzip + do_stream content-length bug for HTTP/1.0 reqs, which is eerily 
similar to this issue, but not quite the same) because it doesn't apply 
cleanly/sanely to our codebase due to conflicts with the former.
  
  What I can reliably and predictably observe and control for now is: we have a 
response-length-specific response corruption bug, only when both of these 
conditions are met:
  
  1. do_stream is in effect for this request (for text cluster, this means it's 
pass or initial miss+chfp(Created-Hit-For-Pass) traffic)
  2. the response has to be gunzipped for the client (client does not advertise 
gzip support, but backend response is gzipped by the applayer, or gzipped by 
varnish due to do_gzip rules).
  
  In a lot of the test scenarios/requests myself and others were using 
previously, we weren't necessarily controlling for these variables well, which 
led to a lot of inconsistent results (notably, X-Wikimedia-Debug effectively 
turns non-pass traffic into pass-traffic when debugging, but the same might not 
be true if testing directly from varnish to mw1017 without X-Wikimedia-Debug).
  
  The do_gzip (and related gunzip) behaviors have been in place for a long 
time.  What's new lately is the do_stream behaviors.  These were added to the 
cache_text cluster in the past couple of months for the pass-traffic cases.  
cache_upload has had do_stream for certain requests for a very long time, but 
various constraints there conspire to make it accidentally-unlikely we'll 
observe this bug on cache_upload for legitimate traffic.  cache_misc probably 
suffers from this as well, but the conditions under which it will or won't is 
trickier in this case, but almost surely this is related to 
https://phabricator.wikimedia.org/T133490 as well.
  
  So the basic game plan for this bug is:
  cache_text - revert the relatively-recent do_stream-enabling VCL patches.
  cache_misc - will resolve itself with varnish4 upgrade, which is imminent for 
this cluster
  cache_upload - keep ignoring what is probably a non-problem in practice there 
for now, will eventually get fixed up with varnish 4 upgrade.
  cache_maps - already varnish4, wouldn't have this issue.

TASK DETAIL
  https://phabricator.wikimedia.org/T133866

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BBlack
Cc: Ricordisamoa, Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, 
Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, 
D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb



_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to