BBlack added a comment.
So, as it turns out, this is a general varnishd bug in our specific varnishd build. For purposes of this bug, our varnishd code is essentially 3.0.7 plus a bunch of ancient forward-ported 'plus' patches related to streaming, and we're missing https://github.com/varnishcache/varnish-cache/commit/72981734a141a0a52172b85bae55f8877f69ff42 (do_gzip + do_stream content-length bug for HTTP/1.0 reqs, which is eerily similar to this issue, but not quite the same) because it doesn't apply cleanly/sanely to our codebase due to conflicts with the former. What I can reliably and predictably observe and control for now is: we have a response-length-specific response corruption bug, only when both of these conditions are met: 1. do_stream is in effect for this request (for text cluster, this means it's pass or initial miss+chfp(Created-Hit-For-Pass) traffic) 2. the response has to be gunzipped for the client (client does not advertise gzip support, but backend response is gzipped by the applayer, or gzipped by varnish due to do_gzip rules). In a lot of the test scenarios/requests myself and others were using previously, we weren't necessarily controlling for these variables well, which led to a lot of inconsistent results (notably, X-Wikimedia-Debug effectively turns non-pass traffic into pass-traffic when debugging, but the same might not be true if testing directly from varnish to mw1017 without X-Wikimedia-Debug). The do_gzip (and related gunzip) behaviors have been in place for a long time. What's new lately is the do_stream behaviors. These were added to the cache_text cluster in the past couple of months for the pass-traffic cases. cache_upload has had do_stream for certain requests for a very long time, but various constraints there conspire to make it accidentally-unlikely we'll observe this bug on cache_upload for legitimate traffic. cache_misc probably suffers from this as well, but the conditions under which it will or won't is trickier in this case, but almost surely this is related to https://phabricator.wikimedia.org/T133490 as well. So the basic game plan for this bug is: cache_text - revert the relatively-recent do_stream-enabling VCL patches. cache_misc - will resolve itself with varnish4 upgrade, which is imminent for this cluster cache_upload - keep ignoring what is probably a non-problem in practice there for now, will eventually get fixed up with varnish 4 upgrade. cache_maps - already varnish4, wouldn't have this issue. TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: Ricordisamoa, Trung.anh.dinh, MZMcBride, Anomie, Yurivict, TerraCodes, Orlodrim, BBlack, akosiaris, zhuyifei1999, elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, Jay8g, jeremyb _______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs