BBlack added a comment.
Just jotting down the things I know so far from investigating this morning. I still don't have a good answer yet. Based on just the test URL, debugging it extensively at various layers: 1. The response size of that URL is in the ballpark of 32KB uncompressed, so this is not a large-objects issue. It also streams out of the backend reliably and quickly, there are no significant timing pauses. 2. Without a doubt, anytime the client doesn't use AE:gzip when talking to the public endpoint, the response is corrupt. 3. Slightly deeper, even when testing requests against a single layer of varnish internally, the response to a non-AE:gzip client request is corrupt. 4. It's definitely happening at the Varnish<->Apache/MediaWiki boundary (disagreement) or within a single varnishd process as it prepares the response. It's not a Varnish<->Varnish, Varnish<->Nginx, or Nginx<->client issue. 5. All of the gzip/chunked stuff looks basically correct in the headers at varnish/apache boundary and varnish/client boundary. We do send AE:gzip when expected, we do get CE:gzip when expected (only when asked for), the gzip-ness of MW/Apache's output always correctly follows its CE:gzip header (or lack thereof), etc. 6. Curl has no problem parsing the output of a direct fetch from Apache/MediaWiki, whether using `--compressed` to set AE:gzip or not, and the results hash the same (identical content). 7. Varnish emits the corrupt content for the non-AE:gzip client regardless of whether I tweak the test scenario to ensure that varnish is the one gzipping the content, or that Apache/Mediawiki are the ones gzipping the content. So it's not an error in gzip compression of the response by just one party or the other. The error happens when gunzipping the response for the non-AE:gzip client. 8. However, when I run through a similar set of fully-debugged test scenarios for https://en.wikipedia.org/wiki/Barack_Obama , which is ~1MB in uncompressed length, and similarly TE:chunked with backend gzip capabilities and do_gzip=true, and on the same cluster and VCL (and even same test machine), I don't get the corruption for a non-AE:gzip client, even though varnish is decompressing that on the fly as with the bad test URL. The obvious distinctions here between the Barack article and the failing test URL aren't much: api.svc vs appservers.svc shouldn't matter, right, they're both behind the same apache and hhvm configs. This leaves me guessing that there's something special about the specific output of the test URL that's causing this. There's almost certainly a varnish bug involved here, but the question is: is this a pure varnish gunzip bug that's sensitive to certain conditions which exist for the test URL but not the Barack one? Is the output of Apache/MW buggy in some way for the test URL such that it's tripping the bug (in which case it's still a varnish bug that it doesn't reject the buggy response and turn it into a 503 or similar), or is it non-buggy, but "special" in a way that trips a varnish bug? TASK DETAIL https://phabricator.wikimedia.org/T133866 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: BBlack Cc: elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, jeremyb _______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs