BBlack added a comment.

  Just jotting down the things I know so far from investigating this morning.  
I still don't have a good answer yet.
  
  Based on just the test URL, debugging it extensively at various layers:
  
  1. The response size of that URL is in the ballpark of 32KB uncompressed, so 
this is not a large-objects issue.  It also streams out of the backend reliably 
and quickly, there are no significant timing pauses.
  2. Without a doubt, anytime the client doesn't use AE:gzip when talking to 
the public endpoint, the response is corrupt.
  3. Slightly deeper, even when testing requests against a single layer of 
varnish internally, the response to a non-AE:gzip client request is corrupt.
  4. It's definitely happening at the Varnish<->Apache/MediaWiki boundary 
(disagreement) or within a single varnishd process as it prepares the response. 
 It's not a Varnish<->Varnish, Varnish<->Nginx, or Nginx<->client issue.
  5. All of the gzip/chunked stuff looks basically correct in the headers at 
varnish/apache boundary and varnish/client boundary.  We do send AE:gzip when 
expected, we do get CE:gzip when expected (only when asked for), the gzip-ness 
of MW/Apache's output always correctly follows its CE:gzip header (or lack 
thereof), etc.
  6. Curl has no problem parsing the output of a direct fetch from 
Apache/MediaWiki, whether using `--compressed` to set AE:gzip or not, and the 
results hash the same (identical content).
  7. Varnish emits the corrupt content for the non-AE:gzip client regardless of 
whether I tweak the test scenario to ensure that varnish is the one gzipping 
the content, or that Apache/Mediawiki are the ones gzipping the content.  So 
it's not an error in gzip compression of the response by just one party or the 
other.  The error happens when gunzipping the response for the non-AE:gzip 
client.
  8. However, when I run through a similar set of fully-debugged test scenarios 
for https://en.wikipedia.org/wiki/Barack_Obama , which is ~1MB in uncompressed 
length, and similarly TE:chunked with backend gzip capabilities and 
do_gzip=true, and on the same cluster and VCL (and even same test machine), I 
don't get the corruption for a non-AE:gzip client, even though varnish is 
decompressing that on the fly as with the bad test URL.
  
  The obvious distinctions here between the Barack article and the failing test 
URL aren't much: api.svc vs appservers.svc shouldn't matter, right, they're 
both behind the same apache and hhvm configs.  This leaves me guessing that 
there's something special about the specific output of the test URL that's 
causing this.
  
  There's almost certainly a varnish bug involved here, but the question is: is 
this a pure varnish gunzip bug that's sensitive to certain conditions which 
exist for the test URL but not the Barack one?  Is the output of Apache/MW 
buggy in some way for the test URL such that it's tripping the bug (in which 
case it's still a varnish bug that it doesn't reject the buggy response and 
turn it into a 503 or similar), or is it non-buggy, but "special" in a way that 
trips a varnish bug?

TASK DETAIL
  https://phabricator.wikimedia.org/T133866

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BBlack
Cc: elukey, ema, Aklapper, hoo, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331, 
jeremyb



_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to