Hi,

tl;dr: snapshots.debian.org reproducibly truncates around 1-2% of the
downloads. While I think that's unusually high, it turns into a real
problem in combination with "lazy" clients (or proxies) that do not
rigorously check their response.


Longer description: I wrote a test script that tries a lot of HTTP GET
requests on random resources from snapshot.debian.org, collecting
statistics on the responses. Independent of the origin used [0], this
reveals the following patterns:

  98% - 99.5%  of all requests return a good status code and the
               downloaded body matches in size and by sha1 hash.

  0.5% - 1.5%  of responses are truncated, i.e. the connection closes
               before all data is received

  < 0.5%       other errors like 403, 503 or hash mismatches

Interestingly, for immediate retries, the distribution changes a lot for
the second attempt:

  50% - 76%    successes

  17.6% - 50%  truncated responses

  up to ca. 5% other errors

Further retries show a roughly similar success rate, so that eventually,
after enough retries, the client gets the expected response.

(With deferred retries, giving varnish a chance to populate the cache,
the second attempt stats tend a lot towards 100% success, therefore I
conclude that the truncation failures are related to cache misses.)


Manual inspection of the HTTP response headers from snapshot.debian.org
reveals, that varnish is in use (and that somebody there is a Terry
Pratchett fan). It seems varnish delivers cached resources with the
usual "Content-Length" field in its headers, but uses
"Transfer-Encoding: chunked" for cache misses. (Which sounds like an
entirely reasonable thing.)

But back to the facts: with an additional apt-cache-ng proxy in between,
the success rate for follow-up requests drops to nearly 0%. The result
distribution for the second attempt then looks more like:

  97.4%        hash mismatch
  1.2%         successes
  1.2%         error 503

For further attempts, these tend towards 100% hash mismatches, because
apt-cache-ng stores an invalid (truncated) copy in its cache.

I tested chunk transfer encoding of apt-cacher-ng a bit, but since
version 0.8.0_pre2-1 of apt-cacher-ng, it seems handle chunking just
fine [1]. And I tried to reproduce the error on the varnish side, but
without success so far.

So I'm not entirely sure where exactly the problem lies, but I am
currently suspecting varnish because of two rather vague observations:

 a) on traffic intercepted between a apt-cacher-ng and varnish, I saw
    an incorrect resource length advertized by varnish (only once, by
    manual interception) [2]. I intend to try to reproduce this
    automatically to test the varnish side.

 b) a varnish issue that sounds related and got fixed just a couple
    days ago:
    https://github.com/varnishcache/varnish-cache/issues/2035

    I very much doubt this fix has already made it to the snapshots
    infrastructure. It certainly didn't arrive in Debian, yet.

I cannot proof either of these is really related or the cause of the
hash mismatches for the combination with varnish with apt-cacher-ng.

Before continuing to investigate, I simply wanted to share my findings.
I'd welcome feedback as well as hints about the infrastructure, that
might be relevant. I intend to continue analyzing varnish and
apt-cacher-ng to be able to eventually solve this issue.

Kind Regards

Markus Wanner



[0]: I tried from 5 different sources with highly varying bandwidth.
Most of those are located in Europe, though. And during European daytime.


[1]: Chunked transfer encoding seems to work since this commit:

commit eebb3c93fd63ceb479b7100f71861069f5b16914
Author: Eduard Bloch <bl...@debian.org>
Date:   Sat Aug 30 20:33:49 2014 -0700

    Fix for premature abort of chunked transfer


[2]: The intercepted request on a file weighting 474 KiB:

GET
/archive/debian/20140818T083621Z/pool/main/libg/libgnomeui/libgnomeui-dev_2.24.5-2_kfreebsd-amd64.deb
HTTP/1.1
User-Agent: Debian Apt-Cacher-NG/0.8.0
Host: snapshot.debian.org
Range: bytes=122879-
Accept: */*
Accept-Encoding: identity
Connection: keep-alive

HTTP/1.1 206 Partial Content
Date: Thu, 08 Dec 2016 14:46:21 GMT
Server: Apache
Expires: Sun, 18 Dec 2016 14:46:21 GMT
Cache-Control: public, max-age=864000
ETag: "1aef9a4d688e1e1a0bd09427b894e47fd37413dc"
Last-Modified: Mon, 05 Sep 2011 15:23:40 GMT
X-Clacks-Overhead: GNU Terry Pratchett
Content-Type: application/x-debian-package
X-Varnish: 102013332
Age: 0
Via: 1.1 varnish-v4
Transfer-Encoding: chunked
Connection: keep-alive
Accept-Ranges: bytes
Content-Range: bytes 122879-122879/122880

001
.
0

i.e. varnish answered with a chunk of one byte (which may or may not be
correct) but a known incorrect value for the complete-range. I certainly
wouldn't blame apt-cacher-ng for truncation of the file in that case.


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to