Hi, this plugin for Apache Traffic Server now checks the "Digest:
SHA-256=..." header. It's still proof of concept but it's up on GitHub
[1]. Here is a post to the Traffic Server developers list with details
[2]

The plugin computes SHA-256 digests for responses from origin servers.
Then, given a response with a "Location: ..." header and a "Digest:
SHA-256=..." header, if the "Location: ..." URL is not already cached
but the digest matches content in the cache, it rewrites the
"Location: ..." header with the cached URL. This should redirect
clients to mirrors that are already cached

I'd love any feedback on this approach

Next steps are to check if the code quality is good enough. Does it
tie up the event loop while computing digests? Also the proof of
concept maps digests to cached URLs by storing URLs as objects in the
Traffic Server cache. It works, but other alternatives were discussed
like extending core with new APIs, KyotoDB, Memcached. What is the
ideal way to store digests?

This proof of concept handles the case that content is already cached
but the cached URL isn't listed among the "Link: <...>; rel=duplicate"
headers, maybe because it was downloaded from a server not
participating in the CDN, or because there are too many mirrors to
list in "Link: <...>; rel=duplicate" headers. Also this potentially
reduces the number of cache reads, because the "Link: <...>;
rel=duplicate" header means scanning URLs until one is found that's
already cached or the list is exhausted, whereas the "Digest:
SHA-256=..." header means a constant number of lookups

RFC 6249 requires a "Digest: SHA-256=..." header or "Link: <...>;
rel=duplicate" headers MUST be ignored:

>   If Instance Digests are not provided by the Metalink servers, the
>   Link header fields pertaining to this specification MUST be ignored.

>   Metalinks contain whole file hashes as described in
>   Section 6, and MUST include SHA-256, as specified in [FIPS-180-3].

Other next steps might be to add more Metalink client features, e.g.
downloading segments from multiple origin servers in parallel. A first
step might be, given a "Location: ..." header and "Link: <...>;
rel=duplicate" headers, instead of rewriting the "Location: ..."
header, transparently request the resource and replace the whole
response. Then add Metalink features to the transparent request

I suspect this breaks HTTP in some ways, and is too complicated.
Instead of adding Metalink features to the proxy, how well does the
proxy work with Metalink clients? Is there any outstanding work or
issues related to range requests?

Alex Rousskov pointed out a project for Squid to implement duplicate
transfer detection:

  * http://comments.gmane.org/gmane.comp.web.squid.devel/15803
  * http://comments.gmane.org/gmane.comp.web.squid.devel/16335
  * http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf

The goal of this plugin is to address the frustration of users when
they click a download button and sometimes the download completes in
seconds, when they are redirected to a mirror that is already cached,
and other times takes hours, when they are redirected to a mirror that
isn't

Per Jessen is working on another project for Squid with a similar goal
[3] but the design is a bit different

  [1] https://github.com/jablko/dedup
  [2] 
http://mail-archives.apache.org/mod_mbox/trafficserver-dev/201206.mbox/%3C4FE82F1D.2010906%40nottheoilrig.com%3E
  [3] http://mirrorbrain.org/archive/mirrorbrain/0170.html

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Reply via email to