Hi, what do you think about a Google App Engine app that generates 
Metalinks for URLs? Maybe something like this already exists?

The first time you visit, e.g. 
http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2
 
it downloads the content and computes a digest. App Engine has *lots* of 
bandwidth, so this is snappy. Then it sends a response with "Digest: 
SHA-256=..." and "Location: ..." headers, similar to MirrorBrain

It also records the digest with Google's Datastore, so on subsequent 
visits, it doesn't download or recompute the digest

Finally, it also checks the Datastore for other URLs with matching digest, 
and sends "Link: <...>; rel=duplicate" headers for each of these. So if you 
visit, e.g. 
http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2
 
it sends "Link: 
<http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>; 
rel=duplicate"

The idea is that this could be useful for sites that don't yet generate 
Metalinks, like SourceForge. You could always prefix a URL that you pass to 
a Metalink client with "http://mintiply.appspot.com/"; to get a Metalink. 
Alternatively, if a Metalink client noticed that it was downloading a large 
file without mirror or hash metadata, it could try to get more mirrors from 
this app, while it continued downloading the file. As long as someone else 
had previously tried the same URL, or App Engine can download the file 
faster than the client, then it should get more mirrors in time to help 
finish the download. Popular downloads should have the most complete list 
of mirrors, since these URLs should have been tried the most

Right now it only downloads a URL once, and remembers the digest forever, 
which assumes that the content at the URL never changes. This is true for 
many downloads, but in future it could respect cache control headers

Also right now it only generates HTTP Metalinks with a whole file digest. 
But in future it could conceivably generate XML Metalinks with partial 
digests

A major limitation with this proof of concept is that I ran into some App 
Engine errors with downloads of any significant size, like Ubuntu ISOs. The 
App Engine maximum response size is 32 MB. The app overcomes this with byte 
ranges and downloading files in 32 MB segments. This works on my local 
machine with the App Engine dev server, but in production Google apparently 
kills the process after downloading just a few segments, because it uses 
too much memory. This seems wrong, since the app throws away each segment 
after adding it to the digest. So if it has enough memory to download one 
segment, it shouldn't require any more memory for additional segments. 
Maybe this could be worked around by manually calling the Python garbage 
collector, or by shrinking the segment size...

Also I ran into a second bug with App Engine URL Fetch and downloads of any 
significant size: 
http://code.google.com/p/googleappengine/issues/detail?id=7732#c6

Another thought is whether any web crawlers already maintain a database of 
digests that an app like this could exploit?

Here is the codes: 
https://github.com/jablko/mintiply/blob/master/mintiply.py

What are your thoughts? Maybe something like this already exists, or was 
already tried in the past...

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/metalink-discussion/-/r7cq8sL0LuMJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Reply via email to