Generate Metalinks with Google App Engine

Jack Bates Mon, 13 Aug 2012 23:30:52 -0700

Hi, what do you think about a Google App Engine app that generates 
Metalinks for URLs? Maybe something like this already exists?

The first time you visit, e.g.
http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2

it downloads the content and computes a digest. App Engine has *lots* of
bandwidth, so this is snappy. Then it sends a response with "Digest:
SHA-256=..." and "Location: ..." headers, similar to MirrorBrain

It also records the digest with Google's Datastore, so on subsequent
visits, it doesn't download or recompute the digest

Finally, it also checks the Datastore for other URLs with matching digest,
and sends "Link: <...>; rel=duplicate" headers for each of these. So if you
visit, e.g.
http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2

it sends "Link:
<http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>;
rel=duplicate"

The idea is that this could be useful for sites that don't yet generate
Metalinks, like SourceForge. You could always prefix a URL that you pass to
a Metalink client with "http://mintiply.appspot.com/"; to get a Metalink.
Alternatively, if a Metalink client noticed that it was downloading a large
file without mirror or hash metadata, it could try to get more mirrors from
this app, while it continued downloading the file. As long as someone else
had previously tried the same URL, or App Engine can download the file
faster than the client, then it should get more mirrors in time to help
finish the download. Popular downloads should have the most complete list
of mirrors, since these URLs should have been tried the most

Right now it only downloads a URL once, and remembers the digest forever,
which assumes that the content at the URL never changes. This is true for
many downloads, but in future it could respect cache control headers

Also right now it only generates HTTP Metalinks with a whole file digest.
But in future it could conceivably generate XML Metalinks with partial
digests

A major limitation with this proof of concept is that I ran into some App
Engine errors with downloads of any significant size, like Ubuntu ISOs. The
App Engine maximum response size is 32 MB. The app overcomes this with byte
ranges and downloading files in 32 MB segments. This works on my local
machine with the App Engine dev server, but in production Google apparently
kills the process after downloading just a few segments, because it uses
too much memory. This seems wrong, since the app throws away each segment
after adding it to the digest. So if it has enough memory to download one
segment, it shouldn't require any more memory for additional segments.
Maybe this could be worked around by manually calling the Python garbage
collector, or by shrinking the segment size...

Also I ran into a second bug with App Engine URL Fetch and downloads of any
significant size:
http://code.google.com/p/googleappengine/issues/detail?id=7732#c6

Another thought is whether any web crawlers already maintain a database of
digests that an app like this could exploit?

Here is the codes:
https://github.com/jablko/mintiply/blob/master/mintiply.py

What are your thoughts? Maybe something like this already exists, or was
already tried in the past...

--
You received this message because you are subscribed to the Google Groups
"Metalink Discussion" group.
To view this discussion on the web visit
https://groups.google.com/d/msg/metalink-discussion/-/r7cq8sL0LuMJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/metalink-discussion?hl=en.

Generate Metalinks with Google App Engine

Reply via email to