Re: Generate Metalinks with Google App Engine

Jack Bates Thu, 16 Aug 2012 03:42:32 -0700

On Wednesday, August 15, 2012 4:50:11 AM UTC-7, Sundaram Ananthanarayanan 
wrote:
>
> Hi Jack!
>
> I really liked your idea. I think its cool :-)
>


Hi Sundaram and thanks for your encouragement!

1. I tried the following URL on your server and it didnt give the correct 
> SHA-256 hash. Maybe it was my mistake. Anyways do check it out. URL: 
> http://www.abisource.com/downloads/abiword/2.4.6/Windows/abiword-setup-2.4.6.exe.
>  
> The expected hash is SHA-256 hash is 
> 685a82ca2a9c56861e5ca22b38e697791485664c36ad883f410dac9e96d09f62 .
>

I think this is because the "Digest: SHA-256=..." header is Base64 encoded. 
The header I get is "Digest: 
SHA-256=aFqCyiqcVoYeXKIrOOaXeRSFZkw2rYg/QQ2snpbQn2I=" which I think is 
equivalent to the hash you expected:

>>> 
binascii.hexlify(base64.b64decode('aFqCyiqcVoYeXKIrOOaXeRSFZkw2rYg/QQ2snpbQn2I='))
'685a82ca2a9c56861e5ca22b38e697791485664c36ad883f410dac9e96d09f62'
>>> 

2. I was just wondering if you had a cache-control header while sending 
> request to huge files. I didn't find that in your code. It may not apply to 
> your problem, but cache is probably included in quota of daily usage by 
> google. Most download managers (to my knowledge) switch off cache during 
> file downloads. It may not apply to your problem, but again its just one 
> line of code, so it wouldn't hurt to try, Try adding an header with header 
> name "Cache-Control" and value "no-cache". Please do ignore this if you 
> have already tried it.
>

I haven't tried this, thanks for suggesting it. However looking at the 
"no-cache" directive, I think it means that intermediate caches must not 
use a cached copy when responding to the request. Since a purpose of the 
app is to speed up downloads, it needs to get the content and compute the 
digest as quickly as possible, and maybe an intermediate cache could 
actually help. Assuming an intermediate cache could possibly accelerate the 
download or save some bandwidth, wouldn't it be better *not* to send 
"Cache-Control: no-cache"?

I haven't noticed if the App Engine URL Fetch has an associated HTTP cache, 
and if it has, whether it has a quota. But that's a good idea to check on 
that

On the subject of "Cache-Control: no-cache", if a client sends this header 
to the app, I wonder if the app should get the content again and recompute 
the digest

Thanks.
>

Thank you for giving feedback

On Tuesday, August 14, 2012 12:00:47 PM UTC+5:30, Jack Bates wrote:
>>
>> Hi, what do you think about a Google App Engine app that generates 
>> Metalinks for URLs? Maybe something like this already exists?
>>
>> The first time you visit, e.g. 
>> http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2it
>>  downloads the content and computes a digest. App Engine has *lots* of 
>> bandwidth, so this is snappy. Then it sends a response with "Digest: 
>> SHA-256=..." and "Location: ..." headers, similar to MirrorBrain
>>
>> It also records the digest with Google's Datastore, so on subsequent 
>> visits, it doesn't download or recompute the digest
>>
>> Finally, it also checks the Datastore for other URLs with matching 
>> digest, and sends "Link: <...>; rel=duplicate" headers for each of these. 
>> So if you visit, e.g. 
>> http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2it
>>  sends "Link: <
>> http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>; 
>> rel=duplicate"
>>
>> The idea is that this could be useful for sites that don't yet generate 
>> Metalinks, like SourceForge. You could always prefix a URL that you pass to 
>> a Metalink client with "http://mintiply.appspot.com/"; to get a Metalink. 
>> Alternatively, if a Metalink client noticed that it was downloading a large 
>> file without mirror or hash metadata, it could try to get more mirrors from 
>> this app, while it continued downloading the file. As long as someone else 
>> had previously tried the same URL, or App Engine can download the file 
>> faster than the client, then it should get more mirrors in time to help 
>> finish the download. Popular downloads should have the most complete list 
>> of mirrors, since these URLs should have been tried the most
>>
>> Right now it only downloads a URL once, and remembers the digest forever, 
>> which assumes that the content at the URL never changes. This is true for 
>> many downloads, but in future it could respect cache control headers
>>
>> Also right now it only generates HTTP Metalinks with a whole file digest. 
>> But in future it could conceivably generate XML Metalinks with partial 
>> digests
>>
>> A major limitation with this proof of concept is that I ran into some App 
>> Engine errors with downloads of any significant size, like Ubuntu ISOs. The 
>> App Engine maximum response size is 32 MB. The app overcomes this with byte 
>> ranges and downloading files in 32 MB segments. This works on my local 
>> machine with the App Engine dev server, but in production Google apparently 
>> kills the process after downloading just a few segments, because it uses 
>> too much memory. This seems wrong, since the app throws away each segment 
>> after adding it to the digest. So if it has enough memory to download one 
>> segment, it shouldn't require any more memory for additional segments. 
>> Maybe this could be worked around by manually calling the Python garbage 
>> collector, or by shrinking the segment size...
>>
>> Also I ran into a second bug with App Engine URL Fetch and downloads of 
>> any significant size: 
>> http://code.google.com/p/googleappengine/issues/detail?id=7732#c6
>>
>> Another thought is whether any web crawlers already maintain a database 
>> of digests that an app like this could exploit?
>>
>> Here is the codes: 
>> https://github.com/jablko/mintiply/blob/master/mintiply.py
>>
>> What are your thoughts? Maybe something like this already exists, or was 
>> already tried in the past...
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/metalink-discussion/-/2KL6A_M4BWkJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Re: Generate Metalinks with Google App Engine

Reply via email to