Re: Generate Metalinks with Google App Engine

Jack Bates Mon, 20 Aug 2012 22:45:52 -0700

On Sunday, August 19, 2012 2:15:46 PM UTC-7, Bram Neijt wrote:
>
> A single page export will not work, for sure, but as for that I was 
> thinking about moving data out of dynmirror to mintiply. 
>
> For example, if you don't want to download the complete file before 
> you have a metalink, you could check at 
> http://www.dynmirror.net/metalink/?url=http://example.com 
> to see if dynmirror has any metalink information. You could use 
> dynmirror as a kind of caching backend for downloads. 
>
> Another thing I could do is have dynmirror redirect to mintiply if 
> there is no hash information available, maybe that would be a good 
> approach... 
>
> I'm not really sure it would add anything, but technically it should 
> be possible and I think it might be good to get some code commits on 
> dynmirror anyway ;) 
>


That sounds like a good idea. Please let me know if there's anything I can 
do to help with this

Cheers

Greets, 
>
> Bram 
>
>
> On Sun, Aug 19, 2012 at 9:58 AM, Jack Bates <[email protected]<javascript:>> 
> wrote: 
> > On Thursday, August 16, 2012 10:44:19 PM UTC-7, Jack Bates wrote: 
> >> 
> >> On Tuesday, August 14, 2012 1:58:22 PM UTC-7, Bram Neijt wrote: 
> >>> 
> >>> Hi Jack, 
> >>> 
> >>> I once created a similair thing, but it required the "owner" of the 
> >>> file to host the MD5 he/she thinks it should be. It then generates a 
> >>> metalink based on all the md5/sha1/sha256 hashes in the database. 
> >>> 
> >>> The idea is that anybody can step up and start a mirror by hosting the 
> >>> files and the MD5SUMS and have the service spider the MD5SUMS file. 
> >>> 
> >>> You can find the service at: http://www.dynmirror.net/ 
> >> 
> >> 
> >> Cool! The design of this site is impressive. I like how it shows 
> >> analytics, like recent downloads, on the front page 
> >> 
> >>> It might be a good idea to join up the databases or do some 
> >>> collaboration somewhere. Let's see what we can do. For instance, I 
> >>> could add a mintiply url collection or something like that? Or maybe I 
> >>> could have dynmirror register the hash/link combinations at mintiply? 
> >> 
> >> 
> >> Great idea, thanks for suggesting it. The first thing that comes to 
> mind 
> >> is, how would you like to get data out of Mintiply (and into 
> Dynmirror)? Is 
> >> there an API that Mintiply could provide that would make this as easy 
> as 
> >> possible? 
> > 
> > 
> > Hi Bram and thanks again for inviting me to collaborate, 
> > 
> > As an experiment, I just added a page to export all of the data from 
> > Mintiply, in Metalink format. Let me know what you think. Could this be 
> > useful to a project like Dynmirror? or would you prefer a different 
> format, 
> > or different data? 
> > 
> > There isn't much data in the app yet, so dumping everything in one 
> Metalink 
> > response works fine. If the amount of data ever gets large, we may need 
> to 
> > rethink this 
> > 
> > Here is the page: http://mintiply.appspot.com/export 
> > 
> >>> Let me know what you think. Currently, I think I'm the only user of 
> >>> dynmirror.net (at http://www.logfish.net/pr/ccbuild/downloads/ ). 
> >>> 
> >>> I'd also be happy to dig up and publish the code somewhere if I havn't 
> >>> already. 
> >>> 
> >>> Greets, 
> >>> 
> >>> Bram 
> >> 
> >> 
> >> Thanks very much for inviting me to collaborate 
> >> 
> >>> On Tue, Aug 14, 2012 at 8:30 AM, Jack Bates <[email protected]> 
> wrote: 
> >>> > Hi, what do you think about a Google App Engine app that generates 
> >>> > Metalinks 
> >>> > for URLs? Maybe something like this already exists? 
> >>> > 
> >>> > The first time you visit, e.g. 
> >>> > 
> >>> > 
> http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2
>  
> >>> > it downloads the content and computes a digest. App Engine has 
> *lots* 
> >>> > of 
> >>> > bandwidth, so this is snappy. Then it sends a response with "Digest: 
> >>> > SHA-256=..." and "Location: ..." headers, similar to MirrorBrain 
> >>> > 
> >>> > It also records the digest with Google's Datastore, so on subsequent 
> >>> > visits, 
> >>> > it doesn't download or recompute the digest 
> >>> > 
> >>> > Finally, it also checks the Datastore for other URLs with matching 
> >>> > digest, 
> >>> > and sends "Link: <...>; rel=duplicate" headers for each of these. So 
> if 
> >>> > you 
> >>> > visit, e.g. 
> >>> > 
> >>> > 
> http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2
>  
> >>> > it sends "Link: 
> >>> > <http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>; 
>
> >>> > rel=duplicate" 
> >>> > 
> >>> > The idea is that this could be useful for sites that don't yet 
> generate 
> >>> > Metalinks, like SourceForge. You could always prefix a URL that you 
> >>> > pass to 
> >>> > a Metalink client with "http://mintiply.appspot.com/"; to get a 
> >>> > Metalink. 
> >>> > Alternatively, if a Metalink client noticed that it was downloading 
> a 
> >>> > large 
> >>> > file without mirror or hash metadata, it could try to get more 
> mirrors 
> >>> > from 
> >>> > this app, while it continued downloading the file. As long as 
> someone 
> >>> > else 
> >>> > had previously tried the same URL, or App Engine can download the 
> file 
> >>> > faster than the client, then it should get more mirrors in time to 
> help 
> >>> > finish the download. Popular downloads should have the most complete 
> >>> > list of 
> >>> > mirrors, since these URLs should have been tried the most 
> >>> > 
> >>> > Right now it only downloads a URL once, and remembers the digest 
> >>> > forever, 
> >>> > which assumes that the content at the URL never changes. This is 
> true 
> >>> > for 
> >>> > many downloads, but in future it could respect cache control headers 
> >>> > 
> >>> > Also right now it only generates HTTP Metalinks with a whole file 
> >>> > digest. 
> >>> > But in future it could conceivably generate XML Metalinks with 
> partial 
> >>> > digests 
> >>> > 
> >>> > A major limitation with this proof of concept is that I ran into 
> some 
> >>> > App 
> >>> > Engine errors with downloads of any significant size, like Ubuntu 
> ISOs. 
> >>> > The 
> >>> > App Engine maximum response size is 32 MB. The app overcomes this 
> with 
> >>> > byte 
> >>> > ranges and downloading files in 32 MB segments. This works on my 
> local 
> >>> > machine with the App Engine dev server, but in production Google 
> >>> > apparently 
> >>> > kills the process after downloading just a few segments, because it 
> >>> > uses too 
> >>> > much memory. This seems wrong, since the app throws away each 
> segment 
> >>> > after 
> >>> > adding it to the digest. So if it has enough memory to download one 
> >>> > segment, 
> >>> > it shouldn't require any more memory for additional segments. Maybe 
> >>> > this 
> >>> > could be worked around by manually calling the Python garbage 
> >>> > collector, or 
> >>> > by shrinking the segment size... 
> >>> > 
> >>> > Also I ran into a second bug with App Engine URL Fetch and downloads 
> of 
> >>> > any 
> >>> > significant size: 
> >>> > http://code.google.com/p/googleappengine/issues/detail?id=7732#c6 
> >>> > 
> >>> > Another thought is whether any web crawlers already maintain a 
> database 
> >>> > of 
> >>> > digests that an app like this could exploit? 
> >>> > 
> >>> > Here is the codes: 
> >>> > https://github.com/jablko/mintiply/blob/master/mintiply.py 
> >>> > 
> >>> > What are your thoughts? Maybe something like this already exists, or 
> >>> > was 
> >>> > already tried in the past... 
> >>> > 
> >>> > -- 
> >>> > You received this message because you are subscribed to the Google 
> >>> > Groups 
> >>> > "Metalink Discussion" group. 
> >>> > To view this discussion on the web visit 
> >>> > https://groups.google.com/d/msg/metalink-discussion/-/r7cq8sL0LuMJ. 
> >>> > To post to this group, send email to [email protected]. 
> >>> > To unsubscribe from this group, send email to 
> >>> > [email protected] <javascript:>. 
> >>> > For more options, visit this group at 
> >>> > http://groups.google.com/group/metalink-discussion?hl=en. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "Metalink Discussion" group. 
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msg/metalink-discussion/-/nQSS5zOJRrgJ. 
> > 
> > To post to this group, send email to 
> > [email protected]<javascript:>. 
>
> > To unsubscribe from this group, send email to 
> > [email protected] <javascript:>. 
> > For more options, visit this group at 
> > http://groups.google.com/group/metalink-discussion?hl=en. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Metalink Discussion" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/metalink-discussion/-/zkL9SJJaRssJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/metalink-discussion?hl=en.

Re: Generate Metalinks with Google App Engine

Reply via email to