On Wednesday, August 15, 2012 10:07:58 AM UTC-7, Neil M. wrote: > > > > > Another thought is whether any web crawlers already maintain a database > of > > digests that an app like this could exploit? > > > > Here is the codes: > > https://github.com/jablko/mintiply/blob/master/mintiply.py > > > > What are your thoughts? Maybe something like this already exists, or was > > > already tried in the past... > > I've written a metalink crawler for .metalink files. Its pretty dumb but > it gets the job done. The code is available here: > > http://metalinks.svn.sourceforge.net/viewvc/metalinks/crawler/ > > You can see the results here: > > http://www.nabber.org/projects/metalink/crawler/list.php > > I imagine it wouldn't be hard to modify to instead of grabbing the > .metalink files, parse them and dump them into your database. One > advantage to this method is any URLs that are now dead are still captured > in the .metalink files, so your AppEngine code could detect and redirect a > "dumb" browser to a working download location instead. >
Interesting idea, and thanks for writing this Metalink crawler As for a hash database, I've been researching options for my Appupdater > project. There are some hash search type sites out there but I don't > think > they will be useful in this case since I haven't seen any that track URLs, > its usually just file size, version, product name, etc. There seem to be > plenty of datasets out there for installers from the various download > websites, like sourceforge.net, softpedia, oldapps.com, etc. However, > from > what I can tell there is no way to download a database from any of these, > you'd have parse the individual web pages. While possible that doesn't > seem to be a very efficient way of doing things, you'd need to customize > it > for each website. Actually probably the better and easier way is to build > a .exe, .msi, etc. crawler, download the file and compute your own hashes. > It will take a lot of time and bandwidth but you'd get a really good > dataset that way. In other words have a crawler that feeds your AppEngine > code URLs to process. > I agree. Thanks a lot for sharing your experience researching options for Appupdater Neil > -- You received this message because you are subscribed to the Google Groups "Metalink Discussion" group. To view this discussion on the web visit https://groups.google.com/d/msg/metalink-discussion/-/zZPp5NxfB9EJ. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/metalink-discussion?hl=en.
