Hi, HarvestMan has been under development since Jul 2003. However the last time a public release was made was in Sep 2005. Now after a gap of more than two years, I am announcing the initial release (alpha) of the version 2.0 of HarvestMan and the companion program Hget.
The version 2.0 is under development still and a lot of things will change down the line. I have been thinking of making a final announcement after everything is done; however it looks like it will take a long time for the complete work to be done, so I have decide to make intermediate alpha and beta releases, till the final version is ready. There are lots of changes in HarvestMan, the main change being a new plugin feature which allows to modify program behaviour by writing small pieces of Python code as plugins (say, akin to Firefox extensions). As of now, plugins exist for integration with Lucene, Swish-e. (As of this writing, HarvestMan + plugins is currently being used by students in a University in Europe to write custom web crawling applications.) The changes are not completed yet. The program is still a single process. I will be changing this to first a client/server split and then to a p2p architecture for better scaling, as development progresses. The highlight is actually another application named "Hget" which is built on top of HarvestMan as a framework. Hget can be considered as wget on steroids, and can be used as a download manager to perform HTTP downloads in pieces from the web. It can perform HTTP Multipart downloading, mirror search and download, HTTP resuming, failover and has built-in support for sourceforge.net mirrors. More features are getting added daily. Hget and HarvestMan are packaged together. The URL is http://www.harvestmanontheweb.com/packages/2.0/HarvestMan-2.0alpha.tar.gz The setup.py script can be used to install both programs. I have improved setup.py a lot and it now does a very good job of pulling in the required dependencies and doing a clean install. HarvestMan depends on pyparsing , so this is pulled in automatically, if not found. The current version of HarvestMan also includes a rudimentary Javascript parser (2 in fact). There is a pure Python parser written using pyparsing which can extract Javascript from HTML and do basic processing (like document.write and Javascript redirection). Then there is another one, a pure Python re-implementation of RbNarcissus, a pure ruby parser for Javascript. Since this is an alpha version, there would be bugs. Also this dissemination is for a limited audience, so I am announcing this here only. There is no cheeseshop package yet and no announcement in larger Python mailing lists (like c.l.py). If you are interested in the program and in general interested in web crawling etc, do download it and give it a try. Even if you are not interested in web crawling, I think the Hget application would be very useful to you. Please report bugs preferably at, http://developer.berlios.de/bugs/?group_id=1873 or email them straight to me. For anyone interested in development, the project is currently hosted on the server http://svn.eiao.net . The trunk can be checked out at http://svn.eiao.net/robacc/experimental/HarvestMan-2.0 . Kindly note that the trunk is under development and may not be stable. I don't yet have the notion of nightly drops etc, since this is mostly a single person project :) Thanks & regards, -- -Anand _______________________________________________ BangPypers mailing list [email protected] http://mail.python.org/mailman/listinfo/bangpypers
