On 9/2/05, Oleg Kalnichevski <[EMAIL PROTECTED]> wrote: > On Thu, Sep 01, 2005 at 10:30:29PM -0400, Henri Yandell wrote: > > Never got round to adding it to Commons, robots.txt parser: > > > > http://www.osjava.org/norbert/ -> > > http://www.robotstxt.org/wc/norobots-rfc.html > > > > Web-spider: > > > > http://www.osjava.org/scraping-engine/ > > > > HTML pseudo-scraper (probably more for Jakarta Silk/Web Components): > > > > http://www.osjava.org/genjava/multiprojects/gj-scrape/ (poor site at > > the moment, it's a substring()/indexOf() parsing system instead of > > trying to be fancy). > > > > Hen > > > > Henri, > > I think a web spider and robots.txt parser would be a welcome addition > to the project. If you are personally interested in porting these > applications to use HttpClient / Http Components go ahead and add the > web spider to the project goals and yourself to the list of intitial > committers. In my opinion voting you in to a committer status is a > matter of formality
The robots.txt parser has a single GET request currently using HttpUrlConnection, so moving this to use HttpClient is pretty easy (if even thought necessary, adding the dependency for one method call is usually overkill). Will go ahead and add this to the list as it has very little religion. The web-spider might want a bit more investigation on the community's part. It had its guts ripped out to form a kind of container project called oscube so has a dependency on that, and might be scoped a bit beyond what Http Components would want from a spider. Cron via Quartz, notification, database storing etc. It already uses HttpClient for its fetching there (along with Commons Net for FTP). http://svn.osjava.org/cgi-bin/viewcvs.cgi/trunk/scraping-engine/xdocs/manual/images/Scrapers.png?rev=1967&view=auto So a bit more than the simple wget clone that might have been envisioned. :) Plan is to add a mini-scraping language to it, support POP and possibly end up with some kind of rules engine/job language. A lot of religion for HttpClient to swallow, but it is there if it piques interest. Hen --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
