Renaud forwarded me the thread and I just subscribed, so apologize for not proper responding.
Thanks Renaud for the headsup. > Hi, > > On 3/1/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Is the Droids lab at all related to that parsing project in Nutch? > > Partly, yes. I've been looking at Droids and so far I think it's main > focus has been on the crawling part rather than on the analysis of > retrieved content. Yes, droids should be a generic crawler framework. I took Nutch and ripped out the plugin/extension point framework and wrote some PoC plugins. I changed many thinks to make the code simpler so from Nutch original code is not much left. Further I am using ivy for dependencies management for the core and the plugins. The first crawler is not close to the one from nutch but via plugins one could implement the same functionality (but there is ATM no interest on Nutch). The implemented crawler x-m02y07 is more (very basic for now) wget style -> request url, extract links and save the page to disk. > A generic content analysis toolkit would likely be > a great companion for Droids. Yes indeed. I am ATM playing with http://simile.mit.edu/repository/crowbar/trunk/ Stefano pointed me to it and it is very interesting since the idea is to use a gecko based browser as server to browse a page and let the browser analyze the page. Very interesting since it enables crawler to index web2 components such as ajax. http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser The core for any crawler is the link recognition where we can go different routes. In the short term we can enhance the parse-html droids plugin with neko html (similar route as nutch is going) but in the long run we should try to incorporate a virtual browser like Stefano pointed out on the labs ml. > In fact I was earlier contemplating > about starting a related effort in Apache Labs (see > http://issues.apache.org/jira/browse/JCR-728), That seems more to aim to close the mime type gap that we have ATM and I think labs would be the right place for this. > but there seems to be > enough demand for such functionality that a more full-fledged project > might be better. Maybe you are interested in starting some plugins in Droids and as soon we got some community around the code we can request for incubation. Some Forrest folks also expressed their interest in Droids. Actually Forrest/cocoon was one of the main reason I started it. The other was Solr. > > > There seems to be several efforts that are related here that could > > probably make for a nice new project under Lucene, IMO. They all > > seem to have to do with getting and preparing text for processing by > > some type of consumer of text. > > Exactly. It would be great to see some consolidation of efforts. > The grant advantage of labs is that all apache committer have write access meaning cross project efforts like this one are perfect to get started in labs. If enough people get attracted the lab get promoted. When a lab is promoted, the files are moved over to the incubation area. http://labs.apache.org/bylaws.html Looking forward to see you on labs@labs.apache.org salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java & XML consulting, training and solutions --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]