It would be great to go the Incubator route. I pointed this out to Andrzej way back when this was starting too but it would be good to think about things like Spring injection (DI-framework) for config, phase-based crawler processing, etc. Check out some of the work in the OODT catalog crawler [1]. Might help.
I read the Wiki page Ken pointed to and one of the options proposed (at the time) was Lucene sub-project. I don't think that'll work anymore since the ASF doesn't want umbrella projects. So, the goal would be Incubator PMC sponsorship with a graduation path towards TLP. But it's up to the guys doing what they want to do, and I'm on the outside looking on on this one. *Except for* Nutch's concern :-) I care very much about the way that Nutch consumes any of this code so I'm happy to chime in on that. Cheers, Chris [1] http://oodt.apache.org/components/maven/crawler/ On Jul 6, 2011, at 1:15 PM, Markus Jelsma wrote: > Impressive! Are you guys going for the ASF incubator? > >> [Apologies for cross-posting] >> >> The initial release of crawler-commons is available from : >> http://code.google.com/p/crawler-commons/downloads/list >> >> The purpose of this project is to develop a set of reusable Java components >> that implement functionality common to any web crawler. These components >> would benefit from collaboration among various existing web crawler >> projects, and reduce duplication of effort. >> The current version contains resources for : >> - parsing robots.txt >> - parsing sitemaps >> - URL analyzer which returns Top Level Domains >> - a simple HttpFetcher >> >> This release is available on Sonatype's OSS Nexus repository [ >> https://oss.sonatype.org/content/repositories/releases/com/google/code/craw >> ler-commons/] and should be available on Maven Central soon. >> >> Please send your questions, comments or suggestions to >> http://groups.google.com/group/crawler-commons >> >> Best regards, >> >> Julien >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++