It would be great to go the Incubator route. I pointed this out to 
Andrzej way back when this was starting too but it would be good to 
think about things like Spring injection (DI-framework) for config, 
phase-based crawler processing, etc. Check out some of the work
in the OODT catalog crawler [1]. Might help.

I read the Wiki page Ken pointed to and one of the options proposed
(at the time) was Lucene sub-project. I don't think that'll work anymore 
since the ASF doesn't want umbrella projects. So, the goal would be
Incubator PMC sponsorship with a graduation path towards TLP. 

But it's up to the guys doing what they want to do, and I'm on the 
outside looking on on this one. *Except for* Nutch's concern :-) I care 
very much about the way that Nutch consumes any of this code 
so I'm happy to chime in on that. 

Cheers,
Chris

[1] http://oodt.apache.org/components/maven/crawler/


On Jul 6, 2011, at 1:15 PM, Markus Jelsma wrote:

> Impressive! Are you guys going for the ASF incubator?
> 
>> [Apologies for cross-posting]
>> 
>> The initial release of crawler-commons is available from :
>> http://code.google.com/p/crawler-commons/downloads/list
>> 
>> The purpose of this project is to develop a set of reusable Java components
>> that implement functionality common to any web crawler. These components
>> would benefit from collaboration among various existing web crawler
>> projects, and reduce duplication of effort.
>> The current version contains resources for :
>> - parsing robots.txt
>> - parsing sitemaps
>> - URL analyzer which returns Top Level Domains
>> - a simple HttpFetcher
>> 
>> This release is available on Sonatype's OSS Nexus repository [
>> https://oss.sonatype.org/content/repositories/releases/com/google/code/craw
>> ler-commons/] and should be available on Maven Central soon.
>> 
>> Please send your questions, comments or suggestions to
>> http://groups.google.com/group/crawler-commons
>> 
>> Best regards,
>> 
>> Julien
>> 
>> --
>> 
>> Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to