Hi,

I'm not sure what is the status of the nutchbase - it's missed a lot of
> fixes and changes in trunk since it's been last touched ...
>

yes, maybe we should start the 2.0 branch from 1.1 instead
Dogacan - what do you think?

BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it


> Also, the goal of the crawler-commons project is to provide APIs and
> implementations of stuff that is needed for every open source crawler
> project, like: robots handling, url filtering and url normalization, URL
> state management, perhaps deduplication. We should coordinate our
> efforts, and share code freely so that other projects (bixo, heritrix,
> droids) may contribute to this shared pool of functionality, much like
> Tika does for the common need of parsing complex formats.
>

definitely

 +1 - we may still keep a thin abstract layer to allow other
> indexing/search backends, but the current mess of indexing/query filters
> and competing indexing frameworks (lucene, fields, solr) should go away.
> We should go directly from DOM to a NutchDocument, and stop there.
>


I think that separating the parsing filters from the indexing filters can
have its merits e.g. combining the metadata generated by 2 or more different
parsing filters into a single field in the NutchDocument, keeping only a
subset of the available information etc...


> >
> > I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
> > update?
>

Have created a new page to serve as a support for discussion :
http://wiki.apache.org/nutch/Nutch2Roadmap

julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to