Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The Nutch2Roadmap page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap
--
New page:
= Nutch2Roadmap =
Here is a list of the features and architectural changes that will be
implemented in Nutch 2.0.
* Storage Abstraction
* initially with back end implementations for HBase and HDFS
* extend it to other storages later e.g. MySQL etc...
* Plugin cleanup : Tika only for parsing document formats
* keep only stuff HtmlParseFilters (probably with a different API) so that
we can post-process the DOM created in Tika from whatever original format.
* Externalize functionalities to crawler-commons project
[http://code.google.com/p/crawler-commons/]
* robots handling, url filtering and url normalization, URL state
management, perhaps deduplication. We should coordinate our efforts, and share
code freely so that other projects (bixo, heritrix,droids) may contribute to
this shared pool of functionality, much like Tika does for the common need of
parsing complex formats.
* Remove index / search and delegate to SOLR
* we may still keep a thin abstract layer to allow other indexing/search
backends (ElasticSearch?), but the current mess of indexing/query filters and
competing indexing frameworks (lucene, fields, solr) should go away. We should
go directly from DOM to a NutchDocument, and stop there.
* Various new functionalities
* e.g. sitemap support, canonical tag, better handling of redirects,
detecting duplicated sites, detection of spam cliques, tools to manage the
webgraph, etc.
This document is meant to serve as a basis for discussion, feel free to
contribute to it