Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FrontPage" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/FrontPage?action=diff&rev1=290&rev2=291 * Nutch 1.x: A well matured, production ready crawler. 1.x enables fine grained configuration, relying on [[http://hadoop.apache.org/|Apache Hadoop]] data structures, which are great for batch processing. * Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using [[http://gora.apache.org|Apache Gora]] for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. - Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. [[htp://tika.apache.org|Apache Tika]] for parsing. Additionally, pluggable indexing exists for [[http://lucene.apache.org/solr|Apache Solr]], [[http://www.elasticsearch.org|Elastic Search]], etc. + Being pluggable and modular of course has it's benefits, Nutch provides extensible interfaces such as Parse, Index and ScoringFilter's for custom implementations e.g. [[http://tika.apache.org|Apache Tika]] for parsing. Additionally, pluggable indexing exists for [[http://lucene.apache.org/solr|Apache Solr]], [[http://www.elasticsearch.org|Elastic Search]], etc. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster