Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "FrontPage" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/FrontPage?action=diff&rev1=290&rev2=291

   * Nutch 1.x: A well matured, production ready crawler. 1.x enables fine 
grained configuration, relying on [[http://hadoop.apache.org/|Apache Hadoop]] 
data structures, which are great for batch processing.
   * Nutch 2.x: An emerging alternative taking direct inspiration from 1.x, but 
which differs in one key area; storage is abstracted away from any specific 
underlying data store by using [[http://gora.apache.org|Apache Gora]] for 
handling object to persistent mappings. This means we can implement an 
extremely flexibile model/stack for storing everything (fetch time, status, 
content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage 
solutions.
  
- Being pluggable and modular of course has it's benefits, Nutch provides 
extensible interfaces such as Parse, Index and ScoringFilter's for custom 
implementations e.g. [[htp://tika.apache.org|Apache Tika]] for parsing. 
Additionally, pluggable indexing exists for 
[[http://lucene.apache.org/solr|Apache Solr]], 
[[http://www.elasticsearch.org|Elastic Search]], etc.
+ Being pluggable and modular of course has it's benefits, Nutch provides 
extensible interfaces such as Parse, Index and ScoringFilter's for custom 
implementations e.g. [[http://tika.apache.org|Apache Tika]] for parsing. 
Additionally, pluggable indexing exists for 
[[http://lucene.apache.org/solr|Apache Solr]], 
[[http://www.elasticsearch.org|Elastic Search]], etc.
  
  Nutch can run on a single machine, but gains a lot of its strength from 
running in a Hadoop cluster
  

Reply via email to