Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchTutorial" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=90&rev2=91

Comment:
Updates for release of Nutch 1.15, fix Deduplication section

  
  {{{
       Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] 
(<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] 
[-normalize] [-addBinaryContent] [-base64]
-      Example: bin/nutch index http://localhost:8983/solr crawl/crawldb/ 
-linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize 
-deleteGone
+      Example: bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ 
crawl/segments/20131108063838/ -filter -normalize -deleteGone
  }}}
  === Step-by-Step: Deleting Duplicates ===
- Once indexed the entire contents, it must be disposed of duplicate urls in 
this way ensures that the urls are unique.
+ Duplicates (identical content but different URL) are optionally marked in the 
CrawlDb and are deleted later in the Solr index.
  
- MapReduce:
+ MapReduce "dedup" job:
  
-  * Map: Identity map where keys are digests and values are  
[[http://wiki.apache.org/nutch/SolrRecord|SolrRecord]] instances (which contain 
id, boost and timestamp)
-  * Reduce: After map, [[http://wiki.apache.org/nutch/SolrRecord|SolrRecord]]s 
with the same digest will be grouped together. Now, of these documents with the 
same digests, delete all of them except the one with the highest score (boost 
field). If two (or more) documents have the same score, then the document with 
the latest timestamp is kept. Again, every other is deleted from solr index.
+  * Map: Identity map where keys are digests and values are CrawlDatum records
+  * Reduce: CrawlDatums with the same digest are marked (except one of them) 
as duplicates. There are multiple heuristics available to choose the item which 
is not marked as duplicate - the one with the shortest URL, fetched most 
recently, or with the highest score.
  
  {{{
+      Usage: bin/nutch dedup <crawldb> [-group <none|host|domain>] 
[-compareOrder <score>,<fetchTime>,<urlLength>]
-      Usage: bin/nutch dedup <solr url>
-      Example: /bin/nutch dedup http://localhost:8983/solr
  }}}
+ 
+ Deletion in the index is performed by the cleaning job (see below) or if the 
index job is called with the command-line flag {{-deleteGone}}.
  
  For more information see 
[[https://wiki.apache.org/nutch/bin/nutch%20dedup|dedup documentation]].
  
@@ -310, +311 @@

  
  Every version of Nutch is built against a specific Solr version, but you may 
also try a "close" version.
  || Nutch || Solr   ||
+ || 1.15  || 7.3.1  ||
  || 1.14  || 6.6.0  ||
  || 1.13  || 5.5.0  ||
  || 1.12  || 5.4.1  ||
  
+ To install Solr:
   * download binary file from 
[[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
   * unzip to `$HOME/apache-solr`, we will now refer to this as 
`${APACHE_SOLR_HOME}`
   * create resources for a new nutch solr core `cp -r 
${APACHE_SOLR_HOME}/server/solr/configsets/basic_configs 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch`
@@ -321, +324 @@

   * make sure that there is no `managed-schema` "in the way": `rm 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema`
   * start the solr server `${APACHE_SOLR_HOME}/bin/solr start`
   * create the nutch core `${APACHE_SOLR_HOME}/bin/solr create -c nutch -d 
server/solr/configsets/nutch/conf/`
+ 
+ After that you need to point Nutch to the Solr instance:
+  * (Nutch 1.15 and later) edit the file {{conf/index-writers.xml}}, see 
IndexWriters
-  * add the core name to the Solr server URL: 
`-Dsolr.server.url=http://localhost:8983/solr/nutch`
+  * (until Nutch 1.14) add the core name to the Solr server URL: 
`-Dsolr.server.url=http://localhost:8983/solr/nutch`
  
  = Verify Solr installation =
  After you started Solr admin console, you should be able to access the 
following links:

Reply via email to