Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch solrdedup" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20solrdedup?action=diff&rev1=2&rev2=3

  '''Preparation''':
  Query the solr server for the number of documents (say, N), Partition N among 
M map tasks. For example, if we have two map tasks the first map task will deal 
with solr documents from 0 - (N / 2 - 1) and the second will deal with 
documents from (N / 2) to (N - 1). This can be thought of as a linearly 
executing divide and conquer algorithm.
  
- '''Map Reduce''':
+ '''MapReduce''':
   * Map: Identity map where keys are digests and values are {@link SolrRecord} 
instances(which contain id, boost and timestamp)
  
   * Reduce: After map, {@link SolrRecord}s with the same digest will be 
grouped together. Now, of these documents with the same digests, delete all of 
them except the one with the highest score (boost field). If two (or more) 
documents have the same score, then the document with the latest timestamp is 
kept. Again, every other is deleted from solr index.

Reply via email to