Brian created NUTCH-1608:
----------------------------

             Summary: SolrDeleteDuplicates bug: choosing preferred page when 
duplicates does not work
                 Key: NUTCH-1608
                 URL: https://issues.apache.org/jira/browse/NUTCH-1608
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 2.2.1, 2.1
         Environment: all
            Reporter: Brian
            Priority: Minor


There is a bug in the code for deciding which version of a page to keep when 
there are duplicates.  This is a bug in the reduce function and is a common 
pitfall when using hadoop/mapreduce, as explained here:
   
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/

The issue is that in the reduce function getting the next iterator does not 
change the location of the reference returned, but only updates the content at 
the same location (and returns that same location - i.e., reference), so it is 
not correct to compare with a previously stored reference as they point to the 
same location and thus will be the same. Instead it is necessary to make a copy 
of the object to preserve it for later comparison. 

The patch added also encodes additional preferences between URLs: after 
comparing the boost values it then compares the extension - preferring either 
no extension or a .htm or .html extension, then length - preferring shorter 
URLs, then timestamp.  This can be modified as desired by changing the contents 
of the "isPreferredOver" method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to