Brian created NUTCH-1608:
----------------------------
Summary: SolrDeleteDuplicates bug: choosing preferred page when
duplicates does not work
Key: NUTCH-1608
URL: https://issues.apache.org/jira/browse/NUTCH-1608
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 2.2.1, 2.1
Environment: all
Reporter: Brian
Priority: Minor
There is a bug in the code for deciding which version of a page to keep when
there are duplicates. This is a bug in the reduce function and is a common
pitfall when using hadoop/mapreduce, as explained here:
http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
The issue is that in the reduce function getting the next iterator does not
change the location of the reference returned, but only updates the content at
the same location (and returns that same location - i.e., reference), so it is
not correct to compare with a previously stored reference as they point to the
same location and thus will be the same. Instead it is necessary to make a copy
of the object to preserve it for later comparison.
The patch added also encodes additional preferences between URLs: after
comparing the boost values it then compares the extension - preferring either
no extension or a .htm or .html extension, then length - preferring shorter
URLs, then timestamp. This can be modified as desired by changing the contents
of the "isPreferredOver" method.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira