[ 
https://issues.apache.org/jira/browse/NUTCH-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1608:
----------------------------------------

    Fix Version/s: 2.3
    
> SolrDeleteDuplicates bug: choosing preferred page when duplicates does not 
> work
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-1608
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1608
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 2.1, 2.2.1
>         Environment: all
>            Reporter: Brian
>            Priority: Minor
>              Labels: patch
>             Fix For: 2.3
>
>         Attachments: NUTCH-1608.patch
>
>
> There is a bug in the code for deciding which version of a page to keep when 
> there are duplicates.  This is a bug in the reduce function and is a common 
> pitfall when using hadoop/mapreduce, as explained here:
>    
> http://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/
> The issue is that in the reduce function getting the next iterator does not 
> change the location of the reference returned, but only updates the content 
> at the same location (and returns that same location - i.e., reference), so 
> it is not correct to compare with a previously stored reference as they point 
> to the same location and thus will be the same. Instead it is necessary to 
> make a copy of the object to preserve it for later comparison. 
> The patch added also encodes additional preferences between URLs: after 
> comparing the boost values it then compares the extension - preferring either 
> no extension or a .htm or .html extension, then length - preferring shorter 
> URLs, then timestamp.  This can be modified as desired by changing the 
> contents of the "isPreferredOver" method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to