[ 
https://issues.apache.org/jira/browse/NUTCH-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Vachon updated NUTCH-2540:
------------------------------
    Environment: Non-distributed, single node, standalone Nutch jobs run in a 
sinlge JVM with HBase as the data store. 2.3.1

> Support Generic Deduplication in Nutch 2.x
> ------------------------------------------
>
>                 Key: NUTCH-2540
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2540
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 2.3.1
>         Environment: Non-distributed, single node, standalone Nutch jobs run 
> in a sinlge JVM with HBase as the data store. 2.3.1
>            Reporter: Ben Vachon
>            Priority: Major
>              Labels: dedupe
>             Fix For: 2.4
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Currently, deduplication in 2.x exists only as a utility for the Solr index.
> My use-case for Nutch required deduplication so I wrote custom code that 
> checks for duplicates based on digest and deletes them at index time. I 
> figured I'd port the change so that others could use it as well.
> This is a very simple approach to Deduplication. There's plenty of room to 
> improve it.
> This change adds a new DataStore for Duplicate entries that are just lists of 
> urls with signatures as keys.
> A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map 
> WebPages into the Duplicate DataStore.
> Since the key of the Duplicate store is the digest field of the WebPage store 
> entries, duplicate matching can be configured via extension of the Signature 
> abstract class.
> A new "-deduplicate" argument is added to the IndexingJob (false by default). 
> If this flag is used, then the IndexingJob will check the Duplicate DataStore 
> for duplicate URLs, run pluggable DuplicateFilters to determine which URL 
> belongs to the original WebPage, and skip the WebPage if it is not the 
> original, and delete (from the index) the other pages if the WebPage is the 
> original.
> I've also added a BasicDuplicateFilter plugin class that considers the URL 
> with the shortest path to be the original.
> Eventually, it would be best to consider things like score and fetch time 
> when determining which WebPage to keep and which to remove.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to