[ https://issues.apache.org/jira/browse/NUTCH-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Vachon updated NUTCH-2540: ------------------------------ Environment: Non-distributed, single node, standalone Nutch jobs run in a sinlge JVM with HBase as the data store. 2.3.1 > Support Generic Deduplication in Nutch 2.x > ------------------------------------------ > > Key: NUTCH-2540 > URL: https://issues.apache.org/jira/browse/NUTCH-2540 > Project: Nutch > Issue Type: New Feature > Components: indexer > Affects Versions: 2.3.1 > Environment: Non-distributed, single node, standalone Nutch jobs run > in a sinlge JVM with HBase as the data store. 2.3.1 > Reporter: Ben Vachon > Priority: Major > Labels: dedupe > Fix For: 2.4 > > Original Estimate: 120h > Remaining Estimate: 120h > > Currently, deduplication in 2.x exists only as a utility for the Solr index. > My use-case for Nutch required deduplication so I wrote custom code that > checks for duplicates based on digest and deletes them at index time. I > figured I'd port the change so that others could use it as well. > This is a very simple approach to Deduplication. There's plenty of room to > improve it. > This change adds a new DataStore for Duplicate entries that are just lists of > urls with signatures as keys. > A DeduplicatorJob can be run between the DbUpdatorJob and IndexingJob to map > WebPages into the Duplicate DataStore. > Since the key of the Duplicate store is the digest field of the WebPage store > entries, duplicate matching can be configured via extension of the Signature > abstract class. > A new "-deduplicate" argument is added to the IndexingJob (false by default). > If this flag is used, then the IndexingJob will check the Duplicate DataStore > for duplicate URLs, run pluggable DuplicateFilters to determine which URL > belongs to the original WebPage, and skip the WebPage if it is not the > original, and delete (from the index) the other pages if the WebPage is the > original. > I've also added a BasicDuplicateFilter plugin class that considers the URL > with the shortest path to be the original. > Eventually, it would be best to consider things like score and fetch time > when determining which WebPage to keep and which to remove. -- This message was sent by Atlassian JIRA (v7.6.3#76005)