[ https://issues.apache.org/jira/browse/NUTCH-95?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki closed NUTCH-95. ---------------------------------- Resolution: Duplicate > DeleteDuplicates depends on the order of input segments > ------------------------------------------------------- > > Key: NUTCH-95 > URL: https://issues.apache.org/jira/browse/NUTCH-95 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 0.6, 0.7, 0.8 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > > DeleteDuplicates depends on what order the input segments are processed, > which in turn depends on the order of segment dirs returned from > NutchFileSystem.listFiles(File). In most cases this is undesired and may lead > to deleting wrong records from indexes. The silent assumption that segments > at the end of the listing are more recent is not always true. > Here's the explanation: > * Dedup first deletes the URL duplicates by computing MD5 hashes for each > URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx > is just an int index to the array of open IndexReaders - and if segment dirs > are moved/copied/renamed then entries in that array may change their order. > And then for all equal triples Dedup keeps just the first entry. Naturally, > if segmentIdx is changed due to dir renaming, a different record will be kept > and different ones will be deleted... > * then Dedup deletes content duplicates, again by computing hashes for each > content, and then sorting records by (hash, segmentIdx, docIdx). However, by > now we already have a different set of undeleted docs depending on the order > of input segments. On top of that, the same factor acts here, i.e. segmentIdx > changes when you re-shuffle the input segment dirs - so again, when identical > entries are compared the one with the lowest (segmentIdx, docIdx) is picked. > Solution: use the fetched date from the first record in each segment to > determine the order of segments. Alternatively, modify DeleteDuplicates to > use the newer algorithm from SegmentMergeTool. This algorithm works by > sorting records using tuples of (urlHash, contentHash, fetchDate, score, > urlLength). Then: > 1. If urlHash is the same, keep the doc with the highest fetchDate (the > latest version, as recorded by Fetcher). > 2. If contentHash is the same, keep the doc with the highest score, and then > if the scores are the same, keep the doc with the shortest url. > Initial fix will be prepared for the trunk/ and then backported to the > release branch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.