[ http://issues.apache.org/jira/browse/NUTCH-371?page=comments#action_12439643 ] Andrzej Bialecki commented on NUTCH-371: -----------------------------------------
I think we need to change DeleteDuplicates to implement the following algorithm: Step 1: delete URL duplicates, keeping the most recent document Step 2: delete content duplicates, keeping the one with the highest score (or optionally the one with the shortest url?) The order of these steps is important: first we need to ensure that we will keep the most recent versions of the pages - currently dedup removes by content hash first, which may delete newer documents and keep older ones ... oops. Indexer doesn't check this either - see NUTCH-378 for more details. This requires storing fetchTime in the index, which automatically solves NUTCH-95. The second step would keep the best scoring pages and discard all others. Or perhaps we should keep the shortest urls? Finally, we really, really need a JUnit test for this - I already started writing one, stay tuned. > DeleteDuplicates should remove documents with duplicate URLs > ------------------------------------------------------------ > > Key: NUTCH-371 > URL: http://issues.apache.org/jira/browse/NUTCH-371 > Project: Nutch > Issue Type: Bug > Components: indexer > Reporter: Chris Schneider > > DeleteDuplicates is supposed to delete documents with duplicate URLs (after > deleting documents with identical MD5 hashes), but this part is apparently > not yet implemented. Here's the comment from DeleteDuplicates.java: > // 2. map indexes -> <<url, fetchdate>, <index,doc>> > // partition by url > // reduce, deleting all but most recent. > // > // Part 2 is not yet implemented, but the Indexer currently only indexes one > // URL per page, so this is not a critical problem. > It is apparently also known that re-fetching the same URL (e.g., one month > later) will result in more than one document with the same URL (this is > alluded to in NUTCH-95), but the comment above suggests that the indexer will > solve the problem before DeleteDuplicates, because it will only index one > document per URL. > This is not necessarily the case if the segments are to be divided among > search servers, as each server will have its own index built from its own > portion of the segments. Thus, if the URL in question was fetched in > different segments, and these segments end up assigned to different search > servers, then the indexer can't be relied on to eliminate the duplicates. > Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., > deleting documents with duplicate URLs) needs to be implemented. I agree with > Byron and Andrzej that the most recently fetched document (rather than the > one with the highest score) should be preserved. > Finally, it's also possible to get duplicate URLs in the segments without > re-fetching an expired URL in the crawldb. This can happen if 3 different > URLs all redirect to the target URL. This is yet another consequence of > handling redirections immediately, rather than adding the target URL to the > crawldb for fetching in some subsequent segment (see NUTCH-273). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
