DeleteDuplicates should remove documents with duplicate URLs
------------------------------------------------------------

                 Key: NUTCH-371
                 URL: http://issues.apache.org/jira/browse/NUTCH-371
             Project: Nutch
          Issue Type: Bug
          Components: indexer
            Reporter: Chris Schneider


DeleteDuplicates is supposed to delete documents with duplicate URLs (after 
deleting documents with identical MD5 hashes), but this part is apparently not 
yet implemented. Here's the comment from DeleteDuplicates.java:

// 2. map indexes -> <<url, fetchdate>, <index,doc>>
// partition by url
// reduce, deleting all but most recent.
//
// Part 2 is not yet implemented, but the Indexer currently only indexes one
// URL per page, so this is not a critical problem.

It is apparently also known that re-fetching the same URL (e.g., one month 
later) will result in more than one document with the same URL (this is alluded 
to in NUTCH-95), but the comment above suggests that the indexer will solve the 
problem before DeleteDuplicates, because it will only index one document per 
URL.

This is not necessarily the case if the segments are to be divided among search 
servers, as each server will have its own index built from its own portion of 
the segments. Thus, if the URL in question was fetched in different segments, 
and these segments end up assigned to different search servers, then the 
indexer can't be relied on to eliminate the duplicates.

Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., 
deleting documents with duplicate URLs) needs to be implemented. I agree with 
Byron and Andrzej that most recently fetched document (rather than the one with 
the highest score) should be preserved.

Finally, it's also possible to get duplicate URLs in the segments without 
re-fetching an expired URL in the crawldb. This can happen if 3 different URLs 
all redirect to the target URL. This is yet another consequence of handling 
redirections immediately, rather than adding the target URL to the crawldb for 
fetching some subsequent segment (see NUTCH-273).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to