[jira] Commented: (NUTCH-371) DeleteDuplicates should remove documents with duplicate URLs

Andrzej Bialecki (JIRA) Tue, 03 Oct 2006 12:50:38 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-371?page=comments#action_12439643 ] 
            
Andrzej Bialecki  commented on NUTCH-371:
-----------------------------------------


I think we need to change DeleteDuplicates to implement the following algorithm:

Step 1: delete URL duplicates, keeping the most recent document

Step 2: delete content duplicates, keeping the one with the highest score (or 
optionally the one with the shortest url?)

The order of these steps is important: first we need to ensure that we will 
keep the most recent versions of the pages - currently dedup removes by content 
hash first, which may delete newer documents and keep older ones ... oops. 
Indexer doesn't check this either - see NUTCH-378 for more details.

This requires storing fetchTime in the index, which automatically solves 
NUTCH-95.

The second step would keep the best scoring pages and discard all others. Or 
perhaps we should keep the shortest urls?

Finally, we really, really need a JUnit test for this - I already started 
writing one, stay tuned.

> DeleteDuplicates should remove documents with duplicate URLs
> ------------------------------------------------------------
>
>                 Key: NUTCH-371
>                 URL: http://issues.apache.org/jira/browse/NUTCH-371
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Chris Schneider
>
> DeleteDuplicates is supposed to delete documents with duplicate URLs (after 
> deleting documents with identical MD5 hashes), but this part is apparently 
> not yet implemented. Here's the comment from DeleteDuplicates.java:
> // 2. map indexes -> <<url, fetchdate>, <index,doc>>
> // partition by url
> // reduce, deleting all but most recent.
> //
> // Part 2 is not yet implemented, but the Indexer currently only indexes one
> // URL per page, so this is not a critical problem.
> It is apparently also known that re-fetching the same URL (e.g., one month 
> later) will result in more than one document with the same URL (this is 
> alluded to in NUTCH-95), but the comment above suggests that the indexer will 
> solve the problem before DeleteDuplicates, because it will only index one 
> document per URL.
> This is not necessarily the case if the segments are to be divided among 
> search servers, as each server will have its own index built from its own 
> portion of the segments. Thus, if the URL in question was fetched in 
> different segments, and these segments end up assigned to different search 
> servers, then the indexer can't be relied on to eliminate the duplicates.
> Thus, it seems like the second part of the DeleteDuplicates algorithm (i.e., 
> deleting documents with duplicate URLs) needs to be implemented. I agree with 
> Byron and Andrzej that the most recently fetched document (rather than the 
> one with the highest score) should be preserved.
> Finally, it's also possible to get duplicate URLs in the segments without 
> re-fetching an expired URL in the crawldb. This can happen if 3 different 
> URLs all redirect to the target URL. This is yet another consequence of 
> handling redirections immediately, rather than adding the target URL to the 
> crawldb for fetching in some subsequent segment (see NUTCH-273).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-371) DeleteDuplicates should remove documents with duplicate URLs

Reply via email to