[jira] [Updated] (NUTCH-656) DeleteDuplicates based on crawlDB only

Julien Nioche (JIRA) Thu, 14 Nov 2013 02:30:42 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julien Nioche updated NUTCH-656:
--------------------------------

    Attachment: NUTCH-656.v3.patch

Thanks for your comments Seb. This new patch addresses some of the issues you 
pointed out  : 

{quote}
shouldn't this be in package o.a.n.crawl instead of o.a.n.indexer? Only CrawlDb 
is involved, although it's related to indexing, of course.
{quote}

done. it does not call the indexers directly and only modifies the crawldb so 
this is indeed the right place for it

{quote}
got a NPE if signature is null (may happen for successfully fetched docs, e.g., 
if parsing is skipped because of truncated content): we can skip docs without 
signature, they are not indexed and, consequently, never duplicates.
{quote}

fixed  


{quote}
Status db_duplicate is used only in CleaningJob. Shouldn't it be used also in 
IndexerMapReduce? If DeduplicationJob is run before IndexingJob duplicates are 
even not indexed. Also indexer backends which do not allow to remove docs after 
indexing would profit.
{quote}

I added some code in IndexerMapReduce so that entries marked as duplicates are 
now sent for deletion 

{quote}
In a continuous crawl it may happen that a deduplicated doc loses this status 
because the doc it is a duplicate of disappears. DeduplicationJob does not 
reset the duplicate status in this case. The doc get indexed not before it is 
re-fetched. To trigger a re-index is hard because we would need the old segment 
with fetch content. So we can ignore this problem (for now). Right?
{quote}

yes let's keep it simple for now

Will commit this shortly and will remove the indexer.solr subpackage in a 
separate JIRA

Thanks for taking the time to review this


> DeleteDuplicates based on crawlDB only 
> ---------------------------------------
>
>                 Key: NUTCH-656
>                 URL: https://issues.apache.org/jira/browse/NUTCH-656
>             Project: Nutch
>          Issue Type: Wish
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-656.patch, NUTCH-656.v2.patch, NUTCH-656.v3.patch
>
>
> The existing dedup functionality relies on Lucene indices and can't be used 
> when the indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead 
> to detect URLs to delete then do the deletions in an indexer-neutral way. As 
> far as I understand the content of the crawlDB contains all the elements we 
> need for dedup, namely :
> * URL 
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs : 
> * read crawlDB and compare on URLs : keep only most recent element - oldest 
> are stored in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL 
> + fetch time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature 
> or score) and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the 
> indices.
> Any thoughts on this? Am I missing something?
> Julien



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (NUTCH-656) DeleteDuplicates based on crawlDB only

Reply via email to