[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

Jurian Broertjes (JIRA) Tue, 12 Jun 2018 09:02:20 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509801#comment-16509801
 ]


Jurian Broertjes commented on NUTCH-2565:
-----------------------------------------

Maybe it would be sufficient to only test on STATUS_DB_UNFETCHED in 
calculateLastFetchTime(datum), but fallback on CrawlDatum.getFetchTime() in the 
merger and pick the newest according to that.

That way we could also just pick the retries value from the newest one and keep 
it simple.

I'll add a PR later for review

> MergeDB incorrectly handles unfetched CrawlDatums
> -------------------------------------------------
>
>                 Key: NUTCH-2565
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2565
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Jurian Broertjes
>            Priority: Minor
>
> I ran into this issue when merging a crawlDB originating from sitemaps into 
> our normal crawlDB. CrawlDatums are merged based on output of 
> AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are 
> unfetched, this can overwrite fetchTime or other stuff.
> I assume this is a bug and have a simple fix for it that checks if CrawlDatum 
> has status db_unfetched.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2565) MergeDB incorrectly handles unfetched CrawlDatums

Reply via email to