[ https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509573#comment-16509573 ]
Jurian Broertjes commented on NUTCH-2565: ----------------------------------------- One solution would be to sum the retries of both CrawlDatums. We could do this only for db_unfetched or for others aswell. What do you think would be appropriate? > MergeDB incorrectly handles unfetched CrawlDatums > ------------------------------------------------- > > Key: NUTCH-2565 > URL: https://issues.apache.org/jira/browse/NUTCH-2565 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.14 > Reporter: Jurian Broertjes > Priority: Minor > > I ran into this issue when merging a crawlDB originating from sitemaps into > our normal crawlDB. CrawlDatums are merged based on output of > AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are > unfetched, this can overwrite fetchTime or other stuff. > I assume this is a bug and have a simple fix for it that checks if CrawlDatum > has status db_unfetched. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)