[ https://issues.apache.org/jira/browse/NUTCH-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432456#comment-16432456 ]
Sebastian Nagel commented on NUTCH-2565: ---------------------------------------- Ok, good point. I think the CrawlDbMerger was never considered to merge an unfetched CrawlDb. What about CrawlDatums with status db_unfetched and retries > 0 ? Shouldn't failed fetches also take precedence over newly injected items? > MergeDB incorrectly handles unfetched CrawlDatums > ------------------------------------------------- > > Key: NUTCH-2565 > URL: https://issues.apache.org/jira/browse/NUTCH-2565 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.14 > Reporter: Jurian Broertjes > Priority: Minor > > I ran into this issue when merging a crawlDB originating from sitemaps into > our normal crawlDB. CrawlDatums are merged based on output of > AbstractFetchSchedule::calculateLastFetchTime(). When CrawlDatums are > unfetched, this can overwrite fetchTime or other stuff. > I assume this is a bug and have a simple fix for it that checks if CrawlDatum > has status db_unfetched. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)