[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1922: ---------------------------------------- Fix Version/s: (was: 2.4) 2.3.1 > DbUpdater overwrites fetch status for URLs from previous batches, causes > repeated re-fetches > -------------------------------------------------------------------------------------------- > > Key: NUTCH-1922 > URL: https://issues.apache.org/jira/browse/NUTCH-1922 > Project: Nutch > Issue Type: Bug > Affects Versions: 2.3 > Reporter: Gerhard Gossen > Fix For: 2.3.1 > > Attachments: NUTCH-1922.patch > > > When Nutch 2 finds a link to a URL that was crawled in a previous batch, it > resets the fetch status of that URL to {{unfetched}}. This makes this URL > available for a re-fetch, even if its crawl interval is not yet over. > To reproduce, using version 2.3: > {code} > # Nutch configuration > ant runtime > cd runtime/local > mkdir seeds > echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt > bin/crawl seeds test 2 > {code} > This uses two files {{a.html}} and {{b.html}} that link to each other. > In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. > In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. > This should update the score and link fields of {{a.html}}, but not the fetch > status. However, when I run {{bin/nutch readdb -crawlId test -url > http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns > {{status: 1 (status_unfetched)}}. > Expected would be {{status: 2 (status_fetched)}}. > The reason seems to be that DbUpdateReducer assumes that [links to a URL not > processed in the same batch always belong to new > pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109]. > Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate > job, but that change skipped all pages with a different batch ID, so I assume > that this introduced this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)