[ https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2748: ----------------------------------- Attachment: test-NUTCH-2748.zip > Fetch status gone (redirect exceeded) not to overwrite existing items in > CrawlDb > -------------------------------------------------------------------------------- > > Key: NUTCH-2748 > URL: https://issues.apache.org/jira/browse/NUTCH-2748 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher > Affects Versions: 1.16 > Reporter: Sebastian Nagel > Priority: Major > Fix For: 1.17 > > Attachments: test-NUTCH-2748.zip > > > If fetcher is following redirects and the max. number of redirects in a > redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum > item with status "fetch_gone" and protocol status "redir_exceeded". During > the next CrawlDb update the "gone" item will set the status of existing items > (including "db_fetched") with "db_gone". It shouldn't as there has been no > fetch of the final redirect target and indeed nothing is know about it's > status. An wrong db_gone may then cause that a page gets deleted from the > search index. > There are two possible solutions: > 1. ignore protocol status "redir_exceeded" during CrawlDb update > 2. when http.redirect.max is hit the fetcher stores nothing or a redirect > status instead of a fetch_gone > Solution 2. seems easier to implement and it would be possible to make the > behavior configurable: > - store the redirect target as outlink, i.e. same behavior as if > http.redirect.max == 0 > - store "fetch_gone" (current behavior) > - store nothing, i.e. ignore those redirects - this should be the default as > it's close to the current behavior without the risk to accidentally set > successful fetches to db_gone -- This message was sent by Atlassian Jira (v8.3.4#803005)