[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

Sebastian Nagel (Jira) Fri, 08 Nov 2019 05:14:26 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970155#comment-16970155
 ]


Sebastian Nagel commented on NUTCH-2748:
----------------------------------------

Hi [~markus17], already working on a patch. Agreed, that current behavior is 
definitely wrong. I'll simplify my first solution. The second would only allow 
to treat exceeded redirects as links or skip them. Skipping may make sense when 
http.redirect.max is already set to a higher number (3 or more) or in a large 
crawl where you cannot trust the sites. 

> Fetch status gone (redirect exceeded) not to overwrite existing items in 
> CrawlDb
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2748
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2748
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>         Attachments: test-NUTCH-2748.zip
>
>
> If fetcher is following redirects and the max. number of redirects in a 
> redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum 
> item with status "fetch_gone" and protocol status "redir_exceeded". During 
> the next CrawlDb update the "gone" item will set the status of existing items 
> (including "db_fetched") with "db_gone". It shouldn't as there has been no 
> fetch of the final redirect target and indeed nothing is know about it's 
> status. An wrong db_gone may then cause that a page gets deleted from the 
> search index.
> There are two possible solutions:
> 1. ignore protocol status "redir_exceeded" during CrawlDb update
> 2. when http.redirect.max is hit the fetcher stores nothing or a redirect 
> status instead of a fetch_gone
> Solution 2. seems easier to implement and it would be possible to make the 
> behavior configurable:
> - store the redirect target as outlink, i.e. same behavior as if 
> http.redirect.max == 0
> - store "fetch_gone" (current behavior)
> - store nothing, i.e. ignore those redirects - this should be the default as 
> it's close to the current behavior without the risk to accidentally set 
> successful fetches to db_gone



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

Reply via email to