[ 
https://issues.apache.org/jira/browse/NUTCH-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483910#comment-16483910
 ] 

ASF GitHub Bot commented on NUTCH-2581:
---------------------------------------

sebastian-nagel opened a new pull request #331: NUTCH-2581 Caching of 
redirected robots.txt may overwrite correct robots.txt rules
URL: https://github.com/apache/nutch/pull/331
 
 
   - only cache redirected robots.txt rules if the target URL path and query 
equal /robots.txt

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Caching of redirected robots.txt may overwrite correct robots.txt rules
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2581
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2581
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, robots
>    Affects Versions: 2.3.1, 1.14
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 2.4, 1.15
>
>
> Redirected robots.txt rules are also cached for the target host. That may 
> cause that the correct robots.txt rules are never fetched. E.g., 
> http://wyomingtheband.com/robots.txt redirects to 
> https://www.facebook.com/wyomingtheband/robots.txt. Because fetching fails 
> with a 404 bots are allowed to crawl wyomingtheband.com. The rules is 
> erroneously also cached for the redirect target host www.facebook.com which 
> is clear regarding their [robots.txt|https://www.facebook.com/robots.txt] 
> rules and does not allow crawling.
> Nutch may cache redirected robots.txt rules only if the path part (in doubt, 
> including the query) of the redirect target URL is exactly {{/robots.txt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to