[ 
https://issues.apache.org/jira/browse/NUTCH-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461284#comment-17461284
 ] 

ASF GitHub Bot commented on NUTCH-2808:
---------------------------------------

sebastian-nagel merged pull request #711:
URL: https://github.com/apache/nutch/pull/711


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document side effects of ignoring robots.txt
> --------------------------------------------
>
>                 Key: NUTCH-2808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2808
>             Project: Nutch
>          Issue Type: Improvement
>          Components: documentation, robots
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.19
>
>
> (see NUTCH-1927 and NUTCH-2803)
> The aim of NUTCH-1927 was to make it possible to ignore the robots.txt for a 
> defined set of hosts/domains. Ignoring the robots.txt entirely has some site 
> effects which should be documented:
> - undesired content (duplicates, private pages, etc.) may get indexed
> - the Crawl-Delay is ignored
> - no sitemaps are detected (cf. NUTCH-2807)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to