[ https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma reassigned NUTCH-1257: ------------------------------------ Assignee: Markus Jelsma > Support for the x-robots-tag HTTP Header > ---------------------------------------- > > Key: NUTCH-1257 > URL: https://issues.apache.org/jira/browse/NUTCH-1257 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Reporter: Mike > Assignee: Markus Jelsma > Labels: http,, privacy,, robots, > Fix For: 1.7, 2.2 > > > Google and Bing both currently support the x-robots-tag HTTP header. This is > important, because they have a policy of not *crawling* links that are in a > robots.txt file, and not *indexing* links that are set to noindex. In the > case that a page is indexed but not crawled, Google and Bing will show the > page in their results, but it will lack a snippet (since they didn't crawl > it, they can't generate one). > As a result, the only way to block Google and Bing from having a page in > their index is to use the robots meta tag in HTML pages and the x-robots-tag > in other mimetypes. > As a site owner that needs to keep specific pages private, I *cannot* trust > robots.txt to keep my pages out of Google and Bing, and I have to use the two > robots standards. Since Nutch doesn't support the HTTP header, I have to > block it from crawling ALL non-HTML pages on my site. > This is not an ideal state of affairs, and it would be great if Nutch > supported the x-robots-tag HTTP header. > I've done more research on this topic on my blog: > - > http://michaeljaylissner.com/blog/support-for-x-robots-tag-http-header-and-robots-HTML-meta-tag > - > http://michaeljaylissner.com/blog/respecting-privacy-while-providing-hundreds-of-thousands-of-public-documents -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira