[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tejas Patil updated NUTCH-1513: ------------------------------- Attachment: NUTCH-1513.2.x.v2.patch NUTCH-1513.trunk.v2.patch Attached the patches for both trunk and 2.x. (As mentioned in earlier comment above, I am using http.agent.name and http.robots.agents.) > Support Robots.txt for Ftp urls > ------------------------------- > > Key: NUTCH-1513 > URL: https://issues.apache.org/jira/browse/NUTCH-1513 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.7, 2.2 > Reporter: Tejas Patil > Assignee: Tejas Patil > Priority: Minor > Labels: robots.txt > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, > NUTCH-1513.trunk.v2.patch > > > As per [0], a FTP website can have robots.txt like [1]. In the nutch code, > Ftp plugin is not parsing the robots file and accepting all urls. > In "_src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_" > {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { > return EmptyRobotRules.RULES; > }{noformat} > Its not clear of this was part of design or if its a bug. > [0] : > https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt > [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira