[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543720#comment-13543720 ]
Tejas Patil commented on NUTCH-1513: ------------------------------------ For this has to be supported I have 2 approaches: # Implement robots handling for Ftp in a similar way as its been done for Http protocol. Parsing performed by Nutch. # Same as #1 but use Crawler-Commons to get the parsing done. There is already [NUTCH-1031|https://issues.apache.org/jira/browse/NUTCH-1031] filed for its integration with Nutch. The [last release|http://code.google.com/p/crawler-commons/downloads/list] of Crawler-Commons was in July 2011. Looks like it aint under active development. Please let me know your comments about these approaches. > Support Robots.txt for Ftp urls > ------------------------------- > > Key: NUTCH-1513 > URL: https://issues.apache.org/jira/browse/NUTCH-1513 > Project: Nutch > Issue Type: Improvement > Affects Versions: 1.7, 2.2 > Reporter: Tejas Patil > Priority: Minor > Labels: robots.txt > > As per [0], a FTP website can have robots.txt like [1]. In the nutch code, > Ftp plugin is not parsing the robots file and accepting all urls. > In "_src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_" > {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { > return EmptyRobotRules.RULES; > }{noformat} > Its not clear of this was part of design or if its a bug. > [0] : > https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt > [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira