Hi,

As per [0], a FTP website can have robots.txt like [1]. In the nutch code,
Ftp plugin is not parsing the robots file and simply accepting any url.

In "src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java"

*  public RobotRules getRobotRules(Text url, CrawlDatum datum) {*
*    return EmptyRobotRules.RULES;*
*  }*

Was this done intentionally or is this a bug ?

[0] :
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
[1] : ftp://example.com/robots.txt

Thanks,
Tejas Patil

Reply via email to