Hi,
As per [0], a FTP website can have robots.txt like [1]. In the nutch code,
Ftp plugin is not parsing the robots file and simply accepting any url.
In "src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java"
* public RobotRules getRobotRules(Text url, CrawlDatum datum) {*
* return EmptyRobotRules.RULES;*
* }*
Was this done intentionally or is this a bug ?
[0] :
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
[1] : ftp://example.com/robots.txt
Thanks,
Tejas Patil