Hi,

I don't know if it is a bug, however your suggested improvement would be
welcomed without a doubt.

If you could please log a Jira we can review.

Best

Lewis

On Fri, Jan 4, 2013 at 3:39 AM, Tejas Patil <[email protected]>wrote:

> Hi,
>
> As per [0], a FTP website can have robots.txt like [1]. In the nutch code,
> Ftp plugin is not parsing the robots file and simply accepting any url.
>
> In
> "src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java"
>
> *  public RobotRules getRobotRules(Text url, CrawlDatum datum) {*
> *    return EmptyRobotRules.RULES;*
> *  }*
>
> Was this done intentionally or is this a bug ?
>
> [0] :
>
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> [1] : ftp://example.com/robots.txt
>
> Thanks,
> Tejas Patil
>



-- 
*Lewis*

Reply via email to