Re: [Nutch-dev] lib-http crawl-delay problem

rubdabadub Thu, 15 Feb 2007 03:45:45 -0800

Hi:

I am unable to get the attached patch via mail. Its better if you
create a JIra issue and attached the patch there.


Thank you.

On 2/15/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> Hi,
>
> There seems to be two small bugs in lib-http's RobotRulesParser.
>
> First is about reading crawl-delay. The code doesn't check for addRules,
> so the nutch bot will get the crawl-delay value of another robot's
> crawl-delay in robots.txt. Let me try to be more clear:
>
> User-agent: foobot
> Crawl-delay: 3600
>
> User-agent: *
> Disallow:
>
>
> In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
> value, no matter what nutch bot's name actually is.
>
> Second is about main method. RobotRulesParser.main advertises its usage
> as "<robots-file> <url-file> <agent-name>+" but if you give it more than
> one agent time it refuses it.
>
> Trivial patch attached.
>
> --
> Doğacan Güney
>
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] lib-http crawl-delay problem

Reply via email to