Re: Ignoring Robots.txt

Super Man Fri, 11 Sep 2009 10:05:32 -0700

My sysadm refuses to change the robots.txt citing the following reason:

The moment he allows a specific agent, a lot of crawlers impersonate
as that user agent and tries to crawl that site.


Are you saying there is no way to configure nutch to ignore robots.txt?

Thanks,
Zee

On Fri, Sep 11, 2009 at 9:10 PM, David M. Cole <[email protected]> wrote:
> At 3:00 PM +0530 9/11/09, Super Man wrote:
>>
>> Any clues?
>
> Zee:
>
> The robots.txt protocol allows for identifying different user-agents within
> the one file, with each getting their own individual set of privileges (see
> http://www.robotstxt.org/ for more info).
>
> Ask your sysadmin to include an additional robots privilege record for the
> robot-name you choose that allows your robot access where others are not
> allowed.
>
> You can set the user-agent in the nutch-default.xml file, changing the
> http.robots.agents tag accordingly. As Jake Jacobson found out in June, you
> *must* end the series of user-agents in the http.robots.agents tag with an
> asterisk (*), i.e.:
>
> <property>
>     <name>http.robots.agents</name>
>     <value>my-robot,*</value>
> </property>
>
> Hope this helps.
>
> \dmc
>
> --
> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
>   David M. Cole                                            [email protected]
>   Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
>   Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
> *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
>

Re: Ignoring Robots.txt

Reply via email to