Hi,

For security research, there is an option to white-list robots.txt.
It’s not enabled by default and must be directly enabled.

The solution is - there isn’t one. People used to just hack
Nutch and do the same thing by commenting out a line of code 
which accomplished the same check.

Those people that are using Nutch and not obeying robots.txt
are doing just that. But Nutch itself by default does obey it.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 5/24/16, 3:17 PM, "BlackIce" <blackice...@gmail.com> wrote:

>Hi,
>
>I've just seen on a website which tracks bots, that "Tarantula" ,  our
>nutch 1.11 based crawler is being classified as not obeying robots.txt.
>
>What's the solution?

Reply via email to