Hi, For security research, there is an option to white-list robots.txt. It’s not enabled by default and must be directly enabled.
The solution is - there isn’t one. People used to just hack Nutch and do the same thing by commenting out a line of code which accomplished the same check. Those people that are using Nutch and not obeying robots.txt are doing just that. But Nutch itself by default does obey it. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 5/24/16, 3:17 PM, "BlackIce" <blackice...@gmail.com> wrote: >Hi, > >I've just seen on a website which tracks bots, that "Tarantula" , our >nutch 1.11 based crawler is being classified as not obeying robots.txt. > >What's the solution?