Hi - that is a curious case indeed as Nutch adhere's to robots.txt. Can they
provide you with a reason for marking your Nutch as impolite?
Markus
-----Original message-----
> From:Mattmann, Chris A (3980) <chris.a.mattm...@jpl.nasa.gov>
> Sent: Wednesday 25th May 2016 0:26
> To: user@nutch.apache.org
> Subject: Re: Robots.txt
>
> Hi,
>
> By default, as I mentioned, Nutch does obey robots.txt. There is
> a whitelist property that can be set in nutch-default to selectively
> disable it for certain sites (again for valid security research use
> cases).
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>
>
>
>
> On 5/24/16, 3:24 PM, "BlackIce" <blackice...@gmail.com> wrote:
>
> >I don't recall messing with anything to do with robots.txt, I want us to
> >be as polite as possible.
> >On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" <
> >chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> >> Hi,
> >>
> >> For security research, there is an option to white-list robots.txt.
> >> It’s not enabled by default and must be directly enabled.
> >>
> >> The solution is - there isn’t one. People used to just hack
> >> Nutch and do the same thing by commenting out a line of code
> >> which accomplished the same check.
> >>
> >> Those people that are using Nutch and not obeying robots.txt
> >> are doing just that. But Nutch itself by default does obey it.
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW: http://sunset.usc.edu/~mattmann/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> Director, Information Retrieval and Data Science Group (IRDS)
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> WWW: http://irds.usc.edu/
> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 5/24/16, 3:17 PM, "BlackIce" <blackice...@gmail.com> wrote:
> >>
> >> >Hi,
> >> >
> >> >I've just seen on a website which tracks bots, that "Tarantula" , our
> >> >nutch 1.11 based crawler is being classified as not obeying robots.txt.
> >> >
> >> >What's the solution?
> >>
>