Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.

Reply via email to