Hello, I was wondering if anyone could guide me on how to crawl the web and ignore the robots.txt since I can not index some big sites. Or if someone could point how to get around it. I read somewhere about a protocol.plugin.check.robots but that was for nutch.
The way I index is bin/post -c gettingstarted https://en.wikipedia.org/ but I can't index the site I'm guessing because of the robots.txt. I can index with bin/post -c gettingstarted http://lucene.apache.org/solr which I am guessing allows it. I was also wondering how to find the name of the crawler bin/post uses.