Usually, if a webmaster finds that your crawler has ignored their robots.txt, they will block you machine, or maybe even your entire IP block, from accessing their site.
Karl -----Original Message----- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, July 15, 2013 9:30 AM To: java-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler Lucene does not provide any capabilities for crawling websites. You would have to contact the Nutch project, the ManifoldCF project, or other web crawling projects. As far as bypassing robots.txt, that is a very unethical thing to do. It is rather offensive that you seem to be suggesting that anybody on this mailing list would engage in such an unethical or unprofessional activity. -- Jack Krupansky -----Original Message----- From: Ramakrishna Sent: Monday, July 15, 2013 9:13 AM To: java-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler Hi.. I'm trying nutch to crawl some web-sites. Unfortunately they restricted to crawl their web-site by writing robots.txt. By using crawl-anywhere can I crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz send me the materials/links to study about crawl-anywhere or else plz suggest me which are the crawlers to use to crawl web-sites without bothering about robots.txt of that particular site. Its urgent plz reply as soon as possible. Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org