Re: [ANNOUNCE] Web Crawler

Jack Krupansky Mon, 15 Jul 2013 06:32:09 -0700

Lucene does not provide any capabilities for crawling websites. You wouldhave to contact the Nutch project, the ManifoldCF project, or other webcrawling projects.

As far as bypassing robots.txt, that is a very unethical thing to do. It israther offensive that you seem to be suggesting that anybody on this mailinglist would engage in such an unethical or unprofessional activity.


-- Jack Krupansky

-----Original Message-----From: Ramakrishna

Sent: Monday, July 15, 2013 9:13 AM
To: java-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Hi..

I'm trying nutch to crawl some web-sites. Unfortunately they restricted to
crawl their web-site by writing robots.txt. By using crawl-anywhere can I
crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz
send me the materials/links to study about crawl-anywhere or else plz
suggest me which are the crawlers to use to crawl web-sites without
bothering about robots.txt of that particular site. Its urgent plz reply as
soon as possible.

Thanks in advance

--

View this message in context:http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607833p4078039.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: [ANNOUNCE] Web Crawler

Reply via email to