Pushpesh Kr. Rajwanshi wrote:
I want to know if anyone is able to successfully run distributed crawl on multiple machines involving crawling millions of pages? and how hard is to do that? Do i just have to do some configuration and set up or do some implementations also?
I recently performed a four-level deep crawl, starting from urls in DMOZ, limiting each level to 16M urls. This ran on 20 machines taking around 24 hours using about 100Mbit and retrieved around 50M pages. I used Nutch unmodified, specifying only a few configuration options. So, yes, it is possible.
Doug ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
