Re: Going Beyond the Prototype

Dietrich Thu, 12 May 2011 11:23:15 -0700

On Tue, May 10, 2011 at 10:37 AM, webdev1977 <[email protected]> wrote:
> One problem has to do with the fact that I am doing vertical searching.  The
> side effect of this is that the crawl process is SO slow.  It took about 48
> hours to crawl about 350,000 urls all from the same website. I am am
> crawling a shared file system and I am sure that constitutes vertical
> crawling.  The other web crawling I am doing also only comes from a handful
> of urls.  Maybe nutch is not the solution to use based on this?
>
There are two options, number one is mandatory:
- don't crawl one web site at a time. Crawl all of your sites at once
so the threads can be partitioned across all sites. You will be able
to crawl much faster without overloading any of the sites. This one is
a must
- use a Hadoop cluster to crawl, not a single machine


Dietrich

Re: Going Beyond the Prototype

Reply via email to