Re: advice, config files for crawling private wikipedia mirror

Markus Jelsma Mon, 10 Oct 2011 07:15:53 -0700

Hi,

Based on our experience i would recommend running Nutch on a Hadoop pseudo-
cluster with a bit more memory and at least 4 CPU cores. Fetch and parse of 
those url's wont' be a problem but updating the crawldb and generating fetch 
lists is going to be a problem.


Are you also indexing? Then that will also be a very costly process.

Cheers

On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote:
> HI,
> 
> I am looking for advice on how to configure Nutch (and Solr) to crawl a
> private Wikipedia mirror.
> 
>    - It is my mirror on an intranet so I do not need to be polite to
> myself. -  I need to complete this 11 million page crawl as fast as I
> reasonably can.
>    - Both crawler and mirror are 1.7GB machines dedicated to this task.
>    -  I only need to crawl internal links (not external).
>    - Eventually I will need to update the crawl but a monthly update will
> be sufficient.
> 
> Any advice (and sample config files) would be much appreciated!
> 
> Fred

--

Re: advice, config files for crawling private wikipedia mirror

Reply via email to