Re: advice, config files for crawling private wikipedia mirror

Fred Zimmerman Mon, 10 Oct 2011 07:28:48 -0700

OK, that sounds good.  Tell me about the indexing.  I came across an article
where someone had indexed about 10% of a wikipedia clone


http://h3x.no/2011/05/10/guide-solr-performance-tuning

who with a much bigger machine and a *lot* of tuning was able to reduce time
required from 168m to 16min for the 600,000 records.

Fred



On Mon, Oct 10, 2011 at 10:15 AM, Markus Jelsma
<[email protected]>wrote:

> Hi,
>
> Based on our experience i would recommend running Nutch on a Hadoop pseudo-
> cluster with a bit more memory and at least 4 CPU cores. Fetch and parse of
> those url's wont' be a problem but updating the crawldb and generating
> fetch
> lists is going to be a problem.
>
> Are you also indexing? Then that will also be a very costly process.
>
> Cheers
>
> On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote:
> > HI,
> >
> > I am looking for advice on how to configure Nutch (and Solr) to crawl a
> > private Wikipedia mirror.
> >
> >    - It is my mirror on an intranet so I do not need to be polite to
> > myself. -  I need to complete this 11 million page crawl as fast as I
> > reasonably can.
> >    - Both crawler and mirror are 1.7GB machines dedicated to this task.
> >    -  I only need to crawl internal links (not external).
> >    - Eventually I will need to update the crawl but a monthly update will
> > be sufficient.
> >
> > Any advice (and sample config files) would be much appreciated!
> >
> > Fred
>
> --
>

Re: advice, config files for crawling private wikipedia mirror

Reply via email to