OK, that sounds good. Tell me about the indexing. I came across an article where someone had indexed about 10% of a wikipedia clone
http://h3x.no/2011/05/10/guide-solr-performance-tuning who with a much bigger machine and a *lot* of tuning was able to reduce time required from 168m to 16min for the 600,000 records. Fred On Mon, Oct 10, 2011 at 10:15 AM, Markus Jelsma <[email protected]>wrote: > Hi, > > Based on our experience i would recommend running Nutch on a Hadoop pseudo- > cluster with a bit more memory and at least 4 CPU cores. Fetch and parse of > those url's wont' be a problem but updating the crawldb and generating > fetch > lists is going to be a problem. > > Are you also indexing? Then that will also be a very costly process. > > Cheers > > On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote: > > HI, > > > > I am looking for advice on how to configure Nutch (and Solr) to crawl a > > private Wikipedia mirror. > > > > - It is my mirror on an intranet so I do not need to be polite to > > myself. - I need to complete this 11 million page crawl as fast as I > > reasonably can. > > - Both crawler and mirror are 1.7GB machines dedicated to this task. > > - I only need to crawl internal links (not external). > > - Eventually I will need to update the crawl but a monthly update will > > be sufficient. > > > > Any advice (and sample config files) would be much appreciated! > > > > Fred > > -- >

