I think I'm just hitting the same issues as reported in this other thread and since I'm new to nutch I can't compare http://www.mail-archive.com/[email protected]/msg13665.html by Doğacan Güne
At the beginning of a fetch cycle everything is going damn fast and then it keeps slowing down and "spinwainting", something looks wrong. -Ray- 2009/4/28 Raymond Balmès <[email protected]> > Well I also monitored the HD bandwidth pretty normal nothing really > special, only during the crawldb merge phase there is a saturation but > pretty short. > What do you call lots of threads more than 100 ? > > -Ray- > > 2009/4/28 Alex Basa <[email protected]> > > >> I've found that the best speed increases I've made is in boosting the IO >> by upgrading the HD. I use an I-SCSI with a zfs file system. Another >> improvement is if your hardware handles lots of threads. >> >> Alex >> >> >> --- On Tue, 4/28/09, Raymond Balmès <[email protected]> wrote: >> >> > From: Raymond Balmès <[email protected]> >> > Subject: Re: dual core and crawling >> > To: [email protected] >> > Date: Tuesday, April 28, 2009, 10:54 AM >> > the full web. >> > >> > 2009/4/28 Dennis Kubes <[email protected]> >> > >> > > Are you crawling only certain domains or doing a full >> > web crawl? >> > > >> > > Dennis >> > > >> > > >> > > Raymond Balme`s wrote: >> > > >> > >> Actually even if I put 100 threads it does no go >> > faster, I have 30Mbit/s >> > >> fiber internet connection so that shouldn't be >> > the problem. >> > >> >> > >> I thought if I would put more threads I could >> > fetch more sites in >> > >> parrallel >> > >> and so use more of the bandwidth & the CPU... >> > so waiting on DNS should be >> > >> seen. >> > >> Or is it that I need run muliple fetchers in >> > parallel, but I'm not sure >> > >> how >> > >> to do that and merge the results back at the end. >> > >> >> > >> -Ray- >> > >> >> > >> 2009/4/28 Dennis Kubes <[email protected]> >> > >> >> > >> Java Threads do take advantage of multiple cores. >> > The fetcher does use >> > >>> multiple threads. Also having multiple >> > fetcher tasks on a single machine >> > >>> will utilize more of the CPU. Even with 50 >> > threads on a single machine, >> > >>> depending on the websites being crawled the >> > utilization might not get >> > >>> that >> > >>> much higher. Much of the time spent in >> > fetching is spent waiting on DNS >> > >>> and >> > >>> the websites being fetched. >> > >>> >> > >>> Dennis >> > >>> >> > >>> >> > >>> Raymond Balmčs wrote: >> > >>> >> > >>> I use a dual core intel, I observed the crawls >> > never gets above 50% mark >> > >>>> CPU >> > >>>> load, despite the fact that used -threads >> > 50... does nutch take >> > >>>> advantage >> > >>>> of >> > >>>> multi-cores ? >> > >>>> Do I miss a setting somewhere ? >> > >>>> >> > >>>> -Ray- >> > >>>> >> > >>>> >> > >>>> >> > >> >> >> >> >> >
