I think I'm just hitting the same issues as reported in this other thread
and since I'm new to nutch I can't compare
http://www.mail-archive.com/[email protected]/msg13665.html
by Doğacan
Güne

At the beginning of a fetch cycle everything is going damn fast and then it
keeps slowing down and "spinwainting", something looks wrong.

-Ray-




2009/4/28 Raymond Balmès <[email protected]>

> Well I also monitored the HD bandwidth pretty normal nothing really
> special, only during the crawldb merge phase there is a saturation but
> pretty short.
> What do you call lots of threads more than 100 ?
>
> -Ray-
>
> 2009/4/28 Alex Basa <[email protected]>
>
>
>> I've found that the best speed increases I've made is in boosting the IO
>> by upgrading the HD.  I use an I-SCSI with a zfs file system.  Another
>> improvement is if your hardware handles lots of threads.
>>
>> Alex
>>
>>
>> --- On Tue, 4/28/09, Raymond Balmès <[email protected]> wrote:
>>
>> > From: Raymond Balmès <[email protected]>
>> > Subject: Re: dual core and crawling
>> > To: [email protected]
>> > Date: Tuesday, April 28, 2009, 10:54 AM
>>  > the full web.
>> >
>> > 2009/4/28 Dennis Kubes <[email protected]>
>> >
>> > > Are you crawling only certain domains or doing a full
>> > web crawl?
>> > >
>> > > Dennis
>> > >
>> > >
>> > > Raymond Balme`s wrote:
>> > >
>> > >> Actually even if I put 100 threads it does no go
>> > faster, I have 30Mbit/s
>> > >> fiber internet connection so that shouldn't be
>> > the problem.
>> > >>
>> > >> I thought if I would put more threads I could
>> > fetch more sites in
>> > >> parrallel
>> > >> and so use more of the bandwidth & the CPU...
>> > so waiting on DNS should be
>> > >> seen.
>> > >> Or is it that I need run muliple fetchers in
>> > parallel, but I'm not sure
>> > >> how
>> > >> to do that and merge the results back at the end.
>> > >>
>> > >> -Ray-
>> > >>
>> > >> 2009/4/28 Dennis Kubes <[email protected]>
>> > >>
>> > >> Java Threads do take advantage of multiple cores.
>> > The fetcher does use
>> > >>> multiple threads.  Also having multiple
>> > fetcher tasks on a single machine
>> > >>> will utilize more of the CPU.  Even with 50
>> > threads on a single machine,
>> > >>> depending on the websites being crawled the
>> > utilization might not get
>> > >>> that
>> > >>> much higher.  Much of the time spent in
>> > fetching is spent waiting on DNS
>> > >>> and
>> > >>> the websites being fetched.
>> > >>>
>> > >>> Dennis
>> > >>>
>> > >>>
>> > >>> Raymond Balmčs wrote:
>> > >>>
>> > >>> I use a dual core intel, I observed the crawls
>> > never gets above 50% mark
>> > >>>> CPU
>> > >>>> load, despite the fact that used -threads
>> > 50... does nutch take
>> > >>>> advantage
>> > >>>> of
>> > >>>> multi-cores ?
>> > >>>> Do I miss a setting somewhere ?
>> > >>>>
>> > >>>> -Ray-
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>
>>
>>
>>
>>
>

Reply via email to