How are you doing your DNS for fetching? If you have a single server handling DNS requests for you, you may be overloading it and causing a inadvertent DOS attack on it.

Dennis

Raymond Balmès wrote:
I think I'm just hitting the same issues as reported in this other thread
and since I'm new to nutch I can't compare
http://www.mail-archive.com/[email protected]/msg13665.html
by Doğacan
Güne

At the beginning of a fetch cycle everything is going damn fast and then it
keeps slowing down and "spinwainting", something looks wrong.

-Ray-




2009/4/28 Raymond Balmès <[email protected]>

Well I also monitored the HD bandwidth pretty normal nothing really
special, only during the crawldb merge phase there is a saturation but
pretty short.
What do you call lots of threads more than 100 ?

-Ray-

2009/4/28 Alex Basa <[email protected]>


I've found that the best speed increases I've made is in boosting the IO
by upgrading the HD.  I use an I-SCSI with a zfs file system.  Another
improvement is if your hardware handles lots of threads.

Alex


--- On Tue, 4/28/09, Raymond Balmès <[email protected]> wrote:

From: Raymond Balmès <[email protected]>
Subject: Re: dual core and crawling
To: [email protected]
Date: Tuesday, April 28, 2009, 10:54 AM
 > the full web.
2009/4/28 Dennis Kubes <[email protected]>

Are you crawling only certain domains or doing a full
web crawl?
Dennis


Raymond Balme`s wrote:

Actually even if I put 100 threads it does no go
faster, I have 30Mbit/s
fiber internet connection so that shouldn't be
the problem.
I thought if I would put more threads I could
fetch more sites in
parrallel
and so use more of the bandwidth & the CPU...
so waiting on DNS should be
seen.
Or is it that I need run muliple fetchers in
parallel, but I'm not sure
how
to do that and merge the results back at the end.

-Ray-

2009/4/28 Dennis Kubes <[email protected]>

Java Threads do take advantage of multiple cores.
The fetcher does use
multiple threads.  Also having multiple
fetcher tasks on a single machine
will utilize more of the CPU.  Even with 50
threads on a single machine,
depending on the websites being crawled the
utilization might not get
that
much higher.  Much of the time spent in
fetching is spent waiting on DNS
and
the websites being fetched.

Dennis


Raymond Balmčs wrote:

I use a dual core intel, I observed the crawls
never gets above 50% mark
CPU
load, despite the fact that used -threads
50... does nutch take
advantage
of
multi-cores ?
Do I miss a setting somewhere ?

-Ray-







Reply via email to