As Arkadi said, your hdd is to slow for 2 x quad core processor. I have
the same problem and now thinking of using more boxes or very fast
drives (sas 15k).
Raymond Balmčs pisze:
Well I suspect the sort function is mono-threaded as usually they are so
only one core is used 25% is the max you will get.
I have a dual core and it only goes to 50% CPU in many of the steps ... I
assumed that some phases are mono-threaded.
-Raymond-
2009/6/4 John Martyniak <[email protected]>
Hi Arkadi,
Thanks for the info, that does sound like a good feature to have, a quick
and dirty merge would be good.
Did you ever come across any parameters or settings that can be changed to
make the merge faster? It seems that in my case it just keeps chugging
along at about 20% utilization, I would like to get that up to 60% or 70%.
Any ideas?
-John
On Jun 3, 2009, at 9:40 PM, <[email protected]> wrote:
Hi John,
This was my experience, too. If I've interpreted the source code
correctly, the time in merging is spent on sorting, which is required
because the segments are assumed to be "random", possibly containing
duplicated URLs. The sort process groups URLs together and allows to choose
the one to include in the merge result.
I think that if there were a simplified version of merge available, that
assumed that all segments come from same crawl and are in "good" completed
state, this merge would be very fast, because no sorting would be required.
It would be very useful, too, because it seems that this "simple" use is
what people need.
Regards,
Arkadi
-----Original Message-----
From: John Martyniak [mailto:[email protected]]
Sent: Thursday, June 04, 2009 10:01 AM
To: [email protected]
Subject: Merge taking forever
I am running into some problems.
I have 8 segments all with approximately 250K (~2 million) URLS. I am
trying to merge that into one.
But takes forever, it had been running for about 3 days before I
stopped it. It also has used 904 GB in the /tmp directory.
The machine that it is running on is a Dual Intel Quad core 2.8 GHz,
with 24 GB of RAM. The CPU stays at about 20% utilization.
Any ideas? I went through the nutch configs and didn't see anything
that seemed like it would add more memory, workers, etc to this task.
Any help would be greatly appreciated.
Thank you,
-John
John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: [email protected]
w: http://www.beforedawnsolutions.com
John Martyniak
President/CEO
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562
c: 303-522-1756
e: [email protected]
w: http://www.beforedawnsolutions.com