Well I suspect the sort function is mono-threaded as usually they are so
only one core is used 25% is the max you will get.
I have a dual core and it only goes to 50% CPU in many of the steps ... I
assumed that some phases are mono-threaded.

-Raymond-

2009/6/4 John Martyniak <[email protected]>

> Hi Arkadi,
>
> Thanks for the info, that does sound like a good feature to have, a quick
> and dirty merge would be good.
>
> Did you ever come across any parameters or settings that can be changed to
> make the merge faster?  It seems that in my case it just keeps chugging
> along at about 20% utilization, I would like to get that up to 60% or 70%.
>
> Any ideas?
>
> -John
>
>
> On Jun 3, 2009, at 9:40 PM, <[email protected]> wrote:
>
> Hi John,
>>
>> This was my experience, too. If I've interpreted the source code
>> correctly, the time in merging is spent on sorting, which is required
>> because the segments are assumed to be "random", possibly containing
>> duplicated URLs. The sort process groups URLs together and allows to choose
>> the one to include in the merge result.
>>
>> I think that if there were a simplified version of merge available, that
>> assumed that all segments come from same crawl and are in "good" completed
>> state, this merge would be very fast, because no sorting would be required.
>> It would be very useful, too, because it seems that this "simple" use is
>> what people need.
>>
>> Regards,
>>
>> Arkadi
>>
>>
>> -----Original Message-----
>>> From: John Martyniak [mailto:[email protected]]
>>> Sent: Thursday, June 04, 2009 10:01 AM
>>> To: [email protected]
>>> Subject: Merge taking forever
>>>
>>> I am running into some problems.
>>>
>>> I have 8 segments all with approximately 250K (~2 million) URLS.  I am
>>> trying to merge that into one.
>>>
>>> But takes forever, it had been running for about 3 days before I
>>> stopped it.  It also has used 904 GB in the /tmp directory.
>>>
>>> The machine that it is running on is a Dual Intel Quad core 2.8 GHz,
>>> with 24 GB of RAM.  The CPU stays at about 20% utilization.
>>>
>>> Any ideas?  I went through the nutch configs and didn't see anything
>>> that seemed like it would add more memory, workers, etc to this task.
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Thank you,
>>>
>>> -John
>>>
>>>
>>>
>>>
>>> John Martyniak
>>> President/CEO
>>> Before Dawn Solutions, Inc.
>>> 9457 S. University Blvd #266
>>> Highlands Ranch, CO 80126
>>> o: 877-499-1562
>>> c: 303-522-1756
>>> e: [email protected]
>>> w: http://www.beforedawnsolutions.com
>>>
>>
>>
> John Martyniak
> President/CEO
> Before Dawn Solutions, Inc.
> 9457 S. University Blvd #266
> Highlands Ranch, CO 80126
> o: 877-499-1562
> c: 303-522-1756
> e: [email protected]
> w: http://www.beforedawnsolutions.com
>
>

Reply via email to