Well I suspect the sort function is mono-threaded as usually they are so only one core is used 25% is the max you will get. I have a dual core and it only goes to 50% CPU in many of the steps ... I assumed that some phases are mono-threaded.
-Raymond- 2009/6/4 John Martyniak <[email protected]> > Hi Arkadi, > > Thanks for the info, that does sound like a good feature to have, a quick > and dirty merge would be good. > > Did you ever come across any parameters or settings that can be changed to > make the merge faster? It seems that in my case it just keeps chugging > along at about 20% utilization, I would like to get that up to 60% or 70%. > > Any ideas? > > -John > > > On Jun 3, 2009, at 9:40 PM, <[email protected]> wrote: > > Hi John, >> >> This was my experience, too. If I've interpreted the source code >> correctly, the time in merging is spent on sorting, which is required >> because the segments are assumed to be "random", possibly containing >> duplicated URLs. The sort process groups URLs together and allows to choose >> the one to include in the merge result. >> >> I think that if there were a simplified version of merge available, that >> assumed that all segments come from same crawl and are in "good" completed >> state, this merge would be very fast, because no sorting would be required. >> It would be very useful, too, because it seems that this "simple" use is >> what people need. >> >> Regards, >> >> Arkadi >> >> >> -----Original Message----- >>> From: John Martyniak [mailto:[email protected]] >>> Sent: Thursday, June 04, 2009 10:01 AM >>> To: [email protected] >>> Subject: Merge taking forever >>> >>> I am running into some problems. >>> >>> I have 8 segments all with approximately 250K (~2 million) URLS. I am >>> trying to merge that into one. >>> >>> But takes forever, it had been running for about 3 days before I >>> stopped it. It also has used 904 GB in the /tmp directory. >>> >>> The machine that it is running on is a Dual Intel Quad core 2.8 GHz, >>> with 24 GB of RAM. The CPU stays at about 20% utilization. >>> >>> Any ideas? I went through the nutch configs and didn't see anything >>> that seemed like it would add more memory, workers, etc to this task. >>> >>> Any help would be greatly appreciated. >>> >>> Thank you, >>> >>> -John >>> >>> >>> >>> >>> John Martyniak >>> President/CEO >>> Before Dawn Solutions, Inc. >>> 9457 S. University Blvd #266 >>> Highlands Ranch, CO 80126 >>> o: 877-499-1562 >>> c: 303-522-1756 >>> e: [email protected] >>> w: http://www.beforedawnsolutions.com >>> >> >> > John Martyniak > President/CEO > Before Dawn Solutions, Inc. > 9457 S. University Blvd #266 > Highlands Ranch, CO 80126 > o: 877-499-1562 > c: 303-522-1756 > e: [email protected] > w: http://www.beforedawnsolutions.com > >
