how long does it take for your 6 millions URLs to be crawled/parsed/indexed... I'm curious to know because I'm about to shoot in this area but I have no idea how long it will take.
-Ray- 2009/6/5 John Martyniak <[email protected]> > Arkady, > > I think that is beauty of nutch I have built a index of a little more 6 > million urls with "out of the box" Nutch. I would say that is pretty good > for most situations before you have to start getting into hadoop and > multiple machines. > > -John > > > On Jun 4, 2009, at 5:19 PM, <[email protected]> wrote: > > Hi Andrzej, >> >> -----Original Message----- >>> From: Andrzej Bialecki [mailto:[email protected]] >>> Sent: Thursday, June 04, 2009 9:47 PM >>> To: [email protected] >>> Subject: Re: Merge taking forever >>> >>> Bartosz Gadzimski wrote: >>> >>>> As Arkadi said, your hdd is to slow for 2 x quad core processor. I have >>>> the same problem and now thinking of using more boxes or very fast >>>> drives (sas 15k). >>>> >>>> Raymond Balmčs pisze: >>>> >>>>> Well I suspect the sort function is mono-threaded as usually they are >>>>> >>>> so >>> >>>> only one core is used 25% is the max you will get. >>>>> I have a dual core and it only goes to 50% CPU in many of the steps ... >>>>> >>>> I >>> >>>> assumed that some phases are mono-threaded. >>>>> >>>> >>> Folks, >>> >>> From your conversation I suspect that you are running Hadoop with >>> LocalJobtracker, i.e. in a single JVM - correct? >>> >>> While this works ok for small datasets, you don't really benefit from >>> map-reduce parallelism (and you still pay the penalty for the >>> overheads). As your dataset grows, you will quickly reach the >>> scalability limits - in this case, the limit of IO throughput of a >>> single drive, during the sort phase of a large dataset. The excessive IO >>> demands can be solved by distributing the load (over many drives, and >>> over many machines), which is what HDFS is designed to do well. >>> >>> Hadoop tasks are usually single-threaded, and additionally >>> LocalJobTracker implements only a primitive non-parallel model of task >>> execution - i.e. each task is scheduled to run sequentially in turn. If >>> you run the regular distributed JobTracker, Hadoop splits the load among >>> many tasks running in parallel. >>> >>> So, the solution is this: set up a distributed Hadoop cluster, even if >>> it's going to consist of a single node - because then the data will be >>> split and processed in parallel by several JVM instances. This will also >>> help the operating system to schedule these processes over multiple >>> CPU-s. Additionally, if you still experience IO contention, consider >>> moving to HDFS as the filestystem, and spread it over more than 1 >>> machine and more than 1 disk in each machine. >>> >> >> Thank you for these recommendations. >> >> I think that there is a large group of users (perhaps limited by budget or >> time they are willing to spend) that will give up on trying to use Nutch >> unless they can run it on a single box with simple configuration. >> >> Regards, >> >> Arkadi >> >> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >> >>
