Hi Ken, Don't know Bixo, how does it compare Nutch. What do you mean by copy to S3...
-Raymond- 2009/6/5 Ken Krugler <[email protected]> > how long does it take for your 6 millions URLs to be >> crawled/parsed/indexed... I'm curious to know because I'm about to shoot >> in >> this area but I have no idea how long it will take. >> > > Not sure this helps, but below are some general stats from a fresh crawl I > just did using Bixo, which should be similar to Nutch times. > > This is for a 5 server cluster (1 master, 4 slaves) of lowest-end boxes at > Amazon's EC2. > > * 1.2M URLs from 42K domains > * Fetch took 3.5 hours, mostly due to long tail > * Parse took 2.5 hours > * Index took 25 minutes > * Copy to S3 took 1 hour > > Adding more servers would have made some things faster, but it wouldn't > have significantly increased the speed of the fetch, due to the limited > number of domains. > > So total time was 7.5 hours. 8 hours * 5 servers = 40 compute hours, or $4 > using EC2 pricing of $0.10/hr. > > Amazon's Elastic MapReduce (EMR) is slightly more expensive, but you could > avoid dealing with Hadoop config/setup time. > > -- Ken > > 2009/6/5 John Martyniak <[email protected]> >> >> Arkady, >>> >>> > I think that is beauty of nutch I have built a index of a little more >> 6 >> >>> million urls with "out of the box" Nutch. I would say that is pretty >>> good >>> for most situations before you have to start getting into hadoop and >>> >> > multiple machines. >> > >> >>> -John >>> >>> >>> On Jun 4, 2009, at 5:19 PM, <[email protected]> wrote: >>> >>> Hi Andrzej, >>> >>>> >>>> -----Original Message----- >>>> >>>>> From: Andrzej Bialecki [mailto:[email protected]] >>>>> Sent: Thursday, June 04, 2009 9:47 PM >>>>> To: [email protected] >>>>> Subject: Re: Merge taking forever >>>>> >>>>> Bartosz Gadzimski wrote: >>>>> >>>>> As Arkadi said, your hdd is to slow for 2 x quad core processor. I >>>>>> have >>>>>> the same problem and now thinking of using more boxes or very fast >>>>>> drives (sas 15k). >>>>>> >>>>>> Raymond Balmãs pisze: >>>>>> >>>>>> >>>>>> Well I suspect the sort function is mono-threaded as usually they >>>>>>> are >>>>>>> >>>>>>> so >>>>>> >>>>> >>>>> only one core is used 25% is the max you will get. >>>>>> >>>>>>> I have a dual core and it only goes to 50% CPU in many of the steps >>>>>>> ... >>>>>>> >>>>>>> I >>>>>> >>>>> >>>>> assumed that some phases are mono-threaded. >>>>>> >>>>>>> >>>>>>> >>>>>> Folks, >>>>> >>>>> From your conversation I suspect that you are running Hadoop with >>>>> LocalJobtracker, i.e. in a single JVM - correct? >>>>> >>>>> While this works ok for small datasets, you don't really benefit from >>>>> map-reduce parallelism (and you still pay the penalty for the >>>>> overheads). As your dataset grows, you will quickly reach the >>>>> scalability limits - in this case, the limit of IO throughput of a >>>>> single drive, during the sort phase of a large dataset. The excessive >>>>> IO >>>>> demands can be solved by distributing the load (over many drives, and >>>>> over many machines), which is what HDFS is designed to do well. >>>>> >>>>> Hadoop tasks are usually single-threaded, and additionally >>>>> LocalJobTracker implements only a primitive non-parallel model of task >>>>> execution - i.e. each task is scheduled to run sequentially in turn. >>>>> If >>>>> you run the regular distributed JobTracker, Hadoop splits the load >>>>> among >>>>> many tasks running in parallel. >>>>> >>>>> So, the solution is this: set up a distributed Hadoop cluster, even if >>>>> it's going to consist of a single node - because then the data will be >>>>> split and processed in parallel by several JVM instances. This will >>>>> also >>>>> help the operating system to schedule these processes over multiple >>>>> CPU-s. Additionally, if you still experience IO contention, consider >>>>> moving to HDFS as the filestystem, and spread it over more than 1 >>>>> >>>> >>> machine and more than 1 disk in each machine. >> >> >> >>> Thank you for these recommendations. >>>> >>>> I think that there is a large group of users (perhaps limited by budget >>>> or >>>> time they are willing to spend) that will give up on trying to use >>>> Nutch >>>> unless they can run it on a single box with simple configuration. >>>> >>>> Regards, >>>> >>>> >> Arkadi >> > > > -- > Ken Krugler > +1 530-210-6378
