Hi Ken,

Don't know Bixo, how does it compare Nutch.
What do you mean by copy to S3...

-Raymond-

2009/6/5 Ken Krugler <[email protected]>

> how long does it take for your 6 millions URLs to be
>> crawled/parsed/indexed... I'm curious to know because I'm about to shoot
>> in
>> this area but I have no idea how long it will take.
>>
>
> Not sure this helps, but below are some general stats from a fresh crawl I
> just did using Bixo, which should be similar to Nutch times.
>
> This is for a 5 server cluster (1 master, 4 slaves) of lowest-end boxes at
> Amazon's EC2.
>
> * 1.2M URLs from 42K domains
> * Fetch took 3.5 hours, mostly due to long tail
> * Parse took 2.5 hours
> * Index took 25 minutes
> * Copy to S3 took 1 hour
>
> Adding more servers would have made some things faster, but it wouldn't
> have significantly increased the speed of the fetch, due to the limited
> number of domains.
>
> So total time was 7.5 hours. 8 hours * 5 servers = 40 compute hours, or $4
> using EC2 pricing of $0.10/hr.
>
> Amazon's Elastic MapReduce (EMR) is slightly more expensive, but you could
> avoid dealing with Hadoop config/setup time.
>
> -- Ken
>
>  2009/6/5 John Martyniak <[email protected]>
>>
>>   Arkady,
>>>
>>>   > I think that is beauty of nutch I have built a index of a little more
>> 6
>>
>>>  million urls with "out of the box" Nutch.  I would say that is pretty
>>> good
>>>  for most situations before you have to start getting into hadoop and
>>>
>>  > multiple machines.
>>  >
>>
>>>  -John
>>>
>>>
>>>  On Jun 4, 2009, at 5:19 PM, <[email protected]> wrote:
>>>
>>>  Hi Andrzej,
>>>
>>>>
>>>>  -----Original Message-----
>>>>
>>>>>  From: Andrzej Bialecki [mailto:[email protected]]
>>>>>  Sent: Thursday, June 04, 2009 9:47 PM
>>>>>  To: [email protected]
>>>>>  Subject: Re: Merge taking forever
>>>>>
>>>>>  Bartosz Gadzimski wrote:
>>>>>
>>>>>   As Arkadi said, your hdd is to slow for 2 x quad core processor. I
>>>>>> have
>>>>>>  the same problem and now thinking of using more boxes or very fast
>>>>>>  drives (sas 15k).
>>>>>>
>>>>>>  Raymond Balmãs pisze:
>>>>>>
>>>>>>
>>>>>>   Well I suspect the sort function is mono-threaded as usually they
>>>>>>> are
>>>>>>>
>>>>>>>   so
>>>>>>
>>>>>
>>>>>   only one core is used 25% is the max you will get.
>>>>>>
>>>>>>>  I have a dual core and it only goes to 50% CPU in many of the steps
>>>>>>> ...
>>>>>>>
>>>>>>>   I
>>>>>>
>>>>>
>>>>>   assumed that some phases are mono-threaded.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>   Folks,
>>>>>
>>>>>  From your conversation I suspect that you are running Hadoop with
>>>>>  LocalJobtracker, i.e. in a single JVM - correct?
>>>>>
>>>>>  While this works ok for small datasets, you don't really benefit from
>>>>>  map-reduce parallelism (and you still pay the penalty for the
>>>>>  overheads). As your dataset grows, you will quickly reach the
>>>>>  scalability limits - in this case, the limit of IO throughput of a
>>>>>  single drive, during the sort phase of a large dataset. The excessive
>>>>> IO
>>>>>  demands can be solved by distributing the load (over many drives, and
>>>>>  over many machines), which is what HDFS is designed to do well.
>>>>>
>>>>>  Hadoop tasks are usually single-threaded, and additionally
>>>>>  LocalJobTracker implements only a primitive non-parallel model of task
>>>>>  execution - i.e. each task is scheduled to run sequentially in turn.
>>>>> If
>>>>>  you run the regular distributed JobTracker, Hadoop splits the load
>>>>> among
>>>>>  many tasks running in parallel.
>>>>>
>>>>>  So, the solution is this: set up a distributed Hadoop cluster, even if
>>>>>  it's going to consist of a single node - because then the data will be
>>>>>  split and processed in parallel by several JVM instances. This will
>>>>> also
>>>>>  help the operating system to schedule these processes over multiple
>>>>>  CPU-s. Additionally, if you still experience IO contention, consider
>>>>>  moving to HDFS as the filestystem, and spread it over more than 1
>>>>>
>>>>  >>> machine and more than 1 disk in each machine.
>>  >>
>>
>>>  Thank you for these recommendations.
>>>>
>>>>  I think that there is a large group of users (perhaps limited by budget
>>>> or
>>>>  time they are willing to spend) that will give up on trying to use
>>>> Nutch
>>>>  unless they can run it on a single box with simple configuration.
>>>>
>>>>  Regards,
>>>>
>>>>   >> Arkadi
>>
>
>
> --
> Ken Krugler
> +1 530-210-6378

Reply via email to