how long does it take for your 6 millions URLs to be
crawled/parsed/indexed... I'm curious to know because I'm about to shoot in
this area but I have no idea how long it will take.

-Ray-

2009/6/5 John Martyniak <[email protected]>

> Arkady,
>
> I think that is beauty of nutch I have built a index of a little more 6
> million urls with "out of the box" Nutch.  I would say that is pretty good
> for most situations before you have to start getting into hadoop and
> multiple machines.
>
> -John
>
>
> On Jun 4, 2009, at 5:19 PM, <[email protected]> wrote:
>
> Hi Andrzej,
>>
>> -----Original Message-----
>>> From: Andrzej Bialecki [mailto:[email protected]]
>>> Sent: Thursday, June 04, 2009 9:47 PM
>>> To: [email protected]
>>> Subject: Re: Merge taking forever
>>>
>>> Bartosz Gadzimski wrote:
>>>
>>>> As Arkadi said, your hdd is to slow for 2 x quad core processor. I have
>>>> the same problem and now thinking of using more boxes or very fast
>>>> drives (sas 15k).
>>>>
>>>> Raymond Balmčs pisze:
>>>>
>>>>> Well I suspect the sort function is mono-threaded as usually they are
>>>>>
>>>> so
>>>
>>>> only one core is used 25% is the max you will get.
>>>>> I have a dual core and it only goes to 50% CPU in many of the steps ...
>>>>>
>>>> I
>>>
>>>> assumed that some phases are mono-threaded.
>>>>>
>>>>
>>> Folks,
>>>
>>> From your conversation I suspect that you are running Hadoop with
>>> LocalJobtracker, i.e. in a single JVM - correct?
>>>
>>> While this works ok for small datasets, you don't really benefit from
>>> map-reduce parallelism (and you still pay the penalty for the
>>> overheads). As your dataset grows, you will quickly reach the
>>> scalability limits - in this case, the limit of IO throughput of a
>>> single drive, during the sort phase of a large dataset. The excessive IO
>>> demands can be solved by distributing the load (over many drives, and
>>> over many machines), which is what HDFS is designed to do well.
>>>
>>> Hadoop tasks are usually single-threaded, and additionally
>>> LocalJobTracker implements only a primitive non-parallel model of task
>>> execution - i.e. each task is scheduled to run sequentially in turn. If
>>> you run the regular distributed JobTracker, Hadoop splits the load among
>>> many tasks running in parallel.
>>>
>>> So, the solution is this: set up a distributed Hadoop cluster, even if
>>> it's going to consist of a single node - because then the data will be
>>> split and processed in parallel by several JVM instances. This will also
>>> help the operating system to schedule these processes over multiple
>>> CPU-s. Additionally, if you still experience IO contention, consider
>>> moving to HDFS as the filestystem, and spread it over more than 1
>>> machine and more than 1 disk in each machine.
>>>
>>
>> Thank you for these recommendations.
>>
>> I think that there is a large group of users (perhaps limited by budget or
>> time they are willing to spend) that will give up on trying to use Nutch
>> unless they can run it on a single box with simple configuration.
>>
>> Regards,
>>
>> Arkadi
>>
>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>
>>

Reply via email to