Re: Merge taking forever

MilleBii Mon, 15 Jun 2009 09:49:14 -0700

My segments are roughly 1,3GB, 5GB, 4,5GB.


2009/6/15 Bartosz Gadzimski <[email protected]>

> Hello,
>
> Can you look about size of merged segments?
>
> If I remember correctly when I had segment1 = 1GB and segment2= 1GB new
> merged segment was like 5GB but I havn't got time to look into it.
>
> Thanks,
> Bartosz
>
> czerwionka paul pisze:
>
>  hi justin,
>>
>> i am running hadoop in distributed mode and having the same problem.
>>
>> merging segments just eats up much more temp space than the segments would
>> have combined.
>>
>> paul.
>>
>> On 14.06.2009, at 18:17, MilleBii wrote:
>>
>>  Same for merging 3 segments of 100k, 100K, 300k URLs resulted in
>>> consumming
>>> 200Gb and partition full after 18hours processing
>>>
>>> Something strange with this segment merge,
>>>
>>> Conf : PC Dual Core, Vista, Hadoop on single node.
>>>
>>> Can someone confirm if installing Hadoop in a distributed will fix it ?
>>> Is
>>> there a good config guide for the distributed mode.
>>>
>>>
>>> 2009/6/12 Justin Yao <[email protected]>
>>>
>>>  Hi John,
>>>> I have no idea about that neither.
>>>> Justin
>>>>
>>>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
>>>> [email protected]> wrote:
>>>>
>>>>  Justin,
>>>>>
>>>>> Thanks for the response.
>>>>>
>>>>> I was having a similar issue, i was trying to merge the segments for
>>>>>
>>>> crawls
>>>>
>>>>> during the month of may probably around 13-15GB,  so after everything
>>>>> was
>>>>> running it had used tmp space of around 900 GB doesn't seem very
>>>>>
>>>> efficient.
>>>>
>>>>>
>>>>> I will try this out and see if it changes anything.
>>>>>
>>>>> Do you know if there is any risk in using the following:
>>>>> <property>
>>>>>  <name>mapred.min.split.size</name>
>>>>>  <value>671088640</value>
>>>>> </property>
>>>>>
>>>>> as suggested in the article?
>>>>>
>>>>> -John
>>>>>
>>>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:
>>>>>
>>>>> Hi John,
>>>>>
>>>>>>
>>>>>> I had the same issue before but never found a solution.
>>>>>> Here is a workaround mentioned by someone in this mailing list, you
>>>>>> may
>>>>>> have
>>>>>> a try:
>>>>>>
>>>>>> Seemingly abnormal temp space use by segment merger
>>>>>>
>>>>>>
>>>>>>
>>>> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
>>>> <
>>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
>>>>
>>>>
>>>>>
>>>>>> Regards,
>>>>>> Justin
>>>>>>
>>>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
>>>>>> [email protected]
>>>>>>
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>>>
>>>>>>> So a update to this item.
>>>>>>>
>>>>>>> I did start running nutch with hadoop, I am trying a single node
>>>>>>> config
>>>>>>> just to test it out.
>>>>>>>
>>>>>>> It took forever to get all of the files in the DFS it was just over
>>>>>>>
>>>>>> 80GB
>>>>
>>>>> but it is in there.  So I started the SegmentMerge job, and it is
>>>>>>>
>>>>>> working
>>>>
>>>>> flawlessly, still a little slow though.
>>>>>>>
>>>>>>> Also looking at the stats for the CPU they sometimes go over 20% by
>>>>>>> not
>>>>>>> by
>>>>>>> much and not often, the Disk is very lightly taxed, peak was about 20
>>>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue
>>>>>>> there.
>>>>>>>
>>>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I
>>>>>>> restarted all it is still only using 2 and 1.  Any ideas?  I made
>>>>>>> that
>>>>>>> change in the hadoop-site.xml file BTW.
>>>>>>>
>>>>>>> -John
>>>>>>>
>>>>>>>
>>>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
>>>>>>>
>>>>>>> John Martyniak wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Andrzej,
>>>>>>>>
>>>>>>>>> I am a little embarassed asking.  But is there is a setup guide for
>>>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as
>>>>>>>>> setting
>>>>>>>>> up for
>>>>>>>>> Nutch 0.17 (Which I think is the existing guide out there).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Basically, yes - but this guide is primarily about the set up of
>>>>>>>>
>>>>>>> Hadoop
>>>>
>>>>> cluster using the Hadoop pieces distributed with Nutch. As such these
>>>>>>>> instructions are already slightly outdated. So it's best simply to
>>>>>>>> install a
>>>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki,
>>>>>>>> and
>>>>>>>> then
>>>>>>>> build nutch*.job file separately.
>>>>>>>>
>>>>>>>> Also I have Hadoop already running for some other applications, not
>>>>>>>>
>>>>>>>>  associated with Nutch, can I use the same install?  I think that it
>>>>>>>>>
>>>>>>>> is
>>>>
>>>>> the
>>>>>>>>> same version that Nutch 1.0 uses.  Or is it just easier to set it
>>>>>>>>> up
>>>>>>>>> using
>>>>>>>>> the nutch config.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Yes, it's perfectly ok to use Nutch with an existing Hadoop
>>>>>>>> cluster of
>>>>>>>> the
>>>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would
>>>>>>>> strongly
>>>>>>>> recommend this way, instead of the usual "dirty" way of setting up
>>>>>>>>
>>>>>>> Nutch
>>>>
>>>>> by
>>>>>>>> replicating the local build dir ;)
>>>>>>>>
>>>>>>>> Just specify the nutch*.job file like this:
>>>>>>>>
>>>>>>>>    bin/hadoop jar nutch*.job <className> <args ..>
>>>>>>>>
>>>>>>>> where className and args is one of Nutch command-line tools. You can
>>>>>>>> also
>>>>>>>> modify slightly the bin/nutch script, so that you don't have to
>>>>>>>>
>>>>>>> specify
>>>>
>>>>> fully-qualified class names.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Andrzej Bialecki     <><
>>>>>>>> ___. ___ ___ ___ _ _   __________________________________
>>>>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>  John Martyniak
>>>>> President
>>>>> Before Dawn Solutions, Inc.
>>>>> 9457 S. University Blvd #266
>>>>> Highlands Ranch, CO 80126
>>>>> o: 877-499-1562 x707
>>>>> f: 877-499-1562
>>>>> c: 303-522-1756
>>>>> e: [email protected]
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>

Re: Merge taking forever

Reply via email to