Hi,

Have you tried setting *mapred.compress.map.output *to true? This should
reduce the amount of temp space required.

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/6/15 czerwionka paul <[email protected]>

> hi justin,
>
> i am running hadoop in distributed mode and having the same problem.
>
> merging segments just eats up much more temp space than the segments would
> have combined.
>
> paul.
>
>
> On 14.06.2009, at 18:17, MilleBii wrote:
>
>  Same for merging 3 segments of 100k, 100K, 300k URLs resulted in
>> consumming
>> 200Gb and partition full after 18hours processing
>>
>> Something strange with this segment merge,
>>
>> Conf : PC Dual Core, Vista, Hadoop on single node.
>>
>> Can someone confirm if installing Hadoop in a distributed will fix it ? Is
>> there a good config guide for the distributed mode.
>>
>>
>> 2009/6/12 Justin Yao <[email protected]>
>>
>>  Hi John,
>>> I have no idea about that neither.
>>> Justin
>>>
>>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
>>> [email protected]> wrote:
>>>
>>>  Justin,
>>>>
>>>> Thanks for the response.
>>>>
>>>> I was having a similar issue, i was trying to merge the segments for
>>>>
>>> crawls
>>>
>>>> during the month of may probably around 13-15GB,  so after everything
>>>> was
>>>> running it had used tmp space of around 900 GB doesn't seem very
>>>>
>>> efficient.
>>>
>>>>
>>>> I will try this out and see if it changes anything.
>>>>
>>>> Do you know if there is any risk in using the following:
>>>> <property>
>>>>  <name>mapred.min.split.size</name>
>>>>  <value>671088640</value>
>>>> </property>
>>>>
>>>> as suggested in the article?
>>>>
>>>> -John
>>>>
>>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:
>>>>
>>>> Hi John,
>>>>
>>>>>
>>>>> I had the same issue before but never found a solution.
>>>>> Here is a workaround mentioned by someone in this mailing list, you may
>>>>> have
>>>>> a try:
>>>>>
>>>>> Seemingly abnormal temp space use by segment merger
>>>>>
>>>>>
>>>>>
>>> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
>>> <
>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
>>> >
>>>
>>>>
>>>>> Regards,
>>>>> Justin
>>>>>
>>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
>>>>> [email protected]
>>>>>
>>>>>  wrote:
>>>>>>
>>>>>>
>>>>> Ok.
>>>>>
>>>>>>
>>>>>> So a update to this item.
>>>>>>
>>>>>> I did start running nutch with hadoop, I am trying a single node
>>>>>> config
>>>>>> just to test it out.
>>>>>>
>>>>>> It took forever to get all of the files in the DFS it was just over
>>>>>>
>>>>> 80GB
>>>
>>>> but it is in there.  So I started the SegmentMerge job, and it is
>>>>>>
>>>>> working
>>>
>>>> flawlessly, still a little slow though.
>>>>>>
>>>>>> Also looking at the stats for the CPU they sometimes go over 20% by
>>>>>> not
>>>>>> by
>>>>>> much and not often, the Disk is very lightly taxed, peak was about 20
>>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue
>>>>>> there.
>>>>>>
>>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I
>>>>>> restarted all it is still only using 2 and 1.  Any ideas?  I made that
>>>>>> change in the hadoop-site.xml file BTW.
>>>>>>
>>>>>> -John
>>>>>>
>>>>>>
>>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
>>>>>>
>>>>>> John Martyniak wrote:
>>>>>>
>>>>>>
>>>>>>> Andrzej,
>>>>>>>
>>>>>>>> I am a little embarassed asking.  But is there is a setup guide for
>>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as
>>>>>>>> setting
>>>>>>>> up for
>>>>>>>> Nutch 0.17 (Which I think is the existing guide out there).
>>>>>>>>
>>>>>>>>
>>>>>>>>  Basically, yes - but this guide is primarily about the set up of
>>>>>>>
>>>>>> Hadoop
>>>
>>>> cluster using the Hadoop pieces distributed with Nutch. As such these
>>>>>>> instructions are already slightly outdated. So it's best simply to
>>>>>>> install a
>>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and
>>>>>>> then
>>>>>>> build nutch*.job file separately.
>>>>>>>
>>>>>>> Also I have Hadoop already running for some other applications, not
>>>>>>>
>>>>>>>  associated with Nutch, can I use the same install?  I think that it
>>>>>>>>
>>>>>>> is
>>>
>>>> the
>>>>>>>> same version that Nutch 1.0 uses.  Or is it just easier to set it up
>>>>>>>> using
>>>>>>>> the nutch config.
>>>>>>>>
>>>>>>>>
>>>>>>>>  Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster
>>>>>>> of
>>>>>>> the
>>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would
>>>>>>> strongly
>>>>>>> recommend this way, instead of the usual "dirty" way of setting up
>>>>>>>
>>>>>> Nutch
>>>
>>>> by
>>>>>>> replicating the local build dir ;)
>>>>>>>
>>>>>>> Just specify the nutch*.job file like this:
>>>>>>>
>>>>>>>    bin/hadoop jar nutch*.job <className> <args ..>
>>>>>>>
>>>>>>> where className and args is one of Nutch command-line tools. You can
>>>>>>> also
>>>>>>> modify slightly the bin/nutch script, so that you don't have to
>>>>>>>
>>>>>> specify
>>>
>>>> fully-qualified class names.
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Andrzej Bialecki     <><
>>>>>>> ___. ___ ___ ___ _ _   __________________________________
>>>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>  John Martyniak
>>>> President
>>>> Before Dawn Solutions, Inc.
>>>> 9457 S. University Blvd #266
>>>> Highlands Ranch, CO 80126
>>>> o: 877-499-1562 x707
>>>> f: 877-499-1562
>>>> c: 303-522-1756
>>>> e: [email protected]
>>>>
>>>>
>>>>
>>>
>

Reply via email to