Hi, Have you tried setting *mapred.compress.map.output *to true? This should reduce the amount of temp space required.
Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/6/15 czerwionka paul <[email protected]> > hi justin, > > i am running hadoop in distributed mode and having the same problem. > > merging segments just eats up much more temp space than the segments would > have combined. > > paul. > > > On 14.06.2009, at 18:17, MilleBii wrote: > > Same for merging 3 segments of 100k, 100K, 300k URLs resulted in >> consumming >> 200Gb and partition full after 18hours processing >> >> Something strange with this segment merge, >> >> Conf : PC Dual Core, Vista, Hadoop on single node. >> >> Can someone confirm if installing Hadoop in a distributed will fix it ? Is >> there a good config guide for the distributed mode. >> >> >> 2009/6/12 Justin Yao <[email protected]> >> >> Hi John, >>> I have no idea about that neither. >>> Justin >>> >>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak < >>> [email protected]> wrote: >>> >>> Justin, >>>> >>>> Thanks for the response. >>>> >>>> I was having a similar issue, i was trying to merge the segments for >>>> >>> crawls >>> >>>> during the month of may probably around 13-15GB, so after everything >>>> was >>>> running it had used tmp space of around 900 GB doesn't seem very >>>> >>> efficient. >>> >>>> >>>> I will try this out and see if it changes anything. >>>> >>>> Do you know if there is any risk in using the following: >>>> <property> >>>> <name>mapred.min.split.size</name> >>>> <value>671088640</value> >>>> </property> >>>> >>>> as suggested in the article? >>>> >>>> -John >>>> >>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote: >>>> >>>> Hi John, >>>> >>>>> >>>>> I had the same issue before but never found a solution. >>>>> Here is a workaround mentioned by someone in this mailing list, you may >>>>> have >>>>> a try: >>>>> >>>>> Seemingly abnormal temp space use by segment merger >>>>> >>>>> >>>>> >>> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> >>> < >>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 >>> > >>> >>>> >>>>> Regards, >>>>> Justin >>>>> >>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak < >>>>> [email protected] >>>>> >>>>> wrote: >>>>>> >>>>>> >>>>> Ok. >>>>> >>>>>> >>>>>> So a update to this item. >>>>>> >>>>>> I did start running nutch with hadoop, I am trying a single node >>>>>> config >>>>>> just to test it out. >>>>>> >>>>>> It took forever to get all of the files in the DFS it was just over >>>>>> >>>>> 80GB >>> >>>> but it is in there. So I started the SegmentMerge job, and it is >>>>>> >>>>> working >>> >>>> flawlessly, still a little slow though. >>>>>> >>>>>> Also looking at the stats for the CPU they sometimes go over 20% by >>>>>> not >>>>>> by >>>>>> much and not often, the Disk is very lightly taxed, peak was about 20 >>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue >>>>>> there. >>>>>> >>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I >>>>>> restarted all it is still only using 2 and 1. Any ideas? I made that >>>>>> change in the hadoop-site.xml file BTW. >>>>>> >>>>>> -John >>>>>> >>>>>> >>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: >>>>>> >>>>>> John Martyniak wrote: >>>>>> >>>>>> >>>>>>> Andrzej, >>>>>>> >>>>>>>> I am a little embarassed asking. But is there is a setup guide for >>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as >>>>>>>> setting >>>>>>>> up for >>>>>>>> Nutch 0.17 (Which I think is the existing guide out there). >>>>>>>> >>>>>>>> >>>>>>>> Basically, yes - but this guide is primarily about the set up of >>>>>>> >>>>>> Hadoop >>> >>>> cluster using the Hadoop pieces distributed with Nutch. As such these >>>>>>> instructions are already slightly outdated. So it's best simply to >>>>>>> install a >>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and >>>>>>> then >>>>>>> build nutch*.job file separately. >>>>>>> >>>>>>> Also I have Hadoop already running for some other applications, not >>>>>>> >>>>>>> associated with Nutch, can I use the same install? I think that it >>>>>>>> >>>>>>> is >>> >>>> the >>>>>>>> same version that Nutch 1.0 uses. Or is it just easier to set it up >>>>>>>> using >>>>>>>> the nutch config. >>>>>>>> >>>>>>>> >>>>>>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster >>>>>>> of >>>>>>> the >>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would >>>>>>> strongly >>>>>>> recommend this way, instead of the usual "dirty" way of setting up >>>>>>> >>>>>> Nutch >>> >>>> by >>>>>>> replicating the local build dir ;) >>>>>>> >>>>>>> Just specify the nutch*.job file like this: >>>>>>> >>>>>>> bin/hadoop jar nutch*.job <className> <args ..> >>>>>>> >>>>>>> where className and args is one of Nutch command-line tools. You can >>>>>>> also >>>>>>> modify slightly the bin/nutch script, so that you don't have to >>>>>>> >>>>>> specify >>> >>>> fully-qualified class names. >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> Andrzej Bialecki <>< >>>>>>> ___. ___ ___ ___ _ _ __________________________________ >>>>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>>>>> ___|||__|| \| || | Embedded Unix, System Integration >>>>>>> http://www.sigram.com Contact: info at sigram dot com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> John Martyniak >>>> President >>>> Before Dawn Solutions, Inc. >>>> 9457 S. University Blvd #266 >>>> Highlands Ranch, CO 80126 >>>> o: 877-499-1562 x707 >>>> f: 877-499-1562 >>>> c: 303-522-1756 >>>> e: [email protected] >>>> >>>> >>>> >>> >
