Same for merging 3 segments of 100k, 100K, 300k URLs resulted in consumming 200Gb and partition full after 18hours processing
Something strange with this segment merge, Conf : PC Dual Core, Vista, Hadoop on single node. Can someone confirm if installing Hadoop in a distributed will fix it ? Is there a good config guide for the distributed mode. 2009/6/12 Justin Yao <[email protected]> > Hi John, > I have no idea about that neither. > Justin > > On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak < > [email protected]> wrote: > > > Justin, > > > > Thanks for the response. > > > > I was having a similar issue, i was trying to merge the segments for > crawls > > during the month of may probably around 13-15GB, so after everything was > > running it had used tmp space of around 900 GB doesn't seem very > efficient. > > > > I will try this out and see if it changes anything. > > > > Do you know if there is any risk in using the following: > > <property> > > <name>mapred.min.split.size</name> > > <value>671088640</value> > > </property> > > > > as suggested in the article? > > > > -John > > > > On Jun 11, 2009, at 7:25 PM, Justin Yao wrote: > > > > Hi John, > >> > >> I had the same issue before but never found a solution. > >> Here is a workaround mentioned by someone in this mailing list, you may > >> have > >> a try: > >> > >> Seemingly abnormal temp space use by segment merger > >> > >> > http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> > >> > >> Regards, > >> Justin > >> > >> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak < > >> [email protected] > >> > >>> wrote: > >>> > >> > >> Ok. > >>> > >>> So a update to this item. > >>> > >>> I did start running nutch with hadoop, I am trying a single node config > >>> just to test it out. > >>> > >>> It took forever to get all of the files in the DFS it was just over > 80GB > >>> but it is in there. So I started the SegmentMerge job, and it is > working > >>> flawlessly, still a little slow though. > >>> > >>> Also looking at the stats for the CPU they sometimes go over 20% by not > >>> by > >>> much and not often, the Disk is very lightly taxed, peak was about 20 > >>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue > >>> there. > >>> > >>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I > >>> restarted all it is still only using 2 and 1. Any ideas? I made that > >>> change in the hadoop-site.xml file BTW. > >>> > >>> -John > >>> > >>> > >>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: > >>> > >>> John Martyniak wrote: > >>> > >>>> > >>>> Andrzej, > >>>>> I am a little embarassed asking. But is there is a setup guide for > >>>>> setting up Hadoop for Nutch 1.0, or is it the same process as setting > >>>>> up for > >>>>> Nutch 0.17 (Which I think is the existing guide out there). > >>>>> > >>>>> > >>>> Basically, yes - but this guide is primarily about the set up of > Hadoop > >>>> cluster using the Hadoop pieces distributed with Nutch. As such these > >>>> instructions are already slightly outdated. So it's best simply to > >>>> install a > >>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and > >>>> then > >>>> build nutch*.job file separately. > >>>> > >>>> Also I have Hadoop already running for some other applications, not > >>>> > >>>>> associated with Nutch, can I use the same install? I think that it > is > >>>>> the > >>>>> same version that Nutch 1.0 uses. Or is it just easier to set it up > >>>>> using > >>>>> the nutch config. > >>>>> > >>>>> > >>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster of > >>>> the > >>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would strongly > >>>> recommend this way, instead of the usual "dirty" way of setting up > Nutch > >>>> by > >>>> replicating the local build dir ;) > >>>> > >>>> Just specify the nutch*.job file like this: > >>>> > >>>> bin/hadoop jar nutch*.job <className> <args ..> > >>>> > >>>> where className and args is one of Nutch command-line tools. You can > >>>> also > >>>> modify slightly the bin/nutch script, so that you don't have to > specify > >>>> fully-qualified class names. > >>>> > >>>> -- > >>>> Best regards, > >>>> Andrzej Bialecki <>< > >>>> ___. ___ ___ ___ _ _ __________________________________ > >>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web > >>>> ___|||__|| \| || | Embedded Unix, System Integration > >>>> http://www.sigram.com Contact: info at sigram dot com > >>>> > >>>> > >>>> > >>> > > John Martyniak > > President > > Before Dawn Solutions, Inc. > > 9457 S. University Blvd #266 > > Highlands Ranch, CO 80126 > > o: 877-499-1562 x707 > > f: 877-499-1562 > > c: 303-522-1756 > > e: [email protected] > > > > >
