Hi John, I have no idea about that neither. Justin On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak < [email protected]> wrote:
> Justin, > > Thanks for the response. > > I was having a similar issue, i was trying to merge the segments for crawls > during the month of may probably around 13-15GB, so after everything was > running it had used tmp space of around 900 GB doesn't seem very efficient. > > I will try this out and see if it changes anything. > > Do you know if there is any risk in using the following: > <property> > <name>mapred.min.split.size</name> > <value>671088640</value> > </property> > > as suggested in the article? > > -John > > On Jun 11, 2009, at 7:25 PM, Justin Yao wrote: > > Hi John, >> >> I had the same issue before but never found a solution. >> Here is a workaround mentioned by someone in this mailing list, you may >> have >> a try: >> >> Seemingly abnormal temp space use by segment merger >> >> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 >> >> Regards, >> Justin >> >> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak < >> [email protected] >> >>> wrote: >>> >> >> Ok. >>> >>> So a update to this item. >>> >>> I did start running nutch with hadoop, I am trying a single node config >>> just to test it out. >>> >>> It took forever to get all of the files in the DFS it was just over 80GB >>> but it is in there. So I started the SegmentMerge job, and it is working >>> flawlessly, still a little slow though. >>> >>> Also looking at the stats for the CPU they sometimes go over 20% by not >>> by >>> much and not often, the Disk is very lightly taxed, peak was about 20 >>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue >>> there. >>> >>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I >>> restarted all it is still only using 2 and 1. Any ideas? I made that >>> change in the hadoop-site.xml file BTW. >>> >>> -John >>> >>> >>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: >>> >>> John Martyniak wrote: >>> >>>> >>>> Andrzej, >>>>> I am a little embarassed asking. But is there is a setup guide for >>>>> setting up Hadoop for Nutch 1.0, or is it the same process as setting >>>>> up for >>>>> Nutch 0.17 (Which I think is the existing guide out there). >>>>> >>>>> >>>> Basically, yes - but this guide is primarily about the set up of Hadoop >>>> cluster using the Hadoop pieces distributed with Nutch. As such these >>>> instructions are already slightly outdated. So it's best simply to >>>> install a >>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and >>>> then >>>> build nutch*.job file separately. >>>> >>>> Also I have Hadoop already running for some other applications, not >>>> >>>>> associated with Nutch, can I use the same install? I think that it is >>>>> the >>>>> same version that Nutch 1.0 uses. Or is it just easier to set it up >>>>> using >>>>> the nutch config. >>>>> >>>>> >>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster of >>>> the >>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would strongly >>>> recommend this way, instead of the usual "dirty" way of setting up Nutch >>>> by >>>> replicating the local build dir ;) >>>> >>>> Just specify the nutch*.job file like this: >>>> >>>> bin/hadoop jar nutch*.job <className> <args ..> >>>> >>>> where className and args is one of Nutch command-line tools. You can >>>> also >>>> modify slightly the bin/nutch script, so that you don't have to specify >>>> fully-qualified class names. >>>> >>>> -- >>>> Best regards, >>>> Andrzej Bialecki <>< >>>> ___. ___ ___ ___ _ _ __________________________________ >>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>> ___|||__|| \| || | Embedded Unix, System Integration >>>> http://www.sigram.com Contact: info at sigram dot com >>>> >>>> >>>> >>> > John Martyniak > President > Before Dawn Solutions, Inc. > 9457 S. University Blvd #266 > Highlands Ranch, CO 80126 > o: 877-499-1562 x707 > f: 877-499-1562 > c: 303-522-1756 > e: [email protected] > >
