Same for merging 3 segments of 100k, 100K, 300k URLs resulted in consumming
200Gb and partition full after 18hours processing

Something strange with this segment merge,

Conf : PC Dual Core, Vista, Hadoop on single node.

Can someone confirm if installing Hadoop in a distributed will fix it ? Is
there a good config guide for the distributed mode.


2009/6/12 Justin Yao <[email protected]>

> Hi John,
> I have no idea about that neither.
> Justin
>
> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
> [email protected]> wrote:
>
> > Justin,
> >
> > Thanks for the response.
> >
> > I was having a similar issue, i was trying to merge the segments for
> crawls
> > during the month of may probably around 13-15GB,  so after everything was
> > running it had used tmp space of around 900 GB doesn't seem very
> efficient.
> >
> > I will try this out and see if it changes anything.
> >
> > Do you know if there is any risk in using the following:
> > <property>
> >   <name>mapred.min.split.size</name>
> >   <value>671088640</value>
> > </property>
> >
> > as suggested in the article?
> >
> > -John
> >
> > On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:
> >
> >  Hi John,
> >>
> >> I had the same issue before but never found a solution.
> >> Here is a workaround mentioned by someone in this mailing list, you may
> >> have
> >> a try:
> >>
> >> Seemingly abnormal temp space use by segment merger
> >>
> >>
> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
> >>
> >> Regards,
> >> Justin
> >>
> >> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
> >> [email protected]
> >>
> >>> wrote:
> >>>
> >>
> >>  Ok.
> >>>
> >>> So a update to this item.
> >>>
> >>> I did start running nutch with hadoop, I am trying a single node config
> >>> just to test it out.
> >>>
> >>> It took forever to get all of the files in the DFS it was just over
> 80GB
> >>> but it is in there.  So I started the SegmentMerge job, and it is
> working
> >>> flawlessly, still a little slow though.
> >>>
> >>> Also looking at the stats for the CPU they sometimes go over 20% by not
> >>> by
> >>> much and not often, the Disk is very lightly taxed, peak was about 20
> >>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue
> >>> there.
> >>>
> >>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I
> >>> restarted all it is still only using 2 and 1.  Any ideas?  I made that
> >>> change in the hadoop-site.xml file BTW.
> >>>
> >>> -John
> >>>
> >>>
> >>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
> >>>
> >>> John Martyniak wrote:
> >>>
> >>>>
> >>>>  Andrzej,
> >>>>> I am a little embarassed asking.  But is there is a setup guide for
> >>>>> setting up Hadoop for Nutch 1.0, or is it the same process as setting
> >>>>> up for
> >>>>> Nutch 0.17 (Which I think is the existing guide out there).
> >>>>>
> >>>>>
> >>>> Basically, yes - but this guide is primarily about the set up of
> Hadoop
> >>>> cluster using the Hadoop pieces distributed with Nutch. As such these
> >>>> instructions are already slightly outdated. So it's best simply to
> >>>> install a
> >>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and
> >>>> then
> >>>> build nutch*.job file separately.
> >>>>
> >>>> Also I have Hadoop already running for some other applications, not
> >>>>
> >>>>> associated with Nutch, can I use the same install?  I think that it
> is
> >>>>> the
> >>>>> same version that Nutch 1.0 uses.  Or is it just easier to set it up
> >>>>> using
> >>>>> the nutch config.
> >>>>>
> >>>>>
> >>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster of
> >>>> the
> >>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would strongly
> >>>> recommend this way, instead of the usual "dirty" way of setting up
> Nutch
> >>>> by
> >>>> replicating the local build dir ;)
> >>>>
> >>>> Just specify the nutch*.job file like this:
> >>>>
> >>>>      bin/hadoop jar nutch*.job <className> <args ..>
> >>>>
> >>>> where className and args is one of Nutch command-line tools. You can
> >>>> also
> >>>> modify slightly the bin/nutch script, so that you don't have to
> specify
> >>>> fully-qualified class names.
> >>>>
> >>>> --
> >>>> Best regards,
> >>>> Andrzej Bialecki     <><
> >>>> ___. ___ ___ ___ _ _   __________________________________
> >>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >>>> http://www.sigram.com  Contact: info at sigram dot com
> >>>>
> >>>>
> >>>>
> >>>
> > John Martyniak
> > President
> > Before Dawn Solutions, Inc.
> > 9457 S. University Blvd #266
> > Highlands Ranch, CO 80126
> > o: 877-499-1562 x707
> > f: 877-499-1562
> > c: 303-522-1756
> > e: [email protected]
> >
> >
>

Reply via email to