Re: Merge taking forever

Julien Nioche Mon, 15 Jun 2009 10:58:48 -0700

Hi,


> Presumably in hadoop-site.xml as a property/value ?


Indeed.

J.


>
> On the other hand, I'm asking myself why merging segments... I don't fully
> understand the benefits, if someone can shed some light.
>
>
> 2009/6/15 Julien Nioche <[email protected]>
>
> > Hi,
> >
> > Have you tried setting *mapred.compress.map.output *to true? This should
> > reduce the amount of temp space required.
> >
> > Julien
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> > 2009/6/15 czerwionka paul <[email protected]>
> >
> > > hi justin,
> > >
> > > i am running hadoop in distributed mode and having the same problem.
> > >
> > > merging segments just eats up much more temp space than the segments
> > would
> > > have combined.
> > >
> > > paul.
> > >
> > >
> > > On 14.06.2009, at 18:17, MilleBii wrote:
> > >
> > >  Same for merging 3 segments of 100k, 100K, 300k URLs resulted in
> > >> consumming
> > >> 200Gb and partition full after 18hours processing
> > >>
> > >> Something strange with this segment merge,
> > >>
> > >> Conf : PC Dual Core, Vista, Hadoop on single node.
> > >>
> > >> Can someone confirm if installing Hadoop in a distributed will fix it
> ?
> > Is
> > >> there a good config guide for the distributed mode.
> > >>
> > >>
> > >> 2009/6/12 Justin Yao <[email protected]>
> > >>
> > >>  Hi John,
> > >>> I have no idea about that neither.
> > >>> Justin
> > >>>
> > >>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
> > >>> [email protected]> wrote:
> > >>>
> > >>>  Justin,
> > >>>>
> > >>>> Thanks for the response.
> > >>>>
> > >>>> I was having a similar issue, i was trying to merge the segments for
> > >>>>
> > >>> crawls
> > >>>
> > >>>> during the month of may probably around 13-15GB,  so after
> everything
> > >>>> was
> > >>>> running it had used tmp space of around 900 GB doesn't seem very
> > >>>>
> > >>> efficient.
> > >>>
> > >>>>
> > >>>> I will try this out and see if it changes anything.
> > >>>>
> > >>>> Do you know if there is any risk in using the following:
> > >>>> <property>
> > >>>>  <name>mapred.min.split.size</name>
> > >>>>  <value>671088640</value>
> > >>>> </property>
> > >>>>
> > >>>> as suggested in the article?
> > >>>>
> > >>>> -John
> > >>>>
> > >>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:
> > >>>>
> > >>>> Hi John,
> > >>>>
> > >>>>>
> > >>>>> I had the same issue before but never found a solution.
> > >>>>> Here is a workaround mentioned by someone in this mailing list, you
> > may
> > >>>>> have
> > >>>>> a try:
> > >>>>>
> > >>>>> Seemingly abnormal temp space use by segment merger
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>
> >
> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
> <
> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> >
> > <
> >
> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> > >
> > >>> <
> > >>>
> >
> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> > >>> >
> > >>>
> > >>>>
> > >>>>> Regards,
> > >>>>> Justin
> > >>>>>
> > >>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
> > >>>>> [email protected]
> > >>>>>
> > >>>>>  wrote:
> > >>>>>>
> > >>>>>>
> > >>>>> Ok.
> > >>>>>
> > >>>>>>
> > >>>>>> So a update to this item.
> > >>>>>>
> > >>>>>> I did start running nutch with hadoop, I am trying a single node
> > >>>>>> config
> > >>>>>> just to test it out.
> > >>>>>>
> > >>>>>> It took forever to get all of the files in the DFS it was just
> over
> > >>>>>>
> > >>>>> 80GB
> > >>>
> > >>>> but it is in there.  So I started the SegmentMerge job, and it is
> > >>>>>>
> > >>>>> working
> > >>>
> > >>>> flawlessly, still a little slow though.
> > >>>>>>
> > >>>>>> Also looking at the stats for the CPU they sometimes go over 20%
> by
> > >>>>>> not
> > >>>>>> by
> > >>>>>> much and not often, the Disk is very lightly taxed, peak was about
> > 20
> > >>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no
> issue
> > >>>>>> there.
> > >>>>>>
> > >>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but
> when
> > I
> > >>>>>> restarted all it is still only using 2 and 1.  Any ideas?  I made
> > that
> > >>>>>> change in the hadoop-site.xml file BTW.
> > >>>>>>
> > >>>>>> -John
> > >>>>>>
> > >>>>>>
> > >>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
> > >>>>>>
> > >>>>>> John Martyniak wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>> Andrzej,
> > >>>>>>>
> > >>>>>>>> I am a little embarassed asking.  But is there is a setup guide
> > for
> > >>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as
> > >>>>>>>> setting
> > >>>>>>>> up for
> > >>>>>>>> Nutch 0.17 (Which I think is the existing guide out there).
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>  Basically, yes - but this guide is primarily about the set up
> of
> > >>>>>>>
> > >>>>>> Hadoop
> > >>>
> > >>>> cluster using the Hadoop pieces distributed with Nutch. As such
> these
> > >>>>>>> instructions are already slightly outdated. So it's best simply
> to
> > >>>>>>> install a
> > >>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki,
> > and
> > >>>>>>> then
> > >>>>>>> build nutch*.job file separately.
> > >>>>>>>
> > >>>>>>> Also I have Hadoop already running for some other applications,
> not
> > >>>>>>>
> > >>>>>>>  associated with Nutch, can I use the same install?  I think that
> > it
> > >>>>>>>>
> > >>>>>>> is
> > >>>
> > >>>> the
> > >>>>>>>> same version that Nutch 1.0 uses.  Or is it just easier to set
> it
> > up
> > >>>>>>>> using
> > >>>>>>>> the nutch config.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>  Yes, it's perfectly ok to use Nutch with an existing Hadoop
> > cluster
> > >>>>>>> of
> > >>>>>>> the
> > >>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would
> > >>>>>>> strongly
> > >>>>>>> recommend this way, instead of the usual "dirty" way of setting
> up
> > >>>>>>>
> > >>>>>> Nutch
> > >>>
> > >>>> by
> > >>>>>>> replicating the local build dir ;)
> > >>>>>>>
> > >>>>>>> Just specify the nutch*.job file like this:
> > >>>>>>>
> > >>>>>>>    bin/hadoop jar nutch*.job <className> <args ..>
> > >>>>>>>
> > >>>>>>> where className and args is one of Nutch command-line tools. You
> > can
> > >>>>>>> also
> > >>>>>>> modify slightly the bin/nutch script, so that you don't have to
> > >>>>>>>
> > >>>>>> specify
> > >>>
> > >>>> fully-qualified class names.
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Best regards,
> > >>>>>>> Andrzej Bialecki     <><
> > >>>>>>> ___. ___ ___ ___ _ _   __________________________________
> > >>>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > >>>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > >>>>>>> http://www.sigram.com  Contact: info at sigram dot com
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>  John Martyniak
> > >>>> President
> > >>>> Before Dawn Solutions, Inc.
> > >>>> 9457 S. University Blvd #266
> > >>>> Highlands Ranch, CO 80126
> > >>>> o: 877-499-1562 x707
> > >>>> f: 877-499-1562
> > >>>> c: 303-522-1756
> > >>>> e: [email protected]
> > >>>>
> > >>>>
> > >>>>
> > >>>
> > >
> >
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Merge taking forever

Reply via email to