Re: Merge taking forever

Alex Basa Wed, 17 Jun 2009 07:57:35 -0700

Can someone point me to creating a hadoop dfs on a single machine?  I'd like to 
test this out to see how much it speeds up merging.  I'm using a ZFS filesystem 
and have no IO waits.  It seems like when the index reaches the 4GB range, the 
merge time drastically goes up.  My ZFS filesystem is on a SAN and is not the 
bottleneck.


Thanks in advance!

--- On Mon, 6/15/09, Julien Nioche <[email protected]> wrote:

> From: Julien Nioche <[email protected]>
> Subject: Re: Merge taking forever
> To: [email protected]
> Date: Monday, June 15, 2009, 12:58 PM
> Hi,
> 
> 
> > Presumably in hadoop-site.xml as a property/value ?
> 
> 
> Indeed.
> 
> J.
> 
> 
> >
> > On the other hand, I'm asking myself why merging
> segments... I don't fully
> > understand the benefits, if someone can shed some
> light.
> >
> >
> > 2009/6/15 Julien Nioche <[email protected]>
> >
> > > Hi,
> > >
> > > Have you tried setting
> *mapred.compress.map.output *to true? This should
> > > reduce the amount of temp space required.
> > >
> > > Julien
> > > --
> > > DigitalPebble Ltd
> > > http://www.digitalpebble.com
> > >
> > > 2009/6/15 czerwionka paul <[email protected]>
> > >
> > > > hi justin,
> > > >
> > > > i am running hadoop in distributed mode and
> having the same problem.
> > > >
> > > > merging segments just eats up much more temp
> space than the segments
> > > would
> > > > have combined.
> > > >
> > > > paul.
> > > >
> > > >
> > > > On 14.06.2009, at 18:17, MilleBii wrote:
> > > >
> > > >  Same for merging 3 segments of 100k,
> 100K, 300k URLs resulted in
> > > >> consumming
> > > >> 200Gb and partition full after 18hours
> processing
> > > >>
> > > >> Something strange with this segment
> merge,
> > > >>
> > > >> Conf : PC Dual Core, Vista, Hadoop on
> single node.
> > > >>
> > > >> Can someone confirm if installing Hadoop
> in a distributed will fix it
> > ?
> > > Is
> > > >> there a good config guide for the
> distributed mode.
> > > >>
> > > >>
> > > >> 2009/6/12 Justin Yao <[email protected]>
> > > >>
> > > >>  Hi John,
> > > >>> I have no idea about that neither.
> > > >>> Justin
> > > >>>
> > > >>> On Fri, Jun 12, 2009 at 8:05 AM,
> John Martyniak <
> > > >>> [email protected]>
> wrote:
> > > >>>
> > > >>>  Justin,
> > > >>>>
> > > >>>> Thanks for the response.
> > > >>>>
> > > >>>> I was having a similar issue, i
> was trying to merge the segments for
> > > >>>>
> > > >>> crawls
> > > >>>
> > > >>>> during the month of may probably
> around 13-15GB,  so after
> > everything
> > > >>>> was
> > > >>>> running it had used tmp space of
> around 900 GB doesn't seem very
> > > >>>>
> > > >>> efficient.
> > > >>>
> > > >>>>
> > > >>>> I will try this out and see if
> it changes anything.
> > > >>>>
> > > >>>> Do you know if there is any risk
> in using the following:
> > > >>>> <property>
> > > >>>> 
> <name>mapred.min.split.size</name>
> > > >>>> 
> <value>671088640</value>
> > > >>>> </property>
> > > >>>>
> > > >>>> as suggested in the article?
> > > >>>>
> > > >>>> -John
> > > >>>>
> > > >>>> On Jun 11, 2009, at 7:25 PM,
> Justin Yao wrote:
> > > >>>>
> > > >>>> Hi John,
> > > >>>>
> > > >>>>>
> > > >>>>> I had the same issue before
> but never found a solution.
> > > >>>>> Here is a workaround
> mentioned by someone in this mailing list, you
> > > may
> > > >>>>> have
> > > >>>>> a try:
> > > >>>>>
> > > >>>>> Seemingly abnormal temp
> space use by segment merger
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>
> > >
> > http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
> > <
> > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> > >
> > > <
> > >
> > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> > > >
> > > >>> <
> > > >>>
> > >
> > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
> > > >>> >
> > > >>>
> > > >>>>
> > > >>>>> Regards,
> > > >>>>> Justin
> > > >>>>>
> > > >>>>> On Sat, Jun 6, 2009 at 4:09
> PM, John Martyniak <
> > > >>>>> [email protected]
> > > >>>>>
> > > >>>>>  wrote:
> > > >>>>>>
> > > >>>>>>
> > > >>>>> Ok.
> > > >>>>>
> > > >>>>>>
> > > >>>>>> So a update to this
> item.
> > > >>>>>>
> > > >>>>>> I did start running
> nutch with hadoop, I am trying a single node
> > > >>>>>> config
> > > >>>>>> just to test it out.
> > > >>>>>>
> > > >>>>>> It took forever to get
> all of the files in the DFS it was just
> > over
> > > >>>>>>
> > > >>>>> 80GB
> > > >>>
> > > >>>> but it is in there.  So I
> started the SegmentMerge job, and it is
> > > >>>>>>
> > > >>>>> working
> > > >>>
> > > >>>> flawlessly, still a little slow
> though.
> > > >>>>>>
> > > >>>>>> Also looking at the
> stats for the CPU they sometimes go over 20%
> > by
> > > >>>>>> not
> > > >>>>>> by
> > > >>>>>> much and not often, the
> Disk is very lightly taxed, peak was about
> > > 20
> > > >>>>>> MB/sec, the drives and
> interface are rated at 3 GB/sec, so no
> > issue
> > > >>>>>> there.
> > > >>>>>>
> > > >>>>>> I tried to set the map
> jobs to 7 and the reduce jobs to 3, but
> > when
> > > I
> > > >>>>>> restarted all it is
> still only using 2 and 1.  Any ideas?  I made
> > > that
> > > >>>>>> change in the
> hadoop-site.xml file BTW.
> > > >>>>>>
> > > >>>>>> -John
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On Jun 4, 2009, at 10:00
> AM, Andrzej Bialecki wrote:
> > > >>>>>>
> > > >>>>>> John Martyniak wrote:
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> Andrzej,
> > > >>>>>>>
> > > >>>>>>>> I am a little
> embarassed asking.  But is there is a setup guide
> > > for
> > > >>>>>>>> setting up
> Hadoop for Nutch 1.0, or is it the same process as
> > > >>>>>>>> setting
> > > >>>>>>>> up for
> > > >>>>>>>> Nutch 0.17
> (Which I think is the existing guide out there).
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>  Basically,
> yes - but this guide is primarily about the set up
> > of
> > > >>>>>>>
> > > >>>>>> Hadoop
> > > >>>
> > > >>>> cluster using the Hadoop pieces
> distributed with Nutch. As such
> > these
> > > >>>>>>> instructions are
> already slightly outdated. So it's best simply
> > to
> > > >>>>>>> install a
> > > >>>>>>> clean Hadoop 0.19.1
> according to the instructions on Hadoop wiki,
> > > and
> > > >>>>>>> then
> > > >>>>>>> build nutch*.job
> file separately.
> > > >>>>>>>
> > > >>>>>>> Also I have Hadoop
> already running for some other applications,
> > not
> > > >>>>>>>
> > > >>>>>>>  associated
> with Nutch, can I use the same install?  I think that
> > > it
> > > >>>>>>>>
> > > >>>>>>> is
> > > >>>
> > > >>>> the
> > > >>>>>>>> same version
> that Nutch 1.0 uses.  Or is it just easier to set
> > it
> > > up
> > > >>>>>>>> using
> > > >>>>>>>> the nutch
> config.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>  Yes, it's
> perfectly ok to use Nutch with an existing Hadoop
> > > cluster
> > > >>>>>>> of
> > > >>>>>>> the
> > > >>>>>>> same vintage (which
> is 0.19.1 in Nutch 1.0). In fact, I would
> > > >>>>>>> strongly
> > > >>>>>>> recommend this way,
> instead of the usual "dirty" way of setting
> > up
> > > >>>>>>>
> > > >>>>>> Nutch
> > > >>>
> > > >>>> by
> > > >>>>>>> replicating the
> local build dir ;)
> > > >>>>>>>
> > > >>>>>>> Just specify the
> nutch*.job file like this:
> > > >>>>>>>
> > > >>>>>>>   
> bin/hadoop jar nutch*.job <className> <args ..>
> > > >>>>>>>
> > > >>>>>>> where className and
> args is one of Nutch command-line tools. You
> > > can
> > > >>>>>>> also
> > > >>>>>>> modify slightly the
> bin/nutch script, so that you don't have to
> > > >>>>>>>
> > > >>>>>> specify
> > > >>>
> > > >>>> fully-qualified class names.
> > > >>>>>>>
> > > >>>>>>> --
> > > >>>>>>> Best regards,
> > > >>>>>>> Andrzej
> Bialecki     <><
> > > >>>>>>> ___. ___ ___ ___ _
> _   __________________________________
> > > >>>>>>> [__ ||
> __|__/|__||\/|  Information Retrieval, Semantic Web
> > > >>>>>>> ___|||__|| 
> \|  ||  |  Embedded Unix, System Integration
> > > >>>>>>> http://www.sigram.com  Contact: info at sigram dot
> com
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>  John Martyniak
> > > >>>> President
> > > >>>> Before Dawn Solutions, Inc.
> > > >>>> 9457 S. University Blvd #266
> > > >>>> Highlands Ranch, CO 80126
> > > >>>> o: 877-499-1562 x707
> > > >>>> f: 877-499-1562
> > > >>>> c: 303-522-1756
> > > >>>> e: [email protected]
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >
> > >
> >
> 
> 
> 
> -- 
> DigitalPebble Ltd
> http://www.digitalpebble.com
>

Re: Merge taking forever

Reply via email to