Re: Merge taking forever

Bartosz Gadzimski Mon, 15 Jun 2009 05:37:30 -0700

Hello,

Can you look about size of merged segments?

If I remember correctly when I had segment1 = 1GB and segment2= 1GB newmerged segment was like 5GB but I havn't got time to look into it.


Thanks,
Bartosz

czerwionka paul pisze:

hi justin,

i am running hadoop in distributed mode and having the same problem.
merging segments just eats up much more temp space than the segmentswould have combined.
paul.

On 14.06.2009, at 18:17, MilleBii wrote:
Same for merging 3 segments of 100k, 100K, 300k URLs resulted inconsumming
200Gb and partition full after 18hours processing

Something strange with this segment merge,

Conf : PC Dual Core, Vista, Hadoop on single node.
Can someone confirm if installing Hadoop in a distributed will fix it? Is
there a good config guide for the distributed mode.


2009/6/12 Justin Yao <[email protected]>
Hi John,
I have no idea about that neither.
Justin

On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
[email protected]> wrote:
Justin,

Thanks for the response.

I was having a similar issue, i was trying to merge the segments for
crawls
during the month of may probably around 13-15GB, so aftereverything was
running it had used tmp space of around 900 GB doesn't seem very
efficient.
I will try this out and see if it changes anything.

Do you know if there is any risk in using the following:
<property>
 <name>mapred.min.split.size</name>
 <value>671088640</value>
</property>

as suggested in the article?

-John

On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:

Hi John,
I had the same issue before but never found a solution.
Here is a workaround mentioned by someone in this mailing list,you may
have
a try:

Seemingly abnormal temp space use by segment merger
http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>
Regards,
Justin

On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
[email protected]
wrote:
Ok.
So a update to this item.
I did start running nutch with hadoop, I am trying a single nodeconfig
just to test it out.

It took forever to get all of the files in the DFS it was just over
80GB
but it is in there.  So I started the SegmentMerge job, and it is
working
flawlessly, still a little slow though.
Also looking at the stats for the CPU they sometimes go over 20%by not
by
much and not often, the Disk is very lightly taxed, peak wasabout 20
MB/sec, the drives and interface are rated at 3 GB/sec, so no issue
there.
I tried to set the map jobs to 7 and the reduce jobs to 3, butwhen Irestarted all it is still only using 2 and 1. Any ideas? I madethat
change in the hadoop-site.xml file BTW.

-John


On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:

John Martyniak wrote:
Andrzej,
I am a little embarassed asking. But is there is a setup guideforsetting up Hadoop for Nutch 1.0, or is it the same process assetting
up for
Nutch 0.17 (Which I think is the existing guide out there).
Basically, yes - but this guide is primarily about the set up of
Hadoop
cluster using the Hadoop pieces distributed with Nutch. As suchthese
instructions are already slightly outdated. So it's best simply to
install a
clean Hadoop 0.19.1 according to the instructions on Hadoopwiki, and
then
build nutch*.job file separately.

Also I have Hadoop already running for some other applications, not
associated with Nutch, can I use the same install? I thinkthat it
is
the
same version that Nutch 1.0 uses. Or is it just easier to setit up
using
the nutch config.
Yes, it's perfectly ok to use Nutch with an existing Hadoopcluster of
the
same vintage (which is 0.19.1 in Nutch 1.0). In fact, I wouldstrongly
recommend this way, instead of the usual "dirty" way of setting up
Nutch
by
replicating the local build dir ;)

Just specify the nutch*.job file like this:

    bin/hadoop jar nutch*.job <className> <args ..>
where className and args is one of Nutch command-line tools. Youcan
also
modify slightly the bin/nutch script, so that you don't have to
specify
fully-qualified class names.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
John Martyniak
President
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562 x707
f: 877-499-1562
c: 303-522-1756
e: [email protected]

Re: Merge taking forever

Reply via email to