Hello,

Can you look about size of merged segments?

If I remember correctly when I had segment1 = 1GB and segment2= 1GB new merged segment was like 5GB but I havn't got time to look into it.

Thanks,
Bartosz

czerwionka paul pisze:
hi justin,

i am running hadoop in distributed mode and having the same problem.

merging segments just eats up much more temp space than the segments would have combined.

paul.

On 14.06.2009, at 18:17, MilleBii wrote:

Same for merging 3 segments of 100k, 100K, 300k URLs resulted in consumming
200Gb and partition full after 18hours processing

Something strange with this segment merge,

Conf : PC Dual Core, Vista, Hadoop on single node.

Can someone confirm if installing Hadoop in a distributed will fix it ? Is
there a good config guide for the distributed mode.


2009/6/12 Justin Yao <[email protected]>

Hi John,
I have no idea about that neither.
Justin

On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
[email protected]> wrote:

Justin,

Thanks for the response.

I was having a similar issue, i was trying to merge the segments for
crawls
during the month of may probably around 13-15GB, so after everything was
running it had used tmp space of around 900 GB doesn't seem very
efficient.

I will try this out and see if it changes anything.

Do you know if there is any risk in using the following:
<property>
 <name>mapred.min.split.size</name>
 <value>671088640</value>
</property>

as suggested in the article?

-John

On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:

Hi John,

I had the same issue before but never found a solution.
Here is a workaround mentioned by someone in this mailing list, you may
have
a try:

Seemingly abnormal temp space use by segment merger


http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569>

Regards,
Justin

On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
[email protected]

wrote:


Ok.

So a update to this item.

I did start running nutch with hadoop, I am trying a single node config
just to test it out.

It took forever to get all of the files in the DFS it was just over
80GB
but it is in there.  So I started the SegmentMerge job, and it is
working
flawlessly, still a little slow though.

Also looking at the stats for the CPU they sometimes go over 20% by not
by
much and not often, the Disk is very lightly taxed, peak was about 20
MB/sec, the drives and interface are rated at 3 GB/sec, so no issue
there.

I tried to set the map jobs to 7 and the reduce jobs to 3, but when I restarted all it is still only using 2 and 1. Any ideas? I made that
change in the hadoop-site.xml file BTW.

-John


On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:

John Martyniak wrote:


Andrzej,
I am a little embarassed asking. But is there is a setup guide for setting up Hadoop for Nutch 1.0, or is it the same process as setting
up for
Nutch 0.17 (Which I think is the existing guide out there).


Basically, yes - but this guide is primarily about the set up of
Hadoop
cluster using the Hadoop pieces distributed with Nutch. As such these
instructions are already slightly outdated. So it's best simply to
install a
clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and
then
build nutch*.job file separately.

Also I have Hadoop already running for some other applications, not

associated with Nutch, can I use the same install? I think that it
is
the
same version that Nutch 1.0 uses. Or is it just easier to set it up
using
the nutch config.


Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster of
the
same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would strongly
recommend this way, instead of the usual "dirty" way of setting up
Nutch
by
replicating the local build dir ;)

Just specify the nutch*.job file like this:

    bin/hadoop jar nutch*.job <className> <args ..>

where className and args is one of Nutch command-line tools. You can
also
modify slightly the bin/nutch script, so that you don't have to
specify
fully-qualified class names.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




John Martyniak
President
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562 x707
f: 877-499-1562
c: 303-522-1756
e: [email protected]






Reply via email to