hi justin,
i am running hadoop in distributed mode and having the same problem.
merging segments just eats up much more temp space than the segments
would have combined.
paul.
On 14.06.2009, at 18:17, MilleBii wrote:
Same for merging 3 segments of 100k, 100K, 300k URLs resulted in
consumming
200Gb and partition full after 18hours processing
Something strange with this segment merge,
Conf : PC Dual Core, Vista, Hadoop on single node.
Can someone confirm if installing Hadoop in a distributed will fix
it ? Is
there a good config guide for the distributed mode.
2009/6/12 Justin Yao <[email protected]>
Hi John,
I have no idea about that neither.
Justin
On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak <
[email protected]> wrote:
Justin,
Thanks for the response.
I was having a similar issue, i was trying to merge the segments for
crawls
during the month of may probably around 13-15GB, so after
everything was
running it had used tmp space of around 900 GB doesn't seem very
efficient.
I will try this out and see if it changes anything.
Do you know if there is any risk in using the following:
<property>
<name>mapred.min.split.size</name>
<value>671088640</value>
</property>
as suggested in the article?
-John
On Jun 11, 2009, at 7:25 PM, Justin Yao wrote:
Hi John,
I had the same issue before but never found a solution.
Here is a workaround mentioned by someone in this mailing list,
you may
have
a try:
Seemingly abnormal temp space use by segment merger
http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569
>
Regards,
Justin
On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <
[email protected]
wrote:
Ok.
So a update to this item.
I did start running nutch with hadoop, I am trying a single node
config
just to test it out.
It took forever to get all of the files in the DFS it was just
over
80GB
but it is in there. So I started the SegmentMerge job, and it is
working
flawlessly, still a little slow though.
Also looking at the stats for the CPU they sometimes go over 20%
by not
by
much and not often, the Disk is very lightly taxed, peak was
about 20
MB/sec, the drives and interface are rated at 3 GB/sec, so no
issue
there.
I tried to set the map jobs to 7 and the reduce jobs to 3, but
when I
restarted all it is still only using 2 and 1. Any ideas? I
made that
change in the hadoop-site.xml file BTW.
-John
On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote:
John Martyniak wrote:
Andrzej,
I am a little embarassed asking. But is there is a setup
guide for
setting up Hadoop for Nutch 1.0, or is it the same process as
setting
up for
Nutch 0.17 (Which I think is the existing guide out there).
Basically, yes - but this guide is primarily about the set up of
Hadoop
cluster using the Hadoop pieces distributed with Nutch. As such
these
instructions are already slightly outdated. So it's best simply
to
install a
clean Hadoop 0.19.1 according to the instructions on Hadoop
wiki, and
then
build nutch*.job file separately.
Also I have Hadoop already running for some other applications,
not
associated with Nutch, can I use the same install? I think
that it
is
the
same version that Nutch 1.0 uses. Or is it just easier to set
it up
using
the nutch config.
Yes, it's perfectly ok to use Nutch with an existing Hadoop
cluster of
the
same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would
strongly
recommend this way, instead of the usual "dirty" way of setting
up
Nutch
by
replicating the local build dir ;)
Just specify the nutch*.job file like this:
bin/hadoop jar nutch*.job <className> <args ..>
where className and args is one of Nutch command-line tools.
You can
also
modify slightly the bin/nutch script, so that you don't have to
specify
fully-qualified class names.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
John Martyniak
President
Before Dawn Solutions, Inc.
9457 S. University Blvd #266
Highlands Ranch, CO 80126
o: 877-499-1562 x707
f: 877-499-1562
c: 303-522-1756
e: [email protected]