Re: Hadoop cluster on EC2: hangs on big chunks of data

Friso van Vollenhoven Thu, 27 Oct 2011 01:42:22 -0700

What is your input data? Some types of files are not splittable, because of 
non-splittable compression codecs (like gzip). Could that be in your case?


Friso

On 25 okt. 2011, at 21:43, Artem Yankov wrote:

It looks like input data is not splited correctly. It always generates only one 
map task and gives it to one of the nodes. I tried to pass parameters like  -D 
mapred.max.split.size but it doesn't seem to have any effect.

So the question would be: how to specify the maximum amount of input records 
each mapper can receive?

On Tue, Oct 25, 2011 at 10:56 AM, Artem Yankov 
<artem.yan...@gmail.com<mailto:artem.yan...@gmail.com>> wrote:
Hey,

I set up a hadoop cluster on EC2 using this documentation: 
http://wiki.apache.org/hadoop/AmazonEC2

OS: Linux Fedora 8
Hadoop version is 0.20.203.0
java version "1.7.0_01"
heap size: 1Gb (stats always shows that it uses only 4% of this)
I use mongo-hadoop plugin to get data from mongodb.

Everything seems to work perfect with the small chunks of data: calculations 
are fast, I'm getting the results and tasks
seem to be distributed normally among the slaves.

Then I try to load a huge amount of data (22 Millions of records) and 
everything hangs. First slave receives a map task and other slaves are not. In 
logs I constantly see this:

INFO org.apache.hadoop.hdfs.StateChange: *BLOCK* NameSystem.processReport: from 
x.x.x.x:50010, blocks: 2, processing time: 0 m

I tried to use different number of slaves (maximum I ran 25 nodes), but it 
doesn't help cause it seems that when first slave receives a job it blocks 
everything else. (again, everything works cool with the small chunks of data).

There are no significant CPU or Memory load on Master.

Any ideas on what can be a reason of this?

Artem.

Re: Hadoop cluster on EC2: hangs on big chunks of data

Reply via email to