What Hadoop version?
On a clusster this size there are two things to check right away:
1. In the Hadoop UI, during the job, are the reduce and map slots close to
being filled up most of the time, or are tasks completing faster than the
scheduler can keep up so that there are often many empty sl
Thanks for your help Jason. I actually did reduce the heap size to 400M
and it sped things up a few percent. From my experience with jvm's, if
you can handle lower amounts of heap your app will run faster because GC
is more efficient for smaller garbage collections (which is also why
using incr
The value really varies by job and by cluster, the larger the split, the
more chance there is that a small number of splits will take much longer to
complete than the rest resulting in a long job tail where very little of
your cluster is utilized while they complete.
The flip side is with very sma
That definitely helps a lot! I saw a few people talking about it on the
webs, and they say to set the value to Long.MAX_VALUE, but that is not
what I have found to be best. I see about 25% improvement at 300MB
(3), CPU utilization is up to about 50-70%+, but I am still fine
tuning.
th
I remember having a problem like this at one point, it was related to the
mean run time of my tasks, and the rate that the jobtracker could start new
tasks.
By increasing the split size until the mean run time of my tasks was in the
minutes, I was able to drive up the utilization.
On Wed, Oct 14
No, there doesn't seem to be all that much network traffic. Most of the
time traffic (measured with nethogs) is about 15-30K/s on the master and
slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe
5-10 seconds on a query that takes 10 minutes, but that is still less
than wh
are your network interface or the namenode/jobtracker/datanodes saturated
On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline wrote:
> I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11
> c1.xlarge instances (1 master, 10 slaves), that is the biggest instance
> available with
I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of
11 c1.xlarge instances (1 master, 10 slaves), that is the biggest
instance available with 20 compute units and 4x 400gb disks.
I wrote some scripts to test many (100's) of configurations running a
particular Hive query to