Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-15 Thread Scott Carey
What Hadoop version? On a clusster this size there are two things to check right away: 1. In the Hadoop UI, during the job, are the reduce and map slots close to being filled up most of the time, or are tasks completing faster than the scheduler can keep up so that there are often many empty sl

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-15 Thread Chris Seline
Thanks for your help Jason. I actually did reduce the heap size to 400M and it sped things up a few percent. From my experience with jvm's, if you can handle lower amounts of heap your app will run faster because GC is more efficient for smaller garbage collections (which is also why using incr

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Jason Venner
The value really varies by job and by cluster, the larger the split, the more chance there is that a small number of splits will take much longer to complete than the rest resulting in a long job tail where very little of your cluster is utilized while they complete. The flip side is with very sma

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Chris Seline
That definitely helps a lot! I saw a few people talking about it on the webs, and they say to set the value to Long.MAX_VALUE, but that is not what I have found to be best. I see about 25% improvement at 300MB (3), CPU utilization is up to about 50-70%+, but I am still fine tuning. th

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Jason Venner
I remember having a problem like this at one point, it was related to the mean run time of my tasks, and the rate that the jobtracker could start new tasks. By increasing the split size until the mean run time of my tasks was in the minutes, I was able to drive up the utilization. On Wed, Oct 14

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Chris Seline
No, there doesn't seem to be all that much network traffic. Most of the time traffic (measured with nethogs) is about 15-30K/s on the master and slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe 5-10 seconds on a query that takes 10 minutes, but that is still less than wh

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-13 Thread Jason Venner
are your network interface or the namenode/jobtracker/datanodes saturated On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline wrote: > I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11 > c1.xlarge instances (1 master, 10 slaves), that is the biggest instance > available with

Optimization of cpu and i/o usage / other bottlenecks?

2009-10-13 Thread Chris Seline
I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11 c1.xlarge instances (1 master, 10 slaves), that is the biggest instance available with 20 compute units and 4x 400gb disks. I wrote some scripts to test many (100's) of configurations running a particular Hive query to