Re: Memory intensive jobs and JVM reuse

2010-04-29 Thread Aleksandar Stupar
Hi, this may sound silly, but what I would try is following: - use CombinedFileInputFormat with mapred.max.split.size set to 4xBlockSize (can be more than 4...) this will reduce the number of input splits and therefor number of map tasks so you can do the following: - set mapred.tasktracker.m

Re: Memory intensive jobs and JVM reuse

2010-04-29 Thread Nick Jones
On 4/29/2010 10:52 AM, Aaron Kimball wrote: * JVM reuse only applies within the same job. Different jobs are always different JVMs * JVM reuse is serial; you'll only get task B in a JVM after task A has already completed -- never both at the same time. If you configure Hadoop to run 4 tasks per

Re: Memory intensive jobs and JVM reuse

2010-04-29 Thread Aaron Kimball
* JVM reuse only applies within the same job. Different jobs are always different JVMs * JVM reuse is serial; you'll only get task B in a JVM after task A has already completed -- never both at the same time. If you configure Hadoop to run 4 tasks per TT concurrently, you'll still have 4 JVMs up.

Re: Memory intensive jobs and JVM reuse

2010-04-29 Thread David Rosenstrauch
On 04/29/2010 11:08 AM, Danny Leshem wrote: David, DistributedCache distributes files across the cluster - it is not a shared memory cache. My problem is not distributing the HashMap across machines, but the fact that it is replicated in memory for each task (or each job, for that matter). OK,

Re: Memory intensive jobs and JVM reuse

2010-04-29 Thread Danny Leshem
David, DistributedCache distributes files across the cluster - it is not a shared memory cache. My problem is not distributing the HashMap across machines, but the fact that it is replicated in memory for each task (or each job, for that matter). On Thu, Apr 29, 2010 at 4:57 PM, David Rosenstrauc

Re: Memory intensive jobs and JVM reuse

2010-04-29 Thread Raja Thiruvathuru
Can you show (cut & paste) how whats your job config looks like. On Thu, Apr 29, 2010 at 8:58 AM, Danny Leshem wrote: > Hello, > > I'm using Hadoop to run a memory intensive job on different input datum. > The job requires the availability (in memory) of some read-only HashMap, > about 4Gb in si

Re: Memory intensive jobs and JVM reuse

2010-04-29 Thread David Rosenstrauch
On 04/29/2010 08:58 AM, Danny Leshem wrote: Hello, I'm using Hadoop to run a memory intensive job on different input datum. The job requires the availability (in memory) of some read-only HashMap, about 4Gb in size. The same fixed HashMap is used for all input datum. I'm using a cluster of EC2

Memory intensive jobs and JVM reuse

2010-04-29 Thread Danny Leshem
Hello, I'm using Hadoop to run a memory intensive job on different input datum. The job requires the availability (in memory) of some read-only HashMap, about 4Gb in size. The same fixed HashMap is used for all input datum. I'm using a cluster of EC2 machines with more than enough memory (around