Hi,
this may sound silly, but what I would try is following:
- use CombinedFileInputFormat with mapred.max.split.size set to 4xBlockSize
(can be more than 4...)
this will reduce the number of input splits and therefor number of map tasks
so you can do the following:
- set mapred.tasktracker.m
On 4/29/2010 10:52 AM, Aaron Kimball wrote:
* JVM reuse only applies within the same job. Different jobs are
always different JVMs
* JVM reuse is serial; you'll only get task B in a JVM after task A
has already completed -- never both at the same time. If you configure
Hadoop to run 4 tasks per
* JVM reuse only applies within the same job. Different jobs are always
different JVMs
* JVM reuse is serial; you'll only get task B in a JVM after task A has
already completed -- never both at the same time. If you configure Hadoop to
run 4 tasks per TT concurrently, you'll still have 4 JVMs up.
On 04/29/2010 11:08 AM, Danny Leshem wrote:
David,
DistributedCache distributes files across the cluster - it is not a shared
memory cache.
My problem is not distributing the HashMap across machines, but the fact
that it is replicated in memory for each task (or each job, for that
matter).
OK,
David,
DistributedCache distributes files across the cluster - it is not a shared
memory cache.
My problem is not distributing the HashMap across machines, but the fact
that it is replicated in memory for each task (or each job, for that
matter).
On Thu, Apr 29, 2010 at 4:57 PM, David Rosenstrauc
Can you show (cut & paste) how whats your job config looks like.
On Thu, Apr 29, 2010 at 8:58 AM, Danny Leshem wrote:
> Hello,
>
> I'm using Hadoop to run a memory intensive job on different input datum.
> The job requires the availability (in memory) of some read-only HashMap,
> about 4Gb in si
On 04/29/2010 08:58 AM, Danny Leshem wrote:
Hello,
I'm using Hadoop to run a memory intensive job on different input datum.
The job requires the availability (in memory) of some read-only HashMap,
about 4Gb in size.
The same fixed HashMap is used for all input datum.
I'm using a cluster of EC2
Hello,
I'm using Hadoop to run a memory intensive job on different input datum.
The job requires the availability (in memory) of some read-only HashMap,
about 4Gb in size.
The same fixed HashMap is used for all input datum.
I'm using a cluster of EC2 machines with more than enough memory (around