Re: Questions on MultithreadedMapper

2010-04-28 Thread Jim Twensky
Thanks Ted. Is it correct to assume that all class members defined
inside my Mapper are visible to all of the threads, so I should pay
careful attention and take synchronization into account when accessing
those objects?

Jim

On Tue, Apr 27, 2010 at 11:50 PM, Ted Yu yuzhih...@gmail.com wrote:
 Looking through MultithreadedMapRunner, map() seems to be the only method
 called by executorService:
        MultithreadedMapRunner.this.mapper.map(key, value, output,
 reporter);


 On Tue, Apr 27, 2010 at 3:46 PM, Jim Twensky jim.twen...@gmail.com wrote:

 Hi,

 I've decided to refactor some of my Hadoop jobs and implement them
 using MultithreadedMapper.class but I got puzzled because of some
 unexpected error messages at run time.
 Here are some relevant settings regarding my Hadoop cluster:

 mapred.tasktracker.map.tasks.maximum = 1
 mapred.tasktracker.reduce.tasks.maximum = 1
 mapred.job.reuse.jvm.num.tasks = -1
 mapred.map.multithreadedrunner.threads = 4

 I'd like to know how threads are used to run the map task in a single
 JVM (Correct me if this is wrong). Suppose I've got a sample Mapper
 class as such:

 class Mapper ... {

 MyObject A;
 static MyObject B;

 setup() {
   Configuration conf = context.getConfiguration();
   A.initialize(c);
   B.initialize(c);
 }

 map() {...}

 cleanup() {...}

 Does each thread run all three of setup(), map(), cleanup() methods ?

 -OR-

 Are setup() and cleanup() run once per task (and thus per JVM
 according to my settings) and so map is the only multithreaded
 function?
 Also, are the objects A and B shared among different threads or does
 each trade have its own copy of them? My initial guess was that each
 thread would have a separate copy of A, and B would be shared among
 the 4 threads running on the same box since it is defined as static,
 but it appears to me that this assumption is not correct and A seems
 to be shared.

 Thanks,
 Jim




Questions on MultithreadedMapper

2010-04-27 Thread Jim Twensky
Hi,

I've decided to refactor some of my Hadoop jobs and implement them
using MultithreadedMapper.class but I got puzzled because of some
unexpected error messages at run time.
Here are some relevant settings regarding my Hadoop cluster:

mapred.tasktracker.map.tasks.maximum = 1
mapred.tasktracker.reduce.tasks.maximum = 1
mapred.job.reuse.jvm.num.tasks = -1
mapred.map.multithreadedrunner.threads = 4

I'd like to know how threads are used to run the map task in a single
JVM (Correct me if this is wrong). Suppose I've got a sample Mapper
class as such:

class Mapper ... {

MyObject A;
static MyObject B;

setup() {
   Configuration conf = context.getConfiguration();
   A.initialize(c);
   B.initialize(c);
}

map() {...}

cleanup() {...}

Does each thread run all three of setup(), map(), cleanup() methods ?

-OR-

Are setup() and cleanup() run once per task (and thus per JVM
according to my settings) and so map is the only multithreaded
function?
Also, are the objects A and B shared among different threads or does
each trade have its own copy of them? My initial guess was that each
thread would have a separate copy of A, and B would be shared among
the 4 threads running on the same box since it is defined as static,
but it appears to me that this assumption is not correct and A seems
to be shared.

Thanks,
Jim