RE: MapReduce jobs with expensive initialization

Scott Carey Sun, 01 Mar 2009 11:21:50 -0800

You could create a singleton class and reference the dictionary stuff in that.  
You would probably want this separate from other classes as to control exactly 
what data is held on to for a long time and what is not.


class Singleton {

private static final _instance Singleton = new Singleton();

private Singleton() {
 ... initialize here, only ever called once per classloader or JVM; 
}

public Singleton getSingleton() {
return _instance;
}

in mapper:

Singleton dictionary = Singleton.getSingleton();

This assumes that each mapper doesn't live in its own classloader space (which 
would make even static singletons not shareable), and has the drawback that 
once initialized, that memory associated with the singleton won't go away until 
the JVM or classloader that hosts it dies. 

I have not tried this myself, and do not know the exact classloader semantics 
used in the new 'persistent' task JVMs.  They could have a classloader per job, 
and dispose of those when the job is complete -- though then it is impossible 
to persist data across jobs but only within them.  Or there could be one 
permanent persisted classloader, or one per task.   All will behave differently 
with respect to statics like the above example.

________________________________________
From: Stuart White [stuart.whi...@gmail.com]
Sent: Saturday, February 28, 2009 6:06 AM
To: core-user@hadoop.apache.org
Subject: MapReduce jobs with expensive initialization

I have a mapreduce job that requires expensive initialization (loading
of some large dictionaries before processing).

I want to avoid executing this initialization more than necessary.

I understand that I need to call setNumTasksToExecutePerJvm to -1 to
force mapreduce to reuse JVMs when executing tasks.

How I've been performing my initialization is, in my mapper, I
override MapReduceBase#configure, read my parms from the JobConf, and
load my dictionaries.

It appears, from the tests I've run, that even though
NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
are being created for each task, and therefore I'm still re-running
this expensive initialization for each task.

So, my question is: how can I avoid re-executing this expensive
initialization per-task?  Should I move my initialization code out of
my mapper class and into my "main" class?  If so, how do I pass
references to the loaded dictionaries from my main class to my mapper?

Thanks!

RE: MapReduce jobs with expensive initialization

Reply via email to