You could create a singleton class and reference the dictionary stuff in that. You would probably want this separate from other classes as to control exactly what data is held on to for a long time and what is not.
class Singleton { private static final _instance Singleton = new Singleton(); private Singleton() { ... initialize here, only ever called once per classloader or JVM; } public Singleton getSingleton() { return _instance; } in mapper: Singleton dictionary = Singleton.getSingleton(); This assumes that each mapper doesn't live in its own classloader space (which would make even static singletons not shareable), and has the drawback that once initialized, that memory associated with the singleton won't go away until the JVM or classloader that hosts it dies. I have not tried this myself, and do not know the exact classloader semantics used in the new 'persistent' task JVMs. They could have a classloader per job, and dispose of those when the job is complete -- though then it is impossible to persist data across jobs but only within them. Or there could be one permanent persisted classloader, or one per task. All will behave differently with respect to statics like the above example. ________________________________________ From: Stuart White [stuart.whi...@gmail.com] Sent: Saturday, February 28, 2009 6:06 AM To: core-user@hadoop.apache.org Subject: MapReduce jobs with expensive initialization I have a mapreduce job that requires expensive initialization (loading of some large dictionaries before processing). I want to avoid executing this initialization more than necessary. I understand that I need to call setNumTasksToExecutePerJvm to -1 to force mapreduce to reuse JVMs when executing tasks. How I've been performing my initialization is, in my mapper, I override MapReduceBase#configure, read my parms from the JobConf, and load my dictionaries. It appears, from the tests I've run, that even though NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class are being created for each task, and therefore I'm still re-running this expensive initialization for each task. So, my question is: how can I avoid re-executing this expensive initialization per-task? Should I move my initialization code out of my mapper class and into my "main" class? If so, how do I pass references to the loaded dictionaries from my main class to my mapper? Thanks!