If you have to you can reach through all of the class loaders and find the instance of your singleton class that has the data loaded. It is awkward, and I haven't done this in java since the late 90's. It did work the last time I did it.
On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey <sc...@richrelevance.com>wrote: > You could create a singleton class and reference the dictionary stuff in > that. You would probably want this separate from other classes as to > control exactly what data is held on to for a long time and what is not. > > class Singleton { > > private static final _instance Singleton = new Singleton(); > > private Singleton() { > ... initialize here, only ever called once per classloader or JVM; > } > > public Singleton getSingleton() { > return _instance; > } > > in mapper: > > Singleton dictionary = Singleton.getSingleton(); > > This assumes that each mapper doesn't live in its own classloader space > (which would make even static singletons not shareable), and has the > drawback that once initialized, that memory associated with the singleton > won't go away until the JVM or classloader that hosts it dies. > > I have not tried this myself, and do not know the exact classloader > semantics used in the new 'persistent' task JVMs. They could have a > classloader per job, and dispose of those when the job is complete -- though > then it is impossible to persist data across jobs but only within them. Or > there could be one permanent persisted classloader, or one per task. All > will behave differently with respect to statics like the above example. > > ________________________________________ > From: Stuart White [stuart.whi...@gmail.com] > Sent: Saturday, February 28, 2009 6:06 AM > To: core-user@hadoop.apache.org > Subject: MapReduce jobs with expensive initialization > > I have a mapreduce job that requires expensive initialization (loading > of some large dictionaries before processing). > > I want to avoid executing this initialization more than necessary. > > I understand that I need to call setNumTasksToExecutePerJvm to -1 to > force mapreduce to reuse JVMs when executing tasks. > > How I've been performing my initialization is, in my mapper, I > override MapReduceBase#configure, read my parms from the JobConf, and > load my dictionaries. > > It appears, from the tests I've run, that even though > NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class > are being created for each task, and therefore I'm still re-running > this expensive initialization for each task. > > So, my question is: how can I avoid re-executing this expensive > initialization per-task? Should I move my initialization code out of > my mapper class and into my "main" class? If so, how do I pass > references to the loaded dictionaries from my main class to my mapper? > > Thanks! >