If you have to you can reach through all of the class loaders and find the
instance of your singleton class that has the data loaded. It is awkward,
and
I haven't done this in java since the late 90's. It did work the last time I
did it.


On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey <sc...@richrelevance.com>wrote:

> You could create a singleton class and reference the dictionary stuff in
> that.  You would probably want this separate from other classes as to
> control exactly what data is held on to for a long time and what is not.
>
> class Singleton {
>
> private static final _instance Singleton = new Singleton();
>
> private Singleton() {
>  ... initialize here, only ever called once per classloader or JVM;
> }
>
> public Singleton getSingleton() {
> return _instance;
> }
>
> in mapper:
>
> Singleton dictionary = Singleton.getSingleton();
>
> This assumes that each mapper doesn't live in its own classloader space
> (which would make even static singletons not shareable), and has the
> drawback that once initialized, that memory associated with the singleton
> won't go away until the JVM or classloader that hosts it dies.
>
> I have not tried this myself, and do not know the exact classloader
> semantics used in the new 'persistent' task JVMs.  They could have a
> classloader per job, and dispose of those when the job is complete -- though
> then it is impossible to persist data across jobs but only within them.  Or
> there could be one permanent persisted classloader, or one per task.   All
> will behave differently with respect to statics like the above example.
>
> ________________________________________
> From: Stuart White [stuart.whi...@gmail.com]
> Sent: Saturday, February 28, 2009 6:06 AM
> To: core-user@hadoop.apache.org
> Subject: MapReduce jobs with expensive initialization
>
> I have a mapreduce job that requires expensive initialization (loading
> of some large dictionaries before processing).
>
> I want to avoid executing this initialization more than necessary.
>
> I understand that I need to call setNumTasksToExecutePerJvm to -1 to
> force mapreduce to reuse JVMs when executing tasks.
>
> How I've been performing my initialization is, in my mapper, I
> override MapReduceBase#configure, read my parms from the JobConf, and
> load my dictionaries.
>
> It appears, from the tests I've run, that even though
> NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
> are being created for each task, and therefore I'm still re-running
> this expensive initialization for each task.
>
> So, my question is: how can I avoid re-executing this expensive
> initialization per-task?  Should I move my initialization code out of
> my mapper class and into my "main" class?  If so, how do I pass
> references to the loaded dictionaries from my main class to my mapper?
>
> Thanks!
>

Reply via email to