Thanks for that link Prashant - very useful.
Two brief follow-up questions:
1) Having put data in the cache, I would like to be a good citizen by deleting
the data from the cache once
I’ve finished - how do I do that?
2) Would it be simpler to pass the data as a value in the jobConf object?
There is currently no way to delete the data from the cache when you are done.
It is garbage collected when the cache starts to fill up (in LRU order if you
are on a newer release). The DistributedCache.addCacheFile is modifying the
JobConf behind the scenes for you. If you want to dig into
I have a series of mappers that I would like to be passed data using the
distributed cache mechanism. At the
moment, I am using HDFS to pass the data, but this seems wasteful to me, since
they are all reading the same data.
Is there a piece of example code that shows how data files can be
I believe you want to ship data to each node in your cluster before MR
begins so the mappers can access files local to their machine. Hadoop
tutorial on YDN has some good info on this.
http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
-Prashant Kommireddi
On Fri, Nov 25, 2011 at