Arko: If you have keyed both the big blob and the input files similarly, and you can output both streams to HDFS sorted by key, then you can reformulate this whole process as a map-side join. It will be a lot simpler and more efficient than scanning the whole blob for each input.
Also, do whatever loading you have to do in the constructor or the configure method so save a lot of repetition. Hope this helps, Anthony On Mon, Oct 31, 2011 at 4:45 PM, Arko Provo Mukherjee <arkoprovomukher...@gmail.com> wrote: > Hello, > I have a situation where I am reading a big file from HDFS and then > comparing all the data in that file with each input to the mapper. > Now since my mapper is trying to read the entire HDFS file for each of its > input, the amount of data it is having to read and keep in memory is > becoming large (file size * no of inputs to the mapper) > Can we someone avoid this by loading the file once for each mapper such that > the mapper can reuse the loaded file for each of the inputs that it > receives. > If this can be done, then for each mapper, I can just load the file once and > then the mapper can use it for the entire slice of data that it receives. > Thanks a lot in advance! > > Warm regards > Arko