Sharing data in a mapper for all values

2011-10-31 Thread Arko Provo Mukherjee
Hello, I have a situation where I am reading a big file from HDFS and then comparing all the data in that file with each input to the mapper. Now since my mapper is trying to read the entire HDFS file for each of its input, the amount of data it is having to read and keep in memory is becoming

Re: Sharing data in a mapper for all values

2011-10-31 Thread Joey Echeverria
Yes, you can read the file in the configure() (old api) and setup() (new api) methods. The data can be saved in a variable that will be accessible to every call to map(). -Joey On Mon, Oct 31, 2011 at 7:45 PM, Arko Provo Mukherjee arkoprovomukher...@gmail.com wrote: Hello, I have a situation

Re: Sharing data in a mapper for all values

2011-10-31 Thread Harsh J
Arko, Have you considered using Hive/Pig for the same kind of functionality instead? There are also ways to use reducers for this with proper group/sort comparators in place (need more understanding of what you're trying to achieve here before we can give out a solution), but you can use the

Re: Sharing data in a mapper for all values

2011-10-31 Thread Anthony Urso
Arko: If you have keyed both the big blob and the input files similarly, and you can output both streams to HDFS sorted by key, then you can reformulate this whole process as a map-side join. It will be a lot simpler and more efficient than scanning the whole blob for each input. Also, do