Hello,
I have a situation where I am reading a big file from HDFS and then
comparing all the data in that file with each input to the mapper.
Now since my mapper is trying to read the entire HDFS file for each of its
input, the amount of data it is having to read and keep in memory is
becoming
Yes, you can read the file in the configure() (old api) and setup()
(new api) methods. The data can be saved in a variable that will be
accessible to every call to map().
-Joey
On Mon, Oct 31, 2011 at 7:45 PM, Arko Provo Mukherjee
arkoprovomukher...@gmail.com wrote:
Hello,
I have a situation
Arko,
Have you considered using Hive/Pig for the same kind of functionality instead?
There are also ways to use reducers for this with proper group/sort
comparators in place (need more understanding of what you're trying to
achieve here before we can give out a solution), but you can use the
Arko:
If you have keyed both the big blob and the input files similarly, and
you can output both streams to HDFS sorted by key, then you can
reformulate this whole process as a map-side join. It will be a lot
simpler and more efficient than scanning the whole blob for each
input.
Also, do