I am currently working on a RecordReader to read a custom time series
data binary file format and was wondering about ways to be most
efficient in designing the InputFormat/RecordReader process. Reading
through:
 
http://wiki.apache.org/hadoop/HadoopMapReduce
<http://wiki.apache.org/hadoop/HadoopMapReduce> 
 
gave me a lot of hints about how the various classes work together in
order to read any type of file. I was looking at how the TextInputFormat
uses the LineRecordReader in order to send individual lines to each
mapper. My question is, what is a good heuristic in how to choose how
much data to send to each mapper? With the stock LineRecordReader each
mapper only gets to work with a single line which leads me to believe
that we want to give each mapper very little work. Currently I'm looking
at either sending each mapper a single point of data (10 bytes), which
seems small, or sending a single mapper a block of data (around 819
points, at 10 bytes each, ---> 8190 bytes). I'm leaning towards sending
the block to the mapper.
 
These factors are based around dealing with a legacy file format (for
now) so I'm just trying to make the best tradeoff possible for the short
term until I get some basic stuff rolling, at which point I can suggest
a better storage format, or just start converting the groups of stored
points into a format more fitting for the platform. I understand that
the InputFormat is not really trying to make much meaning out of the
data, other than to help assist in getting the correct data out of the
file based on the file split variables. Another question I have is, with
a pretty much stock install, generally how big is each FileSplit?
 
Josh Patterson
TVA

Reply via email to