Thanks Aaron, that really helps. I probably do need to control the
number of splits. My input 'data' consists of Java objects and their
size (in bytes) doesn't necessarily reflect the amount of time needed
for each map operation. I need to ensure that I have enough map tasks
so that all cpus are utilized and the job gets done in a reasonable
amount of time. (Currently I'm creating multiple input files and making
them unsplitable, but subclassing SequenceFileInputFormat to explicitly
control then number of splits sounds like a better approach).
Barnet
Aaron Kimball wrote:
Yes, there can be more than one InputSplit per SequenceFile. The file will
be split more-or-less along 64 MB boundaries. (the actual "edges" of the
splits will be adjusted to hit the next block of key-value pairs, so it
might be a few kilobytes off.)
The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks())
as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
is always 100% user-controlled.) If you need exact control over the number
of map tasks, you'll need to subclass it and modify this behavior. That
having been said -- are you sure you actually need to precisely control this
value? Or is it enough to know how many splits were created?
- Aaron
On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b.wag...@comcast.net> wrote:
Suppose a SequenceFile (containing keys and values that are BytesWritable)
is used as input. Will it be divided into InputSplits? If so, what's the
criteria use for splitting?
I'm interested in this because I need to control the number of map tasks
used, which (if I understand it correctly), is equal to the number of
InputSplits.
thanks,
bw