Thanks Aaron, that really helps. I probably do need to control the number of splits. My input 'data' consists of Java objects and their size (in bytes) doesn't necessarily reflect the amount of time needed for each map operation. I need to ensure that I have enough map tasks so that all cpus are utilized and the job gets done in a reasonable amount of time. (Currently I'm creating multiple input files and making them unsplitable, but subclassing SequenceFileInputFormat to explicitly control then number of splits sounds like a better approach).

Barnet

Aaron Kimball wrote:
Yes, there can be more than one InputSplit per SequenceFile. The file will
be split more-or-less along 64 MB boundaries. (the actual "edges" of the
splits will be adjusted to hit the next block of key-value pairs, so it
might be a few kilobytes off.)

The SequenceFileInputFormat regards mapred.map.tasks (conf.setNumMapTasks())
as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
is always 100% user-controlled.) If you need exact control over the number
of map tasks, you'll need to subclass it and modify this behavior. That
having been said -- are you sure you actually need to precisely control this
value? Or is it enough to know how many splits were created?

- Aaron

On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman <b.wag...@comcast.net> wrote:

Suppose a SequenceFile (containing keys and values that are BytesWritable)
is used as input. Will it be divided into InputSplits?  If so, what's the
criteria use for splitting?

I'm interested in this because I need to control the number of map tasks
used, which (if I understand it correctly), is equal to the number of
InputSplits.

thanks,

bw



Reply via email to