Aaron Kimball wrote:
Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but
the run of cheap objects is consolidated.
^ I'm not concerned about the variation in processing time of objects; there isn't enough variation to worry about. I'm primarily concerned with having enough map tasks to utilized all nodes (and cores).
In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data.
^ This sounds like the right solution. I'll still need to extend SequenceFileInputFormat, but it should be relatively simple to put a fixed number of objects into each split.

thanks

Reply via email to