Re: Are SequenceFiles split? If so, how?

Barnet Wagman Thu, 23 Apr 2009 08:08:25 -0700

Aaron Kimball wrote:

Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but

the run of cheap objects is consolidated.

^ I'm not concerned about the variation in processing time of objects;there isn't enough variation to worry about. I'm primarily concernedwith having enough map tasks to utilized all nodes (and cores).

In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data.

^ This sounds like the right solution. I'll still need to extendSequenceFileInputFormat, but it should be relatively simple to put afixed number of objects into each split.


thanks

Re: Are SequenceFiles split? If so, how?

Reply via email to