Re: Are SequenceFiles split? If so, how?

Shevek Thu, 23 Apr 2009 03:43:36 -0700

On Thu, 2009-04-23 at 17:56 +0900, Aaron Kimball wrote:
> Explicitly controlling your splits will be very challenging. Taking the case
> where you have expensive (X) and cheap (C) objects to process, you may have
> a file where the records are lined up X C X C X C X X X X X C C C. In this
> case, you'll need to scan through the whole file and build splits such that
> the lengthy run of expensive objects is broken up into separate splits, but
> the run of cheap objects is consolidated. I'm suspicious that you can do
> this without scanning through the data (which is what often constitutes the
> bulk of a time in a mapreduce program).


I would also like the ability to stream the data and shuffle it into
buckets; when any bucket achieves a fixed cost (currently assessed as
byte size), it would be shipped as a task.

In practise, in the Hadoop architecture, this causes an extra level of
I/O, since all the data must be read into the shuffler and re-sorted.
Also, it breaks the ability to run map tasks on systems hosting the
data. However, it is a subject about which I am doing some thinking.

> But how much data are you using? I would imagine that if you're operating at
> the scale where Hadoop makes sense, then the high- and low-cost objects will
> -- on average -- balance out and tasks will be roughly evenly proportioned.

True, dat.

But it's still worth thinking about stream splitting, since the
theoretical complexity overhead is an increased constant on a linear
term.

Will get more into architecture first.

S.

Re: Are SequenceFiles split? If so, how?

Reply via email to