On Thu, 2009-04-23 at 17:56 +0900, Aaron Kimball wrote: > Explicitly controlling your splits will be very challenging. Taking the case > where you have expensive (X) and cheap (C) objects to process, you may have > a file where the records are lined up X C X C X C X X X X X C C C. In this > case, you'll need to scan through the whole file and build splits such that > the lengthy run of expensive objects is broken up into separate splits, but > the run of cheap objects is consolidated. I'm suspicious that you can do > this without scanning through the data (which is what often constitutes the > bulk of a time in a mapreduce program).
I would also like the ability to stream the data and shuffle it into buckets; when any bucket achieves a fixed cost (currently assessed as byte size), it would be shipped as a task. In practise, in the Hadoop architecture, this causes an extra level of I/O, since all the data must be read into the shuffler and re-sorted. Also, it breaks the ability to run map tasks on systems hosting the data. However, it is a subject about which I am doing some thinking. > But how much data are you using? I would imagine that if you're operating at > the scale where Hadoop makes sense, then the high- and low-cost objects will > -- on average -- balance out and tasks will be roughly evenly proportioned. True, dat. But it's still worth thinking about stream splitting, since the theoretical complexity overhead is an increased constant on a linear term. Will get more into architecture first. S.