That is quite doable. Typically, the way that you do this is to buffer the data either in memory or on local disk. Both work fine. You can munch on the data until the cows come home that way. Hadoop will still schedule your tasks and handle failures for you.
The downside is that you lose communication between chunks of your data. Sometimes that is fine. Sometimes it isn't. The specific case where it is just fine is where you have multiple map functions that need to be applied to individual input records. These can trivially be smashed together into a single map pass and that is just what frameworks like Pig and Cascading do. This doesn't help you if you want to have lots of communication or global summaries, but I think you know that. On Thu, Jan 28, 2010 at 11:30 AM, Markus Weimer <[email protected]> wrote: > In a way, I want a sequential program scheduled through hadoop. I will > loose the parallelism, but I want to keep data locality, scheduling > and restart-on-failure. > -- Ted Dunning, CTO DeepDyve
