Re: Multiple data-local passes?

Ted Dunning Thu, 28 Jan 2010 11:39:58 -0800

That is quite doable.  Typically, the way that you do this is to buffer the
data either in memory or on local disk.  Both work fine.  You can munch on
the data until the cows come home that way.  Hadoop will still schedule your
tasks and handle failures for you.

The downside is that you lose communication between chunks of your data.
Sometimes that is fine.  Sometimes it isn't.  The specific case where it is
just fine is where you have multiple map functions that need to be applied
to individual input records.  These can trivially be smashed together into a
single map pass and that is just what frameworks like Pig and Cascading do.

This doesn't help you if you want to have lots of communication or global
summaries, but I think you know that.

On Thu, Jan 28, 2010 at 11:30 AM, Markus Weimer <[email protected]> wrote:

> In a way, I want a sequential program scheduled through hadoop. I will
> loose the parallelism, but I want to keep data locality, scheduling
> and restart-on-failure.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Multiple data-local passes?

Reply via email to