Say I have two MapReduce processes, A and B.  The two are algorithmically
dissimilar, so they have to be implemented as separate MapReduce processes.
 The output of A is used as the input of B, so A has to run first.  However,
B doesn't need to take all of A's output as input, only a partition of it.
 So in theory A and B could run at the same time in a producer/consumer
arrangement, where B would start to work as soon as A had produced some
output but before A had completed.  Obviously, this could be a big
parallelization win.

Is this possible in MapReduce?  I know at the most basic level it is
not–there is no synchronization mechanism that allows the same HDFS
directory to be used for both input and output–but is there some abstraction
layer on top that allows it?  I've been digging around, and I think the
answer is "No" but I want to be sure.

More specifically, the only abstraction layer I'm aware of that chains
together MapReduce processes is Cascade, and I think it requires the reduce
steps to be serialized, but again I'm not sure because I've only read the
documentation and haven't actually played with it.

Reply via email to