>> 2) Assume I am running cascading (chained) MR modules. In this case I feel
>> there is a huge overhead when output of MR1 is written back to HDFS and then
>> read from there as input of MR2.Can this be avoided ? (maybe store it in
>> some memory without hitting the HDFS and NameNode ) Please let me know if
>> there s some means of exercising this because it will increase the
>> efficiency of chained MR to a great extent.

> Not possible to pipeline in Apache Hadoop. Have a look at HOP (Hadoop
> On-line project), which has some of what you seek.

It is under some circumstances.  With ChainMapper and ChainReducer, if the
key/value signatures of the inputs and outputs of all mappers and reducers
are the same, then the only disk I/O is at the endpoints.  Note that there
is _no_ buffering at all, however (just a single-element queue between each
pair), so all maps and reduces in each ChainMapper or ChainReducer chain
have to reside in memory simultaneously.

I haven't ever used them, btw, so I don't know how useful or efficient they
are.  I just came across them while working on another feature that turns
out to be fundamentally incompatible with them...

Greg

Reply via email to