I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <vino...@apache.org
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Reply via email to