Re: chaining (the output of) jobs/ reducers

2013-09-17 Thread Adrian CAPDEFIER
Thanks Bryan. This is great stuff! On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: Hey Adrian, To clarify, the replication happens on *write*. So as you write output from the reducer of Job A, you are writing into hdfs. Part of that write path is

Re: chaining (the output of) jobs/ reducers

2013-09-17 Thread Adrian CAPDEFIER
I've just seen your email, Vinod. This is the behaviour that I'd expect and similar to other data integration tools; I will keep an eye out for it as a long term option. On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli vino...@apache.org wrote: Other than the short term solutions

Re: chaining (the output of) jobs/ reducers

2013-09-13 Thread Bryan Beaudreault
Hey Adrian, To clarify, the replication happens on *write*. So as you write output from the reducer of Job A, you are writing into hdfs. Part of that write path is replicating the data to 2 additional hosts in the cluster (local + 2, this is configured by dfs.replication configuration value).

Re: chaining (the output of) jobs/ reducers

2013-09-13 Thread Vinod Kumar Vavilapalli
Other than the short term solutions that others have proposed, Apache Tez solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your own custom processors - all without persisting the intermediate outputs to HDFS. It works on top of YARN, though the first

chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER
Howdy, My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2. The heavy lifting of the operation is in step1

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Chris Curtin
If you want to stay in Java look at Cascading. Pig is also helpful. I think there are other (Spring integration maybe?) but I'm not familiar with them enough to make a recommendation. Note that with Cascading and Pig you don't write 'map reduce' you write logic and they map it to the various

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd prefer to keep, if possible, everything as close to the hadoop libraries. I am sure I am overlooking something basic as repartitioning is a fairly common operation in MPP environments. On Thu, Sep 12, 2013 at 2:39 PM,

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Shahab Yunus
The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config. Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Bryan Beaudreault
It really comes down to the following: In Job A set mapred.output.dir to some directory X. In Job B set mapred.input.dir to the same directory X. For Job A, do context.write() as normally, and each reducer will create an output file in mapred.output.dir. Then in Job B each of those will

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER
Thanks Bryan. Yes, I am using hadoop + hdfs. If I understand your point, hadoop tries to start the mapping processes on nodes where the data is local and if that's not possible, then it is hdfs that replicates the data to the mapper nodes? I expected to have to set up this in the code and I

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Venkata K Pisupat
Cascading would a good option in case you have a complex flow. However, in your case, you are trying to chain two jobs only. I would suggest you to follow these steps. 1. The output directory of Job1 would be set at the input directory for Job2. 2. Launch Job1 using the new API. In launcher