subject:"chaining \(the output of\) jobs\/ reducers"

chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER

Howdy, My application requires 2 distinct processing steps (reducers) to be performed on the input data. The first operation generates changes the key values and, records that had different keys in step 1 can end up having the same key in step 2. The heavy lifting of the operation is in step1 and

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Chris Curtin

If you want to stay in Java look at Cascading. Pig is also helpful. I think there are other (Spring integration maybe?) but I'm not familiar with them enough to make a recommendation. Note that with Cascading and Pig you don't write 'map reduce' you write logic and they map it to the various mappe

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER

Thank you, Chris. I will look at Cascading and Pig, but for starters I'd prefer to keep, if possible, everything as close to the hadoop libraries. I am sure I am overlooking something basic as repartitioning is a fairly common operation in MPP environments. On Thu, Sep 12, 2013 at 2:39 PM, Chris

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Shahab Yunus

"The temporary file solution will work in a single node configuration, but I'm not sure about an MPP config. Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes or does thi

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Bryan Beaudreault

It really comes down to the following: In Job A set mapred.output.dir to some directory X. In Job B set mapred.input.dir to the same directory X. For Job A, do context.write() as normally, and each reducer will create an output file in mapred.output.dir. Then in Job B each of those will correspo

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Adrian CAPDEFIER

Thanks Bryan. Yes, I am using hadoop + hdfs. If I understand your point, hadoop tries to start the mapping processes on nodes where the data is local and if that's not possible, then it is hdfs that replicates the data to the mapper nodes? I expected to have to set up this in the code and I comp

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Venkata K Pisupat

Cascading would a good option in case you have a complex flow. However, in your case, you are trying to chain two jobs only. I would suggest you to follow these steps. 1. The output directory of Job1 would be set at the input directory for Job2. 2. Launch Job1 using the new API. In launcher pr

Re: chaining (the output of) jobs/ reducers

2013-09-12 Thread Bryan Beaudreault

Hey Adrian, To clarify, the replication happens on *write*. So as you write output from the reducer of Job A, you are writing into hdfs. Part of that write path is replicating the data to 2 additional hosts in the cluster (local + 2, this is configured by dfs.replication configuration value). S

Re: chaining (the output of) jobs/ reducers

2013-09-13 Thread Vinod Kumar Vavilapalli

Other than the short term solutions that others have proposed, Apache Tez solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your own custom processors - all without persisting the intermediate outputs to HDFS. It works on top of YARN, though the first r

Re: chaining (the output of) jobs/ reducers

2013-09-17 Thread Adrian CAPDEFIER

Thanks Bryan. This is great stuff! On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault wrote: > Hey Adrian, > > To clarify, the replication happens on *write*. So as you write output > from the reducer of Job A, you are writing into hdfs. Part of that write > path is replicating the data to 2

Re: chaining (the output of) jobs/ reducers

2013-09-17 Thread Adrian CAPDEFIER

I've just seen your email, Vinod. This is the behaviour that I'd expect and similar to other data integration tools; I will keep an eye out for it as a long term option. On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli wrote: > > Other than the short term solutions that others have prop

chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

Re: chaining (the output of) jobs/ reducers

11 matches

Site Navigation

Mail list logo

Footer information