It really comes down to the following: In Job A set mapred.output.dir to some directory X. In Job B set mapred.input.dir to the same directory X.
For Job A, do context.write() as normally, and each reducer will create an output file in mapred.output.dir. Then in Job B each of those will correspond to a mapper. Of course you need to make sure your input and output formats, as well as input and output keys/values, match up between the two jobs as well. If you are using HDFS, which it seems you are, the directories specified can be HDFS directories. In that case, with a replication factor of 3, each of these output files will exist on 3 nodes. Hadoop and HDFS will do the work to ensure that the mappers in the second job do as good a job as possible to be data or rack-local. On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <chivas314...@gmail.com>wrote: > Thank you, Chris. I will look at Cascading and Pig, but for starters I'd > prefer to keep, if possible, everything as close to the hadoop libraries. > > I am sure I am overlooking something basic as repartitioning is a fairly > common operation in MPP environments. > > > On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <curtin.ch...@gmail.com>wrote: > >> If you want to stay in Java look at Cascading. Pig is also helpful. I >> think there are other (Spring integration maybe?) but I'm not familiar with >> them enough to make a recommendation. >> >> Note that with Cascading and Pig you don't write 'map reduce' you write >> logic and they map it to the various mapper/reducer steps automatically. >> >> Hope this helps, >> >> Chris >> >> >> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314...@gmail.com >> > wrote: >> >>> Howdy, >>> >>> My application requires 2 distinct processing steps (reducers) to be >>> performed on the input data. The first operation generates changes the key >>> values and, records that had different keys in step 1 can end up having the >>> same key in step 2. >>> >>> The heavy lifting of the operation is in step1 and step2 only combines >>> records where keys were changed. >>> >>> In short the overview is: >>> Sequential file -> Step 1 -> Step 2 -> Output. >>> >>> >>> To implement this in hadoop, it seems that I need to create a separate >>> job for each step. >>> >>> Now I assumed, there would some sort of job management under hadoop to >>> link Job 1 and 2, but the only thing I could find was related to job >>> scheduling and nothing on how to synchronize the input/output of the linked >>> jobs. >>> >>> >>> >>> The only crude solution that I can think of is to use a temporary file >>> under HDFS, but even so I'm not sure if this will work. >>> >>> The overview of the process would be: >>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer >>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer >>> (key2, value 3)] => output. >>> >>> Is there a better way to pass the output from Job A as input to Job B >>> (e.g. using network streams or some built in java classes that don't do >>> disk i/o)? >>> >>> >>> >>> The temporary file solution will work in a single node configuration, >>> but I'm not sure about an MPP config. >>> >>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or >>> both jobs run on all 4 nodes - will HDFS be able to redistribute >>> automagically the records between nodes or does this need to be coded >>> somehow? >>> >> >> >