Re: Saving Intermediate Results from the Mapper

Amogh Vasekar Tue, 24 Nov 2009 22:50:37 -0800

Hi,
I'm not sure if this will apply to your case since i'm not aware of the common 
part of job2:mapper and job3:mapper but would like to give it a shot.
The whole process can be combined into a single mapred job. The mapper will 
read a record and process till the "saved data part" , then for each record 
will output 2 records , one each for the job2 and job3 mappers. The keys of 
records will be tagged ( <tag,key> ) depending on what reducer processing you 
want to do. In reduce() you can use this tag to make processing decision. A 
custom partitioner might be needed depending on the key types to ensure unique 
sets for reducer.
Ignore if this doesn't fit your bill :)

Amogh

On 11/25/09 9:35 AM, "Gordon Linoff" <glin...@gmail.com> wrote:

Does anyone have a pointer to code that allows the map to save data in
intermediate files, for use in a later map/reduce job?  I have been looking
for an example and cannot find one.

I have investigated MultipleOutputFormat and MultipleOutputs.  Because I am
using version 0.18.3, I don't have MultipleOutputs.  The problem with
MultipleOutputFormat is that the data I want to save is a different format
from the data I want to pass to the Reducer.  I have also tried opening a
sequence file directly from the mapper, but I am concerned that this is not
fault tolerant.

The process currently is:

Job1:  Mapper:  reads complicated data, saves out data structure.
Job2:  Mapper:  reads saved data, processes and sends data to Reducer 2.
Job3:  Mapper:  reads saved data, processes and sends data to Reducer 3.

I would like to combine the first two steps, so the process is:

Job1:  Mapper:  reads complicated data, saves out data structure, and passes
processed data to Reducer 2.
Job2:  Mapper:  reads saved data, processes and sends to Reducer 3.

--gordon

On Sun, Nov 22, 2009 at 9:27 PM, Jason Venner <jason.had...@gmail.com>wrote:

> You can manually write the map output to a new file, there are a number of
> examples of opening a sequence file and writing to it on the web or in the
> example code for various hadoop books.
>
> You can also disable the removal of intermediate data, which will result in
> potentially large amounts of data being left in the mapred.local.dir.
>
>
>
> On Sun, Nov 22, 2009 at 3:56 PM, Gordon Linoff <glin...@gmail.com> wrote:
>
>> I am starting to learn Hadoop, using the Yahoo virtual machine with
>> version
>> 0.18.
>>
>> My question is rather simple.  I would like to execute a map/reduce job.
>>  In
>> addition to getting the results from the reduce, I would also like to save
>> the intermediate results from the map in another HDFS file.  Is this
>> possible?
>>
>> --gordon
>>
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: Saving Intermediate Results from the Mapper

Reply via email to