Re: Store mapreduce output into my own data structures

sallonchina Fri, 27 Nov 2009 19:29:25 -0800

Hi, Lars. Thanks for your detailed reply. I am not sure how to store valuein memchached in map-reduce job? Can you make it more clear or give me someexamples?

As to map task number, each of my input files should be processed to akey-value pair. Under this situation, should each mapper handles a file? CanI use a mapper to handle multiple files and how?


Thanks.
--------------------------------------------------
From: "Lars George" <[email protected]>
Sent: Friday, November 27, 2009 10:54 PM
To: <[email protected]>
Subject: Re: Store mapreduce output into my own data structures

Hi Liu,

You have a few choices, you either a) use no OutputFormat at all or b)
create your own custom one that handles what you need. I have MapReduce
jobs that scan a HBase table and compute a specific value that I then
store in memcached. For that I do that directly in a custom TableMapper
and set the output format to

job.setOutputFormatClass(NullOutputFormat.class);

I often also set the number of reducers to 0 as I can do all the work in
the Mapper. This is because row keys are sorted and unique, so there is
no need to have a Reducer as there is nothing to reduce. So I do

job.setNumReduceTasks(0);

The new Hadoop MapReduce API has removed the ability to set the number
of map tasks. This was always just a hint to the framework anyways and
was not a hard limit. The number of Mappers is linked to the InputFormat
that is used as it is responsible to split the input data into equal
chunks for processing. Our TableInputFormat for example splits the
tables at region boundaries. A FileInputFormat may split text files into
equal blocks matching the Hadoop block size while specifying one of the
data nodes having a copy of it. That way the data can be processed
local. But if the input file is a compressed, non-splittable format such
as GZip then a single Mapper is handling the whole file. Even if you
would have specified 10 map tasks it would only use one as it has no
other choice.

Lars

Liu Xianglong schrieb:

Hi, everyone. Is there someone who uses map-reduce to store the reduceoutput in memory. I mean, now the output path of job is set and reduceoutputs are stored into files under this path.(see the comments alongwith the following codes)
     job.setOutputFormatClass(MyOutputFormat.class);
//can I implement my OutputFormat to store these output key-valuepairs in my data structures, or are these other ways to do it?
     job.setOutputKeyClass(ImmutableBytesWritable.class);
     job.setOutputValueClass(Result.class);
     FileOutputFormat.setOutputPath(job, outputDir);
Is there any way to store them in some variables or data structures?Then how can I implement my OutputFormat? Any suggestions and codes arewelcomed.
Another question: is there some way to set the number of map task? Itseems there is no API to do this in hadoop new job APIs. I am not surethe way to set this number.
Thanks!

Best Wishes!
_____________________________________________________________

刘祥龙  Liu Xianglong

Re: Store mapreduce output into my own data structures

Reply via email to