Hi, Lars. Thanks for your detailed reply. I am not sure how to store value
in memchached in map-reduce job? Can you make it more clear or give me some
examples?
As to map task number, each of my input files should be processed to a
key-value pair. Under this situation, should each mapper handles a file? Can
I use a mapper to handle multiple files and how?
Thanks.
--------------------------------------------------
From: "Lars George" <[email protected]>
Sent: Friday, November 27, 2009 10:54 PM
To: <[email protected]>
Subject: Re: Store mapreduce output into my own data structures
Hi Liu,
You have a few choices, you either a) use no OutputFormat at all or b)
create your own custom one that handles what you need. I have MapReduce
jobs that scan a HBase table and compute a specific value that I then
store in memcached. For that I do that directly in a custom TableMapper
and set the output format to
job.setOutputFormatClass(NullOutputFormat.class);
I often also set the number of reducers to 0 as I can do all the work in
the Mapper. This is because row keys are sorted and unique, so there is
no need to have a Reducer as there is nothing to reduce. So I do
job.setNumReduceTasks(0);
The new Hadoop MapReduce API has removed the ability to set the number
of map tasks. This was always just a hint to the framework anyways and
was not a hard limit. The number of Mappers is linked to the InputFormat
that is used as it is responsible to split the input data into equal
chunks for processing. Our TableInputFormat for example splits the
tables at region boundaries. A FileInputFormat may split text files into
equal blocks matching the Hadoop block size while specifying one of the
data nodes having a copy of it. That way the data can be processed
local. But if the input file is a compressed, non-splittable format such
as GZip then a single Mapper is handling the whole file. Even if you
would have specified 10 map tasks it would only use one as it has no
other choice.
Lars
Liu Xianglong schrieb:
Hi, everyone. Is there someone who uses map-reduce to store the reduce
output in memory. I mean, now the output path of job is set and reduce
outputs are stored into files under this path.(see the comments along
with the following codes)
job.setOutputFormatClass(MyOutputFormat.class);
//can I implement my OutputFormat to store these output key-value
pairs in my data structures, or are these other ways to do it?
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Result.class);
FileOutputFormat.setOutputPath(job, outputDir);
Is there any way to store them in some variables or data structures?
Then how can I implement my OutputFormat? Any suggestions and codes are
welcomed.
Another question: is there some way to set the number of map task? It
seems there is no API to do this in hadoop new job APIs. I am not sure
the way to set this number.
Thanks!
Best Wishes!
_____________________________________________________________
刘祥龙 Liu Xianglong