RE: Best way to write multiple files from a MR job?

2009-03-03 Thread Saranath Raghavan
This should help.

String jobId = jobConf.get(mapred.job.id);
String taskId = jobConf.get(mapred.task.partition);
String filename = file_ + jobId + _ + taskId;

- Saranath

-Original Message-
From: Stuart White [mailto:stuart.whi...@gmail.com] 
Sent: Tuesday, March 03, 2009 6:50 PM
To: core-user@hadoop.apache.org
Subject: Best way to write multiple files from a MR job?

I have a large amount of data, from which I'd like to extract multiple
different types of data, writing each type of data to different sets
of output files.  What's the best way to accomplish this?  (I should
mention, I'm only using a mapper.  I have no need for sorting or
reduction.)

Of course, if I only wanted 1 output file, I can just .collect() the
output from my mapper and let mapreduce write the output for me.  But,
to get multiple output files, the only way I can see is to manually
write the files myself from within my mapper.  If that's the correct
way, then how can I get a unique filename for each mapper instance?
Obviously hadoop has solved this problem, because it writes out its
partition files (part-0, etc...) with unique numbers.  Is there a
way for my mappers to get this unique number being used so they can
use it to ensure a unique filename?

Thanks!




Re: Best way to write multiple files from a MR job?

2009-03-03 Thread Nick Cen
have you try the MultipleOutputFormat and it is subclass?


2009/3/4 Stuart White stuart.whi...@gmail.com

 I have a large amount of data, from which I'd like to extract multiple
 different types of data, writing each type of data to different sets
 of output files.  What's the best way to accomplish this?  (I should
 mention, I'm only using a mapper.  I have no need for sorting or
 reduction.)

 Of course, if I only wanted 1 output file, I can just .collect() the
 output from my mapper and let mapreduce write the output for me.  But,
 to get multiple output files, the only way I can see is to manually
 write the files myself from within my mapper.  If that's the correct
 way, then how can I get a unique filename for each mapper instance?
 Obviously hadoop has solved this problem, because it writes out its
 partition files (part-0, etc...) with unique numbers.  Is there a
 way for my mappers to get this unique number being used so they can
 use it to ensure a unique filename?

 Thanks!




-- 
http://daily.appspot.com/food/


Re: Best way to write multiple files from a MR job?

2009-03-03 Thread Stuart White
On Tue, Mar 3, 2009 at 9:16 PM, Nick Cen cenyo...@gmail.com wrote:
 have you try the MultipleOutputFormat and it is subclass?

Nope (didn't know it existed).  I'll take a look at it.

Both of these suggestions sound great.  Thanks for the tips!