Re: Best way to write multiple files from a MR job?

2009-03-03 Thread Stuart White
On Tue, Mar 3, 2009 at 9:16 PM, Nick Cen  wrote:
> have you try the MultipleOutputFormat and it is subclass?

Nope (didn't know it existed).  I'll take a look at it.

Both of these suggestions sound great.  Thanks for the tips!


Re: Best way to write multiple files from a MR job?

2009-03-03 Thread Nick Cen
have you try the MultipleOutputFormat and it is subclass?


2009/3/4 Stuart White 

> I have a large amount of data, from which I'd like to extract multiple
> different types of data, writing each type of data to different sets
> of output files.  What's the best way to accomplish this?  (I should
> mention, I'm only using a mapper.  I have no need for sorting or
> reduction.)
>
> Of course, if I only wanted 1 output file, I can just .collect() the
> output from my mapper and let mapreduce write the output for me.  But,
> to get multiple output files, the only way I can see is to manually
> write the files myself from within my mapper.  If that's the correct
> way, then how can I get a unique filename for each mapper instance?
> Obviously hadoop has solved this problem, because it writes out its
> partition files (part-0, etc...) with unique numbers.  Is there a
> way for my mappers to get this unique number being used so they can
> use it to ensure a unique filename?
>
> Thanks!
>



-- 
http://daily.appspot.com/food/


RE: Best way to write multiple files from a MR job?

2009-03-03 Thread Saranath Raghavan
This should help.

String jobId = jobConf.get("mapred.job.id");
String taskId = jobConf.get("mapred.task.partition");
String filename = "file_" + jobId + "_" + taskId;

- Saranath

-Original Message-
From: Stuart White [mailto:stuart.whi...@gmail.com] 
Sent: Tuesday, March 03, 2009 6:50 PM
To: core-user@hadoop.apache.org
Subject: Best way to write multiple files from a MR job?

I have a large amount of data, from which I'd like to extract multiple
different types of data, writing each type of data to different sets
of output files.  What's the best way to accomplish this?  (I should
mention, I'm only using a mapper.  I have no need for sorting or
reduction.)

Of course, if I only wanted 1 output file, I can just .collect() the
output from my mapper and let mapreduce write the output for me.  But,
to get multiple output files, the only way I can see is to manually
write the files myself from within my mapper.  If that's the correct
way, then how can I get a unique filename for each mapper instance?
Obviously hadoop has solved this problem, because it writes out its
partition files (part-0, etc...) with unique numbers.  Is there a
way for my mappers to get this unique number being used so they can
use it to ensure a unique filename?

Thanks!




Best way to write multiple files from a MR job?

2009-03-03 Thread Stuart White
I have a large amount of data, from which I'd like to extract multiple
different types of data, writing each type of data to different sets
of output files.  What's the best way to accomplish this?  (I should
mention, I'm only using a mapper.  I have no need for sorting or
reduction.)

Of course, if I only wanted 1 output file, I can just .collect() the
output from my mapper and let mapreduce write the output for me.  But,
to get multiple output files, the only way I can see is to manually
write the files myself from within my mapper.  If that's the correct
way, then how can I get a unique filename for each mapper instance?
Obviously hadoop has solved this problem, because it writes out its
partition files (part-0, etc...) with unique numbers.  Is there a
way for my mappers to get this unique number being used so they can
use it to ensure a unique filename?

Thanks!