Re: Best way to write multiple files from a MR job?
On Tue, Mar 3, 2009 at 9:16 PM, Nick Cen wrote: > have you try the MultipleOutputFormat and it is subclass? Nope (didn't know it existed). I'll take a look at it. Both of these suggestions sound great. Thanks for the tips!
Re: Best way to write multiple files from a MR job?
have you try the MultipleOutputFormat and it is subclass? 2009/3/4 Stuart White > I have a large amount of data, from which I'd like to extract multiple > different types of data, writing each type of data to different sets > of output files. What's the best way to accomplish this? (I should > mention, I'm only using a mapper. I have no need for sorting or > reduction.) > > Of course, if I only wanted 1 output file, I can just .collect() the > output from my mapper and let mapreduce write the output for me. But, > to get multiple output files, the only way I can see is to manually > write the files myself from within my mapper. If that's the correct > way, then how can I get a unique filename for each mapper instance? > Obviously hadoop has solved this problem, because it writes out its > partition files (part-0, etc...) with unique numbers. Is there a > way for my mappers to get this unique number being used so they can > use it to ensure a unique filename? > > Thanks! > -- http://daily.appspot.com/food/
RE: Best way to write multiple files from a MR job?
This should help. String jobId = jobConf.get("mapred.job.id"); String taskId = jobConf.get("mapred.task.partition"); String filename = "file_" + jobId + "_" + taskId; - Saranath -Original Message- From: Stuart White [mailto:stuart.whi...@gmail.com] Sent: Tuesday, March 03, 2009 6:50 PM To: core-user@hadoop.apache.org Subject: Best way to write multiple files from a MR job? I have a large amount of data, from which I'd like to extract multiple different types of data, writing each type of data to different sets of output files. What's the best way to accomplish this? (I should mention, I'm only using a mapper. I have no need for sorting or reduction.) Of course, if I only wanted 1 output file, I can just .collect() the output from my mapper and let mapreduce write the output for me. But, to get multiple output files, the only way I can see is to manually write the files myself from within my mapper. If that's the correct way, then how can I get a unique filename for each mapper instance? Obviously hadoop has solved this problem, because it writes out its partition files (part-0, etc...) with unique numbers. Is there a way for my mappers to get this unique number being used so they can use it to ensure a unique filename? Thanks!
Best way to write multiple files from a MR job?
I have a large amount of data, from which I'd like to extract multiple different types of data, writing each type of data to different sets of output files. What's the best way to accomplish this? (I should mention, I'm only using a mapper. I have no need for sorting or reduction.) Of course, if I only wanted 1 output file, I can just .collect() the output from my mapper and let mapreduce write the output for me. But, to get multiple output files, the only way I can see is to manually write the files myself from within my mapper. If that's the correct way, then how can I get a unique filename for each mapper instance? Obviously hadoop has solved this problem, because it writes out its partition files (part-0, etc...) with unique numbers. Is there a way for my mappers to get this unique number being used so they can use it to ensure a unique filename? Thanks!