Re: Output directory already exists
Thanks, Owen. This fixed my problem! Shirley On Sep 2, 2008, at 8:44 PM, Owen O'Malley wrote: On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote: Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error "output directory already exists". You just need to define a new OutputFormat that derives from the one that you are really using for the second job. For example, if your second job is using TextOutputFormat, you could derive a subtype and have it always return from checkOutputSpec, even if the directory already exists. Something like: {code} public class NoClobberTextOutputFormat extends TextOutputFormat { RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException { return super(ignored, job, name + "-second", progress); } public void checkOutputSpecs(FileSystem fs, JobConf conf) { } } {code} -- Owen
Re: Output directory already exists
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote: Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error "output directory already exists". Does the framework support this functionality? It seems silly to have to create a temp directory to store the output files from the second job and then have to copy them to the first job's output directory after the second job completes. You basically have to work with the framework. So far, when I've had to sort, split, combine, etc. my data, I put another job in my pipeline to shuffle data around, then worry about efficiency later. This one could be done with two input directories and a nop mapper like IdentityMapper or cat. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Re: Output directory already exists
On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote: > Hi, > > I'm trying to write the output of two different map-reduce jobs into the > same output directory. I'm using MultipleOutputFormats to set the filename > dynamically, so there is no filename collision between the two jobs. > However, I'm getting the error "output directory already exists". You just need to define a new OutputFormat that derives from the one that you are really using for the second job. For example, if your second job is using TextOutputFormat, you could derive a subtype and have it always return from checkOutputSpec, even if the directory already exists. Something like: {code} public class NoClobberTextOutputFormat extends TextOutputFormat { RecordWriter getRecordWriter(FileSystem ignored, JobConf job, String name, Progressable progress) throws IOException { return super(ignored, job, name + "-second", progress); } public void checkOutputSpecs(FileSystem fs, JobConf conf) { } } {code} -- Owen
Re: Output directory already exists
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote: > Hi, > > I'm trying to write the output of two different map-reduce jobs into the > same output directory. I'm using MultipleOutputFormats to set the filename > dynamically, so there is no filename collision between the two jobs. > However, I'm getting the error "output directory already exists". > > Does the framework support this functionality? It seems silly to have to > create a temp directory to store the output files from the second job and > then have to copy them to the first job's output directory after the second > job completes. Map/reduce will create output directory every time it runs and will fail if the directory exists. Seems that there is no way to implement your description other than modify the source code. > > > Thanks, > > Shirley > > -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Output directory already exists
Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error "output directory already exists". Does the framework support this functionality? It seems silly to have to create a temp directory to store the output files from the second job and then have to copy them to the first job's output directory after the second job completes. Thanks, Shirley