Re: Output directory already exists

2008-09-04 Thread Shirley Cohen

Thanks, Owen. This fixed my problem!

Shirley

On Sep 2, 2008, at 8:44 PM, Owen O'Malley wrote:

On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen  
<[EMAIL PROTECTED]> wrote:



Hi,

I'm trying to write the output of two different map-reduce jobs  
into the
same output directory. I'm using MultipleOutputFormats to set the  
filename

dynamically, so there is no filename collision between the two jobs.
However, I'm getting the error "output directory already exists".



You just need to define a new OutputFormat that derives from the  
one that
you are really using for the second job. For example, if your  
second job is
using TextOutputFormat, you could derive a subtype and have it  
always return
from checkOutputSpec, even if the directory already exists.  
Something like:


{code}
public class NoClobberTextOutputFormat extends TextOutputFormat {
  RecordWriter getRecordWriter(FileSystem ignored, JobConf job,
 String name, Progressable  
progress)

throws IOException {
 return super(ignored, job, name + "-second", progress);
  }
  public void checkOutputSpecs(FileSystem fs, JobConf conf) { }
}
{code}

-- Owen




Re: Output directory already exists

2008-09-03 Thread Karl Anderson


On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen <[EMAIL PROTECTED]>  
wrote:



Hi,

I'm trying to write the output of two different map-reduce jobs into  
the
same output directory. I'm using MultipleOutputFormats to set the  
filename

dynamically, so there is no filename collision between the two jobs.
However, I'm getting the error "output directory already exists".

Does the framework support this functionality? It seems silly to  
have to
create a temp directory to store the output files from the second  
job and
then have to copy them to the first job's output directory after the  
second

job completes.


You basically have to work with the framework.  So far, when I've had  
to sort, split, combine, etc. my data, I put another job in my  
pipeline to shuffle data around, then worry about efficiency later.   
This one could be done with two input directories and a nop mapper  
like IdentityMapper or cat.





Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Re: Output directory already exists

2008-09-02 Thread Owen O'Malley
On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm trying to write the output of two different map-reduce jobs into the
> same output directory. I'm using MultipleOutputFormats to set the filename
> dynamically, so there is no filename collision between the two jobs.
> However, I'm getting the error "output directory already exists".


You just need to define a new OutputFormat that derives from the one that
you are really using for the second job. For example, if your second job is
using TextOutputFormat, you could derive a subtype and have it always return
from checkOutputSpec, even if the directory already exists. Something like:

{code}
public class NoClobberTextOutputFormat extends TextOutputFormat {
  RecordWriter getRecordWriter(FileSystem ignored, JobConf job,
 String name, Progressable progress)
throws IOException {
 return super(ignored, job, name + "-second", progress);
  }
  public void checkOutputSpecs(FileSystem fs, JobConf conf) { }
}
{code}

-- Owen


Re: Output directory already exists

2008-09-02 Thread Mafish Liu
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm trying to write the output of two different map-reduce jobs into the
> same output directory. I'm using MultipleOutputFormats to set the filename
> dynamically, so there is no filename collision between the two jobs.
> However, I'm getting the error "output directory already exists".
>
> Does the framework support this functionality? It seems silly to have to
> create a temp directory to store the output files from the second job and
> then have to copy them to the first job's output directory after the second
> job completes.


Map/reduce will create output directory every time it runs and will fail if
the directory exists. Seems that there is no way to implement your
description other than modify the source code.

>
>
> Thanks,
>
> Shirley
>
>


-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Output directory already exists

2008-09-02 Thread Shirley Cohen

Hi,

I'm trying to write the output of two different map-reduce jobs into  
the same output directory. I'm using MultipleOutputFormats to set the  
filename dynamically, so there is no filename collision between the  
two jobs. However, I'm getting the error "output directory already  
exists".


Does the framework support this functionality? It seems silly to have  
to create a temp directory to store the output files from the second  
job and then have to copy them to the first job's output directory  
after the second job completes.


Thanks,

Shirley