Re: Incrementally adding to existing output directory

Max Lebedev Wed, 17 Jul 2013 12:33:55 -0700

Hi Devaraj,

Thank you very much for your help. I've created a CustomOutputFormat which
is almost identical to FileOutputFormat as seen
here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java>
except I've removed line 125 which throws the FileAlreadyExistsException.
However, when I try to run my code, I get this error:

Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
outDir already
exists
at
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
...
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

In my source code, I've changed "FileOutputFormat.setOutputPath" to
"CustomOutputFormat.setOutputPath"

Is it the case that FileOutputFormat.checkOutputSpecs is happening
somewhere else, or have I done something wrong?
I also don't quite understand your suggestion about MultipleOutputs. Would
you mind elaborating?

Thanks,
Max Lebedev

On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k <devara...@huawei.com> wrote:

>  Hi Max,****
>
> ** **
>
>   It can be done by customizing the output format class for your Job
> according to your expectations. You could you refer
> OutputFormat.checkOutputSpecs(JobContext context) method which checks the
> ouput specification. We can override this in your custom OutputFormat. You
> can also see MultipleOutputs class for implementation details how it could
> be done.****
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Max Lebedev [mailto:ma...@actionx.com]
> *Sent:* 16 July 2013 23:33
> *To:* user@hadoop.apache.org
> *Subject:* Incrementally adding to existing output directory****
>
> ** **
>
> Hi****
>
> I'm trying to figure out how to incrementally add to an existing output
> directory using MapReduce.****
>
> I cannot specify the exact output path, as data in the input is sorted
> into categories and then written to different directories based in the
> contents. (in the examples below, token=AAAA or token=BBBB)****
>
> As an example:****
>
> When using MultipleOutput and provided that outDir does not exist yet, the
> following will work:****
>
> hadoop jar myMR.jar
> --input-path=inputDir/dt=2013-05-03/* --output-path=outDir****
>
> The result will be: ****
>
> outDir/token=AAAA/dt=2013-05-03/****
>
> outDir/token=BBBB/dt=2013-05-03/****
>
> However, the following will fail because outDir already exists. Even
> though I am copying new inputs.****
>
> hadoop jar myMR.jar  --input-path=inputDir/dt=2013-05-04/*
> --output-path=outDir****
>
> will throw FileAlreadyExistsException****
>
> What I would expect is that it adds****
>
> outDir/token=AAAA/dt=2013-05-04/****
>
> outDir/token=BBBB/dt=2013-05-04/****
>
> Another possibility would be the following hack but it does not seem to be
> very elegant:****
>
> hadoop jar myMR.jar --input-path=inputDir/2013-05-04/*
> --output-path=tempOutDir****
>
> then copy from tempOutDir to outDir****
>
> Is there a better way to address incrementally adding to an existing
> hadoop output directory?****
>

Re: Incrementally adding to existing output directory

Reply via email to