Hi Devaraj, Thank you very much for your help. I've created a CustomOutputFormat which is almost identical to FileOutputFormat as seen here<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java> except I've removed line 125 which throws the FileAlreadyExistsException. However, when I try to run my code, I get this error:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory outDir already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:887) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapreduce.Job.submit(Job.java:500) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) ... at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) In my source code, I've changed "FileOutputFormat.setOutputPath" to "CustomOutputFormat.setOutputPath" Is it the case that FileOutputFormat.checkOutputSpecs is happening somewhere else, or have I done something wrong? I also don't quite understand your suggestion about MultipleOutputs. Would you mind elaborating? Thanks, Max Lebedev On Tue, Jul 16, 2013 at 9:42 PM, Devaraj k <devara...@huawei.com> wrote: > Hi Max,**** > > ** ** > > It can be done by customizing the output format class for your Job > according to your expectations. You could you refer > OutputFormat.checkOutputSpecs(JobContext context) method which checks the > ouput specification. We can override this in your custom OutputFormat. You > can also see MultipleOutputs class for implementation details how it could > be done.**** > > ** ** > > Thanks**** > > Devaraj k**** > > ** ** > > *From:* Max Lebedev [mailto:ma...@actionx.com] > *Sent:* 16 July 2013 23:33 > *To:* user@hadoop.apache.org > *Subject:* Incrementally adding to existing output directory**** > > ** ** > > Hi**** > > I'm trying to figure out how to incrementally add to an existing output > directory using MapReduce.**** > > I cannot specify the exact output path, as data in the input is sorted > into categories and then written to different directories based in the > contents. (in the examples below, token=AAAA or token=BBBB)**** > > As an example:**** > > When using MultipleOutput and provided that outDir does not exist yet, the > following will work:**** > > hadoop jar myMR.jar > --input-path=inputDir/dt=2013-05-03/* --output-path=outDir**** > > The result will be: **** > > outDir/token=AAAA/dt=2013-05-03/**** > > outDir/token=BBBB/dt=2013-05-03/**** > > However, the following will fail because outDir already exists. Even > though I am copying new inputs.**** > > hadoop jar myMR.jar --input-path=inputDir/dt=2013-05-04/* > --output-path=outDir**** > > will throw FileAlreadyExistsException**** > > What I would expect is that it adds**** > > outDir/token=AAAA/dt=2013-05-04/**** > > outDir/token=BBBB/dt=2013-05-04/**** > > Another possibility would be the following hack but it does not seem to be > very elegant:**** > > hadoop jar myMR.jar --input-path=inputDir/2013-05-04/* > --output-path=tempOutDir**** > > then copy from tempOutDir to outDir**** > > Is there a better way to address incrementally adding to an existing > hadoop output directory?**** >