tim robertson wrote:
Hi all,

I am running MR which is scanning 130M records and then trying to
group them into around 64,000 files.

The Map does the grouping of the record by determining the key, and
then I use a MultipleTextOutputFormat to write the file based on the
key:
        @Override
        protected String generateFileNameForKeyValue(WritableComparable
key,Writable value, String name) {
                return "cell_" + key.toString();
        }

This approach works for small input files, but for the 130M it fails with:

org.apache.hadoop.mapred.Merger$MergeQueue Down to the last
merge-pass, with 10 segments left of total size: 12291866391 bytes
org.apache.hadoop.mapred.LocalJobRunner$Job reduce > reduce
org.apache.hadoop.mapred.JobClient  map 100% reduce 66%
org.apache.hadoop.mapred.LocalJobRunner$Job reduce > reduce
...
org.apache.hadoop.mapred.LocalJobRunner$Job reduce > reduce
org.apache.hadoop.mapred.LocalJobRunner$Job job_local_0001
java.io.IOException: Cannot run program "chmod": error=24, Too many open files
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
        at org.apache.hadoop.util.Shell.run(Shell.java:134)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:317)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:532)
        at 
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:284)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:403)
        at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:117)
        at 
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44)
        at 
org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:300)
        at 
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201)
Caused by: java.io.IOException: error=24, Too many open files
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
        at java.lang.ProcessImpl.start(ProcessImpl.java:91)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
        ... 17 more
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
        at 
com.ibiodiversity.index.mapreduce.occurrence.geometry.OccurrenceByPolygonIntersection.splitOccurrenceDataIntoCells(OccurrenceByPolygonIntersection.java:95)
        at 
com.ibiodiversity.index.mapreduce.occurrence.geometry.OccurrenceByPolygonIntersection.run(OccurrenceByPolygonIntersection.java:54)
        at 
com.ibiodiversity.index.mapreduce.occurrence.geometry.OccurrenceByPolygonIntersection.main(OccurrenceByPolygonIntersection.java:190)


Is this a problem because I am working on my single machine at the
moment, that will go away when I run on the cluster of 25?

Yes. The problem could be because of single machine and LocalJobRuner. I think this should go away on a cluster.
-Amareshwari
I am configuring the job:
  conf.setNumMapTasks(10);
  conf.setNumReduceTasks(5);

Are there perhaps better parameters so it does not try to manage the
temp files all in one go?

Thanks for helping!

Tim

Reply via email to