I'm running several Hadoop jobs sequentially on one cluster. I'm noticing that later jobs are dying because of too many open files, and that earlier runs tend to cause later runs to die - in other words, file resources aren't being freed somewhere.

By running a job over and over again, I can cause all subsequent jobs to die, even jobs that had successfully run earlier.

I'm using streaming on a hadoop-ec2 cluster, hadoop version 18.0, and my inputs and outputs are all HDFS controlled by streaming (stdin and stdout), never writing or reading as a side effect. Each job uses the HDFS output of a previous job as its input, but the jobs are all separate Hadoop processes, and only one is running at a time.

I have increased the open file limit for root to 65536 in limits.conf on my ec2 image, no help.

Is there any solution other than firing up a new cluster for each job?

I could file a bug, but I'm not sure what's consuming the files. On a random job box, /proc/<pid>/fd shows only 359 fd entries for the entire box, and the most open for any process is 174.

Reply via email to