On 26-Sep-08, at 3:09 PM, Eric Zhang wrote:
Hi,
I encountered following FileNotFoundException resulting from "too
many open files" error when i tried to run a job. The job had been
run for several times before without problem. I am confused by the
exception because my code closes all the files and even it doesn't,
the job only have only 10-20 small input/output files. The limit
on the open file on my box is 1024. Besides, the error seemed to
happen even before the task was executed, I am using 0.17 version.
I'd appreciate if somebody can shed some light on this issue. BTW,
the job ran ok after i restarted hadoop. Yes, the hadoop-site.xml
did exist in that directory.
I had the same errors, including the bash one. Running one particular
job would cause all subsequent jobs of any kind to fail, even after
all running jobs had completed or failed out. This was confusing
because the failing jobs themselves often had no relationship to the
cause, they were just in a bad environment.
If you can't successfully run a dummy job (with the identity mapper
and reducer, or a streaming job with cat) once you start getting
failures, then you are probably in the same situation.
I believe that the problem was caused by increasing the timeout, but I
never pinned it down enough to submit a Jira issue. It might have
been the XML reader or something else. I was using streaming, hadoop-
ec2, and either 0.17.0 or 0.18.0. It would happen just as rapidly
after I made an ec2 image with a higher open file limit.
Eventually I figured it out by running each job in my pipeline 5 or so
times before trying the next one, which let me see which job was
causing the problem (because it would eventually fail itself, rather
than hosing a later job).
Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra