Just an FYI, found the solution to this problem.

Apparently, it's an OS limit on the number of sub-directories that can be 
created in another directory.  In this case, we had 31998 sub-directories under 
hadoop/userlogs/, so any new tasks would fail in Job Setup.

>From the unix command line, mkdir fails as well:
  $ mkdir hadoop/userlogs/testdir
  mkdir: cannot create directory `hadoop/userlogs/testdir': Too many links

Difficult to track down because the Hadoop error message gives no hint 
whatsoever.  And normally, you'd look in the userlog itself for more info, but 
in this case the userlog couldn't be created.

Marc


-----Original Message-----
From: Marc Limotte
Sent: Wednesday, September 23, 2009 11:06 AM
To: 'core-u...@hadoop.apache.org'
Subject: Task process exit with nonzero status of 1

I'm seeing this error when I try to run my job.

java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

>From what I can find by doing some Google searches, this means the mapred task 
>JVM has crashed.  Not many suggestions about what to do about it.  Some 
>suggestions about increasing max heap.  I tried that, although I don't think 
>that's the issue because it's not a particularly memory intensive process and 
>I've even tried it with a super small input data set of only a few records.  
>Still see the same issue.

Can't find anything else in the logs.  I don't think my task even started, 
because there are no user logs created at all. Seems to fail during Job Setup.

A little more background.  This job was working fine for weeks, running hourly, 
and then failed on Saturday morning and hasn't worked since.  Obviously, I 
looked for something that changed at that point, but no one was working at that 
time... can't find anything that changed.  I tried the job with different input 
data sets, doesn't seem to matter, unless I run it with no data at all.  The 
job does run with no input data, but if I have even a few input records it 
fails-doesn't seem to matter which records.  I suspected some corruption in 
HDFS, but I was able to extract the data from HDFS (hadoop dfs -get ...) and 
the data looks ok.  I also copied this data set to our TEST cluster and ran the 
job there... and it WORKED!

Ran one of our other jobs and it failed as well, so it doesn't seem to be job 
specific either; looks like every job fails the same way.

Did a complete reboot of the cluster-no impact.

We're using Hadoop 0.20.0, and Java 1.6 update 16 on CentOS 5.2 64bit.

Any suggestions on what could be wrong or where to look for more information 
would be appreciated.



Marc Limotte
Feeva Technology

PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR ONLY 
THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION 
PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, 
DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY PROHIBITED. 
PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE 
THIS MESSAGE FROM YOUR SYSTEM.

Reply via email to