Task process exit with nonzero status of 1 - deleting userlogs helps

2010-06-14 Thread Johannes Zillmann
Hi,

i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into a 
situation where every task scheduled on 2 of the 4 nodes failed. 
Seems like the child jvm crashes. There are no child logs under logs/userlogs. 
Tasktracker gives this:

2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner 
constructed JVM ID: jvm_201006091425_0049_m_-946174604
2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner 
jvm_201006091425_0049_m_-946174604 spawned.
2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM : 
jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner: 
attempt_201006091425_0049_m_003179_0 Child Error
java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job 
created the logs/userlogs again and no error ocuured anymore on this host.
The permissions of userlogs and userlogsOLD are exactly the same. userlogsOLD 
contains about 378M in 132747 files. When copying the content of userlogsOLD 
into userlogs, the tasks of the belonging node starts failing again.

Some questions:
- this seems to me like a problem with too many files in one folder - any 
thoughts on this ?
- is the content of logs/userlogs cleaned up by hadoop regularly ?
- the logs/stdout file of the tasks are not existent, the logs/out fiels of the 
tasktracker hasn't any specific message (other then message posted above) - is 
there any log file left where an error message could be found ?


best regards
Johannes

Re: Task process exit with nonzero status of 1 - deleting userlogs helps

2010-06-14 Thread Edward Capriolo
On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann  wrote:

> Hi,
>
> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
> a situation where every task scheduled on 2 of the 4 nodes failed.
> Seems like the child jvm crashes. There are no child logs under
> logs/userlogs. Tasktracker gives this:
>
> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
> Runner jvm_201006091425_0049_m_-946174604 spawned.
> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
> attempt_201006091425_0049_m_003179_0 Child Error
> java.io.IOException: Task process exit with nonzero status of 1.
>at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>
>
> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
> created the logs/userlogs again and no error ocuured anymore on this host.
> The permissions of userlogs and userlogsOLD are exactly the same.
> userlogsOLD contains about 378M in 132747 files. When copying the content of
> userlogsOLD into userlogs, the tasks of the belonging node starts failing
> again.
>
> Some questions:
> - this seems to me like a problem with too many files in one folder - any
> thoughts on this ?
> - is the content of logs/userlogs cleaned up by hadoop regularly ?
> - the logs/stdout file of the tasks are not existent, the logs/out fiels of
> the tasktracker hasn't any specific message (other then message posted
> above) - is there any log file left where an error message could be found ?
>
>
> best regards
> Johannes


Most file systems have an upper limit on number of subfiles/folders in a
folder. You have probably hit the EXT3 limit. If you launch lots and lots of
jobs you can hit the limit before any cleanup happens.

You can experiment with cleanup and other filesystems. The following log
related issue might be relevant.

https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

Regards,
Edward


Re: Task process exit with nonzero status of 1 - deleting userlogs helps

2010-06-16 Thread Johannes Zillmann
Hi Edward,

i copied the userlogs folder which caused the error. 
Two things which is speak against the too-many files theory.
a) i can add new files to this folder (touch userlogsOLD/a, etc... ) 
b) the sysctl fs.file-max shows 817874 whereas the file count on the first 
level of userlogsOLD is 31999 and all files recursively are 107400.

Any thoughts ?
Johannes


On Jun 14, 2010, at 7:47 PM, Edward Capriolo wrote:

> On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann > wrote:
> 
>> Hi,
>> 
>> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
>> a situation where every task scheduled on 2 of the 4 nodes failed.
>> Seems like the child jvm crashes. There are no child logs under
>> logs/userlogs. Tasktracker gives this:
>> 
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
>> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
>> Runner jvm_201006091425_0049_m_-946174604 spawned.
>> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
>> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
>> attempt_201006091425_0049_m_003179_0 Child Error
>> java.io.IOException: Task process exit with nonzero status of 1.
>>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>> 
>> 
>> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
>> created the logs/userlogs again and no error ocuured anymore on this host.
>> The permissions of userlogs and userlogsOLD are exactly the same.
>> userlogsOLD contains about 378M in 132747 files. When copying the content of
>> userlogsOLD into userlogs, the tasks of the belonging node starts failing
>> again.
>> 
>> Some questions:
>> - this seems to me like a problem with too many files in one folder - any
>> thoughts on this ?
>> - is the content of logs/userlogs cleaned up by hadoop regularly ?
>> - the logs/stdout file of the tasks are not existent, the logs/out fiels of
>> the tasktracker hasn't any specific message (other then message posted
>> above) - is there any log file left where an error message could be found ?
>> 
>> 
>> best regards
>> Johannes
> 
> 
> Most file systems have an upper limit on number of subfiles/folders in a
> folder. You have probably hit the EXT3 limit. If you launch lots and lots of
> jobs you can hit the limit before any cleanup happens.
> 
> You can experiment with cleanup and other filesystems. The following log
> related issue might be relevant.
> 
> https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
> 
> Regards,
> Edward



Re: Task process exit with nonzero status of 1 - deleting userlogs helps

2010-06-16 Thread Amareshwari Sri Ramadasu
The issue is fixed in branch 0.21 through 
http://issues.apache.org/jira/browse/MAPREDUCE-927.
Now, the attempt directories are moved inside job directory. So, userlogs 
directory will have only job directories.

Thanks
Amareshwari
On 6/16/10 12:47 PM, "Johannes Zillmann"  wrote:

Hi Edward,

i copied the userlogs folder which caused the error.
Two things which is speak against the too-many files theory.
a) i can add new files to this folder (touch userlogsOLD/a, etc... )
b) the sysctl fs.file-max shows 817874 whereas the file count on the first 
level of userlogsOLD is 31999 and all files recursively are 107400.

Any thoughts ?
Johannes


On Jun 14, 2010, at 7:47 PM, Edward Capriolo wrote:

> On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann > wrote:
>
>> Hi,
>>
>> i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
>> a situation where every task scheduled on 2 of the 4 nodes failed.
>> Seems like the child jvm crashes. There are no child logs under
>> logs/userlogs. Tasktracker gives this:
>>
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
>> JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
>> 2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
>> Runner jvm_201006091425_0049_m_-946174604 spawned.
>> 2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
>> 2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
>> attempt_201006091425_0049_m_003179_0 Child Error
>> java.io.IOException: Task process exit with nonzero status of 1.
>>   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>
>>
>> At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
>> created the logs/userlogs again and no error ocuured anymore on this host.
>> The permissions of userlogs and userlogsOLD are exactly the same.
>> userlogsOLD contains about 378M in 132747 files. When copying the content of
>> userlogsOLD into userlogs, the tasks of the belonging node starts failing
>> again.
>>
>> Some questions:
>> - this seems to me like a problem with too many files in one folder - any
>> thoughts on this ?
>> - is the content of logs/userlogs cleaned up by hadoop regularly ?
>> - the logs/stdout file of the tasks are not existent, the logs/out fiels of
>> the tasktracker hasn't any specific message (other then message posted
>> above) - is there any log file left where an error message could be found ?
>>
>>
>> best regards
>> Johannes
>
>
> Most file systems have an upper limit on number of subfiles/folders in a
> folder. You have probably hit the EXT3 limit. If you launch lots and lots of
> jobs you can hit the limit before any cleanup happens.
>
> You can experiment with cleanup and other filesystems. The following log
> related issue might be relevant.
>
> https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
>
> Regards,
> Edward