Hi Lance,

Two thoughts here that might be the culprit:

1) Is it possible that the partition that your mapred.local.dir is on is out
of space on that task tracker?

2) Is it possible that you're using a directory under /tmp for
mapred.local.dir and some system cron script cleared out /tmp?

-Todd

On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com> wrote:

> Hi Todd,
> Not sure if this is related, but our hadoop cluster in general is getting
> more and more unstable.  the logs are full of this error message (but having
> trouble tracking down the root problem):
>
> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
> in any of the configured local directories
> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
> in any of the configured local directories
>
>
> Lance
>
>
> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>
>  Hey Lance,
>>
>> Thanks for the logs. They definitely confirmed my suspicion. There are two
>> problems here:
>>
>> 1) If the JobTracker throws an exception during processing of a heartbeat,
>> the tasktracker retries with no delay, since lastHeartbeat isn't updated
>> in
>> TaskTracker.offerService. This is related to HADOOP-3987
>>
>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an invalid
>> task id, the jobtracker will trigger a NullPointerException in
>> JobTracker.getTasksToSave. Instead it should probably create a
>> KillTaskAction. I just filed a JIRA to track this issue:
>>
>> https://issues.apache.org/jira/browse/HADOOP-5761
>>
>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state that
>> the JobTracker didn't know about. How it got there is a separate problem
>> that's a bit harder to track down.
>>
>> Thanks
>> -Todd
>>
>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>
>> wrote:
>>
>>  Here are the job tracker logs from the same time (and yes.. there is
>>> something there!!):
>>>
>>>
>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>>
>>>
>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 2 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1a93388, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0296_r_000014_1
>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Adding
>>> task 'attempt_200904291917_0352_r_000013_0' to tip
>>> task_200904291917_0352_r_000013, for tracker
>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>> 127.0.0.1:42479'
>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
>>> completed task 'attempt_200904291917_0343_m_000003_0' from
>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>> 127.0.0.1:42479'
>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
>>> completed task 'attempt_200904291917_0343_m_000007_0' from
>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>> 127.0.0.1:42479'
>>>
>>>
>>> And then.. a LOT more
>>>
>>>
>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 1 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1b7b4c1, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 1 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1598c57, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 7 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@cb3d5, false, true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker: Serious
>>> problem.  While updating status, cannot find taskid
>>> attempt_200904291917_0252_r_000003_0
>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
>>> Recieved duplicate status update of 'KILLED' for
>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>> 'task_200904291917_0352_m_000010'
>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 1 on 54311, call
>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f27d98, false,
>>> true,
>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>> java.lang.NullPointerException
>>> java.io.IOException: java.lang.NullPointerException
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>      at
>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>      at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>>
>>> Hey Lance,
>>>
>>>>
>>>> Did you see any error messages in the JobTracker logs around the time
>>>> this
>>>> started? I think I understand how this might happen.
>>>>
>>>> Thanks,
>>>> -Todd
>>>>
>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>>>> wrote:
>>>>
>>>> I have not been able to reproduce.  We are using version 19.1 with the
>>>>
>>>>> following patches:
>>>>> 4780-2v19.patch (Jira  4780)
>>>>> closeAll3.patch (Jira 3998)
>>>>>
>>>>> Thanks,
>>>>> Lance
>>>>>
>>>>>
>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>>
>>>>> Hi Lance,
>>>>>
>>>>>
>>>>>> Can I ask what version you were running when you saw this? Is it
>>>>>> reproducible? I'm trying to look at the code path that might produce
>>>>>> such
>>>>>> a
>>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>>
>>>>>> Thanks
>>>>>> -Todd
>>>>>>
>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>
>>>>>> wrote:
>>>>>>
>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig log
>>>>>> file
>>>>>>
>>>>>>  in a few hours. The entry is all the same (every 2 ms):
>>>>>>>
>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> ... (And on and on and on...)
>>>>>>>
>>>>>>>
>>>>>>> These are the few lines before it started:
>>>>>>>
>>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324,
>>>>>>> op:
>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Sent
>>>>>>> out
>>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891,
>>>>>>> op:
>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of
>>>>>>> 11
>>>>>>> at
>>>>>>> 0.32 MB/s) >
>>>>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of
>>>>>>> 11
>>>>>>> at
>>>>>>> 0.32 MB/s) >
>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
>>>>>>> reponseId '5341
>>>>>>>
>>>>>>> --And on for    2+ gigs
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>

Reply via email to