Re: Infinite Loop Resending status from task tracker

Lance Riedel Thu, 14 May 2009 10:01:59 -0700

Sorry, had missed that Todd had created Jira -
HADOOP-5761<https://issues.apache.org/jira/browse/HADOOP-5761>


Any progress there?

Thanks,
Lance

On Thu, May 14, 2009 at 8:52 AM, Lance Riedel <la...@dotspots.com> wrote:

> Here is the point in the logs where the infinite loop begins - see time
> stamp 2009-05-14 04:03:56,348 : (JobTracker)
>
> 2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200905122015_1168_m_000029_0' from
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,326 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000007_0' to tip
> task_200905122015_1183_r_000007, for tracker
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:58504'
> 2009-05-14 04:03:56,327 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (cleanup)'attempt_200905122015_1184_m_000000_1' to tip
> task_200905122015_1184_m_000000, for tracker
> 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
> 127.0.0.1
> :57741'
> 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_200905122015_1182_r_000011_0' has completed
> task_200905122015_1182_r_000011 successfully.
> 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (cleanup)'attempt_200905122015_1182_r_000010_1' to tip
> task_200905122015_1182_r_000010, for tracker
> 'tracker_domU-12-31-38-01-5C-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1
> :46248'
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_0499_r_000004_1
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_200905122015_1184_m_000004_1' has completed
> task_200905122015_1184_m_000004 successfully.
> 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.ResourceEstimator:
> measured blowup on task_200905122015_1184_m_000004 was 20150008/21581175 =
> 0.93368447269437372009-05-14 04:03:56,331 INFO
> org.apache.hadoop.mapred.ResourceEstimator: new estimate is blowup =
> 0.9152383292400812
> 2009-05-14 04:03:56,335 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000008_0' to tip
> task_200905122015_1183_r_000008, for tracker
> 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/
> 127.0.0.1:57741'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000009_0' to tip
> task_200905122015_1183_r_000009, for tracker
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Removed
> completed task 'attempt_200905122015_1181_r_000009_0' from
> 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/
> 127.0.0.1:35214'
> 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_0499_r_000004_1
> 2009-05-14 04:03:56,337 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000010_0' to tip
> task_200905122015_1183_r_000010, for tracker
> 'tracker_domU-12-31-38-01-81-31.compute-1.internal:localhost.localdomain/
> 127.0.0.1:46518'
> 2009-05-14 04:03:56,343 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_1070_r_000014_1
> 2009-05-14 04:03:56,344 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000011_0' to tip
> task_200905122015_1183_r_000011, for tracker
> 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
> 127.0.0.1:33929'
> 2009-05-14 04:03:56,348 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 54311, call
> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@a4fe4, false, true,
> 21461) from 10.253.178.95:50709: error: java.io.IOException:
> java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
>         at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>         at
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>         at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown
> Source)        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-05-14 04:03:56,351 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_200905122015_1183_r_000007_0 is 24 ms debug.
> 2009-05-14 04:03:56,419 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000012_0' to tip
> task_200905122015_1183_r_000012, for tracker
> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
> 127.0.0.1:58504'
> 2009-05-14 04:03:56,459 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_1070_r_000014_1
> 2009-05-14 04:03:56,459 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200905122015_1182_r_000009_1' of TIP
> 'task_200905122015_1182_r_000009'
> 2009-05-14 04:03:56,460 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000013_0' to tip
> task_200905122015_1183_r_000013, for tracker
> 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
> 127.0.0.1:33929'
> 2009-05-14 04:03:56,460 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 54311, call
> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f1c507, false, true,
> 21461) from 10.253.178.95:50709: error: java.io.IOException:
> java.lang.NullPointerExceptio
> n
> java.io.IOException: java.lang.NullPointerException
>         at
> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
> at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>         at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
> 2009-05-14 04:03:56,464 INFO org.apache.hadoop.mapred.JobTracker: Serious
> problem.  While updating status, cannot find taskid
> attempt_200905122015_1070_r_000014_1
> 2009-05-14 04:03:56,464 WARN org.apache.hadoop.mapred.TaskInProgress:
> Recieved duplicate status update of 'KILLED' for
> 'attempt_200905122015_1182_r_000009_1' of TIP
> 'task_200905122015_1182_r_000009'
> 2009-05-14 04:03:56,465 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task 'attempt_200905122015_1183_r_000014_0' to tip
> task_200905122015_1183_r_000014, for tracker
> 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/
> 127.0.0.1:33929'
> 2009-05-14 04:03:56,465 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 54311, call
> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1d54fc9, false, true,
> 21461) from 10.253.178.95:50709: error: java.io.IOException:
> java.lang.NullPointerExceptio
> n
>
>
> <and on and on an on for 10 gigs worth in a few hours>
>
>
>
>
> On Thu, May 14, 2009 at 8:37 AM, Lance Riedel <la...@dotspots.com> wrote:
>
>> Just had another cluster crash with the same issue. This is still a huge
>> issue for us- still crashing our cluster every other night (actually almost
>> every night now).
>>
>> Should we move to .20? Is there more information i can provide?  Is this
>> related to my other email "Constantly getting DiskErrorExceptions - but
>> logged as INFO"? I haven't seen responses on that.
>>
>> Thanks!
>> Lance
>>
>> On Thu, May 14, 2009 at 7:48 AM, Lance Riedel <la...@dotspots.com> wrote:
>>
>>> Here is the latest here.. Haven't heard any more, but every other night
>>> we get 10 gigs logs and tons of failed tasks and have to restart the cluster
>>>
>>> ---------- Forwarded message ----------
>>> From: Lance Riedel <la...@dotspots.com>
>>> Date: Fri, May 8, 2009 at 10:49 AM
>>> Subject: Re: Infinite Loop Resending status from task tracker
>>> To: core-user@hadoop.apache.org
>>> Cc: Brian Long <br...@dotspots.com>
>>>
>>>
>>> Hi Todd,
>>> Sorry, my response got hung up in my outbox for a couple of days.. arghh
>>>
>>>
>>>
>>> Confirmed that 1) we are not running out of space and 2) that our
>>> mapred.local.dir directory is not in /tmp
>>>
>>> Not sure if this an ec2 problem with a mounted drive?
>>>
>>> We had the same thing happen again, exact same logs and symptoms
>>> (simultaneous in jobtracker and tasktracker)
>>>
>>> Thinking about moving to .20 because of this, any thoughts on that?
>>>
>>> Thanks,
>>> Lance
>>>
>>>
>>> On May 4, 2009, at 4:18 PM, Todd Lipcon wrote:
>>>
>>>  Hi Lance,
>>>>
>>>> Two thoughts here that might be the culprit:
>>>>
>>>> 1) Is it possible that the partition that your mapred.local.dir is on is
>>>> out
>>>> of space on that task tracker?
>>>>
>>>> 2) Is it possible that you're using a directory under /tmp for
>>>> mapred.local.dir and some system cron script cleared out /tmp?
>>>>
>>>> -Todd
>>>>
>>>> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com>
>>>> wrote:
>>>>
>>>>  Hi Todd,
>>>>> Not sure if this is related, but our hadoop cluster in general is
>>>>> getting
>>>>> more and more unstable.  the logs are full of this error message (but
>>>>> having
>>>>> trouble tracking down the root problem):
>>>>>
>>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out
>>>>> in any of the configured local directories
>>>>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>>>
>>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out
>>>>> in any of the configured local directories
>>>>>
>>>>>
>>>>> Lance
>>>>>
>>>>>
>>>>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:
>>>>>
>>>>> Hey Lance,
>>>>>
>>>>>>
>>>>>> Thanks for the logs. They definitely confirmed my suspicion. There are
>>>>>> two
>>>>>> problems here:
>>>>>>
>>>>>> 1) If the JobTracker throws an exception during processing of a
>>>>>> heartbeat,
>>>>>> the tasktracker retries with no delay, since lastHeartbeat isn't
>>>>>> updated
>>>>>> in
>>>>>> TaskTracker.offerService. This is related to HADOOP-3987
>>>>>>
>>>>>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an
>>>>>> invalid
>>>>>> task id, the jobtracker will trigger a NullPointerException in
>>>>>> JobTracker.getTasksToSave. Instead it should probably create a
>>>>>> KillTaskAction. I just filed a JIRA to track this issue:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/HADOOP-5761
>>>>>>
>>>>>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state
>>>>>> that
>>>>>> the JobTracker didn't know about. How it got there is a separate
>>>>>> problem
>>>>>> that's a bit harder to track down.
>>>>>>
>>>>>> Thanks
>>>>>> -Todd
>>>>>>
>>>>>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com>
>>>>>> wrote:
>>>>>>
>>>>>> Here are the job tracker logs from the same time (and yes.. there is
>>>>>>
>>>>>>> something there!!):
>>>>>>>
>>>>>>>
>>>>>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>>
>>>>>>>
>>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 2 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1a93388, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0296_r_000014_1
>>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Adding
>>>>>>> task 'attempt_200904291917_0352_r_000013_0' to tip
>>>>>>> task_200904291917_0352_r_000013, for tracker
>>>>>>>
>>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>>> 127.0.0.1:42479'
>>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Removed
>>>>>>> completed task 'attempt_200904291917_0343_m_000003_0' from
>>>>>>>
>>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>>> 127.0.0.1:42479'
>>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Removed
>>>>>>> completed task 'attempt_200904291917_0343_m_000007_0' from
>>>>>>>
>>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/
>>>>>>> 127.0.0.1:42479'
>>>>>>>
>>>>>>>
>>>>>>> And then.. a LOT more
>>>>>>>
>>>>>>>
>>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 1 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1b7b4c1, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 1 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1598c57, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 7 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@cb3d5, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker:
>>>>>>> Serious
>>>>>>> problem.  While updating status, cannot find taskid
>>>>>>> attempt_200904291917_0252_r_000003_0
>>>>>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
>>>>>>> Recieved duplicate status update of 'KILLED' for
>>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP
>>>>>>> 'task_200904291917_0352_m_000010'
>>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>>>> handler 1 on 54311, call
>>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f27d98, false,
>>>>>>> true,
>>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException:
>>>>>>> java.lang.NullPointerException
>>>>>>> java.io.IOException: java.lang.NullPointerException
>>>>>>>    at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130)
>>>>>>>    at
>>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
>>>>>>>    at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
>>>>>>>    at
>>>>>>>
>>>>>>>
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>    at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
>>>>>>>    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:
>>>>>>>
>>>>>>> Hey Lance,
>>>>>>>
>>>>>>>
>>>>>>>> Did you see any error messages in the JobTracker logs around the
>>>>>>>> time
>>>>>>>> this
>>>>>>>> started? I think I understand how this might happen.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I have not been able to reproduce.  We are using version 19.1 with
>>>>>>>> the
>>>>>>>>
>>>>>>>>  following patches:
>>>>>>>>> 4780-2v19.patch (Jira  4780)
>>>>>>>>> closeAll3.patch (Jira 3998)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Lance
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:
>>>>>>>>>
>>>>>>>>> Hi Lance,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Can I ask what version you were running when you saw this? Is it
>>>>>>>>>> reproducible? I'm trying to look at the code path that might
>>>>>>>>>> produce
>>>>>>>>>> such
>>>>>>>>>> a
>>>>>>>>>> behavior and want to make sure I'm looking at the right version.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com
>>>>>>>>>> >
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig
>>>>>>>>>> log
>>>>>>>>>> file
>>>>>>>>>>
>>>>>>>>>> in a few hours. The entry is all the same (every 2 ms):
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2009-04-30 02:34:40,207 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,398 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,403 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,411 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,414 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,417 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,420 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> ... (And on and on and on...)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> These are the few lines before it started:
>>>>>>>>>>>
>>>>>>>>>>> 2009-04-30 02:34:29,780 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes:
>>>>>>>>>>> 3341324,
>>>>>>>>>>> op:
>>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
>>>>>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Sent
>>>>>>>>>>> out
>>>>>>>>>>> 418891 bytes for reduce: 12 from map:
>>>>>>>>>>> attempt_200904291917_0352_m_000007_0
>>>>>>>>>>> given 418891/418887 from 4301462 with (22, 171)
>>>>>>>>>>> 2009-04-30 02:34:31,522 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
>>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes:
>>>>>>>>>>> 418891,
>>>>>>>>>>> op:
>>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
>>>>>>>>>>> 2009-04-30 02:34:35,382 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>>>> of
>>>>>>>>>>> 11
>>>>>>>>>>> at
>>>>>>>>>>> 0.32 MB/s) >
>>>>>>>>>>> 2009-04-30 02:34:38,385 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10
>>>>>>>>>>> of
>>>>>>>>>>> 11
>>>>>>>>>>> at
>>>>>>>>>>> 0.32 MB/s) >
>>>>>>>>>>> 2009-04-30 02:34:40,207 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,398 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,403 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,411 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,414 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,417 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>> 2009-04-30 02:34:40,420 INFO
>>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker:
>>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com'
>>>>>>>>>>> with
>>>>>>>>>>> reponseId '5341
>>>>>>>>>>>
>>>>>>>>>>> --And on for    2+ gigs
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>>
>>
>

Re: Infinite Loop Resending status from task tracker

Reply via email to