Sorry, had missed that Todd had created Jira - HADOOP-5761<https://issues.apache.org/jira/browse/HADOOP-5761>
Any progress there? Thanks, Lance On Thu, May 14, 2009 at 8:52 AM, Lance Riedel <la...@dotspots.com> wrote: > Here is the point in the logs where the infinite loop begins - see time > stamp 2009-05-14 04:03:56,348 : (JobTracker) > > 2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed > completed task 'attempt_200905122015_1168_m_000029_0' from > 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/ > 127.0.0.1:35214' > 2009-05-14 04:03:56,326 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000007_0' to tip > task_200905122015_1183_r_000007, for tracker > 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ > 127.0.0.1:58504' > 2009-05-14 04:03:56,327 INFO org.apache.hadoop.mapred.JobTracker: Adding > task (cleanup)'attempt_200905122015_1184_m_000000_1' to tip > task_200905122015_1184_m_000000, for tracker > 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/ > 127.0.0.1 > :57741' > 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobInProgress: Task > 'attempt_200905122015_1182_r_000011_0' has completed > task_200905122015_1182_r_000011 successfully. > 2009-05-14 04:03:56,330 INFO org.apache.hadoop.mapred.JobTracker: Adding > task (cleanup)'attempt_200905122015_1182_r_000010_1' to tip > task_200905122015_1182_r_000010, for tracker > 'tracker_domU-12-31-38-01-5C-41.compute-1.internal:localhost.localdomain/ > 127.0.0.1 > :46248' > 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200905122015_0499_r_000004_1 > 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.JobInProgress: Task > 'attempt_200905122015_1184_m_000004_1' has completed > task_200905122015_1184_m_000004 successfully. > 2009-05-14 04:03:56,331 INFO org.apache.hadoop.mapred.ResourceEstimator: > measured blowup on task_200905122015_1184_m_000004 was 20150008/21581175 = > 0.93368447269437372009-05-14 04:03:56,331 INFO > org.apache.hadoop.mapred.ResourceEstimator: new estimate is blowup = > 0.9152383292400812 > 2009-05-14 04:03:56,335 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000008_0' to tip > task_200905122015_1183_r_000008, for tracker > 'tracker_domU-12-31-38-00-80-21.compute-1.internal:localhost.localdomain/ > 127.0.0.1:57741' > 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000009_0' to tip > task_200905122015_1183_r_000009, for tracker > 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/ > 127.0.0.1:35214' > 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Removed > completed task 'attempt_200905122015_1181_r_000009_0' from > 'tracker_domU-12-31-38-01-74-F1.compute-1.internal:localhost.localdomain/ > 127.0.0.1:35214' > 2009-05-14 04:03:56,336 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200905122015_0499_r_000004_1 > 2009-05-14 04:03:56,337 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000010_0' to tip > task_200905122015_1183_r_000010, for tracker > 'tracker_domU-12-31-38-01-81-31.compute-1.internal:localhost.localdomain/ > 127.0.0.1:46518' > 2009-05-14 04:03:56,343 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200905122015_1070_r_000014_1 > 2009-05-14 04:03:56,344 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000011_0' to tip > task_200905122015_1183_r_000011, for tracker > 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/ > 127.0.0.1:33929' > 2009-05-14 04:03:56,348 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 9 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@a4fe4, false, true, > 21461) from 10.253.178.95:50709: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) > at > org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) > at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown > Source) at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) > 2009-05-14 04:03:56,351 INFO org.apache.hadoop.mapred.JobTracker: > attempt_200905122015_1183_r_000007_0 is 24 ms debug. > 2009-05-14 04:03:56,419 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000012_0' to tip > task_200905122015_1183_r_000012, for tracker > 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ > 127.0.0.1:58504' > 2009-05-14 04:03:56,459 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200905122015_1070_r_000014_1 > 2009-05-14 04:03:56,459 WARN org.apache.hadoop.mapred.TaskInProgress: > Recieved duplicate status update of 'KILLED' for > 'attempt_200905122015_1182_r_000009_1' of TIP > 'task_200905122015_1182_r_000009' > 2009-05-14 04:03:56,460 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000013_0' to tip > task_200905122015_1183_r_000013, for tracker > 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/ > 127.0.0.1:33929' > 2009-05-14 04:03:56,460 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 5 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f1c507, false, true, > 21461) from 10.253.178.95:50709: error: java.io.IOException: > java.lang.NullPointerExceptio > n > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) > at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) > at sun.reflect.GeneratedMethodAccessor117.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) > 2009-05-14 04:03:56,464 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200905122015_1070_r_000014_1 > 2009-05-14 04:03:56,464 WARN org.apache.hadoop.mapred.TaskInProgress: > Recieved duplicate status update of 'KILLED' for > 'attempt_200905122015_1182_r_000009_1' of TIP > 'task_200905122015_1182_r_000009' > 2009-05-14 04:03:56,465 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200905122015_1183_r_000014_0' to tip > task_200905122015_1183_r_000014, for tracker > 'tracker_domU-12-31-38-01-AD-91.compute-1.internal:localhost.localdomain/ > 127.0.0.1:33929' > 2009-05-14 04:03:56,465 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 2 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1d54fc9, false, true, > 21461) from 10.253.178.95:50709: error: java.io.IOException: > java.lang.NullPointerExceptio > n > > > <and on and on an on for 10 gigs worth in a few hours> > > > > > On Thu, May 14, 2009 at 8:37 AM, Lance Riedel <la...@dotspots.com> wrote: > >> Just had another cluster crash with the same issue. This is still a huge >> issue for us- still crashing our cluster every other night (actually almost >> every night now). >> >> Should we move to .20? Is there more information i can provide? Is this >> related to my other email "Constantly getting DiskErrorExceptions - but >> logged as INFO"? I haven't seen responses on that. >> >> Thanks! >> Lance >> >> On Thu, May 14, 2009 at 7:48 AM, Lance Riedel <la...@dotspots.com> wrote: >> >>> Here is the latest here.. Haven't heard any more, but every other night >>> we get 10 gigs logs and tons of failed tasks and have to restart the cluster >>> >>> ---------- Forwarded message ---------- >>> From: Lance Riedel <la...@dotspots.com> >>> Date: Fri, May 8, 2009 at 10:49 AM >>> Subject: Re: Infinite Loop Resending status from task tracker >>> To: core-user@hadoop.apache.org >>> Cc: Brian Long <br...@dotspots.com> >>> >>> >>> Hi Todd, >>> Sorry, my response got hung up in my outbox for a couple of days.. arghh >>> >>> >>> >>> Confirmed that 1) we are not running out of space and 2) that our >>> mapred.local.dir directory is not in /tmp >>> >>> Not sure if this an ec2 problem with a mounted drive? >>> >>> We had the same thing happen again, exact same logs and symptoms >>> (simultaneous in jobtracker and tasktracker) >>> >>> Thinking about moving to .20 because of this, any thoughts on that? >>> >>> Thanks, >>> Lance >>> >>> >>> On May 4, 2009, at 4:18 PM, Todd Lipcon wrote: >>> >>> Hi Lance, >>>> >>>> Two thoughts here that might be the culprit: >>>> >>>> 1) Is it possible that the partition that your mapred.local.dir is on is >>>> out >>>> of space on that task tracker? >>>> >>>> 2) Is it possible that you're using a directory under /tmp for >>>> mapred.local.dir and some system cron script cleared out /tmp? >>>> >>>> -Todd >>>> >>>> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com> >>>> wrote: >>>> >>>> Hi Todd, >>>>> Not sure if this is related, but our hadoop cluster in general is >>>>> getting >>>>> more and more unstable. the logs are full of this error message (but >>>>> having >>>>> trouble tracking down the root problem): >>>>> >>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out >>>>> in any of the configured local directories >>>>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>>> >>>>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out >>>>> in any of the configured local directories >>>>> >>>>> >>>>> Lance >>>>> >>>>> >>>>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote: >>>>> >>>>> Hey Lance, >>>>> >>>>>> >>>>>> Thanks for the logs. They definitely confirmed my suspicion. There are >>>>>> two >>>>>> problems here: >>>>>> >>>>>> 1) If the JobTracker throws an exception during processing of a >>>>>> heartbeat, >>>>>> the tasktracker retries with no delay, since lastHeartbeat isn't >>>>>> updated >>>>>> in >>>>>> TaskTracker.offerService. This is related to HADOOP-3987 >>>>>> >>>>>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an >>>>>> invalid >>>>>> task id, the jobtracker will trigger a NullPointerException in >>>>>> JobTracker.getTasksToSave. Instead it should probably create a >>>>>> KillTaskAction. I just filed a JIRA to track this issue: >>>>>> >>>>>> https://issues.apache.org/jira/browse/HADOOP-5761 >>>>>> >>>>>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state >>>>>> that >>>>>> the JobTracker didn't know about. How it got there is a separate >>>>>> problem >>>>>> that's a bit harder to track down. >>>>>> >>>>>> Thanks >>>>>> -Todd >>>>>> >>>>>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com> >>>>>> wrote: >>>>>> >>>>>> Here are the job tracker logs from the same time (and yes.. there is >>>>>> >>>>>>> something there!!): >>>>>>> >>>>>>> >>>>>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Serious >>>>>>> problem. While updating status, cannot find taskid >>>>>>> attempt_200904291917_0252_r_000003_0 >>>>>>> >>>>>>> >>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>>>> handler 2 on 54311, call >>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1a93388, false, >>>>>>> true, >>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>>>> java.lang.NullPointerException >>>>>>> java.io.IOException: java.lang.NullPointerException >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>>>> at >>>>>>> >>>>>>> >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Serious >>>>>>> problem. While updating status, cannot find taskid >>>>>>> attempt_200904291917_0296_r_000014_1 >>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Adding >>>>>>> task 'attempt_200904291917_0352_r_000013_0' to tip >>>>>>> task_200904291917_0352_r_000013, for tracker >>>>>>> >>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ >>>>>>> 127.0.0.1:42479' >>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Removed >>>>>>> completed task 'attempt_200904291917_0343_m_000003_0' from >>>>>>> >>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ >>>>>>> 127.0.0.1:42479' >>>>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Removed >>>>>>> completed task 'attempt_200904291917_0343_m_000007_0' from >>>>>>> >>>>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ >>>>>>> 127.0.0.1:42479' >>>>>>> >>>>>>> >>>>>>> And then.. a LOT more >>>>>>> >>>>>>> >>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Serious >>>>>>> problem. While updating status, cannot find taskid >>>>>>> attempt_200904291917_0252_r_000003_0 >>>>>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>>>> Recieved duplicate status update of 'KILLED' for >>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>>>> 'task_200904291917_0352_m_000010' >>>>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>>>> handler 1 on 54311, call >>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1b7b4c1, false, >>>>>>> true, >>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>>>> java.lang.NullPointerException >>>>>>> java.io.IOException: java.lang.NullPointerException >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>>>> at >>>>>>> >>>>>>> >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Serious >>>>>>> problem. While updating status, cannot find taskid >>>>>>> attempt_200904291917_0252_r_000003_0 >>>>>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>>>> Recieved duplicate status update of 'KILLED' for >>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>>>> 'task_200904291917_0352_m_000010' >>>>>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>>>> handler 1 on 54311, call >>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1598c57, false, >>>>>>> true, >>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>>>> java.lang.NullPointerException >>>>>>> java.io.IOException: java.lang.NullPointerException >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>>>> at >>>>>>> >>>>>>> >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Serious >>>>>>> problem. While updating status, cannot find taskid >>>>>>> attempt_200904291917_0252_r_000003_0 >>>>>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>>>> Recieved duplicate status update of 'KILLED' for >>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>>>> 'task_200904291917_0352_m_000010' >>>>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>>>> handler 7 on 54311, call >>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@cb3d5, false, >>>>>>> true, >>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>>>> java.lang.NullPointerException >>>>>>> java.io.IOException: java.lang.NullPointerException >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>>>> at >>>>>>> >>>>>>> >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker: >>>>>>> Serious >>>>>>> problem. While updating status, cannot find taskid >>>>>>> attempt_200904291917_0252_r_000003_0 >>>>>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>>>> Recieved duplicate status update of 'KILLED' for >>>>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>>>> 'task_200904291917_0352_m_000010' >>>>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>>>> handler 1 on 54311, call >>>>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f27d98, false, >>>>>>> true, >>>>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>>>> java.lang.NullPointerException >>>>>>> java.io.IOException: java.lang.NullPointerException >>>>>>> at >>>>>>> >>>>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>>>> at >>>>>>> >>>>>>> >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote: >>>>>>> >>>>>>> Hey Lance, >>>>>>> >>>>>>> >>>>>>>> Did you see any error messages in the JobTracker logs around the >>>>>>>> time >>>>>>>> this >>>>>>>> started? I think I understand how this might happen. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> -Todd >>>>>>>> >>>>>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I have not been able to reproduce. We are using version 19.1 with >>>>>>>> the >>>>>>>> >>>>>>>> following patches: >>>>>>>>> 4780-2v19.patch (Jira 4780) >>>>>>>>> closeAll3.patch (Jira 3998) >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Lance >>>>>>>>> >>>>>>>>> >>>>>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote: >>>>>>>>> >>>>>>>>> Hi Lance, >>>>>>>>> >>>>>>>>> >>>>>>>>> Can I ask what version you were running when you saw this? Is it >>>>>>>>>> reproducible? I'm trying to look at the code path that might >>>>>>>>>> produce >>>>>>>>>> such >>>>>>>>>> a >>>>>>>>>> behavior and want to make sure I'm looking at the right version. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> -Todd >>>>>>>>>> >>>>>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com >>>>>>>>>> > >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig >>>>>>>>>> log >>>>>>>>>> file >>>>>>>>>> >>>>>>>>>> in a few hours. The entry is all the same (every 2 ms): >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2009-04-30 02:34:40,207 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,398 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,403 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,411 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,414 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,417 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,420 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> ... (And on and on and on...) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> These are the few lines before it started: >>>>>>>>>>> >>>>>>>>>>> 2009-04-30 02:34:29,780 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: >>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: >>>>>>>>>>> 3341324, >>>>>>>>>>> op: >>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0 >>>>>>>>>>> 2009-04-30 02:34:31,522 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Sent >>>>>>>>>>> out >>>>>>>>>>> 418891 bytes for reduce: 12 from map: >>>>>>>>>>> attempt_200904291917_0352_m_000007_0 >>>>>>>>>>> given 418891/418887 from 4301462 with (22, 171) >>>>>>>>>>> 2009-04-30 02:34:31,522 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: >>>>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: >>>>>>>>>>> 418891, >>>>>>>>>>> op: >>>>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0 >>>>>>>>>>> 2009-04-30 02:34:35,382 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 >>>>>>>>>>> of >>>>>>>>>>> 11 >>>>>>>>>>> at >>>>>>>>>>> 0.32 MB/s) > >>>>>>>>>>> 2009-04-30 02:34:38,385 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 >>>>>>>>>>> of >>>>>>>>>>> 11 >>>>>>>>>>> at >>>>>>>>>>> 0.32 MB/s) > >>>>>>>>>>> 2009-04-30 02:34:40,207 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,398 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,403 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,411 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,414 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,417 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> 2009-04-30 02:34:40,420 INFO >>>>>>>>>>> org.apache.hadoop.mapred.TaskTracker: >>>>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>>>> with >>>>>>>>>>> reponseId '5341 >>>>>>>>>>> >>>>>>>>>>> --And on for 2+ gigs >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >>> >> >