Hey Lance, Thanks for the logs. They definitely confirmed my suspicion. There are two problems here:
1) If the JobTracker throws an exception during processing of a heartbeat, the tasktracker retries with no delay, since lastHeartbeat isn't updated in TaskTracker.offerService. This is related to HADOOP-3987 2) If the TaskTracker sends a task in COMMIT_PENDING state with an invalid task id, the jobtracker will trigger a NullPointerException in JobTracker.getTasksToSave. Instead it should probably create a KillTaskAction. I just filed a JIRA to track this issue: https://issues.apache.org/jira/browse/HADOOP-5761 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state that the JobTracker didn't know about. How it got there is a separate problem that's a bit harder to track down. Thanks -Todd On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com> wrote: > Here are the job tracker logs from the same time (and yes.. there is > something there!!): > > > 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200904291917_0252_r_000003_0 > > > 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 2 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1a93388, false, true, > 5341) from 10.253.134.191:42688: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) > at > org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) > at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) > 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200904291917_0296_r_000014_1 > 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Adding > task 'attempt_200904291917_0352_r_000013_0' to tip > task_200904291917_0352_r_000013, for tracker > 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ > 127.0.0.1:42479' > 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed > completed task 'attempt_200904291917_0343_m_000003_0' from > 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ > 127.0.0.1:42479' > 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed > completed task 'attempt_200904291917_0343_m_000007_0' from > 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ > 127.0.0.1:42479' > > > And then.. a LOT more > > > 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200904291917_0252_r_000003_0 > 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress: > Recieved duplicate status update of 'KILLED' for > 'attempt_200904291917_0352_m_000010_1' of TIP > 'task_200904291917_0352_m_000010' > 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 1 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1b7b4c1, false, true, > 5341) from 10.253.134.191:42688: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) > at > org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) > at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) > 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200904291917_0252_r_000003_0 > 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress: > Recieved duplicate status update of 'KILLED' for > 'attempt_200904291917_0352_m_000010_1' of TIP > 'task_200904291917_0352_m_000010' > 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 1 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1598c57, false, true, > 5341) from 10.253.134.191:42688: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) > at > org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) > at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) > 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200904291917_0252_r_000003_0 > 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress: > Recieved duplicate status update of 'KILLED' for > 'attempt_200904291917_0352_m_000010_1' of TIP > 'task_200904291917_0352_m_000010' > 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 7 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@cb3d5, false, true, > 5341) from 10.253.134.191:42688: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) > at > org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) > at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) > 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker: Serious > problem. While updating status, cannot find taskid > attempt_200904291917_0252_r_000003_0 > 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress: > Recieved duplicate status update of 'KILLED' for > 'attempt_200904291917_0352_m_000010_1' of TIP > 'task_200904291917_0352_m_000010' > 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 1 on 54311, call > heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f27d98, false, true, > 5341) from 10.253.134.191:42688: error: java.io.IOException: > java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at > org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) > at > org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) > at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) > > > > > > > > > > > On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote: > > Hey Lance, >> >> Did you see any error messages in the JobTracker logs around the time this >> started? I think I understand how this might happen. >> >> Thanks, >> -Todd >> >> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com> >> wrote: >> >> I have not been able to reproduce. We are using version 19.1 with the >>> following patches: >>> 4780-2v19.patch (Jira 4780) >>> closeAll3.patch (Jira 3998) >>> >>> Thanks, >>> Lance >>> >>> >>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote: >>> >>> Hi Lance, >>> >>>> >>>> Can I ask what version you were running when you saw this? Is it >>>> reproducible? I'm trying to look at the code path that might produce >>>> such >>>> a >>>> behavior and want to make sure I'm looking at the right version. >>>> >>>> Thanks >>>> -Todd >>>> >>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com> >>>> wrote: >>>> >>>> Has anyone seen this before? Our task tracker produced a 2.7 gig log >>>> file >>>> >>>>> in a few hours. The entry is all the same (every 2 ms): >>>>> >>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> ... (And on and on and on...) >>>>> >>>>> >>>>> These are the few lines before it started: >>>>> >>>>> 2009-04-30 02:34:29,780 INFO >>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: >>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324, op: >>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0 >>>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker: Sent >>>>> out >>>>> 418891 bytes for reduce: 12 from map: >>>>> attempt_200904291917_0352_m_000007_0 >>>>> given 418891/418887 from 4301462 with (22, 171) >>>>> 2009-04-30 02:34:31,522 INFO >>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: >>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891, op: >>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0 >>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11 >>>>> at >>>>> 0.32 MB/s) > >>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11 >>>>> at >>>>> 0.32 MB/s) > >>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker: >>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with >>>>> reponseId '5341 >>>>> >>>>> --And on for 2+ gigs >>>>> >>>>> >>>>> >>> >