Just had another cluster crash with the same issue. This is still a huge issue for us- still crashing our cluster every other night (actually almost every night now).
Should we move to .20? Is there more information i can provide? Is this related to my other email "Constantly getting DiskErrorExceptions - but logged as INFO"? I haven't seen responses on that. Thanks! Lance On Thu, May 14, 2009 at 7:48 AM, Lance Riedel <la...@dotspots.com> wrote: > Here is the latest here.. Haven't heard any more, but every other night we > get 10 gigs logs and tons of failed tasks and have to restart the cluster > > ---------- Forwarded message ---------- > From: Lance Riedel <la...@dotspots.com> > Date: Fri, May 8, 2009 at 10:49 AM > Subject: Re: Infinite Loop Resending status from task tracker > To: core-user@hadoop.apache.org > Cc: Brian Long <br...@dotspots.com> > > > Hi Todd, > Sorry, my response got hung up in my outbox for a couple of days.. arghh > > > > Confirmed that 1) we are not running out of space and 2) that our > mapred.local.dir directory is not in /tmp > > Not sure if this an ec2 problem with a mounted drive? > > We had the same thing happen again, exact same logs and symptoms > (simultaneous in jobtracker and tasktracker) > > Thinking about moving to .20 because of this, any thoughts on that? > > Thanks, > Lance > > > On May 4, 2009, at 4:18 PM, Todd Lipcon wrote: > > Hi Lance, >> >> Two thoughts here that might be the culprit: >> >> 1) Is it possible that the partition that your mapred.local.dir is on is >> out >> of space on that task tracker? >> >> 2) Is it possible that you're using a directory under /tmp for >> mapred.local.dir and some system cron script cleared out /tmp? >> >> -Todd >> >> On Sat, May 2, 2009 at 9:01 AM, Lance Riedel <la...@dotspots.com> wrote: >> >> Hi Todd, >>> Not sure if this is related, but our hadoop cluster in general is getting >>> more and more unstable. the logs are full of this error message (but >>> having >>> trouble tracking down the root problem): >>> >>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_1111/attempt_200904301103_1111_m_000001_1/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_1675/attempt_200904301103_1675_r_000012_1/output/file.out >>> in any of the configured local directories >>> 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200904301103_0944/attempt_200904301103_0944_r_000015_0/output/file.out >>> in any of the configured local directories >>> >>> >>> Lance >>> >>> >>> On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote: >>> >>> Hey Lance, >>> >>>> >>>> Thanks for the logs. They definitely confirmed my suspicion. There are >>>> two >>>> problems here: >>>> >>>> 1) If the JobTracker throws an exception during processing of a >>>> heartbeat, >>>> the tasktracker retries with no delay, since lastHeartbeat isn't updated >>>> in >>>> TaskTracker.offerService. This is related to HADOOP-3987 >>>> >>>> 2) If the TaskTracker sends a task in COMMIT_PENDING state with an >>>> invalid >>>> task id, the jobtracker will trigger a NullPointerException in >>>> JobTracker.getTasksToSave. Instead it should probably create a >>>> KillTaskAction. I just filed a JIRA to track this issue: >>>> >>>> https://issues.apache.org/jira/browse/HADOOP-5761 >>>> >>>> 3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state >>>> that >>>> the JobTracker didn't know about. How it got there is a separate problem >>>> that's a bit harder to track down. >>>> >>>> Thanks >>>> -Todd >>>> >>>> On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com> >>>> wrote: >>>> >>>> Here are the job tracker logs from the same time (and yes.. there is >>>> >>>>> something there!!): >>>>> >>>>> >>>>> 2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Serious >>>>> problem. While updating status, cannot find taskid >>>>> attempt_200904291917_0252_r_000003_0 >>>>> >>>>> >>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>> handler 2 on 54311, call >>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1a93388, false, >>>>> true, >>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>> java.lang.NullPointerException >>>>> java.io.IOException: java.lang.NullPointerException >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>> at >>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>> at >>>>> >>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>> 2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Serious >>>>> problem. While updating status, cannot find taskid >>>>> attempt_200904291917_0296_r_000014_1 >>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Adding >>>>> task 'attempt_200904291917_0352_r_000013_0' to tip >>>>> task_200904291917_0352_r_000013, for tracker >>>>> >>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ >>>>> 127.0.0.1:42479' >>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Removed >>>>> completed task 'attempt_200904291917_0343_m_000003_0' from >>>>> >>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ >>>>> 127.0.0.1:42479' >>>>> 2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Removed >>>>> completed task 'attempt_200904291917_0343_m_000007_0' from >>>>> >>>>> 'tracker_domU-12-31-38-00-F0-41.compute-1.internal:localhost.localdomain/ >>>>> 127.0.0.1:42479' >>>>> >>>>> >>>>> And then.. a LOT more >>>>> >>>>> >>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Serious >>>>> problem. While updating status, cannot find taskid >>>>> attempt_200904291917_0252_r_000003_0 >>>>> 2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>> Recieved duplicate status update of 'KILLED' for >>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>> 'task_200904291917_0352_m_000010' >>>>> 2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>> handler 1 on 54311, call >>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1b7b4c1, false, >>>>> true, >>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>> java.lang.NullPointerException >>>>> java.io.IOException: java.lang.NullPointerException >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>> at >>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>> at >>>>> >>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>> 2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Serious >>>>> problem. While updating status, cannot find taskid >>>>> attempt_200904291917_0252_r_000003_0 >>>>> 2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>> Recieved duplicate status update of 'KILLED' for >>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>> 'task_200904291917_0352_m_000010' >>>>> 2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>> handler 1 on 54311, call >>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1598c57, false, >>>>> true, >>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>> java.lang.NullPointerException >>>>> java.io.IOException: java.lang.NullPointerException >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>> at >>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>> at >>>>> >>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Serious >>>>> problem. While updating status, cannot find taskid >>>>> attempt_200904291917_0252_r_000003_0 >>>>> 2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>> Recieved duplicate status update of 'KILLED' for >>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>> 'task_200904291917_0352_m_000010' >>>>> 2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>> handler 7 on 54311, call >>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@cb3d5, false, >>>>> true, >>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>> java.lang.NullPointerException >>>>> java.io.IOException: java.lang.NullPointerException >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>> at >>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>> at >>>>> >>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker: >>>>> Serious >>>>> problem. While updating status, cannot find taskid >>>>> attempt_200904291917_0252_r_000003_0 >>>>> 2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress: >>>>> Recieved duplicate status update of 'KILLED' for >>>>> 'attempt_200904291917_0352_m_000010_1' of TIP >>>>> 'task_200904291917_0352_m_000010' >>>>> 2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server >>>>> handler 1 on 54311, call >>>>> heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f27d98, false, >>>>> true, >>>>> 5341) from 10.253.134.191:42688: error: java.io.IOException: >>>>> java.lang.NullPointerException >>>>> java.io.IOException: java.lang.NullPointerException >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java:2130) >>>>> at >>>>> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923) >>>>> at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) >>>>> at >>>>> >>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) >>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote: >>>>> >>>>> Hey Lance, >>>>> >>>>> >>>>>> Did you see any error messages in the JobTracker logs around the time >>>>>> this >>>>>> started? I think I understand how this might happen. >>>>>> >>>>>> Thanks, >>>>>> -Todd >>>>>> >>>>>> On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com> >>>>>> wrote: >>>>>> >>>>>> I have not been able to reproduce. We are using version 19.1 with the >>>>>> >>>>>> following patches: >>>>>>> 4780-2v19.patch (Jira 4780) >>>>>>> closeAll3.patch (Jira 3998) >>>>>>> >>>>>>> Thanks, >>>>>>> Lance >>>>>>> >>>>>>> >>>>>>> On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote: >>>>>>> >>>>>>> Hi Lance, >>>>>>> >>>>>>> >>>>>>> Can I ask what version you were running when you saw this? Is it >>>>>>>> reproducible? I'm trying to look at the code path that might produce >>>>>>>> such >>>>>>>> a >>>>>>>> behavior and want to make sure I'm looking at the right version. >>>>>>>> >>>>>>>> Thanks >>>>>>>> -Todd >>>>>>>> >>>>>>>> On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Has anyone seen this before? Our task tracker produced a 2.7 gig log >>>>>>>> file >>>>>>>> >>>>>>>> in a few hours. The entry is all the same (every 2 ms): >>>>>>>> >>>>>>>>> >>>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> ... (And on and on and on...) >>>>>>>>> >>>>>>>>> >>>>>>>>> These are the few lines before it started: >>>>>>>>> >>>>>>>>> 2009-04-30 02:34:29,780 INFO >>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: >>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324, >>>>>>>>> op: >>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0 >>>>>>>>> 2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Sent >>>>>>>>> out >>>>>>>>> 418891 bytes for reduce: 12 from map: >>>>>>>>> attempt_200904291917_0352_m_000007_0 >>>>>>>>> given 418891/418887 from 4301462 with (22, 171) >>>>>>>>> 2009-04-30 02:34:31,522 INFO >>>>>>>>> org.apache.hadoop.mapred.TaskTracker.clienttrace: src: >>>>>>>>> xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891, >>>>>>>>> op: >>>>>>>>> MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0 >>>>>>>>> 2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 >>>>>>>>> of >>>>>>>>> 11 >>>>>>>>> at >>>>>>>>> 0.32 MB/s) > >>>>>>>>> 2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 >>>>>>>>> of >>>>>>>>> 11 >>>>>>>>> at >>>>>>>>> 0.32 MB/s) > >>>>>>>>> 2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> 2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>>>>> Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' >>>>>>>>> with >>>>>>>>> reponseId '5341 >>>>>>>>> >>>>>>>>> --And on for 2+ gigs >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> > >