Hi Todd,
Not sure if this is related, but our hadoop cluster in general is getting more and more unstable. the logs are full of this error message (but having trouble tracking down the root problem):

2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_1111/ attempt_200904301103_1111_m_000001_1/output/file.out in any of the configured local directories 2009-05-02 11:30:39,294 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_1675/ attempt_200904301103_1675_r_000012_1/output/file.out in any of the configured local directories 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_0944/ attempt_200904301103_0944_r_000015_0/output/file.out in any of the configured local directories 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_1111/ attempt_200904301103_1111_m_000001_1/output/file.out in any of the configured local directories 2009-05-02 11:30:44,295 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_1675/ attempt_200904301103_1675_r_000012_1/output/file.out in any of the configured local directories 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_0944/ attempt_200904301103_0944_r_000015_0/output/file.out in any of the configured local directories 2009-05-02 11:30:49,296 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_1111/ attempt_200904301103_1111_m_000001_1/output/file.out in any of the configured local directories 2009-05-02 11:30:49,297 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_1675/ attempt_200904301103_1675_r_000012_1/output/file.out in any of the configured local directories 2009-05-02 11:30:54,298 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200904301103_0944/ attempt_200904301103_0944_r_000015_0/output/file.out in any of the configured local directories


Lance

On Apr 30, 2009, at 12:04 PM, Todd Lipcon wrote:

Hey Lance,

Thanks for the logs. They definitely confirmed my suspicion. There are two
problems here:

1) If the JobTracker throws an exception during processing of a heartbeat, the tasktracker retries with no delay, since lastHeartbeat isn't updated in
TaskTracker.offerService. This is related to HADOOP-3987

2) If the TaskTracker sends a task in COMMIT_PENDING state with an invalid
task id, the jobtracker will trigger a NullPointerException in
JobTracker.getTasksToSave. Instead it should probably create a
KillTaskAction. I just filed a JIRA to track this issue:

https://issues.apache.org/jira/browse/HADOOP-5761

3) The TaskTracker somehow had a task attempt in COMMIT_PENDING state that the JobTracker didn't know about. How it got there is a separate problem
that's a bit harder to track down.

Thanks
-Todd

On Thu, Apr 30, 2009 at 11:17 AM, Lance Riedel <la...@dotspots.com> wrote:

Here are the job tracker logs from the same time (and yes.. there is
something there!!):


2009-04-30 02:34:28,484 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200904291917_0252_r_000003_0


2009-04-30 02:34:40,215 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 54311, call
heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1a93388, false, true,
5341) from 10.253.134.191:42688: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
      at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 2130)
      at
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
      at
sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,215 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200904291917_0296_r_000014_1
2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_200904291917_0352_r_000013_0' to tip
task_200904291917_0352_r_000013, for tracker
'tracker_domU-12-31-38-00- F0-41.compute-1.internal:localhost.localdomain/
127.0.0.1:42479'
2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200904291917_0343_m_000003_0' from
'tracker_domU-12-31-38-00- F0-41.compute-1.internal:localhost.localdomain/
127.0.0.1:42479'
2009-04-30 02:34:40,217 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200904291917_0343_m_000007_0' from
'tracker_domU-12-31-38-00- F0-41.compute-1.internal:localhost.localdomain/
127.0.0.1:42479'


And then.. a LOT more


2009-04-30 02:34:40,433 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,433 WARN org.apache.hadoop.mapred.TaskInProgress:
Recieved duplicate status update of 'KILLED' for
'attempt_200904291917_0352_m_000010_1' of TIP
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,433 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54311, call
heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1b7b4c1, false, true,
5341) from 10.253.134.191:42688: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
      at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 2130)
      at
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
      at
sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,441 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,441 WARN org.apache.hadoop.mapred.TaskInProgress:
Recieved duplicate status update of 'KILLED' for
'attempt_200904291917_0352_m_000010_1' of TIP
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,442 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54311, call
heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1598c57, false, true,
5341) from 10.253.134.191:42688: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
      at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 2130)
      at
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
      at
sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,444 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,444 WARN org.apache.hadoop.mapred.TaskInProgress:
Recieved duplicate status update of 'KILLED' for
'attempt_200904291917_0352_m_000010_1' of TIP
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,444 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 54311, call
heartbeat(org.apache.hadoop.mapred.tasktrackersta...@cb3d5, false, true,
5341) from 10.253.134.191:42688: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
      at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 2130)
      at
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
      at
sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)
2009-04-30 02:34:40,447 INFO org.apache.hadoop.mapred.JobTracker: Serious
problem.  While updating status, cannot find taskid
attempt_200904291917_0252_r_000003_0
2009-04-30 02:34:40,447 WARN org.apache.hadoop.mapred.TaskInProgress:
Recieved duplicate status update of 'KILLED' for
'attempt_200904291917_0352_m_000010_1' of TIP
'task_200904291917_0352_m_000010'
2009-04-30 02:34:40,447 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54311, call
heartbeat(org.apache.hadoop.mapred.tasktrackersta...@1f27d98, false, true,
5341) from 10.253.134.191:42688: error: java.io.IOException:
java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
      at
org.apache.hadoop.mapred.JobTracker.getTasksToSave(JobTracker.java: 2130)
      at
org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:1923)
      at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
      at
sun .reflect .DelegatingMethodAccessorImpl .invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)










On Apr 30, 2009, at 11:04 AM, Todd Lipcon wrote:

Hey Lance,

Did you see any error messages in the JobTracker logs around the time this
started? I think I understand how this might happen.

Thanks,
-Todd

On Thu, Apr 30, 2009 at 10:45 AM, Lance Riedel <la...@dotspots.com>
wrote:

I have not been able to reproduce. We are using version 19.1 with the
following patches:
4780-2v19.patch (Jira  4780)
closeAll3.patch (Jira 3998)

Thanks,
Lance


On Apr 30, 2009, at 10:40 AM, Todd Lipcon wrote:

Hi Lance,


Can I ask what version you were running when you saw this? Is it
reproducible? I'm trying to look at the code path that might produce
such
a
behavior and want to make sure I'm looking at the right version.

Thanks
-Todd

On Thu, Apr 30, 2009 at 9:33 AM, Lance Riedel <la...@dotspots.com>
wrote:

Has anyone seen this before? Our task tracker produced a 2.7 gig log
file

in a few hours. The entry is all the same (every 2 ms):

2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
... (And on and on and on...)


These are the few lines before it started:

2009-04-30 02:34:29,780 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
xxx.xxx.xxx.xxx:50060, dest: 10.253.178.95:40268, bytes: 3341324, op:
MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000006_0
2009-04-30 02:34:31,522 INFO org.apache.hadoop.mapred.TaskTracker: Sent
out
418891 bytes for reduce: 12 from map:
attempt_200904291917_0352_m_000007_0
given 418891/418887 from 4301462 with (22, 171)
2009-04-30 02:34:31,522 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src:
xxx.xxx.xxx.xxx:50060, dest: xxx.xxx.xxx.xxx:40268, bytes: 418891, op:
MAPRED_SHUFFLE, cliID: attempt_200904291917_0352_m_000007_0
2009-04-30 02:34:35,382 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11
at
0.32 MB/s) >
2009-04-30 02:34:38,385 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200904291917_0352_r_000003_0 0.3030303% reduce > copy (10 of 11
at
0.32 MB/s) >
2009-04-30 02:34:40,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,398 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,403 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,411 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,414 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,417 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341
2009-04-30 02:34:40,420 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'ec2-xx-xx-xx-xx.compute-1.amazonaws.com' with
reponseId '5341

--And on for    2+ gigs






Reply via email to