Reduce tips complete 100%, but job does not complete saying reduces still running. ----------------------------------------------------------------------------------
Key: HADOOP-2167 URL: https://issues.apache.org/jira/browse/HADOOP-2167 Project: Hadoop Issue Type: Bug Reporter: Amareshwari Sri Ramadasu Assignee: Arun C Murthy Priority: Critical Job's reduces are stuck at 99.43% progress and 2 reduces in running state and Job is not complete. But the reduce task list on the job tracker shows they are complete 100% and marked as SUCCEEDED and Finishtime is available jobtasks.jsp and jobhistory also. With ipc.client.timeout = 600000, the exceptions on TT's running the reduces are On one of the TTs, the logs show the following: 2007-11-07 08:34:16,092 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200711070637_0001_r_000150_0 is done. 2007-11-07 08:35:34,013 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200711070637_0001_r_000156_0 is done. 2007-11-07 08:42:44,751 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.SocketTimeoutException: timedout waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:484) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:897) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:799) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1193) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2055) 2007-11-07 08:42:44,767 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to ................. On the other TT, 2007-11-07 08:40:30,484 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200711070637_0001_r_000160_0 is done. 2007-11-07 08:42:45,508 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.SocketTimeoutException: timedout waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:484) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source) at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:897) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:799) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1193) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2055) 2007-11-07 08:42:45,508 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to .......... On JT logs, the reduce tasks are done successfully: 2007-11-07 06:39:09,151 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200711070637_0001_r_000160_0' to tip tip_200711070637_0001_r_000160, for tracker 'x' 2007-11-07 08:42:45,708 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of task 'task_200711070637_0001_r_000160_0' to 'y' 2007-11-07 08:42:45,708 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200711070637_0001_r_000160_0' has completed tip_200711070637_0001_r_000160 successfully. This would suggest that if tasks are done before the timeout, the problem occurs in progress update. This is also not consistent since other reduce tasks in the same situation are successful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.