[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950102#comment-14950102 ] Jeff Zhang commented on TEZ-2781: - Committed to 0.5/0.6/0.7/master > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished with diagnostics set to > [Container failed. ]], TaskAttempt 2 failed, info=[Container > container_e02_1440915851419_0007_01_04 finished with diagnostics set to > [Container failed. ]], TaskAttempt 3 failed, info=[Container > container_e02_1440915851419_0007_01_05 finished with diagnostics set to > [Container failed. ]]], Vertex failed as one or more tasks failed. > failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] > killed/failed due to:null] > DAG failed due to vertex failure. failedVertices:1 killedVertices:0 > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask > {code} > {code} > 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure > caused by communication failure > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: Too many counters: 2001 max=2000 > at org.apache.hadoop.ipc.Client.call(Client.java:1469) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) > at com.sun.proxy.$Proxy9.heartbeat(Unknown Source) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119) > at > org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381) > at > org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257) > at > org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:224) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.la
[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939162#comment-14939162 ] Siddharth Seth commented on TEZ-2781: - bq. It should OK as long as TaskCommunicatorManager handle the event ordering correctly. It works for now, but has the potential to break in the future if additional task events are handled directly instead of sending them to the Vertex. If not changing this, can you please add a comment in TaskCommunicatorManager indicating the importance of the ordering of events. bq. Catch LimitExceededException would involve lots of code changes. Not sure I understand this, but looking at this again, catch Exception seems better - to ensure the failure message is sent to the task. Rest looks good. +1 > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished with diagnostics set to > [Container failed. ]], TaskAttempt 2 failed, info=[Container > container_e02_1440915851419_0007_01_04 finished with diagnostics set to > [Container failed. ]], TaskAttempt 3 failed, info=[Container > container_e02_1440915851419_0007_01_05 finished with diagnostics set to > [Container failed. ]]], Vertex failed as one or more tasks failed. > failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] > killed/failed due to:null] > DAG failed due to vertex failure. failedVertices:1 killedVertices:0 > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask > {code} > {code} > 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure > caused by communication failure > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: Too many counters: 2001 max=2000 > at org.apache.hadoop.ipc.Client.call(Client.java:1469) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) > at com.sun.proxy.$Proxy9.heartbeat(Unknown Source) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119) > at > org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381) > at > org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257) > at > org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:224) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subjec
[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936726#comment-14936726 ] TezQA commented on TEZ-2781: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12764398/TEZ-2781-3.patch against master revision b153035. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in : org.apache.tez.dag.history.logging.ats.TestATSHistoryWithMiniCluster Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1187//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1187//console This message is automatically generated. > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished with diagnostics set to > [Container failed. ]], TaskAttempt 2 failed, info=[Container > container_e02_1440915851419_0007_01_04 finished with diagnostics set to > [Container failed. ]], TaskAttempt 3 failed, info=[Container > container_e02_1440915851419_0007_01_05 finished with diagnostics set to > [Container failed. ]]], Vertex failed as one or more tasks failed. > failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] > killed/failed due to:null] > DAG failed due to vertex failure. failedVertices:1 killedVertices:0 > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask > {code} > {code} > 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure > caused by communication failure > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: Too many counters: 2001 max=2000 > at org.apache.hadoop.ipc.Client.call(Client.java:1469) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) > at com.sun.proxy.$Proxy9.heartbeat(Unknown Source) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119) > at > org.apache.tez.r
[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936523#comment-14936523 ] Jeff Zhang commented on TEZ-2781: - Minor update on the patch (Use TezConfiguration.TEZ_COUNTERS_MAX_DEFAULT instead of 2000 in test) bq. The ordering of the events should be StatusUpdate followed by TaskAttemptFailedEvent will cause the attempt to move into a Failed state, at which status updates are ignored. It should OK as long as TaskCommunicatorManager handle the event ordering correctly. bq. Instead of catch (Exception) - is it possible to catch (LimitExceededException) - will this harm anything ? Catch LimitExceededException would involve lots of code changes. I think in the long term, we should do that. Actually this ticket only resolve the counter exceed issue partially. Counter exceeded would still happen in AM side, will handle it in another ticket. bq. in the absence of this patch, what behaviour are you seeing ? The processor reports failure because of the LimitExceededException. The TaskReporter then fails while trying to report the error to the AM - and the AM waits for the timeout to kill the task ? TezChild exit due to LimitExceededException, task fails due to the container exit so can not get the diagnostics of task. {noformat} 2015-09-30 15:39:27,084 [ERROR] [main] |yarn.YarnUncaughtExceptionHandler|: Thread Thread[main,5,main] threw an Exception. org.apache.tez.common.counters.LimitExceededException: Too many counters: 1201 max=1200 at org.apache.tez.common.counters.Limits.checkCounters(Limits.java:88) at org.apache.tez.common.counters.Limits.incrCounters(Limits.java:95) at org.apache.tez.common.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:76) at org.apache.tez.common.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:93) at org.apache.tez.common.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:104) at org.apache.tez.common.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:199) at org.apache.tez.common.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:363) at org.apache.tez.runtime.RuntimeTask.getCounters(RuntimeTask.java:127) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.getStatusUpdateEvent(TaskReporter.java:354) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:377) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:128) at org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:428) at org.apache.tez.runtime.task.TezTaskRunner2.run(TezTaskRunner2.java:205) at org.apache.tez.runtime.task.TezChild.run(TezChild.java:256) at org.apache.tez.runtime.task.TezChild.main(TezChild.java:488) {noformat} > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished
[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936404#comment-14936404 ] Siddharth Seth commented on TEZ-2781: - Couple of things. - The ordering of the events should be StatusUpdate followed by TaskAttemptFailedEvent will cause the attempt to move into a Failed state, at which status updates are ignored. We'll end up dropping counters for failed tasks. (It looks like there's no tests which cover this - will create a separate jira to create such a test). - Instead of catch (Exception) - is it possible to catch (LimitExceededException) - will this harm anything ? - The test would be more robust if the number of counters generated were to be based on values from TezConfiguration, instead of 2000. [~zjffdu] - in the absence of this patch, what behaviour are you seeing ? The processor reports failure because of the LimitExceededException. The TaskReporter then fails while trying to report the error to the AM - and the AM waits for the timeout to kill the task ? > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished with diagnostics set to > [Container failed. ]], TaskAttempt 2 failed, info=[Container > container_e02_1440915851419_0007_01_04 finished with diagnostics set to > [Container failed. ]], TaskAttempt 3 failed, info=[Container > container_e02_1440915851419_0007_01_05 finished with diagnostics set to > [Container failed. ]]], Vertex failed as one or more tasks failed. > failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] > killed/failed due to:null] > DAG failed due to vertex failure. failedVertices:1 killedVertices:0 > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask > {code} > {code} > 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure > caused by communication failure > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: Too many counters: 2001 max=2000 > at org.apache.hadoop.ipc.Client.call(Client.java:1469) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) > at com.sun.proxy.$Proxy9.heartbeat(Unknown Source) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119) > at > org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381) > at > org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257) > at > org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1
[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936017#comment-14936017 ] Hitesh Shah commented on TEZ-2781: -- [~sseth] Mind taking a look? This will help improve diagnostics where counters exceed limits. > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished with diagnostics set to > [Container failed. ]], TaskAttempt 2 failed, info=[Container > container_e02_1440915851419_0007_01_04 finished with diagnostics set to > [Container failed. ]], TaskAttempt 3 failed, info=[Container > container_e02_1440915851419_0007_01_05 finished with diagnostics set to > [Container failed. ]]], Vertex failed as one or more tasks failed. > failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] > killed/failed due to:null] > DAG failed due to vertex failure. failedVertices:1 killedVertices:0 > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask > {code} > {code} > 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure > caused by communication failure > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: Too many counters: 2001 max=2000 > at org.apache.hadoop.ipc.Client.call(Client.java:1469) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) > at com.sun.proxy.$Proxy9.heartbeat(Unknown Source) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119) > at > org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381) > at > org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257) > at > org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:224) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.r
[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732226#comment-14732226 ] TezQA commented on TEZ-2781: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12754365/TEZ-2781-2.patch against master revision e0523eb. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.dag.app.TestSpeculation Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1082//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1082//console This message is automatically generated. > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished with diagnostics set to > [Container failed. ]], TaskAttempt 2 failed, info=[Container > container_e02_1440915851419_0007_01_04 finished with diagnostics set to > [Container failed. ]], TaskAttempt 3 failed, info=[Container > container_e02_1440915851419_0007_01_05 finished with diagnostics set to > [Container failed. ]]], Vertex failed as one or more tasks failed. > failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] > killed/failed due to:null] > DAG failed due to vertex failure. failedVertices:1 killedVertices:0 > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask > {code} > {code} > 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure > caused by communication failure > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: Too many counters: 2001 max=2000 > at org.apache.hadoop.ipc.Client.call(Client.java:1469) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) > at com.sun.proxy.$Proxy9.heartbeat(Unknown Source) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119) > at > org.apache.tez.runtime.task.TaskReporter.taskFail
[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
[ https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732187#comment-14732187 ] TezQA commented on TEZ-2781: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12754359/TEZ-2781-1.patch against master revision e0523eb. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1081//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/1081//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1081//console This message is automatically generated. > Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails > -- > > Key: TEZ-2781 > URL: https://issues.apache.org/jira/browse/TEZ-2781 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2781-1.patch > > > It is possible the taskFailed heartbeat fails to send to AM (due to counter > limitation exceed) . In that case client can not get the right diagnostic > info. > {code} > hive> select gencounter(2500) from (select count(*) from abc) a; > Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c > Total jobs = 1 > Launching Job 1 out of 1 > Status: Running (Executing on YARN cluster with App id > application_1440915851419_0007) > > VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED > KILLED > > Map 1 SUCCEEDED 0 000 0 > 0 > Reducer 2 FAILED 1 001 4 > 0 > > VERTICES: 01/02 [>>--] 0%ELAPSED TIME: 25.44 s > > Status: Failed > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, > diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Container > container_e02_1440915851419_0007_01_02 finished with diagnostics set to > [Container failed. ]], TaskAttempt 1 failed, info=[Container > container_e02_1440915851419_0007_01_03 finished with diagnostics set to > [Container failed. ]], TaskAttempt 2 failed, info=[Container > container_e02_1440915851419_0007_01_04 finished with diagnostics set to > [Container failed. ]], TaskAttempt 3 failed, info=[Container > container_e02_1440915851419_0007_01_05 finished with diagnostics set to > [Container failed. ]]], Vertex failed as one or more tasks failed. > failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] > killed/failed due to:null] > DAG failed due to vertex failure. failedVertices:1 killedVertices:0 > FAILED: Execution Error, return code 2 from > org.apache.hadoop.hive.ql.exec.tez.TezTask > {code} > {code} > 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure > caused by communication failure > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: Too many counters: 2001 max=2000 > at org.apache.hadoop.ipc.Client.call(Client.java:1469) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244) > at com.sun.proxy.$Proxy9.heartbeat(Unknown Source) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344) > at > org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java