[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-10-09 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950102#comment-14950102
 ] 

Jeff Zhang commented on TEZ-2781:
-

Committed to 0.5/0.6/0.7/master

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 2 failed, info=[Container 
> container_e02_1440915851419_0007_01_04 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 3 failed, info=[Container 
> container_e02_1440915851419_0007_01_05 finished with diagnostics set to 
> [Container failed. ]]], Vertex failed as one or more tasks failed. 
> failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] 
> killed/failed due to:null]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> {code}
> {code}
> 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure 
> caused by communication failure
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: Too many counters: 2001 max=2000
> at org.apache.hadoop.ipc.Client.call(Client.java:1469)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
> at com.sun.proxy.$Proxy9.heartbeat(Unknown Source)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119)
> at 
> org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:224)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.la

[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-09-30 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939162#comment-14939162
 ] 

Siddharth Seth commented on TEZ-2781:
-

bq. It should OK as long as TaskCommunicatorManager handle the event ordering 
correctly.
It works for now, but has the potential to break in the future if additional 
task events are handled directly instead of sending them to the Vertex. If not 
changing this, can you please add a comment in TaskCommunicatorManager 
indicating the importance of the ordering of events.

bq. Catch LimitExceededException would involve lots of code changes. 
Not sure I understand this, but looking at this again, catch Exception seems 
better - to ensure the failure message is sent to the task.

Rest looks good. +1

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 2 failed, info=[Container 
> container_e02_1440915851419_0007_01_04 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 3 failed, info=[Container 
> container_e02_1440915851419_0007_01_05 finished with diagnostics set to 
> [Container failed. ]]], Vertex failed as one or more tasks failed. 
> failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] 
> killed/failed due to:null]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> {code}
> {code}
> 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure 
> caused by communication failure
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: Too many counters: 2001 max=2000
> at org.apache.hadoop.ipc.Client.call(Client.java:1469)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
> at com.sun.proxy.$Proxy9.heartbeat(Unknown Source)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119)
> at 
> org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:224)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subjec

[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-09-30 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936726#comment-14936726
 ] 

TezQA commented on TEZ-2781:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12764398/TEZ-2781-3.patch
  against master revision b153035.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The following test timeouts occurred in :
 org.apache.tez.dag.history.logging.ats.TestATSHistoryWithMiniCluster

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1187//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1187//console

This message is automatically generated.

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 2 failed, info=[Container 
> container_e02_1440915851419_0007_01_04 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 3 failed, info=[Container 
> container_e02_1440915851419_0007_01_05 finished with diagnostics set to 
> [Container failed. ]]], Vertex failed as one or more tasks failed. 
> failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] 
> killed/failed due to:null]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> {code}
> {code}
> 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure 
> caused by communication failure
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: Too many counters: 2001 max=2000
> at org.apache.hadoop.ipc.Client.call(Client.java:1469)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
> at com.sun.proxy.$Proxy9.heartbeat(Unknown Source)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119)
> at 
> org.apache.tez.r

[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-09-30 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936523#comment-14936523
 ] 

Jeff Zhang commented on TEZ-2781:
-

Minor update on the patch (Use TezConfiguration.TEZ_COUNTERS_MAX_DEFAULT 
instead of 2000 in test)

bq. The ordering of the events should be StatusUpdate followed by 
TaskAttemptFailedEvent will cause the attempt to move into a Failed state, at 
which status updates are ignored.
It should OK as long as TaskCommunicatorManager handle the event ordering 
correctly. 

bq. Instead of catch (Exception) - is it possible to catch 
(LimitExceededException) - will this harm anything ?
Catch LimitExceededException would involve lots of code changes. I think in the 
long term, we should do that. Actually this ticket only resolve the counter 
exceed issue partially. Counter exceeded would still happen in AM side, will 
handle it in another ticket. 

bq.  in the absence of this patch, what behaviour are you seeing ? The 
processor reports failure because of the LimitExceededException. The 
TaskReporter then fails while trying to report the error to the AM - and the AM 
waits for the timeout to kill the task ?
TezChild exit due to LimitExceededException, task fails due to the container 
exit so can not get the diagnostics of task. 
{noformat}
2015-09-30 15:39:27,084 [ERROR] [main] |yarn.YarnUncaughtExceptionHandler|: 
Thread Thread[main,5,main] threw an Exception.
org.apache.tez.common.counters.LimitExceededException: Too many counters: 1201 
max=1200
at org.apache.tez.common.counters.Limits.checkCounters(Limits.java:88)
at org.apache.tez.common.counters.Limits.incrCounters(Limits.java:95)
at 
org.apache.tez.common.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:76)
at 
org.apache.tez.common.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:93)
at 
org.apache.tez.common.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:104)
at 
org.apache.tez.common.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:199)
at 
org.apache.tez.common.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:363)
at org.apache.tez.runtime.RuntimeTask.getCounters(RuntimeTask.java:127)
at 
org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.getStatusUpdateEvent(TaskReporter.java:354)
at 
org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:377)
at 
org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:128)
at 
org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:428)
at org.apache.tez.runtime.task.TezTaskRunner2.run(TezTaskRunner2.java:205)
at org.apache.tez.runtime.task.TezChild.run(TezChild.java:256)
at org.apache.tez.runtime.task.TezChild.main(TezChild.java:488)
{noformat}

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch, TEZ-2781-3.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished 

[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-09-29 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936404#comment-14936404
 ] 

Siddharth Seth commented on TEZ-2781:
-

Couple of things.
- The ordering of the events should be StatusUpdate followed by 
TaskAttemptFailedEvent will cause the attempt to move into a Failed state, at 
which status updates are ignored. We'll end up dropping counters for failed 
tasks. (It looks like there's no tests which cover this - will create a 
separate jira to create such a test).
- Instead of catch (Exception) - is it possible to catch 
(LimitExceededException) - will this harm anything ?
- The test would be more robust if the number of counters generated were to be 
based on values from TezConfiguration, instead of 2000.

[~zjffdu] - in the absence of this patch, what behaviour are you seeing ? The 
processor reports failure because of the LimitExceededException. The 
TaskReporter then fails while trying to report the error to the AM - and the AM 
waits for the timeout to kill the task ?

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 2 failed, info=[Container 
> container_e02_1440915851419_0007_01_04 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 3 failed, info=[Container 
> container_e02_1440915851419_0007_01_05 finished with diagnostics set to 
> [Container failed. ]]], Vertex failed as one or more tasks failed. 
> failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] 
> killed/failed due to:null]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> {code}
> {code}
> 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure 
> caused by communication failure
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: Too many counters: 2001 max=2000
> at org.apache.hadoop.ipc.Client.call(Client.java:1469)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
> at com.sun.proxy.$Proxy9.heartbeat(Unknown Source)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119)
> at 
> org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1

[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-09-29 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936017#comment-14936017
 ] 

Hitesh Shah commented on TEZ-2781:
--

[~sseth] Mind taking a look? This will help improve diagnostics where counters 
exceed limits. 

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 2 failed, info=[Container 
> container_e02_1440915851419_0007_01_04 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 3 failed, info=[Container 
> container_e02_1440915851419_0007_01_05 finished with diagnostics set to 
> [Container failed. ]]], Vertex failed as one or more tasks failed. 
> failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] 
> killed/failed due to:null]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> {code}
> {code}
> 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure 
> caused by communication failure
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: Too many counters: 2001 max=2000
> at org.apache.hadoop.ipc.Client.call(Client.java:1469)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
> at com.sun.proxy.$Proxy9.heartbeat(Unknown Source)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119)
> at 
> org.apache.tez.runtime.task.TaskReporter.taskFailed(TaskReporter.java:381)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:257)
> at 
> org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:51)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:224)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
> at 
> org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.r

[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-09-05 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732226#comment-14732226
 ] 

TezQA commented on TEZ-2781:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12754365/TEZ-2781-2.patch
  against master revision e0523eb.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in :
   org.apache.tez.dag.app.TestSpeculation

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1082//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1082//console

This message is automatically generated.

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch, TEZ-2781-2.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 2 failed, info=[Container 
> container_e02_1440915851419_0007_01_04 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 3 failed, info=[Container 
> container_e02_1440915851419_0007_01_05 finished with diagnostics set to 
> [Container failed. ]]], Vertex failed as one or more tasks failed. 
> failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] 
> killed/failed due to:null]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> {code}
> {code}
> 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure 
> caused by communication failure
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: Too many counters: 2001 max=2000
> at org.apache.hadoop.ipc.Client.call(Client.java:1469)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
> at com.sun.proxy.$Proxy9.heartbeat(Unknown Source)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java:119)
> at 
> org.apache.tez.runtime.task.TaskReporter.taskFail

[jira] [Commented] (TEZ-2781) Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails

2015-09-05 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732187#comment-14732187
 ] 

TezQA commented on TEZ-2781:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12754359/TEZ-2781-1.patch
  against master revision e0523eb.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1081//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1081//artifact/patchprocess/newPatchFindbugsWarningstez-runtime-internals.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1081//console

This message is automatically generated.

> Fallback to send only TaskAttemptFailedEvent if taskFailed heartbeat fails
> --
>
> Key: TEZ-2781
> URL: https://issues.apache.org/jira/browse/TEZ-2781
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.5.4
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2781-1.patch
>
>
> It is possible the taskFailed heartbeat fails to send to AM (due to counter 
> limitation exceed) . In that case client can not get the right diagnostic 
> info. 
> {code}
> hive> select gencounter(2500) from (select count(*) from abc) a;
> Query ID = hrt_qa_2015083122_1956a7d6-1d41-406b-9266-af56ed21883c
> Total jobs = 1
> Launching Job 1 out of 1
> Status: Running (Executing on YARN cluster with App id 
> application_1440915851419_0007)
> 
> VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
> KILLED
> 
> Map 1  SUCCEEDED  0  000   0  
>  0
> Reducer 2 FAILED  1  001   4  
>  0
> 
> VERTICES: 01/02  [>>--] 0%ELAPSED TIME: 25.44 s
> 
> Status: Failed
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1440915851419_0007_2_01, 
> diagnostics=[Task failed, taskId=task_1440915851419_0007_2_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Container 
> container_e02_1440915851419_0007_01_02 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 1 failed, info=[Container 
> container_e02_1440915851419_0007_01_03 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 2 failed, info=[Container 
> container_e02_1440915851419_0007_01_04 finished with diagnostics set to 
> [Container failed. ]], TaskAttempt 3 failed, info=[Container 
> container_e02_1440915851419_0007_01_05 finished with diagnostics set to 
> [Container failed. ]]], Vertex failed as one or more tasks failed. 
> failedTasks:1, Vertex vertex_1440915851419_0007_2_01 [Reducer 2] 
> killed/failed due to:null]
> DAG failed due to vertex failure. failedVertices:1 killedVertices:0
> FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask
> {code}
> {code}
> 2015-08-31 22:00:27,528 WARN [TezChild] task.TezTaskRunner: Heartbeat failure 
> caused by communication failure
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException):
>  IPC server unable to read call parameters: Too many counters: 2001 max=2000
> at org.apache.hadoop.ipc.Client.call(Client.java:1469)
> at org.apache.hadoop.ipc.Client.call(Client.java:1400)
> at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
> at com.sun.proxy.$Proxy9.heartbeat(Unknown Source)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.taskFailed(TaskReporter.java:344)
> at 
> org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.access$300(TaskReporter.java