[jira] [Commented] (TEZ-4543) Throw a special exception to DagClient when there is no current DAG

2024-05-03 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843166#comment-17843166
 ] 

Ayush Saxena commented on TEZ-4543:
---

This is leading to some test failures:
TestAMRecovery, TestDAGRecovery, TestRecovery
ref: https://ci-hadoop.apache.org/job/Tez-qbt-0.10-Build/183/testReport/

I have created TEZ-4559, maybe it is breaking the Recovery code

> Throw a special exception to DagClient when there is no current DAG
> ---
>
> Key: TEZ-4543
> URL: https://issues.apache.org/jira/browse/TEZ-4543
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> given the following scenario:
> 1. DAG is assigned to an AM
> 2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing 
> network errors:
> {code}
> hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 
> dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" 
> dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" 
> queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" 
> sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" 
> thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status 
> due to IOException: DestHost:destPort 
> query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:2
>  , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local 
> exception: java.io.IOException: java.io.IOException: Connection reset by peer
> {code}
> by this time, HS2 cannot tell if the AM is lost forever, or there is a 
> recoverable intermittent network issue
> 3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
> status (getDagStatus call) from the restarted coordinator, HS2 isn't even 
> able to realize it was talking to a new AM, and keeps asking for DAG status
> 4. in AM, the below exception is kept thrown and it's not handled by the 
> DagClient
> {code}
>  <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
> 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" 
> level="INFO" thread="IPC Server handler 0 on 2"] IPC Server handler 0 on 
> 2, call Call#15312255 Retry#0 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
> from 127.0.0.6:56221
> org.apache.tez.dag.api.TezException: No running dag at present
> at 
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
> at 
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
> at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
> at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> {code}
> AM should be able to return a specialized exception which can be handled by 
> the client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TEZ-4543) Throw a special exception to DagClient when there is no current DAG

2024-02-27 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/TEZ-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821228#comment-17821228
 ] 

Ayush Saxena commented on TEZ-4543:
---

Committed to master.

Thanx [~abstractdog] for the contribution!!!

> Throw a special exception to DagClient when there is no current DAG
> ---
>
> Key: TEZ-4543
> URL: https://issues.apache.org/jira/browse/TEZ-4543
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
> Fix For: 0.10.4
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> given the following scenario:
> 1. DAG is assigned to an AM
> 2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing 
> network errors:
> {code}
> hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 
> dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" 
> dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" 
> queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" 
> sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" 
> thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status 
> due to IOException: DestHost:destPort 
> query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:2
>  , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local 
> exception: java.io.IOException: java.io.IOException: Connection reset by peer
> {code}
> by this time, HS2 cannot tell if the AM is lost forever, or there is a 
> recoverable intermittent network issue
> 3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
> status (getDagStatus call) from the restarted coordinator, HS2 isn't even 
> able to realize it was talking to a new AM, and keeps asking for DAG status
> 4. in AM, the below exception is kept thrown and it's not handled by the 
> DagClient
> {code}
>  <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
> 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" 
> level="INFO" thread="IPC Server handler 0 on 2"] IPC Server handler 0 on 
> 2, call Call#15312255 Retry#0 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
> from 127.0.0.6:56221
> org.apache.tez.dag.api.TezException: No running dag at present
> at 
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
> at 
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
> at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
> at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> {code}
> AM should be able to return a specialized exception which can be handled by 
> the client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)