[ 
https://issues.apache.org/jira/browse/TEZ-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4543:
------------------------------
    Description: 
given the following scenario:

1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing 
network errors:
{code}
hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 
dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" 
dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" 
queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" 
sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" 
thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status 
due to IOException: DestHost:destPort 
query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
 , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local 
exception: java.io.IOException: java.io.IOException: Connection reset by peer
{code}
by this time, HS2 cannot tell if the AM is lost forever, or there is a 
recoverable intermittent network issue

3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
status (getDagStatus call) from the restarted coordinator, HS2 isn't even able 
to realize it was talking to a new AM, and keeps asking for DAG status
4. in AM, the below exception is kept thrown and it's not handled by the 
DagClient

{code}
 <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO" 
thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call 
Call#15312255 Retry#0 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.6:56221
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
{code}

  was:
given the following scenario:

1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 retries DAG
3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
status (getDagStatus call) from the restarted coordinator
(most probably because of the host match), HS2 isn't even able to realize it 
was talking to a new AM, keep asking for DAG status
4. in coordinator, the below exception is kept thrown and it's not handled by 
the DagClient

{code}
 <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO" 
thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call 
Call#15312255 Retry#0 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.6:56221
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
{code}


> Throw a special exception to DagClient when there is no current DAG
> -------------------------------------------------------------------
>
>                 Key: TEZ-4543
>                 URL: https://issues.apache.org/jira/browse/TEZ-4543
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.4
>
>
> given the following scenario:
> 1. DAG is assigned to an AM
> 2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing 
> network errors:
> {code}
> hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 
> dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" 
> dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" 
> queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" 
> sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" 
> thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status 
> due to IOException: DestHost:destPort 
> query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
>  , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local 
> exception: java.io.IOException: java.io.IOException: Connection reset by peer
> {code}
> by this time, HS2 cannot tell if the AM is lost forever, or there is a 
> recoverable intermittent network issue
> 3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
> status (getDagStatus call) from the restarted coordinator, HS2 isn't even 
> able to realize it was talking to a new AM, and keeps asking for DAG status
> 4. in AM, the below exception is kept thrown and it's not handled by the 
> DagClient
> {code}
>  <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
> 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" 
> level="INFO" thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 
> 22222, call Call#15312255 Retry#0 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
> from 127.0.0.6:56221
> org.apache.tez.dag.api.TezException: No running dag at present
>     at 
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
>     at 
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
>     at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
>     at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to