[ 
https://issues.apache.org/jira/browse/TEZ-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4543:
------------------------------
    Description: 
given the following scenario:

1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 retries DAG
3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
status (getDagStatus call) from the restarted coordinator
(most probably because of the host match), HS2 isn't even able to realize it 
was talking to a new AM, keep asking for DAG status
4. in coordinator, the below exception is kept thrown and it's not handled by 
the DagClient

{code}
 <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO" 
thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call 
Call#15312255 Retry#0 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.6:56221
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
{code}

> Throw a special exception to DagClient when there is no current DAG
> -------------------------------------------------------------------
>
>                 Key: TEZ-4543
>                 URL: https://issues.apache.org/jira/browse/TEZ-4543
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>
> given the following scenario:
> 1. DAG is assigned to an AM
> 2. AM is killed (e.g. OOMKilled by k8s), HS2 retries DAG
> 3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
> status (getDagStatus call) from the restarted coordinator
> (most probably because of the host match), HS2 isn't even able to realize it 
> was talking to a new AM, keep asking for DAG status
> 4. in coordinator, the below exception is kept thrown and it's not handled by 
> the DagClient
> {code}
>  <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
> 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" 
> level="INFO" thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 
> 22222, call Call#15312255 Retry#0 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
> from 127.0.0.6:56221
> org.apache.tez.dag.api.TezException: No running dag at present
>     at 
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
>     at 
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
>     at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
>     at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to