[
https://issues.apache.org/jira/browse/TEZ-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated TEZ-4543:
------------------------------
Fix Version/s: 0.10.4
> Throw a special exception to DagClient when there is no current DAG
> -------------------------------------------------------------------
>
> Key: TEZ-4543
> URL: https://issues.apache.org/jira/browse/TEZ-4543
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Fix For: 0.10.4
>
>
> given the following scenario:
> 1. DAG is assigned to an AM
> 2. AM is killed (e.g. OOMKilled by k8s), HS2 retries DAG
> 3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG
> status (getDagStatus call) from the restarted coordinator
> (most probably because of the host match), HS2 isn't even able to realize it
> was talking to a new AM, keep asking for DAG status
> 4. in coordinator, the below exception is kept thrown and it's not handled by
> the DagClient
> {code}
> <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1
> 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server"
> level="INFO" thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on
> 22222, call Call#15312255 Retry#0
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus
> from 127.0.0.6:56221
> org.apache.tez.dag.api.TezException: No running dag at present
> at
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
> at
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)