[
https://issues.apache.org/jira/browse/TEZ-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated TEZ-4543:
------------------------------
Description:
given the following scenario:
1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing
network errors:
{code}
hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1
dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl"
dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION"
queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a"
sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4"
thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status
due to IOException: DestHost:destPort
query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
, LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local
exception: java.io.IOException: java.io.IOException: Connection reset by peer
{code}
by this time, HS2 cannot tell if the AM is lost forever, or there is a
recoverable intermittent network issue
3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG
status (getDagStatus call) from the restarted coordinator, HS2 isn't even able
to realize it was talking to a new AM, and keeps asking for DAG status
4. in AM, the below exception is kept thrown and it's not handled by the
DagClient
{code}
<14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1
10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO"
thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call
Call#15312255 Retry#0
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus
from 127.0.0.6:56221
org.apache.tez.dag.api.TezException: No running dag at present
at
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
at
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
at
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
at
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
{code}
AM should be able to return a specialized exception which can be handled by the
client
was:
given the following scenario:
1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing
network errors:
{code}
hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1
dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl"
dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION"
queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a"
sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4"
thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status
due to IOException: DestHost:destPort
query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
, LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local
exception: java.io.IOException: java.io.IOException: Connection reset by peer
{code}
by this time, HS2 cannot tell if the AM is lost forever, or there is a
recoverable intermittent network issue
3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG
status (getDagStatus call) from the restarted coordinator, HS2 isn't even able
to realize it was talking to a new AM, and keeps asking for DAG status
4. in AM, the below exception is kept thrown and it's not handled by the
DagClient
{code}
<14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1
10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO"
thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call
Call#15312255 Retry#0
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus
from 127.0.0.6:56221
org.apache.tez.dag.api.TezException: No running dag at present
at
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
at
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
at
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
at
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
{code}
> Throw a special exception to DagClient when there is no current DAG
> -------------------------------------------------------------------
>
> Key: TEZ-4543
> URL: https://issues.apache.org/jira/browse/TEZ-4543
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Fix For: 0.10.4
>
>
> given the following scenario:
> 1. DAG is assigned to an AM
> 2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing
> network errors:
> {code}
> hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1
> dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl"
> dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION"
> queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a"
> sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4"
> thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status
> due to IOException: DestHost:destPort
> query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
> , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local
> exception: java.io.IOException: java.io.IOException: Connection reset by peer
> {code}
> by this time, HS2 cannot tell if the AM is lost forever, or there is a
> recoverable intermittent network issue
> 3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG
> status (getDagStatus call) from the restarted coordinator, HS2 isn't even
> able to realize it was talking to a new AM, and keeps asking for DAG status
> 4. in AM, the below exception is kept thrown and it's not handled by the
> DagClient
> {code}
> <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1
> 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server"
> level="INFO" thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on
> 22222, call Call#15312255 Retry#0
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus
> from 127.0.0.6:56221
> org.apache.tez.dag.api.TezException: No running dag at present
> at
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
> at
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
> at
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> {code}
> AM should be able to return a specialized exception which can be handled by
> the client
--
This message was sent by Atlassian Jira
(v8.20.10#820010)