[jira] [Updated] (TEZ-4543) Throw a special exception to DagClient when there is no current DAG

Jira Mon, 26 Feb 2024 08:25:08 -0800


     [ 
https://issues.apache.org/jira/browse/TEZ-4543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


László Bodor updated TEZ-4543:
------------------------------
    Description: 
given the following scenario:

1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing 
network errors:
{code}
hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 
dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" 
dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" 
queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" 
sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" 
thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status 
due to IOException: DestHost:destPort 
query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
 , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local 
exception: java.io.IOException: java.io.IOException: Connection reset by peer
{code}
by this time, HS2 cannot tell if the AM is lost forever, or there is a 
recoverable intermittent network issue

3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
status (getDagStatus call) from the restarted coordinator, HS2 isn't even able 
to realize it was talking to a new AM, and keeps asking for DAG status
4. in AM, the below exception is kept thrown and it's not handled by the 
DagClient

{code}
 <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO" 
thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call 
Call#15312255 Retry#0 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.6:56221
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
{code}


AM should be able to return a specialized exception which can be handled by the 
client

  was:
given the following scenario:

1. DAG is assigned to an AM
2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing 
network errors:
{code}
hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 
dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" 
dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" 
queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" 
sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" 
thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status 
due to IOException: DestHost:destPort 
query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
 , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local 
exception: java.io.IOException: java.io.IOException: Connection reset by peer
{code}
by this time, HS2 cannot tell if the AM is lost forever, or there is a 
recoverable intermittent network issue

3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
status (getDagStatus call) from the restarted coordinator, HS2 isn't even able 
to realize it was talking to a new AM, and keeps asking for DAG status
4. in AM, the below exception is kept thrown and it's not handled by the 
DagClient

{code}
 <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" level="INFO" 
thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 22222, call 
Call#15312255 Retry#0 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.6:56221
org.apache.tez.dag.api.TezException: No running dag at present
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
    at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
    at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
{code}


> Throw a special exception to DagClient when there is no current DAG
> -------------------------------------------------------------------
>
>                 Key: TEZ-4543
>                 URL: https://issues.apache.org/jira/browse/TEZ-4543
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.4
>
>
> given the following scenario:
> 1. DAG is assigned to an AM
> 2. AM is killed (e.g. OOMKilled by k8s), HS2 keeps asking the status, facing 
> network errors:
> {code}
> hiveserver2 <14>1 2024-02-26T15:59:56.538Z hiveserver2-0 hiveserver2 1 
> dedef3f4-339f-4ba3-a6ae-300751d3561d [mdc@18060 class="client.DAGClientImpl" 
> dagId="dag_1708961199044_0003_1" level="INFO" operationLogLevel="EXECUTION" 
> queryId="hive_20240226155836_6b1e9eb9-efd7-42fd-8872-f4189c5dda3a" 
> sessionId="9e4cb344-ad7f-4344-9b24-aedaf0e73bf4" 
> thread="HiveServer2-Background-Pool: Thread-129"] Cannot retrieve DAG Status 
> due to IOException: DestHost:destPort 
> query-coordinator-0-0.query-coordinator-0-service.compute-1708603165-qlg5.svc.cluster.local:22222
>  , LocalHost:localPort hiveserver2-0/100.100.83.80:0. Failed on local 
> exception: java.io.IOException: java.io.IOException: Connection reset by peer
> {code}
> by this time, HS2 cannot tell if the AM is lost forever, or there is a 
> recoverable intermittent network issue
> 3. AM restarts quite quickly and the DagClient in HS2 tries to fetch the DAG 
> status (getDagStatus call) from the restarted coordinator, HS2 isn't even 
> able to realize it was talking to a new AM, and keeps asking for DAG status
> 4. in AM, the below exception is kept thrown and it's not handled by the 
> DagClient
> {code}
>  <14>1 2024-02-05T02:06:58.065Z query-coordinator-0-4 query-coordinator 1 
> 10757dcc-1e4c-4dd2-ba76-8a2411ab1bdf [mdc@18060 class="ipc.Server" 
> level="INFO" thread="IPC Server handler 0 on 22222"] IPC Server handler 0 on 
> 22222, call Call#15312255 Retry#0 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
> from 127.0.0.6:56221
> org.apache.tez.dag.api.TezException: No running dag at present
>     at 
> org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:99)
>     at 
> org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:181)
>     at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:102)
>     at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:8513)
>     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:533)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:989)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:917)
>     at java.base/java.security.AccessController.doPrivileged(Native Method)
>     at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
>     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2894)
> {code}
> AM should be able to return a specialized exception which can be handled by 
> the client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (TEZ-4543) Throw a special exception to DagClient when there is no current DAG

Reply via email to