dzcxzl created SPARK-28305:
------------------------------

             Summary: When the AM cannot obtain the container loss reason, ask 
GetExecutorLossReason times out
                 Key: SPARK-28305
                 URL: https://issues.apache.org/jira/browse/SPARK-28305
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 2.4.0
            Reporter: dzcxzl


In some cases, such as NM machine crashes or shuts down,driver ask 
GetExecutorLossReason,
AM getCompletedContainersStatuses can't get the failure information of 
container.

Because the yarn NM detection timeout is 10 minutes, it is controlled by the 
parameter yarn.resourcemanager.rm.container-allocation.expiry-interval-ms.
So AM has to wait for 10 minutes to get the cause of the container failure.

Although the driver's ask fails, it will call recover.
However, due to the 2-minute timeout (spark.network.timeout) configured by 
IdleStateHandler, the connection between driver and am is closed, AM exits, app 
finish, driver exits, causing the job to fail.


AM LOG:

19/07/08 16:56:48 [dispatcher-event-loop-0] INFO YarnAllocator: add executor 
951 to pendingLossReasonRequests for get the loss reason
19/07/08 16:58:48 [dispatcher-event-loop-26] INFO ApplicationMaster$AMEndpoint: 
Driver terminated or disconnected! Shutting down.
19/07/08 16:58:48 [dispatcher-event-loop-26] INFO ApplicationMaster: Final app 
status: SUCCEEDED, exitCode: 0


Driver LOG:

19/07/08 16:58:48,476 [rpc-server-3-3] ERROR TransportChannelHandler: 
Connection to /xx.xx.xx.xx:19398 has been quiet for 120000 ms while there are 
outstanding requests. Assuming connection is dead; please adjust 
spark.network.timeout if this is wrong.
19/07/08 16:58:48,476 [rpc-server-3-3] ERROR TransportResponseHandler: Still 
have 1 requests outstanding when connection from /xx.xx.xx.xx:19398 is closed
19/07/08 16:58:48,510 [rpc-server-3-3] WARN NettyRpcEnv: Ignored failure: 
java.io.IOException: Connection from /xx.xx.xx.xx:19398 closed
19/07/08 16:58:48,516 [netty-rpc-env-timeout] WARN 
YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss 
reason for executor id 951 at RPC address xx.xx.xx.xx:49175, but got no 
response. Marking as slave lost.
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from null in 
120 seconds. This timeout is controlled by spark.rpc.askTimeout



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to