[ https://issues.apache.org/jira/browse/SPARK-27228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16798207#comment-16798207 ]
Lukas Waldmann commented on SPARK-27228: ---------------------------------------- log file added > Spark long delay on close, possible problem with killing executors > ------------------------------------------------------------------ > > Key: SPARK-27228 > URL: https://issues.apache.org/jira/browse/SPARK-27228 > Project: Spark > Issue Type: Bug > Components: Block Manager > Affects Versions: 2.3.0 > Reporter: Lukas Waldmann > Priority: Major > Attachments: log.html > > > When using dynamic allocations after all jobs finishes spark delays for > several minutes before finally finishes. Log suggest that executors are not > cleared up properly. > {quote}{{19/03/21 09:51:38 INFO SparkSession: PROCESSING FINISHED 19/03/21 > 09:51:38 INFO ExecutorAllocationManager: Request to remove executorIds: 355 > 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Requesting to kill > executor(s) 355 19/03/21 09:51:38 INFO YarnClusterSchedulerBackend: Actual > list of executor(s) to be killed is 355 19/03/21 09:51:38 INFO > ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 355. > 19/03/21 09:51:38 INFO ExecutorAllocationManager: Removing executor 355 > because it has been idle for 60 seconds (new desired total will be 65) > 19/03/21 09:51:38 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling > executor 228. 19/03/21 09:51:38 INFO DAGScheduler: Executor lost: 228 (epoch > 446) 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: Trying to remove > executor 228 from BlockManagerMaster. 19/03/21 09:51:38 INFO > BlockManagerMasterEndpoint: Removing block manager BlockManagerId(228, > data-15.bdp.gin.merck.com, 45882, None) 19/03/21 09:51:38 INFO > BlockManagerMaster: Removed 228 successfully in removeExecutor 19/03/21 > 09:51:38 INFO SparkUI: Stopped Spark web UI at > [http://data-04.bdp.gin.merck.com:44304|http://data-04.bdp.gin.merck.com:44304/] > 19/03/21 09:51:38 INFO YarnClusterScheduler: Executor 228 on > data-15.bdp.gin.merck.com killed by driver. 19/03/21 09:51:38 INFO > YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 346. 19/03/21 > 09:51:38 INFO DAGScheduler: Executor lost: 346 (epoch 446) 19/03/21 09:51:38 > INFO BlockManagerMasterEndpoint: Trying to remove executor 346 from > BlockManagerMaster. 19/03/21 09:51:38 INFO BlockManagerMasterEndpoint: > Removing block manager BlockManagerId(346, datanode-02.bdp.gin.merck.com, > 41186, None) 19/03/21 09:51:38 INFO BlockManagerMaster: Removed 346 > successfully in removeExecutor 19/03/21 09:51:38 INFO YarnClusterScheduler: > Executor 346 on datanode-02.bdp.gin.merck.com killed by driver. 19/03/21 > 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor > 332. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 332 (epoch 446) > 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor > 332 from BlockManagerMaster. 19/03/21 09:51:39 INFO > BlockManagerMasterEndpoint: Removing block manager BlockManagerId(332, > data-10.bdp.gin.merck.com, 38713, None) 19/03/21 09:51:39 INFO > BlockManagerMaster: Removed 332 successfully in removeExecutor 19/03/21 > 09:51:39 INFO YarnClusterScheduler: Executor 332 on data-10.bdp.gin.merck.com > killed by driver. 19/03/21 09:51:39 INFO > YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 240. 19/03/21 > 09:51:39 INFO YarnClusterScheduler: Executor 240 on data-22.bdp.gin.merck.com > killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 240 > (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to > remove executor 240 from BlockManagerMaster. 19/03/21 09:51:39 INFO > BlockManagerMasterEndpoint: Removing block manager BlockManagerId(240, > data-22.bdp.gin.merck.com, 43344, None) 19/03/21 09:51:39 INFO > BlockManagerMaster: Removed 240 successfully in removeExecutor 19/03/21 > 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor > 327. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 327 (epoch 446) > 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to remove executor > 327 from BlockManagerMaster. 19/03/21 09:51:39 INFO > BlockManagerMasterEndpoint: Removing block manager BlockManagerId(327, > data-20.bdp.gin.merck.com, 34235, None) 19/03/21 09:51:39 INFO > YarnClusterScheduler: Executor 327 on data-20.bdp.gin.merck.com killed by > driver. 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 327 successfully > in removeExecutor 19/03/21 09:51:39 INFO > YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 355. 19/03/21 > 09:51:39 INFO YarnClusterScheduler: Executor 355 on data-20.bdp.gin.merck.com > killed by driver. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 355 > (epoch 446) 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: Trying to > remove executor 355 from BlockManagerMaster. 19/03/21 09:51:39 INFO > BlockManagerMasterEndpoint: Removing block manager BlockManagerId(355, > data-20.bdp.gin.merck.com, 43141, None) 19/03/21 09:51:39 INFO > BlockManagerMaster: Removed 355 successfully in removeExecutor 19/03/21 > 09:51:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor > 168. 19/03/21 09:51:39 INFO DAGScheduler: Executor lost: 168 (epoch 446) > 19/03/21 09:51:39 INFO YarnClusterScheduler: Executor 168 on > data-07.bdp.gin.merck.com killed by driver. 19/03/21 09:51:39 INFO > BlockManagerMasterEndpoint: Trying to remove executor 168 from > BlockManagerMaster. 19/03/21 09:51:39 INFO BlockManagerMasterEndpoint: > Removing block manager BlockManagerId(168, data-07.bdp.gin.merck.com, 44833, > None) 19/03/21 09:51:39 INFO BlockManagerMaster: Removed 168 successfully in > removeExecutor 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor > 332 with no recent heartbeats: 173942 ms exceeds timeout 120000 ms 19/03/21 > 09:54:26 ERROR YarnClusterScheduler: Lost an executor 332 (already removed): > Executor heartbeat timed out after 173942 ms 19/03/21 09:54:26 WARN > HeartbeatReceiver: Removing executor 346 with no recent heartbeats: 172853 ms > exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost > an executor 346 (already removed): Executor heartbeat timed out after 172853 > ms 19/03/21 09:54:26 WARN HeartbeatReceiver: Removing executor 355 with no > recent heartbeats: 173169 ms exceeds timeout 120000 ms 19/03/21 09:54:26 > ERROR YarnClusterScheduler: Lost an executor 355 (already removed): Executor > heartbeat timed out after 173169 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: > Removing executor 168 with no recent heartbeats: 174129 ms exceeds timeout > 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 168 > (already removed): Executor heartbeat timed out after 174129 ms 19/03/21 > 09:54:26 WARN HeartbeatReceiver: Removing executor 327 with no recent > heartbeats: 169555 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR > YarnClusterScheduler: Lost an executor 327 (already removed): Executor > heartbeat timed out after 169555 ms 19/03/21 09:54:26 WARN HeartbeatReceiver: > Removing executor 240 with no recent heartbeats: 177937 ms exceeds timeout > 120000 ms 19/03/21 09:54:26 ERROR YarnClusterScheduler: Lost an executor 240 > (already removed): Executor heartbeat timed out after 177937 ms 19/03/21 > 09:54:26 WARN HeartbeatReceiver: Removing executor 228 with no recent > heartbeats: 178171 ms exceeds timeout 120000 ms 19/03/21 09:54:26 ERROR > YarnClusterScheduler: Lost an executor 228 (already removed): Executor > heartbeat timed out after 178171 ms 19/03/21 09:54:26 INFO > YarnClusterSchedulerBackend: Requesting to kill executor(s) 332 19/03/21 > 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 332 does not > exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 19/03/21 09:54:26 INFO > YarnClusterSchedulerBackend: Requesting to kill executor(s) 346 19/03/21 > 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 346 does not > exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 19/03/21 09:54:26 INFO > YarnClusterSchedulerBackend: Requesting to kill executor(s) 355 19/03/21 > 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 355 does not > exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 19/03/21 09:54:26 INFO > YarnClusterSchedulerBackend: Requesting to kill executor(s) 168 19/03/21 > 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 168 does not > exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 19/03/21 09:54:26 INFO > YarnClusterSchedulerBackend: Requesting to kill executor(s) 327 19/03/21 > 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 327 does not > exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 19/03/21 09:54:26 INFO > YarnClusterSchedulerBackend: Requesting to kill executor(s) 240 19/03/21 > 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 240 does not > exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 19/03/21 09:54:26 INFO > YarnClusterSchedulerBackend: Requesting to kill executor(s) 228 19/03/21 > 09:54:26 WARN YarnClusterSchedulerBackend: Executor to kill 228 does not > exist! 19/03/21 09:54:26 INFO YarnClusterSchedulerBackend: Actual list of > executor(s) to be killed is 19/03/21 09:56:43 INFO > YarnClusterSchedulerBackend: Shutting down all executors 19/03/21 09:56:43 > INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut > down 19/03/21 09:56:43 INFO SchedulerExtensionServices: Stopping > SchedulerExtensionServices (serviceOption=None, services=List(), > started=false) 19/03/21 09:56:43 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! 19/03/21 09:56:43 INFO MemoryStore: > MemoryStore cleared 19/03/21 09:56:43 INFO BlockManager: BlockManager stopped > 19/03/21 09:56:43 INFO BlockManagerMaster: BlockManagerMaster stopped > 19/03/21 09:56:43 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! 19/03/21 09:56:43 INFO SparkContext: > Successfully stopped SparkContext 19/03/21 09:56:43 INFO ApplicationMaster: > Final app status: SUCCEEDED, exitCode: 0 19/03/21 }} > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org