[ https://issues.apache.org/jira/browse/SPARK-38969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530931#comment-17530931 ]
Yeachan Park commented on SPARK-38969: -------------------------------------- Hi Holden, thanks for responding. Triggering the decom script manually from within a pod still made it exit with code 137, even though the whole execution took much less than 60 seconds. > Graceful decomissionning on Kubernetes fails / decom script error > ----------------------------------------------------------------- > > Key: SPARK-38969 > URL: https://issues.apache.org/jira/browse/SPARK-38969 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.2.0 > Environment: Running spark-thriftserver (3.2.0) on Kubernetes (GKE > 1.20.15-gke.2500). > > Reporter: Yeachan Park > Priority: Minor > > Hello, we are running into some issue while attempting graceful > decommissioning of executors. We enabled: > * spark.decommission.enabled > * spark.storage.decommission.rddBlocks.enabled > * spark.storage.decommission.shuffleBlocks.enabled > * spark.storage.decommission.enabled > and set spark.storage.decommission.fallbackStorage.path to a path in our > bucket. > > The logs from the driver seems to suggest the decommissioning process started > but then unexpectedly exited and failed: > > ``` > 22/04/20 15:09:09 WARN > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor > 3 decommissioned message > 22/04/20 15:09:09 INFO KubernetesClusterSchedulerBackend: Decommission > executors: 3 > 22/04/20 15:09:09 INFO BlockManagerMasterEndpoint: Mark BlockManagers > (BlockManagerId(3, 100.96.1.130, 44789, None)) as being decommissioning. > 22/04/20 15:09:10 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.1.130: > Executor decommission. > 22/04/20 15:09:10 INFO DAGScheduler: Executor lost: 3 (epoch 2) > 22/04/20 15:09:10 INFO ExecutorMonitor: Executor 3 is removed. Remove reason > statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver > killed: 0, unexpectedly exited: 3). > 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Trying to remove executor > 3 from BlockManagerMaster. > 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(3, 100.96.1.130, 44789, None) > 22/04/20 15:09:10 INFO BlockManagerMaster: Removed 3 successfully in > removeExecutor > 22/04/20 15:09:10 INFO DAGScheduler: Shuffle files lost for executor: 3 > (epoch 2) > ``` > > However, the executor logs seem to suggest that decommissioning was > successful: > > ``` > 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Decommission executor 3. > 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Will exit when finished > decommissioning > 22/04/20 15:09:09 INFO BlockManager: Starting block manager decommissioning > process... > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting block migration > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all > RDD blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all > shuffle blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Start refreshing > migratable shuffle blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are > added. In total, 0 shuffles are remained. > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all > cached RDD blocks > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block > migration thread for BlockManagerId(4, 100.96.1.131, 35607, None) > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block > migration thread for BlockManagerId(fallback, remote, 7337, None) > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round > refreshing migratable shuffle blocks, waiting for 30000ms before the next > round refreshing. > 22/04/20 15:09:10 WARN BlockManagerDecommissioner: Asked to decommission RDD > cache blocks, but no blocks to migrate > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round RDD > blocks migration, waiting for 30000ms before the next round migration. > 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Checking to see if we > can shutdown. > 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, > checking migrations > 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, all > blocks migrated, stopping. > 22/04/20 15:09:10 ERROR CoarseGrainedExecutorBackend: Executor self-exiting > due to : Finished decommissioning > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop RDD blocks > migration(). > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop refreshing migratable > shuffle blocks. > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopping migrating shuffle > blocks. > 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Driver commanded a > shutdown > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopped block migration > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block > migration(). > 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block > migration(). > 22/04/20 15:09:10 INFO MemoryStore: MemoryStore cleared > 22/04/20 15:09:10 INFO BlockManager: BlockManager stopped > 22/04/20 15:09:10 INFO ShutdownHookManager: Shutdown hook called > ``` > > The decommissioning script `/opt/decom.sh` also always terminates with exit > code 137, not really sure why that is. > > > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org