[ 
https://issues.apache.org/jira/browse/SPARK-38969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530931#comment-17530931
 ] 

Yeachan Park commented on SPARK-38969:
--------------------------------------

Hi Holden, thanks for responding. Triggering the decom script manually from 
within a pod still made it exit with code 137, even though the whole execution 
took much less than 60 seconds.

> Graceful decomissionning on Kubernetes fails / decom script error
> -----------------------------------------------------------------
>
>                 Key: SPARK-38969
>                 URL: https://issues.apache.org/jira/browse/SPARK-38969
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.2.0
>         Environment: Running spark-thriftserver (3.2.0) on Kubernetes (GKE 
> 1.20.15-gke.2500). 
>  
>            Reporter: Yeachan Park
>            Priority: Minor
>
> Hello, we are running into some issue while attempting graceful 
> decommissioning of executors. We enabled:
>  * spark.decommission.enabled 
>  * spark.storage.decommission.rddBlocks.enabled
>  * spark.storage.decommission.shuffleBlocks.enabled
>  * spark.storage.decommission.enabled
> and set spark.storage.decommission.fallbackStorage.path to a path in our 
> bucket.
>  
> The logs from the driver seems to suggest the decommissioning process started 
> but then unexpectedly exited and failed:
>  
> ```
> 22/04/20 15:09:09 WARN 
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 
> 3 decommissioned message
> 22/04/20 15:09:09 INFO KubernetesClusterSchedulerBackend: Decommission 
> executors: 3
> 22/04/20 15:09:09 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
> (BlockManagerId(3, 100.96.1.130, 44789, None)) as being decommissioning.
> 22/04/20 15:09:10 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.1.130: 
> Executor decommission.
> 22/04/20 15:09:10 INFO DAGScheduler: Executor lost: 3 (epoch 2)
> 22/04/20 15:09:10 INFO ExecutorMonitor: Executor 3 is removed. Remove reason 
> statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver 
> killed: 0, unexpectedly exited: 3).
> 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 3 from BlockManagerMaster.
> 22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(3, 100.96.1.130, 44789, None)
> 22/04/20 15:09:10 INFO BlockManagerMaster: Removed 3 successfully in 
> removeExecutor
> 22/04/20 15:09:10 INFO DAGScheduler: Shuffle files lost for executor: 3 
> (epoch 2)
> ```
>  
> However, the executor logs seem to suggest that decommissioning was 
> successful:
>  
> ```
> 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Decommission executor 3.
> 22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Will exit when finished 
> decommissioning
> 22/04/20 15:09:09 INFO BlockManager: Starting block manager decommissioning 
> process...
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting block migration
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all 
> RDD blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all 
> shuffle blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Start refreshing 
> migratable shuffle blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are 
> added. In total, 0 shuffles are remained.
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all 
> cached RDD blocks
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block 
> migration thread for BlockManagerId(4, 100.96.1.131, 35607, None)
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block 
> migration thread for BlockManagerId(fallback, remote, 7337, None)
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round 
> refreshing migratable shuffle blocks, waiting for 30000ms before the next 
> round refreshing.
> 22/04/20 15:09:10 WARN BlockManagerDecommissioner: Asked to decommission RDD 
> cache blocks, but no blocks to migrate
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round RDD 
> blocks migration, waiting for 30000ms before the next round migration.
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, all 
> blocks migrated, stopping.
> 22/04/20 15:09:10 ERROR CoarseGrainedExecutorBackend: Executor self-exiting 
> due to : Finished decommissioning
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop RDD blocks 
> migration().
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop refreshing migratable 
> shuffle blocks.
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopping migrating shuffle 
> blocks.
> 22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopped block migration
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block 
> migration().
> 22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block 
> migration().
> 22/04/20 15:09:10 INFO MemoryStore: MemoryStore cleared
> 22/04/20 15:09:10 INFO BlockManager: BlockManager stopped
> 22/04/20 15:09:10 INFO ShutdownHookManager: Shutdown hook called
> ```
>  
> The decommissioning script `/opt/decom.sh` also always terminates with exit 
> code 137, not really sure why that is.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to