Yeachan Park created SPARK-38969:
------------------------------------

             Summary: Graceful decomissionning on Kubernetes fails / decom 
script error
                 Key: SPARK-38969
                 URL: https://issues.apache.org/jira/browse/SPARK-38969
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.2.0
         Environment: Running spark-thriftserver (3.2.0) on Kubernetes (GKE 
1.20.15-gke.2500). 

 
            Reporter: Yeachan Park


Hello, we are running into some issue while attempting graceful decommissioning 
of executors. We enabled:
 * spark.decommission.enabled 
 * spark.storage.decommission.rddBlocks.enabled
 * spark.storage.decommission.shuffleBlocks.enabled
 * spark.storage.decommission.enabled

and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.
 
The logs from the driver seems to suggest the decommissioning process started 
but then unexpectedly exited and failed:
 
```
22/04/20 15:09:09 WARN 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 3 
decommissioned message
22/04/20 15:09:09 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 3
22/04/20 15:09:09 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(3, 100.96.1.130, 44789, None)) as being decommissioning.
22/04/20 15:09:10 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.1.130: 
Executor decommission.
22/04/20 15:09:10 INFO DAGScheduler: Executor lost: 3 (epoch 2)
22/04/20 15:09:10 INFO ExecutorMonitor: Executor 3 is removed. Remove reason 
statistics: (gracefully decommissioned: 0, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 3).
22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 
from BlockManagerMaster.
22/04/20 15:09:10 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(3, 100.96.1.130, 44789, None)
22/04/20 15:09:10 INFO BlockManagerMaster: Removed 3 successfully in 
removeExecutor
22/04/20 15:09:10 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 
2)
```
 
However, the executor logs seem to suggest that decommissioning was successful:
 
```
22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Decommission executor 3.
22/04/20 15:09:09 INFO CoarseGrainedExecutorBackend: Will exit when finished 
decommissioning
22/04/20 15:09:09 INFO BlockManager: Starting block manager decommissioning 
process...
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting block migration
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all 
RDD blocks
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all 
shuffle blocks
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Start refreshing migratable 
shuffle blocks
22/04/20 15:09:10 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are 
added. In total, 0 shuffles are remained.
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Attempting to migrate all 
cached RDD blocks
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block 
migration thread for BlockManagerId(4, 100.96.1.131, 35607, None)
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Starting shuffle block 
migration thread for BlockManagerId(fallback, remote, 7337, None)
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round 
refreshing migratable shuffle blocks, waiting for 30000ms before the next round 
refreshing.
22/04/20 15:09:10 WARN BlockManagerDecommissioner: Asked to decommission RDD 
cache blocks, but no blocks to migrate
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Finished current round RDD 
blocks migration, waiting for 30000ms before the next round migration.
22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Checking to see if we can 
shutdown.
22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, checking 
migrations
22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: No running tasks, all 
blocks migrated, stopping.
22/04/20 15:09:10 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due 
to : Finished decommissioning
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop refreshing migratable 
shuffle blocks.
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopping migrating shuffle 
blocks.
22/04/20 15:09:10 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stopped block migration
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block 
migration().
22/04/20 15:09:10 INFO BlockManagerDecommissioner: Stop shuffle block 
migration().
22/04/20 15:09:10 INFO MemoryStore: MemoryStore cleared
22/04/20 15:09:10 INFO BlockManager: BlockManager stopped
22/04/20 15:09:10 INFO ShutdownHookManager: Shutdown hook called
```
 
The decommissioning script `/opt/decom.sh` also always terminates with exit 
code 137, not really sure why that is.
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to