Yeachan Park created SPARK-42737:
------------------------------------

             Summary: Shuffle files lost with graceful decommission fallback 
storage enabled
                 Key: SPARK-42737
                 URL: https://issues.apache.org/jira/browse/SPARK-42737
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.3.2
            Reporter: Yeachan Park


During testing of graceful decommissioning, the driver logs indicate that 
shuffle files were lost:

{code:bash}
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 3
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
23/03/09 15:22:42 WARN 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 1
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
23/03/09 15:22:42 WARN 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission 
executors: 2
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers 
(BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: 
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason 
statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: 
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason 
statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: 
Executor decommission.
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 
from BlockManagerMaster.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason 
statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver 
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(3, 100.96.5.11, 44707, None)
23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in 
removeExecutor
23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 
0)
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, 100.96.5.9, 44491, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 
1)
23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, 100.96.5.10, 39011, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in 
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 
2)
23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
23/03/09 15:22:52 INFO 
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove 
non-existent executor 1
{code}

The decommission logs from the executor also seems to indicate that no shuffle 
data was necessary to migrate:

{code:java}
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished 
decommissioning
23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning 
process...
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can 
shutdown.
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking 
migrations
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet 
migrated.
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all 
RDD blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all 
shuffle blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable 
shuffle blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all 
cached RDD blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are 
added. In total, 0 shuffles are remained.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block 
migration thread for BlockManagerId(fallback, remote, 7337, None)
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round 
refreshing migratable shuffle blocks, waiting for 30000ms before the next round 
refreshing.
23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD 
cache blocks, but no blocks to migrate
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD 
blocks migration, waiting for 30000ms before the next round migration.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we can 
shutdown.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, checking 
migrations
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all 
blocks migrated, stopping.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting due 
to : Finished decommissioning
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable 
shuffle blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle 
blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block 
migration().
{code}


This seems incorrect as there were no shuffle files to migrate to begin with. 
We enabled:

- spark.decommission.enabled 
- spark.storage.decommission.rddBlocks.enabled
- spark.storage.decommission.shuffleBlocks.enabled
- spark.storage.decommission.enabled
and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to