[ https://issues.apache.org/jira/browse/SPARK-42737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yeachan Park updated SPARK-42737: --------------------------------- Description: During testing of graceful decommissioning, the driver logs indicate that shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`: {code:bash} 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 1 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning. 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0) 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: Executor decommission. 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.5.11, 44707, None) 23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor 23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 0) 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 100.96.5.9, 44491, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 1) 23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 100.96.5.10, 39011, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 2) 23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested 23/03/09 15:22:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1 {code} The decommission logs from the executor also seems to indicate that no shuffle data was necessary to migrate: {code:java} 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1. 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning 23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning process... 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown. 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet migrated. 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained. 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(fallback, remote, 7337, None) 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round refreshing migratable shuffle blocks, waiting for 30000ms before the next round refreshing. 23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD cache blocks, but no blocks to migrate 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD blocks migration, waiting for 30000ms before the next round migration. 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown. 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all blocks migrated, stopping. 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting due to : Finished decommissioning 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks migration(). 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable shuffle blocks. 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle blocks. 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block migration(). {code} This seems incorrect as there were no shuffle files to migrate to begin with. We enabled: - spark.decommission.enabled - spark.storage.decommission.rddBlocks.enabled - spark.storage.decommission.shuffleBlocks.enabled - spark.storage.decommission.enabled and set spark.storage.decommission.fallbackStorage.path to a path in our bucket. The same also happened when there were actual shuffle files that were stored in the bucket. was: During testing of graceful decommissioning, the driver logs indicate that shuffle files were lost: {code:bash} 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 3 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 1 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning. 23/03/09 15:22:42 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2 decommissioned message 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission executors: 2 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning. 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0) 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: Executor decommission. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: Executor decommission. 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster. 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver killed: 0, unexpectedly exited: 0). 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, 100.96.5.11, 44707, None) 23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor 23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch 0) 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 100.96.5.9, 44491, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 1) 23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2) 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster. 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 100.96.5.10, 39011, None) 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 2) 23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested 23/03/09 15:22:52 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1 {code} The decommission logs from the executor also seems to indicate that no shuffle data was necessary to migrate: {code:java} 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1. 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning 23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning process... 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown. 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet migrated. 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all RDD blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all shuffle blocks 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable shuffle blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all cached RDD blocks 23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are added. In total, 0 shuffles are remained. 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block migration thread for BlockManagerId(fallback, remote, 7337, None) 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round refreshing migratable shuffle blocks, waiting for 30000ms before the next round refreshing. 23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD cache blocks, but no blocks to migrate 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD blocks migration, waiting for 30000ms before the next round migration. 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we can shutdown. 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, checking migrations 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all blocks migrated, stopping. 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting due to : Finished decommissioning 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks migration(). 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable shuffle blocks. 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle blocks. 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block migration(). {code} This seems incorrect as there were no shuffle files to migrate to begin with. We enabled: - spark.decommission.enabled - spark.storage.decommission.rddBlocks.enabled - spark.storage.decommission.shuffleBlocks.enabled - spark.storage.decommission.enabled and set spark.storage.decommission.fallbackStorage.path to a path in our bucket. > Shuffle files lost with graceful decommission fallback storage enabled > ---------------------------------------------------------------------- > > Key: SPARK-42737 > URL: https://issues.apache.org/jira/browse/SPARK-42737 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.3.2 > Reporter: Yeachan Park > Priority: Minor > > During testing of graceful decommissioning, the driver logs indicate that > shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`: > {code:bash} > 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission > executors: 3 > 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers > (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning. > 23/03/09 15:22:42 WARN > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor > 1 decommissioned message > 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission > executors: 1 > 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers > (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning. > 23/03/09 15:22:42 WARN > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor > 2 decommissioned message > 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission > executors: 2 > 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers > (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning. > 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11: > Executor decommission. > 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason > statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver > killed: 0, unexpectedly exited: 0). > 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0) > 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9: > Executor decommission. > 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason > statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver > killed: 0, unexpectedly exited: 0). > 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10: > Executor decommission. > 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor > 3 from BlockManagerMaster. > 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason > statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver > killed: 0, unexpectedly exited: 0). > 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(3, 100.96.5.11, 44707, None) > 23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in > removeExecutor > 23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 > (epoch 0) > 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1) > 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor > 1 from BlockManagerMaster. > 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, 100.96.5.9, 44491, None) > 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 > (epoch 1) > 23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2) > 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor > 2 from BlockManagerMaster. > 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(2, 100.96.5.10, 39011, None) > 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in > removeExecutor > 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 > (epoch 2) > 23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested > 23/03/09 15:22:52 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove > non-existent executor 1 > {code} > The decommission logs from the executor also seems to indicate that no > shuffle data was necessary to migrate: > {code:java} > 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1. > 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished > decommissioning > 23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning > process... > 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we > can shutdown. > 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, > checking migrations > 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet > migrated. > 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration > 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all > RDD blocks > 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all > shuffle blocks > 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing > migratable shuffle blocks > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all > cached RDD blocks > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are > added. In total, 0 shuffles are remained. > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block > migration thread for BlockManagerId(fallback, remote, 7337, None) > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round > refreshing migratable shuffle blocks, waiting for 30000ms before the next > round refreshing. > 23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD > cache blocks, but no blocks to migrate > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD > blocks migration, waiting for 30000ms before the next round migration. > 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we > can shutdown. > 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, > checking migrations > 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all > blocks migrated, stopping. > 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting > due to : Finished decommissioning > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks > migration(). > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable > shuffle blocks. > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle > blocks. > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration > 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block > migration(). > {code} > This seems incorrect as there were no shuffle files to migrate to begin with. > We enabled: > - spark.decommission.enabled > - spark.storage.decommission.rddBlocks.enabled > - spark.storage.decommission.shuffleBlocks.enabled > - spark.storage.decommission.enabled > and set spark.storage.decommission.fallbackStorage.path to a path in our > bucket. > The same also happened when there were actual shuffle files that were stored > in the bucket. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org