[
https://issues.apache.org/jira/browse/SPARK-42737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yeachan Park updated SPARK-42737:
---------------------------------
Description:
During testing of graceful decommissioning, the driver logs indicate that
shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`:
{code:bash}
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
executors: 3
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
23/03/09 15:22:42 WARN
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
executors: 1
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
23/03/09 15:22:42 WARN
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
executors: 2
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11:
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason
statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9:
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason
statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10:
Executor decommission.
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3
from BlockManagerMaster.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason
statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(3, 100.96.5.11, 44707, None)
23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in
removeExecutor
23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch
0)
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(1, 100.96.5.9, 44491, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch
1)
23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(2, 100.96.5.10, 39011, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch
2)
23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
23/03/09 15:22:52 INFO
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove
non-existent executor 1
{code}
The decommission logs from the executor also seems to indicate that no shuffle
data was necessary to migrate:
{code:java}
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished
decommissioning
23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning
process...
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can
shutdown.
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking
migrations
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet
migrated.
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all
RDD blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all
shuffle blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable
shuffle blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all
cached RDD blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are
added. In total, 0 shuffles are remained.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block
migration thread for BlockManagerId(fallback, remote, 7337, None)
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round
refreshing migratable shuffle blocks, waiting for 30000ms before the next round
refreshing.
23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD
cache blocks, but no blocks to migrate
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD
blocks migration, waiting for 30000ms before the next round migration.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we can
shutdown.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, checking
migrations
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all
blocks migrated, stopping.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting due
to : Finished decommissioning
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable
shuffle blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle
blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block
migration().
{code}
This seems incorrect as there were no shuffle files to migrate to begin with.
We enabled:
- spark.decommission.enabled
- spark.storage.decommission.rddBlocks.enabled
- spark.storage.decommission.shuffleBlocks.enabled
- spark.storage.decommission.enabled
and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.
The same also happened when there were actual shuffle files that were stored in
the bucket.
was:
During testing of graceful decommissioning, the driver logs indicate that
shuffle files were lost:
{code:bash}
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
executors: 3
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
23/03/09 15:22:42 WARN
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 1
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
executors: 1
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
23/03/09 15:22:42 WARN
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 2
decommissioned message
23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
executors: 2
23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
(BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11:
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason
statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9:
Executor decommission.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason
statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10:
Executor decommission.
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 3
from BlockManagerMaster.
23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason
statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver
killed: 0, unexpectedly exited: 0).
23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(3, 100.96.5.11, 44707, None)
23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in
removeExecutor
23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3 (epoch
0)
23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 1
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(1, 100.96.5.9, 44491, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch
1)
23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 2
from BlockManagerMaster.
23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(2, 100.96.5.10, 39011, None)
23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in
removeExecutor
23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch
2)
23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
23/03/09 15:22:52 INFO
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove
non-existent executor 1
{code}
The decommission logs from the executor also seems to indicate that no shuffle
data was necessary to migrate:
{code:java}
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished
decommissioning
23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning
process...
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we can
shutdown.
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks, checking
migrations
23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet
migrated.
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all
RDD blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all
shuffle blocks
23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing migratable
shuffle blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all
cached RDD blocks
23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are
added. In total, 0 shuffles are remained.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block
migration thread for BlockManagerId(fallback, remote, 7337, None)
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round
refreshing migratable shuffle blocks, waiting for 30000ms before the next round
refreshing.
23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD
cache blocks, but no blocks to migrate
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD
blocks migration, waiting for 30000ms before the next round migration.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we can
shutdown.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, checking
migrations
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all
blocks migrated, stopping.
23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting due
to : Finished decommissioning
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks migration().
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable
shuffle blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle
blocks.
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration
23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block
migration().
{code}
This seems incorrect as there were no shuffle files to migrate to begin with.
We enabled:
- spark.decommission.enabled
- spark.storage.decommission.rddBlocks.enabled
- spark.storage.decommission.shuffleBlocks.enabled
- spark.storage.decommission.enabled
and set spark.storage.decommission.fallbackStorage.path to a path in our bucket.
> Shuffle files lost with graceful decommission fallback storage enabled
> ----------------------------------------------------------------------
>
> Key: SPARK-42737
> URL: https://issues.apache.org/jira/browse/SPARK-42737
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.3.2
> Reporter: Yeachan Park
> Priority: Minor
>
> During testing of graceful decommissioning, the driver logs indicate that
> shuffle files were lost - `DAGScheduler: Shuffle files lost for executor`:
> {code:bash}
> 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
> executors: 3
> 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
> (BlockManagerId(3, 100.96.5.11, 44707, None)) as being decommissioning.
> 23/03/09 15:22:42 WARN
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor
> 1 decommissioned message
> 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
> executors: 1
> 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
> (BlockManagerId(1, 100.96.5.9, 44491, None)) as being decommissioning.
> 23/03/09 15:22:42 WARN
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor
> 2 decommissioned message
> 23/03/09 15:22:42 INFO KubernetesClusterSchedulerBackend: Decommission
> executors: 2
> 23/03/09 15:22:42 INFO BlockManagerMasterEndpoint: Mark BlockManagers
> (BlockManagerId(2, 100.96.5.10, 39011, None)) as being decommissioning.
> 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 3 on 100.96.5.11:
> Executor decommission.
> 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 3 is removed. Remove reason
> statistics: (gracefully decommissioned: 1, decommision unfinished: 0, driver
> killed: 0, unexpectedly exited: 0).
> 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 3 (epoch 0)
> 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 1 on 100.96.5.9:
> Executor decommission.
> 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 1 is removed. Remove reason
> statistics: (gracefully decommissioned: 2, decommision unfinished: 0, driver
> killed: 0, unexpectedly exited: 0).
> 23/03/09 15:22:44 ERROR TaskSchedulerImpl: Lost executor 2 on 100.96.5.10:
> Executor decommission.
> 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Trying to remove executor
> 3 from BlockManagerMaster.
> 23/03/09 15:22:44 INFO ExecutorMonitor: Executor 2 is removed. Remove reason
> statistics: (gracefully decommissioned: 3, decommision unfinished: 0, driver
> killed: 0, unexpectedly exited: 0).
> 23/03/09 15:22:44 INFO BlockManagerMasterEndpoint: Removing block manager
> BlockManagerId(3, 100.96.5.11, 44707, None)
> 23/03/09 15:22:44 INFO BlockManagerMaster: Removed 3 successfully in
> removeExecutor
> 23/03/09 15:22:44 INFO DAGScheduler: Shuffle files lost for executor: 3
> (epoch 0)
> 23/03/09 15:22:44 INFO DAGScheduler: Executor lost: 1 (epoch 1)
> 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor
> 1 from BlockManagerMaster.
> 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager
> BlockManagerId(1, 100.96.5.9, 44491, None)
> 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 1 successfully in
> removeExecutor
> 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 1
> (epoch 1)
> 23/03/09 15:22:45 INFO DAGScheduler: Executor lost: 2 (epoch 2)
> 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Trying to remove executor
> 2 from BlockManagerMaster.
> 23/03/09 15:22:45 INFO BlockManagerMasterEndpoint: Removing block manager
> BlockManagerId(2, 100.96.5.10, 39011, None)
> 23/03/09 15:22:45 INFO BlockManagerMaster: Removed 2 successfully in
> removeExecutor
> 23/03/09 15:22:45 INFO DAGScheduler: Shuffle files lost for executor: 2
> (epoch 2)
> 23/03/09 15:22:52 INFO BlockManagerMaster: Removal of executor 1 requested
> 23/03/09 15:22:52 INFO
> KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove
> non-existent executor 1
> {code}
> The decommission logs from the executor also seems to indicate that no
> shuffle data was necessary to migrate:
> {code:java}
> 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
> 23/03/09 15:22:42 INFO CoarseGrainedExecutorBackend: Will exit when finished
> decommissioning
> 23/03/09 15:22:42 INFO BlockManager: Starting block manager decommissioning
> process...
> 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: Checking to see if we
> can shutdown.
> 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: No running tasks,
> checking migrations
> 23/03/09 15:22:43 INFO CoarseGrainedExecutorBackend: All blocks not yet
> migrated.
> 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Starting block migration
> 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all
> RDD blocks
> 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Attempting to migrate all
> shuffle blocks
> 23/03/09 15:22:43 INFO BlockManagerDecommissioner: Start refreshing
> migratable shuffle blocks
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Attempting to migrate all
> cached RDD blocks
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: 0 of 0 local shuffles are
> added. In total, 0 shuffles are remained.
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Starting shuffle block
> migration thread for BlockManagerId(fallback, remote, 7337, None)
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round
> refreshing migratable shuffle blocks, waiting for 30000ms before the next
> round refreshing.
> 23/03/09 15:22:44 WARN BlockManagerDecommissioner: Asked to decommission RDD
> cache blocks, but no blocks to migrate
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Finished current round RDD
> blocks migration, waiting for 30000ms before the next round migration.
> 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Checking to see if we
> can shutdown.
> 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks,
> checking migrations
> 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: No running tasks, all
> blocks migrated, stopping.
> 23/03/09 15:22:44 INFO CoarseGrainedExecutorBackend: Executor self-exiting
> due to : Finished decommissioning
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop RDD blocks
> migration().
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop refreshing migratable
> shuffle blocks.
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopping migrating shuffle
> blocks.
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stopped block migration
> 23/03/09 15:22:44 INFO BlockManagerDecommissioner: Stop shuffle block
> migration().
> {code}
> This seems incorrect as there were no shuffle files to migrate to begin with.
> We enabled:
> - spark.decommission.enabled
> - spark.storage.decommission.rddBlocks.enabled
> - spark.storage.decommission.shuffleBlocks.enabled
> - spark.storage.decommission.enabled
> and set spark.storage.decommission.fallbackStorage.path to a path in our
> bucket.
> The same also happened when there were actual shuffle files that were stored
> in the bucket.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]