prakharjain09 commented on issue #27864: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned URL: https://github.com/apache/spark/pull/27864#issuecomment-598769144 @holdenk @cloud-fan This PR currently does not handle running tasks - What to do with tasks that are already running on these decommissioned executors? If we start BlockManager decommissioning and we stop accepting new cached RDD blocks, then these already running tasks might fail (In case these already running tasks try to cache something). One possible approach is: Kill these already running tasks after some configurable timeout. Executor decommissioning can be broken into two steps: 1. Compute decommissioning: Stop assigning new tasks, Kill existing running tasks - maybe after some timeout 2. Storage decommissioning: BlockManager stops taking new cache blocks, offloads old cache blocks Firstly we can do `Compute decommissioning` i.e. Stop assigning new tasks (already done as part of SPARK-20628), kill running tasks after some timeout (TODO) Then do `Storage decommissioning` i.e. Stop accepting new RDD cache blocks (done as part of this PR), Move existing RDD cache blocks to other possible locations (done as part of this PR), Maybe drop the blocks if executor is not able to offload after some timeout - say after 5-10 cycles/retries. Any thoughts/feedback?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org