Github user brad-kaiser commented on the issue: https://github.com/apache/spark/pull/19041 Hi @squito, The back and forth communication between CacheRecoveryManager and the BlockManagerMasterEndpoint is so that we always have an up to date view of what executors are undergoing cache recovery and we don't replicate blocks to those executors. If you look at recoverLatestBlock, we include the contents of the recoveringExecutors cache. We could conceivably move that cache into the block manager master endpoint, but I think that would end up being messier. I wanted to keep all the cache recovery code localized and not clutter up Block Manager Master Endpoint. CacheRecoveryManager and BlockManagerMaster Endpoint will also be local to the same process so rpc calls between them should be cheap, especially compared to the time it will take to copy blocks around. I will look into the race between removing the block and replicating the next block. Thanks
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org