ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1096520993
########## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ########## @@ -77,6 +77,11 @@ class BlockManagerMasterEndpoint( // Mapping from block id to the set of block managers that have the block. private val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]] + // Mapping from task id to the set of rdd blocks which are generated from the task. + private val tidToRddBlockIds = new mutable.HashMap[Long, mutable.HashSet[RDDBlockId]] + // Record the visible RDD blocks which have been generated at least from one successful task. + private val visibleRDDBlocks = new mutable.HashSet[RDDBlockId] Review Comment: Just found one problem if we track the invisible RDD blocks. If we track the invisible RDD blocks, then we would mark a RDD block as visible(cache can be used) only when it exists in `blockLocations` and not exists in `invisibleRDDBlocks`. When `blockLocations` removed the block(could be caused by executor lost), we will lose the information. Then the new cached data won't be leveraged as soon as possible(right after the cache is generated/reported to master). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org