Github user tdas commented on the pull request: https://github.com/apache/spark/pull/2366#issuecomment-56293506 @mridulm Very good thoughts! I totally agree that replication is not only for streaming, and the implications of this patch in other scenarios is important to understand. Let me address them one by one. *1. Over-replication: * I can see that there is a chance that some sequence of events can cause a block to replicated to another node even if the node. But, I am trying to construct a scenario which would lead to over-replication with this patch BUT does not lead to over replication in current Spark. That is, I want to make sure that there is no regression in the behavior. Note that choosing nodes randomly can lead to more uniform memory usage in the cluster. So there is definitely an advantage to this patch. However to preserve previous behavior we can make the randomization deterministic to the block id. So if a block is recomputed on the same node where it had existed, it will get replicated to the same replication target as before. For each block, this is no different from the current behavior, hence should not create a regression in terms of the chances of over-replication. How does that sound? *2. Non-idempotent computations: * In case of such operations, there are other issues in the existing Spark as well. In the current block manager's put behavior, if a block exists locally, it is not readd as it is assumed that block contents are identical. As a result, with current Spark you can potentially have two nodes with two different versions of the same block. One can argue that this patch can increase the chances. The fix I proposed above should take care of it; the probability of that happening wont be different. *3. Under-replication:* Yes, node loss was not and is still not handled and is outside the scope of this patch. Sure would be nice to add it some day. But that definitely leads to further complexity in the BlockManager.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org