Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/2366#issuecomment-56293506
  
    @mridulm Very good thoughts! I totally agree that replication is not only 
for streaming, and the implications of this patch in other scenarios is 
important to understand. Let me address them one by one.
    
    *1. Over-replication: *
    I can see that there is a chance that some sequence of events can cause a 
block to replicated to another node even if the node. But, I am trying to 
construct a scenario which would lead to over-replication with this patch BUT 
does not lead to over replication in current Spark. That is, I want to make 
sure that there is no regression in the behavior. 
    Note that choosing nodes randomly can lead to more uniform memory usage in 
the cluster. So there is definitely an advantage to this patch. However to 
preserve previous behavior we can make the randomization deterministic to the 
block id. So if a block is recomputed on the same node where it had existed, it 
will get replicated to the same replication target as before. For each block, 
this is no different from the current behavior, hence should not create a 
regression in terms of the chances of over-replication. How does that sound?
    
    *2. Non-idempotent computations: *
    In case of such operations, there are other issues in the existing Spark as 
well. In the current block manager's put behavior, if a block exists locally, 
it is not readd as it is assumed that block contents are identical. As a 
result, with current Spark you can potentially have two nodes with two 
different versions of the same block. One can argue that this patch can 
increase the chances. The fix I proposed above should take care of it; the 
probability of that happening wont be different.
    
    *3. Under-replication:*
    Yes, node loss was not and is still not handled and is outside the scope of 
this patch. Sure would be nice to add it some day. But that definitely leads to 
further complexity in the BlockManager.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to