Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/11105#issuecomment-198444401
  
    @rxin totally understand that this might have an unacceptable impact on 
performance.  That part remains to be explored.  For now the focus has mostly 
been on trying to attain the desired semantics.
    
    I sort-of disagree with the other points (perhaps depends a bit on 
interpretation).  I think holden _is_ working through this in a very detailed 
fashion.  This is labeled "rfc / wip", its not meant to be merged today.   That 
process is just happening in the open, so more community members can be 
involved.   For now, the focus is on getting the semantics right, you can the 
effort is on coming up with test cases for all these different scenarios and 
making sure things make sense.  If the performance turns out to be 
unacceptable, well then that gives us a place to work from -- we could then 
consider other solutions that perhaps need to compromise a bit on semantics but 
do not negatively impact performance.
    
    Can you point to the semantics you disagree with?  There is only one case I 
see from the document you mentioned -- this covers that case and plenty more as 
well:
    
    ```scala
    rdd.map { i => acc += 1; i }
    rdd.count()
    rdd.count()
    
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­//
 vs. 
    rdd2 = rdd.map { i => acc += 1; i }
    rdd.count()
    rdd2.count()
    ```
    
    I agree that its not obvious what the semantics should be here, but this 
actually addresses it in a few ways: (a) most importantly, it just chooses 
*some* semantics which are clearly defined -- the updates from each RDD are 
counted exactly once, so in the first case the value is N, in the second case 
its 2N (I assume in that example, `rdd` still has an accumulator increment in 
its definition?).  This is enough info for the user to decide what to do (they 
can always create a second accumulator, now that the semantics are understood). 
 And (b), it actually lets the user choose.  As holden has pointed out, with 
this approach, you could also keep the accumulator value per-RDD, eg. 
`acc.rddValue(rdd)` and `acc.rddValue(rdd2)`.  That's not currently exposed 
(just to limit the api changes), but that could be added.  This gives 
well-defined semantics and a lot of flexibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to