Github user squito commented on the pull request: https://github.com/apache/spark/pull/11105#issuecomment-198444401 @rxin totally understand that this might have an unacceptable impact on performance. That part remains to be explored. For now the focus has mostly been on trying to attain the desired semantics. I sort-of disagree with the other points (perhaps depends a bit on interpretation). I think holden _is_ working through this in a very detailed fashion. This is labeled "rfc / wip", its not meant to be merged today. That process is just happening in the open, so more community members can be involved. For now, the focus is on getting the semantics right, you can the effort is on coming up with test cases for all these different scenarios and making sure things make sense. If the performance turns out to be unacceptable, well then that gives us a place to work from -- we could then consider other solutions that perhaps need to compromise a bit on semantics but do not negatively impact performance. Can you point to the semantics you disagree with? There is only one case I see from the document you mentioned -- this covers that case and plenty more as well: ```scala rdd.map { i => acc += 1; i } rdd.count() rdd.count() ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ// vs. rdd2 = rdd.map { i => acc += 1; i } rdd.count() rdd2.count() ``` I agree that its not obvious what the semantics should be here, but this actually addresses it in a few ways: (a) most importantly, it just chooses *some* semantics which are clearly defined -- the updates from each RDD are counted exactly once, so in the first case the value is N, in the second case its 2N (I assume in that example, `rdd` still has an accumulator increment in its definition?). This is enough info for the user to decide what to do (they can always create a second accumulator, now that the semantics are understood). And (b), it actually lets the user choose. As holden has pointed out, with this approach, you could also keep the accumulator value per-RDD, eg. `acc.rddValue(rdd)` and `acc.rddValue(rdd2)`. That's not currently exposed (just to limit the api changes), but that could be added. This gives well-defined semantics and a lot of flexibility.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org