Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/2524#issuecomment-63862394 @CodingCat I think we discussed in https://issues.apache.org/jira/browse/SPARK-3628 that it would be best to do this only for result stages first. Can you do that? The reason is that we can't fully guarantee these semantics for transformations, for two reasons: * A shuffle stage may be resubmitted once the old one is garbage-collected (if periodic cleanup is on) * If you use an accumulator in a pipelined transformation like a map(), and then you make a new RDD built on top of that (e.g. apply another map() to it), it won't count as the same stage so you'll still get the updates twice I think we can clarify our documentation to say accumulators offer this guarantee only in actions, and should be used more as counters in other settings. It would also lead to a *much* simpler patch, which is highly preferred for a bug fix.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org