Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/2524#issuecomment-63862394
  
    @CodingCat I think we discussed in 
https://issues.apache.org/jira/browse/SPARK-3628 that it would be best to do 
this only for result stages first. Can you do that? The reason is that we can't 
fully guarantee these semantics for transformations, for two reasons:
    * A shuffle stage may be resubmitted once the old one is garbage-collected 
(if periodic cleanup is on)
    * If you use an accumulator in a pipelined transformation like a map(), and 
then you make a new RDD built on top of that (e.g. apply another map() to it), 
it won't count as the same stage so you'll still get the updates twice
    
    I think we can clarify our documentation to say accumulators offer this 
guarantee only in actions, and should be used more as counters in other 
settings. It would also lead to a *much* simpler patch, which is highly 
preferred for a bug fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to