Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/11105#discussion_r86487454 --- Diff: core/src/main/scala/org/apache/spark/TaskContext.scala --- @@ -69,6 +69,20 @@ object TaskContext { } } +/** + * Identifies where an update for a data property accumulator update came from. This is important + * to ensure that updates are not double-counted when rdds get recomputed. When the executors send + * back data property acucmualtor updates, they seperate out the updates per rdd or shuffle output + * generated by the task. That gives the driver sufficient info to ensure that each update is + * counted once. The shuffleId is important since two seperate shuffle actions could happen on the + * same RDD and the RDD ID and partition ID with different accumulators. + * Any accumulators used inside of `runJob` directly are always counted because there is no + * resubmition of `runJob`/`foreach`. +*/ +private[spark] sealed trait TaskOutputId +private[spark] case class RDDOutputId(rddId: Int, partition: Int) extends TaskOutputId +private[spark] case class ShuffleMapOutputId(shuffleId: Int, partition: Int) extends TaskOutputId +private[spark] case class ForeachOutputId() extends TaskOutputId --- End diff -- can be a `case object`
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org