Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r213319641 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1865,6 +1871,62 @@ abstract class RDD[T: ClassTag]( // RDD chain. @transient protected lazy val isBarrier_ : Boolean = dependencies.filter(!_.isInstanceOf[ShuffleDependency[_, _, _]]).exists(_.rdd.isBarrier()) + + /** + * Returns the random level of this RDD's output. Please refer to [[RandomLevel]] for the + * definition. + * + * By default, an reliably checkpointed RDD, or RDD without parents(root RDD) is IDEMPOTENT. For + * RDDs with parents, we will generate a random level candidate per parent according to the + * dependency. The random level of the current RDD is the random level candidate that is random + * most. Please override [[getOutputRandomLevel]] to provide custom logic of calculating output + * random level. + */ + // TODO: make it public so users can set random level to their custom RDDs. + // TODO: this can be per-partition. e.g. UnionRDD can have different random level for different + // partitions. + private[spark] final lazy val outputRandomLevel: RandomLevel.Value = { + if (checkpointData.exists(_.isInstanceOf[ReliableRDDCheckpointData[_]])) { --- End diff -- yeah though I guess I'm also saying, this problem is worse than we expected, as checkpointing is not a good way to cope with the added cost.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org