[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

squito Tue, 28 Aug 2018 06:47:22 -0700

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r213319641
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -1865,6 +1871,62 @@ abstract class RDD[T: ClassTag](
       // RDD chain.
       @transient protected lazy val isBarrier_ : Boolean =
         dependencies.filter(!_.isInstanceOf[ShuffleDependency[_, _, 
_]]).exists(_.rdd.isBarrier())
    +
    +  /**
    +   * Returns the random level of this RDD's output. Please refer to 
[[RandomLevel]] for the
    +   * definition.
    +   *
    +   * By default, an reliably checkpointed RDD, or RDD without parents(root 
RDD) is IDEMPOTENT. For
    +   * RDDs with parents, we will generate a random level candidate per 
parent according to the
    +   * dependency. The random level of the current RDD is the random level 
candidate that is random
    +   * most. Please override [[getOutputRandomLevel]] to provide custom 
logic of calculating output
    +   * random level.
    +   */
    +  // TODO: make it public so users can set random level to their custom 
RDDs.
    +  // TODO: this can be per-partition. e.g. UnionRDD can have different 
random level for different
    +  // partitions.
    +  private[spark] final lazy val outputRandomLevel: RandomLevel.Value = {
    +    if 
(checkpointData.exists(_.isInstanceOf[ReliableRDDCheckpointData[_]])) {
    --- End diff --
    
    yeah though I guess I'm also saying, this problem is worse than we 
expected, as checkpointing is not a good way to cope with the added cost.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to