[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

mridulm Fri, 17 Aug 2018 09:50:07 -0700

Github user mridulm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22112#discussion_r210756079
  
    --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala ---
    @@ -94,6 +94,16 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: 
ClassTag](
         shuffleId, _rdd.partitions.length, this)
     
       _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
    +
    +  /**
    +   * whether the partitioner is order sensitive to the input data and will 
partition shuffle output
    +   * differently if the input data order changes. For example, hash and 
range partitioners are
    +   * order insensitive, round-robin partitioner is order sensitive. This 
is a property of
    +   * `ShuffleDependency` instead of `Partitioner`, because it's common 
that a map task partitions
    +   * its output by itself and use a dummy partitioner later.
    +   */
    +  // This is defined as a `var` here instead of the constructor, to pass 
the mima check.
    +  private[spark] var orderSensitivePartitioner: Boolean = false
    --- End diff --
    
    Output order is orthogonal to what the partitioner is - but enforced by 
whether `keyOrdering` is defined in shuffle dependency. We should not associate 
order sensitivity to partitioner (which has no influence on order).
    
    > "For example, hash and range partitioners are order insensitive, 
round-robin partitioner is order sensitive".
    There is no round robin partitioner in spark core.
    
    Similarly 
    > "... because it's common that a map task partitions its output by itself 
and use a dummy partitioner later." 
    
    Tasks do not partition - that is the responsibility of partitioner.
    There can be implementations where tasks and partitioner work in tandem - 
but that is an impl detail of user code (`distributePartition` is essentially 
user code which happens to be bundled as part of spark).
    
    In general, something like this would suffice here (whether rdd is ordered 
or not):
    
    `def isOrdered = keyOrdering.isDefined`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

Reply via email to