Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r210756079 --- Diff: core/src/main/scala/org/apache/spark/Dependency.scala --- @@ -94,6 +94,16 @@ class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( shuffleId, _rdd.partitions.length, this) _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this)) + + /** + * whether the partitioner is order sensitive to the input data and will partition shuffle output + * differently if the input data order changes. For example, hash and range partitioners are + * order insensitive, round-robin partitioner is order sensitive. This is a property of + * `ShuffleDependency` instead of `Partitioner`, because it's common that a map task partitions + * its output by itself and use a dummy partitioner later. + */ + // This is defined as a `var` here instead of the constructor, to pass the mima check. + private[spark] var orderSensitivePartitioner: Boolean = false --- End diff -- Output order is orthogonal to what the partitioner is - but enforced by whether `keyOrdering` is defined in shuffle dependency. We should not associate order sensitivity to partitioner (which has no influence on order). > "For example, hash and range partitioners are order insensitive, round-robin partitioner is order sensitive". There is no round robin partitioner in spark core. Similarly > "... because it's common that a map task partitions its output by itself and use a dummy partitioner later." Tasks do not partition - that is the responsibility of partitioner. There can be implementations where tasks and partitioner work in tandem - but that is an impl detail of user code (`distributePartition` is essentially user code which happens to be bundled as part of spark). In general, something like this would suffice here (whether rdd is ordered or not): `def isOrdered = keyOrdering.isDefined`
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org