Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23249#discussion_r239508437 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala --- @@ -118,10 +116,13 @@ case class HashClusteredDistribution( /** * Represents data where tuples have been ordered according to the `ordering` - * [[Expression Expressions]]. This is a strictly stronger guarantee than - * [[ClusteredDistribution]] as an ordering will ensure that tuples that share the - * same value for the ordering expressions are contiguous and will never be split across - * partitions. + * [[Expression Expressions]]. + * + * Tuples that share the same values for the ordering expressions must be contiguous within a + * partition. They can also across partitions, but these partitions must be contiguous. For example, + * if value `v` is the biggest values in partition 3, it can also be in partition 4 as the smallest + * value. If all the values in partition 4 are `v`, it can also be in partition 5 as the smallest + * value. */ case class OrderedDistribution(ordering: Seq[SortOrder]) extends Distribution { --- End diff -- This is only used by sort, and sort doesn't require rows of same value to be colocated in the same partition. Actually we already use this knowledge to optimize `RangePartitioning.satisfy`
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org