[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

cloud-fan Thu, 06 Dec 2018 07:59:14 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23249#discussion_r239508437
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
    @@ -118,10 +116,13 @@ case class HashClusteredDistribution(
     
     /**
      * Represents data where tuples have been ordered according to the 
`ordering`
    - * [[Expression Expressions]].  This is a strictly stronger guarantee than
    - * [[ClusteredDistribution]] as an ordering will ensure that tuples that 
share the
    - * same value for the ordering expressions are contiguous and will never 
be split across
    - * partitions.
    + * [[Expression Expressions]].
    + *
    + * Tuples that share the same values for the ordering expressions must be 
contiguous within a
    + * partition. They can also across partitions, but these partitions must 
be contiguous. For example,
    + * if value `v` is the biggest values in partition 3, it can also be in 
partition 4 as the smallest
    + * value. If all the values in partition 4 are `v`, it can also be in 
partition 5 as the smallest
    + * value.
      */
     case class OrderedDistribution(ordering: Seq[SortOrder]) extends 
Distribution {
    --- End diff --
    
    This is only used by sort, and sort doesn't require rows of same value to 
be colocated in the same partition.
    
    Actually we already use this knowledge to optimize 
`RangePartitioning.satisfy`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23249: [SPARK-26297][SQL] improve the doc of Distributio...

Reply via email to