[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...

cloud-fan Sun, 07 Jan 2018 22:06:06 -0800

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19080#discussion_r160079142
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
    @@ -51,12 +76,41 @@ case object AllTuples extends Distribution
      */
     case class ClusteredDistribution(
         clustering: Seq[Expression],
    -    numPartitions: Option[Int] = None) extends Distribution {
    +    requiredNumPartitions: Option[Int] = None) extends Distribution {
       require(
         clustering != Nil,
         "The clustering expressions of a ClusteredDistribution should not be 
Nil. " +
           "An AllTuples should be used to represent a distribution that only 
has " +
           "a single partition.")
    +
    +  override def createPartitioning(numPartitions: Int): Partitioning = {
    +    assert(requiredNumPartitions.isEmpty || requiredNumPartitions.get == 
numPartitions,
    +      s"This ClusteredDistribution requires ${requiredNumPartitions.get} 
partitions, but " +
    +        s"the actual number of partitions is $numPartitions.")
    +    HashPartitioning(clustering, numPartitions)
    +  }
    +}
    +
    +/**
    + * Represents data where tuples have been partitioned according to the 
hash of the given
    + * `expressions`. The hash function is defined as 
`HashPartitioning.partitionIdExpression`, so only
    + * [[HashPartitioning]] can satisfy this distribution.
    + *
    + * This is a strictly stronger guarantee than [[ClusteredDistribution]]. 
Given a tuple and the
    + * number of partitions, this distribution strictly requires which 
partition the tuple should be in.
    + */
    +case class HashPartitionedDistribution(expressions: Seq[Expression]) 
extends Distribution {
    --- End diff --
    
    good idea, I'll rename it to `HashClusteredDistribution`. But I'd like to 
not extend `ClusteredDistribution`, since if a partition can satisfy 
`ClusteredDistribution`, it may not be able to satisfy 
`HashClusteredDistribution`. Thus we can't replace a parent with a child, which 
obeys Liskov Substitution Principle.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...

Reply via email to