[GitHub] [spark] cloud-fan commented on a change in pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

GitBox Wed, 10 Nov 2021 05:04:58 -0800


cloud-fan commented on a change in pull request #32875:
URL: https://github.com/apache/spark/pull/32875#discussion_r746566259




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -352,3 +380,142 @@ case class BroadcastPartitioning(mode: BroadcastMode) 
extends Partitioning {
     case _ => false
   }
 }
+
+/**
+ * This is used in the scenario where an operator has multiple children (e.g., 
join) and one or more
+ * of which have their own requirement regarding whether its data can be 
considered as
+ * co-partitioned from others. This offers APIs for:
+ *
+ *   1. Comparing with specs from other children of the operator and check if 
they are compatible.
+ *      When two specs are compatible, we can say their data are 
co-partitioned, and Spark will
+ *      potentially able to eliminate shuffle if necessary.
+ *   1. Creating a partitioning that can be used to re-partition another 
child, so that to make it
+ *      having a compatible partitioning as this node.
+ */
+trait ShuffleSpec {
+  /**
+   * Returns the number of partitions of this shuffle spec
+   */
+  def numPartitions: Int
+
+  /**
+   * Returns true iff this spec is compatible with the other [[Partitioning]] 
and
+   * clustering expressions (e.g., from [[ClusteredDistribution]]).
+   *
+   * A true return value means that the data partitioning from this spec can 
be seen as
+   * co-partitioned with the `otherPartitioning`, and therefore no shuffle is 
required when
+   * joining the two sides.
+   */
+  def isCompatibleWith(other: ShuffleSpec): Boolean
+
+  /**
+   * Creates a partitioning that can be used to re-partitioned the other side 
with the given
+   * required distribution.
+   *
+   * Note: this will only be called after `isCompatibleWith` returns true on 
the side where the
+   * `clustering` is returned from.
+   */
+  final def createPartitioning(distribution: Distribution): Partitioning = 
distribution match {
+    case AllTuples =>
+      SinglePartition
+    case ClusteredDistribution(clustering, _) =>
+      createPartitioning0(clustering)
+    case _ =>
+      throw new IllegalStateException("unexpected distribution: " +
+          s"${distribution.getClass.getSimpleName}")
+  }
+
+  def createPartitioning0(clustering: Seq[Expression]): Partitioning
+}
+
+case object SinglePartitionShuffleSpec extends ShuffleSpec {
+  override def isCompatibleWith(other: ShuffleSpec): Boolean = {

Review comment:
       My understanding is, if two partitionings are compatible, their 
corresponding distribution must be compatible as well. If a binary plan 
requires its left child to be `AllTuples` and the right child to be 
`ClusteredDistribution`, I don't think it makes sense to require the children 
to be co-partitioned. Today, the only case we have is a binary node that 
requires its children to be both `ClusteredDistribution`.
   
   That said, I think `SinglePartitionShuffleSpec` should only be compatible 
with `SinglePartitionShuffleSpec`, and `createPartitioning` should take 
numPartitions and keyIndexes (not an arbitrary `Distribution`) and create 
partitioning w.r.t. its own `Distribution`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

Reply via email to