[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85791/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85791/testReport)** for PR 19080 at commit [`a2f1bc1`](https://github.com/apache/spark/commit/a2f1bc1a077929e9d6bdb7e855debfdbac3ca03e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class HashClusteredDistribution(expressions: Seq[Expression]) extends Distribution ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85791/testReport)** for PR 19080 at commit [`a2f1bc1`](https://github.com/apache/spark/commit/a2f1bc1a077929e9d6bdb7e855debfdbac3ca03e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85786/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85786 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85786/testReport)** for PR 19080 at commit [`a2f1bc1`](https://github.com/apache/spark/commit/a2f1bc1a077929e9d6bdb7e855debfdbac3ca03e). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class HashClusteredDistribution(expressions: Seq[Expression]) extends Distribution ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/19080 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85786 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85786/testReport)** for PR 19080 at commit [`a2f1bc1`](https://github.com/apache/spark/commit/a2f1bc1a077929e9d6bdb7e855debfdbac3ca03e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85715/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85715/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Distribution ` * `case class HashPartitionedDistribution(expressions: Seq[Expression]) extends Distribution ` * `case class BroadcastDistribution(mode: BroadcastMode) extends Distribution ` * `case class UnknownPartitioning(numPartitions: Int) extends Partitioning` * `case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #85715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85715/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84325/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #84325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84325/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Distribution ` * `case class HashPartitionedDistribution(expressions: Seq[Expression]) extends Distribution ` * `case class BroadcastDistribution(mode: BroadcastMode) extends Distribution ` * `case class UnknownPartitioning(numPartitions: Int) extends Partitioning` * `case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #84325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84325/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82802/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #82802 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82802/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Distribution ` * `case class HashPartitionedDistribution(expressions: Seq[Expression]) extends Distribution ` * `case class BroadcastDistribution(mode: BroadcastMode) extends Distribution ` * `case class UnknownPartitioning(numPartitions: Int) extends Partitioning` * `case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #82802 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82802/testReport)** for PR 19080 at commit [`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 cc @rxin @JoshRosen @liancheng @sameeragarwal @gatorsmile @brkyvz any more comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 also cc @rxin , to support the "pre-shuffle" feature for data source v2, I need to create similar `Distribution` and `Partitioning` interfaces in the data source package. However, the current model is too complex, I have no idea how to ask the users to implement `Partitioning.compatibleWith` and `Partitioning.guarantees` properly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 so my whole point of view is, co-partition is a really tricky requirement, and it's really hard to implicitly guarantee it during shuffle planning. We should have a weaker guarantee(same number of partitions), and let the operator itself achieve the co-partition requirement by this guarantee and special distribution requirement(`HashPartitionedDistribution`). Also in the future we may have operators that have distribution requirement for multiple children, but they don't need them to be co-partitioned. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 > Both sides will satisfy the required distribution of the join This is not true now. After this PR, join has a stricter distribution requirement called `HashPartitionedDistribution`, so range partitioning doesn't satisfy it and Spark will add shuffle. I think this is reasonable, `ClusteredDistribution` can not represent the co-partition requirement of join. > Can you give a concrete example? let's take join as a example, `t1 join t2 on t1.a = t2.x and t1.b = t2.y`. Then the join requires `HashPartitionedDistribution(a, b)` for `t1`, and requires `HashPartitionedDistribution(x, y)` for `t2`. According to the definition of `HashPartitionedDistribution`, if `t1` has a tuple `(a=1,b=2)`, it will be in a certain partition, let's say the second partition of `t1`. If `t2` has a tuple `(x=1,y=2)`, it will also be in the second partition of `t2`, because Spark guarantees `t1` and `t2` have the same number of partitions, and `HashPartitionedDistribution` determines the partition given a tuple and numParttions. So we can safely zip partitions of `t1` and `t2` and do the join. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/19080 Have a question after reading the new approach. Let's say that we have a join like `T1 JOIN T2 on T1.a = T2.a`. Also `T1` is hash partitioned by the value of `T1.a` and it has 10 partitions, and `T2` is range partitioned by the value of `T2.a` and it has 10 partitions. Both sides will satisfy the required distribution of the join. However, we need to add an exchange at either side in order to produce the correct result. How will we handle this case with this change? Also, regarding > For multiple children, Spark only guarantees they have the same number of partitions, and it's the operator's responsibility to leverage this guarantee to achieve more complicated requirements. Can you give a concrete example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19080 cc @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81264/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19080 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #81264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81264/testReport)** for PR 19080 at commit [`9c870eb`](https://github.com/apache/spark/commit/9c870ebeee438b61f07105c49037a60a2ea23875). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `sealed trait Distribution ` * `case class HashPartitionedDistribution(expressions: Seq[Expression]) extends Distribution ` * `case class BroadcastDistribution(mode: BroadcastMode) extends Distribution ` * `case class UnknownPartitioning(numPartitions: Int) extends Partitioning` * `case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19080 also cc @marmbrus --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19080 **[Test build #81264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81264/testReport)** for PR 19080 at commit [`9c870eb`](https://github.com/apache/spark/commit/9c870ebeee438b61f07105c49037a60a2ea23875). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org