[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user jaceklaskowski commented on the issue: https://github.com/apache/spark/pull/15575 I'm late with this, but just leaving it for future code reviewers... I think the change took the most extreme path where even such simple `outputPartitioning` as the one in `WindowExec` (see below) is currently part of the extension of `UnaryExecNode` while I think it should stay with `UnaryExecNode` and be overridden when the output partitioning is different. It's a kind of code duplication. ``` override def outputPartitioning: Partitioning = child.outputPartitioning ``` The above serves nothing but says _"I simply don't care what happens upstream"_. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/15575 LGTM - merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 yeah, LGTM, it doesn't change current outputPartitioning of operators. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 LGTM too. Unforunately my internet sucks (on a plane) and I can't merge this right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/15575 LGTM, since the scope of this PR is just refactoring. Let me first post the existing code for `outputPartitioning` in `ExpandExec`: ```Scala // The GroupExpressions can output data with arbitrary partitioning, so set it // as UNKNOWN partitioning override def outputPartitioning: Partitioning = UnknownPartitioning(0) ``` It makes sense to set it either `UnknownPartitioning` or `child.outputPartitioning`. However, the above code is setting to a wrong number of partitions. We need to correct it no matter whether we are using the number or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 In practice, setting the `outputPartitioning` of a physical plan like `ExpandExec` to `child.outputPartitioning` doesn't cause any real problem, even this physical plan doesn't keep the same row distribution of its child. That is because if the physical plan changes output, it will have different output attributes, e.g., `col` to `col'` as @tejasapatil pointed out. If its parent plan requires a distribution, says `HashPartition`, this distribution will bound to the physical plan's output `col'`, instead of its child plan's `col`. So even the physical plan uses `child.outputPartitioning`, `EnsureRequirements ` will step in and inject an extra shuffle exchange of `HashPartition(col')` to satisfy the requirement. It works like that as per my understanding. However it doesn't mean the physical plan's output partitioning is exactly as same as its child's, i.e., `HashPartition(col)`, because it doesn't have the output `col`. This part might be confusing to some people, so I think it might be better to explain it more. That is what I understood about this, if I am wrong please kindly point out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil yeah, that is correct. however, I am wondering if we can say this `ExpandExec` have the same distribution of rows as its child...because it even doesn't have the `col`... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya : As per my understanding, if the child operator emits `col`, after applying `ExpandExec`, the output is `col'`. The original child partitioning being over `col`, `ExpandExec` does not seem to alter that. The table above was to summarise the state of things before this PR. I did not change any semantics in this PR as its pure refactoring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @rxin yeah, I am curious why `ExpandExec` and `GenerateExec` have different `outputPartitioning`... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil I see there is 1:1 mapping among output partition of child operator and output partition of `ExpandExec`. For example we have an Expand applying on a data set like col: [1, 2, 3]. If the projections are col, col + 1, col + 2. Assume the partition of the data set is HashPartition(col). We have three partitions: p1: [1] p2: [2] p3: [3] After the Expand, the data set becomes: p1: [1, 2, 3] p2: [2, 3, 4] p3: [3, 4, 5] Is it still valid for HashPartition(col)? Looks like it doesn't. I think It is why there is a comment on ExpandExec in the code position you links to. BTW, in your table `ExpandExec`'s `outputPartitioning` is `UnknownPartitioning`, right? If it doesn't change child's partition, why we don't set it to child's outputPartitioning? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya >> However, if its child has certain partition such as HashPartition, after ExpandExec it becomes a UnknownPartitioning The notion of `Partitioning` in Spark is the distribution of rows across tasks. Even if the child's output has `HashPartitioning`, there is a 1:1 mapping among output partition of child operator and output partition of `ExpandExec`. So, applying `ExpandExec` does not alter the partitioning of child outputs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 The current thing LGTM. cc @yhuai do you have any other feedback? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @viirya >> In the table in the description, CoalesceExec output UnknownPartitioning Yes. Since partitions == 1 is a corner case, I did not put that in the table. If you look at the code, its doing the right thing: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L490 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 @viirya of course if you say coalesce(1) it is a single partition -- any operator that changes partition to 1 partition is single partition. For Expand isn't it just the same as Generate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 @rxin In the table in the description, `CoalesceExec` output `UnknownPartitioning`, actually it can be `SinglePartition` if what you do is `coalesce(1)`. `ExpandExec` doesn't actually move rows across partitions as @tejasapatil pointed out. However, if its child has certain partition such as `HashPartition`, after `ExpandExec` it becomes a `UnknownPartitioning`. I am not sure if it does change the partitioning or not. From the view of output partition of the physical plan, it is changed indeed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 @tejasapatil you were linking to a fb internal branch :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 CoalesceExec is not SinglePartition. Coalesce can lead to an arbitrary number of partitions. ExpandExec doesn't really change the partitioning either. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 But `ExpandExec`'s `GroupExpressions` could generate data with arbitrary partitioning, so even it doesn't move rows across input partitions, doesn't it still can change child' output partition? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 `CoalesceExec` can also have `SinglePartition` as `outputPartitioning` too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @rxin : Yes. There was a comment over there [0] but based on code it does not look like its moving rows across input partitions (cc @chenghao-intel who added that). [0] : https://github.com/facebook/FB-Spark/blob/d26c42982c18da8fb1b21c9eb75aef9364d1b992/sql/core/src/main/scala/org/apache/spark/sql/execution/Expand.scala#L45 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 ExpandExec should follow child's output partitioning and ordering, shouldn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67320/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)** for PR 15575 at commit [`cfcd2a7`](https://github.com/apache/spark/commit/cfcd2a79231ed16c2a3e82551e87a6de888c2af6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67319/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)** for PR 15575 at commit [`2eee2ee`](https://github.com/apache/spark/commit/2eee2eea81d665706a76a5614d26b4401a07f127). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67318/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67318 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)** for PR 15575 at commit [`c5b6626`](https://github.com/apache/spark/commit/c5b6626c144a97054f74adc3158ceea6de3cea58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 @yhuai : please see the table in this PRs description. I have added a `comment` (last column) for each entry to point out those cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 > I felt that there are numerous places where child's output ordering could be used but the operators don't set it Can you list them at here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67320/consoleFull)** for PR 15575 at commit [`cfcd2a7`](https://github.com/apache/spark/commit/cfcd2a79231ed16c2a3e82551e87a6de888c2af6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 Agree with the planner behavior described in the last few comments (relevant code : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L165) I have updated the description of the PR with a table which shows the output partitioning and ordering for each implementation of `UnaryExecNode`. It will help with the review in case we are missing something / setting something wrong. `BroadcastExchangeExec `, `CoalesceExec `, `CollectLimitExec `, `ExpandExec`, `TakeOrderedAndProjectExec` and `ShuffleExchange` do not use child's output partitioning. I felt that there are numerous places where child's output ordering could be used but the operators don't set it (thus using default `Nil`). I won't made those modifications in this PR as I want to only do the refac in this diff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67319/consoleFull)** for PR 15575 at commit [`2eee2ee`](https://github.com/apache/spark/commit/2eee2eea81d665706a76a5614d26b4401a07f127). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 Yeah. So I don't see any exception that an unary node, if it is not `ShuffleExchange`, can have an `outputPartitioning` other than `child.outputPartitioning`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 Our planner decides if to add an `ShuffleExchange` by consider `outputPartitioning` and `requiredDistribution` together. If the `outputPartitioning` of the child does not satisfy the `requiredDistribution` of the parent, we will add a `ShuffleExchange` operator. When we say `child`, it is literally the child of a node. It does not matter if the child is a `ShuffleExchange` or not. For a node, when its `outputPartitioning` is `child. outputPartitioning`, this node does not shuffle data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)** for PR 15575 at commit [`f9a4420`](https://github.com/apache/spark/commit/f9a442080d73dd42682a20f0a1ee6e5844abb938). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15575 The query planner will inject exchange based on the required distribution for the child of `UnaryNodeExec`. Looks like all unary nodes have `child.outputPartitioning` as its output partition. I can't see an exception based on the fact that only `ShuffleExchange` will change `outputPartitioning`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67311/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15575 @yhuai let me be more clear. It is actually fairly confusing to say "this only changes if the operator doesn't shuffle data". The thing is that we are relying on each operator's output partitioning to determine injecting exchanges. So is it safe to depend on "child", which is ambiguous in this context? Child could mean the current child before injecting exchange, or it could mean the injected exchange operator. Can the query planner incorrectly not injecting exchange because a plan node is deemed already having the correct output partitioning? Or is this always guaranteed to not happen due to certain ordering we enforce in planning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67318/consoleFull)** for PR 15575 at commit [`c5b6626`](https://github.com/apache/spark/commit/c5b6626c144a97054f74adc3158ceea6de3cea58). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15575 `outputPartitioning` of a node will only be changed if this node shuffles data. Right now, only `ShuffleExchange` shuffles data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67311/consoleFull)** for PR 15575 at commit [`f9a4420`](https://github.com/apache/spark/commit/f9a442080d73dd42682a20f0a1ee6e5844abb938). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)** for PR 15575 at commit [`85fb132`](https://github.com/apache/spark/commit/85fb132196a700072e5a8e70ebf17ae24d7958e4). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67309/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15575 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15575 **[Test build #67309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67309/consoleFull)** for PR 15575 at commit [`85fb132`](https://github.com/apache/spark/commit/85fb132196a700072e5a8e70ebf17ae24d7958e4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...
Github user tejasapatil commented on the issue: https://github.com/apache/spark/pull/15575 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org