[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48820132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- Another option would just be to do this at construction time, that way we can avoid paying the cost in the analyzer. This would still limit the cases we could cache (i.e. we'd miss cached data unioned with other data), but that doesn't seem like a huge deal. I'd leave this rule here either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48820999 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- To do this at construction time, we need to introduce a new Dataframe API `unionAll` that can combine more than two Dataframes? @marmbrus @rxin Is my understanding correct? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48820341 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168943487 Todo: - Will add the new `Dataframe` and `Dataset` APIs for `unionAll`, if my understanding is correct. - Will add another rule for pushing `Filter` and `Project` through `Unions`. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168943889 **[Test build #48756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48756/consoleFull)** for PR 10577 at commit [`c1f66f7`](https://github.com/apache/spark/commit/c1f66f744fce35eb657f9ec8a971dbd5449d0985). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-169077567 > Will add the new Dataframe and Dataset APIs for unionAll, if my understanding is correct. You don't need to add any new APIs, just call the optimizer rule directly on any existing API that adds a `Union`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48872458 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -169,18 +169,3 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper { case _ => None } } - -/** - * A pattern that collects all adjacent unions and returns their children as a Seq. - */ -object Unions { --- End diff -- I'm not sure I would get rid of this, just use it in your optimization rule. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-169216470 Major changes in this commit: - Combine the Union during the construction (i.e., in the DataFrame and Dataset APIs) - Use tail recursive with extractor object to reimplement the rule `CombineUnions` - Rename the operator `Unions` to `Union` - Fix bugs Todo: - Change the optimizer rule for pushing Filter and Project through Unions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168971536 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168971537 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48756/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168971317 **[Test build #48756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48756/consoleFull)** for PR 10577 at commit [`c1f66f7`](https://github.com/apache/spark/commit/c1f66f744fce35eb657f9ec8a971dbd5449d0985). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-169135841 Understood it. Thank you! Will not introduce new APIs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48897039 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -169,18 +169,3 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper { case _ => None } } - -/** - * A pattern that collects all adjacent unions and returns their children as a Seq. - */ -object Unions { --- End diff -- Sure, will reimplement it using this way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48897156 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala --- @@ -212,22 +212,47 @@ object HiveTypeCoercion { case other => None } - def castOutput(plan: LogicalPlan): LogicalPlan = { -val casted = plan.output.zip(castedTypes).map { - case (e, Some(dt)) if e.dataType != dt => -Alias(Cast(e, dt), e.name)() - case (e, _) => e -} -Project(casted, plan) + if (castedTypes.exists(_.isDefined)) { +(castOutput(left, castedTypes), castOutput(right, castedTypes)) + } else { +(left, right) } +} + +private[this] def widenOutputTypes( +planName: String, +children: Seq[LogicalPlan]): Seq[LogicalPlan] = { + require(children.forall(_.output.length == children.head.output.length)) + + val castedTypes: Seq[Option[DataType]] = +children.tail.foldLeft(children.head.output.map(a => Option(a.dataType))) { --- End diff -- There is a bug in this function. Will fix it tonight. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168802650 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48685/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168795397 @rxin Could you check if this implementation is what you expects? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168795624 Maybe we should just remove the old Union and call the new one Union? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168802645 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10577 [SPARK-12616] [SQL] Adding a New Logical Operator Unions `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children. `Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Union`s into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical Union already supports arbitrary number of children. After this is merged, will submit a separate PR for adding a new optimizer rule: Push `Unions` through `Filter` and `Project` You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark unionAllMultiChildren Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10577.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10577 commit 73270c8aa7b7e387e7b0e75369dfcbf8c554aa5e Author: gatorsmileDate: 2016-01-04T20:09:50Z added a new logical operator UNIONS commit d9811c7bb3f2c15ef9ba6fe95ec0b09f8f66b973 Author: gatorsmile Date: 2016-01-04T20:21:36Z Merge remote-tracking branch 'upstream/master' into unionAllMultiChildren --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168795851 Yeah, it will be better. Will do the change tonight. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10577#issuecomment-168795855 +1, I'd prefer if there is only one operator that performs unions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48793856 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- you should write this without using recursion to avoid stack overflow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10577#discussion_r48819727 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { } /** + * Combines all adjacent [[Union]] and [[Unions]] operators into a single [[Unions]]. + */ +object CombineUnions extends Rule[LogicalPlan] { + private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = plan match { +case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r) --- End diff -- I see. Removing `Union` introduces a lot of work, but almost done. Will submit a commit tomorrow. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org