[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48820132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
+ * Combines all adjacent [[Union]] and [[Unions]] operators into a single 
[[Unions]].
+ */
+object CombineUnions extends Rule[LogicalPlan] {
+  private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = 
plan match {
+case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r)
--- End diff --

Another option would just be to do this at construction time, that way we 
can avoid paying the cost in the analyzer.  This would still limit the cases we 
could cache (i.e. we'd miss cached data unioned with other data), but that 
doesn't seem like a huge deal.

I'd leave this rule here either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48820999
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
+ * Combines all adjacent [[Union]] and [[Unions]] operators into a single 
[[Unions]].
+ */
+object CombineUnions extends Rule[LogicalPlan] {
+  private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = 
plan match {
+case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r)
--- End diff --

To do this at construction time, we need to introduce a new Dataframe API 
`unionAll` that can combine more than two Dataframes? @marmbrus @rxin 

Is my understanding correct? Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48820341
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
+ * Combines all adjacent [[Union]] and [[Unions]] operators into a single 
[[Unions]].
+ */
+object CombineUnions extends Rule[LogicalPlan] {
+  private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = 
plan match {
+case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r)
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168943487
  
Todo: 
  - Will add the new `Dataframe` and `Dataset` APIs for `unionAll`, if my 
understanding is correct.
  - Will add another rule for pushing `Filter` and `Project` through 
`Unions`.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168943889
  
**[Test build #48756 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48756/consoleFull)**
 for PR 10577 at commit 
[`c1f66f7`](https://github.com/apache/spark/commit/c1f66f744fce35eb657f9ec8a971dbd5449d0985).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-169077567
  
> Will add the new Dataframe and Dataset APIs for unionAll, if my 
understanding is correct.

You don't need to add any new APIs, just call the optimizer rule directly 
on any existing API that adds a `Union`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48872458
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -169,18 +169,3 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 case _ => None
   }
 }
-
-/**
- * A pattern that collects all adjacent unions and returns their children 
as a Seq.
- */
-object Unions {
--- End diff --

I'm not sure I would get rid of this, just use it in your optimization rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-169216470
  
Major changes in this commit:
   - Combine the Union during the construction (i.e., in the DataFrame and 
Dataset APIs)
   - Use tail recursive with extractor object to reimplement the rule 
`CombineUnions`
   - Rename the operator `Unions` to `Union`
   - Fix bugs

Todo: 
   - Change the optimizer rule for pushing Filter and Project through 
Unions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168971536
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168971537
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48756/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168971317
  
**[Test build #48756 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48756/consoleFull)**
 for PR 10577 at commit 
[`c1f66f7`](https://github.com/apache/spark/commit/c1f66f744fce35eb657f9ec8a971dbd5449d0985).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-169135841
  
Understood it. Thank you! Will not introduce new APIs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48897039
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -169,18 +169,3 @@ object ExtractFiltersAndInnerJoins extends 
PredicateHelper {
 case _ => None
   }
 }
-
-/**
- * A pattern that collects all adjacent unions and returns their children 
as a Seq.
- */
-object Unions {
--- End diff --

Sure, will reimplement it using this way. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-05 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48897156
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
 ---
@@ -212,22 +212,47 @@ object HiveTypeCoercion {
 case other => None
   }
 
-  def castOutput(plan: LogicalPlan): LogicalPlan = {
-val casted = plan.output.zip(castedTypes).map {
-  case (e, Some(dt)) if e.dataType != dt =>
-Alias(Cast(e, dt), e.name)()
-  case (e, _) => e
-}
-Project(casted, plan)
+  if (castedTypes.exists(_.isDefined)) {
+(castOutput(left, castedTypes), castOutput(right, castedTypes))
+  } else {
+(left, right)
   }
+}
+
+private[this] def widenOutputTypes(
+planName: String,
+children: Seq[LogicalPlan]): Seq[LogicalPlan] = {
+  require(children.forall(_.output.length == 
children.head.output.length))
+
+  val castedTypes: Seq[Option[DataType]] =
+children.tail.foldLeft(children.head.output.map(a => 
Option(a.dataType))) {
--- End diff --

There is a bug in this function. Will fix it tonight. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168802650
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48685/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168795397
  
@rxin Could you check if this implementation is what you expects? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168795624
  
Maybe we should just remove the old Union and call the new one Union?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168802645
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/10577

[SPARK-12616] [SQL] Adding a New Logical Operator Unions

`Union` logical operator only supports two children. Thus, adding a new 
logical operator `Unions` which can have arbitrary number of children.

`Union` logical plan is a binary node. However, a typical use case for 
union is to union a very large number of input sources (DataFrames, RDDs, or 
files). It is not uncommon to union hundreds of thousands of files. In this 
case, our optimizer can become very slow due to the large number of logical 
unions. We should change the Union logical plan to support an arbitrary number 
of children, and add a single rule in the optimizer to collapse all adjacent 
`Union`s into a single `Unions`. Note that this problem doesn't exist in 
physical plan, because the physical Union already supports arbitrary number of 
children.

After this is merged, will submit a separate PR for adding a new optimizer 
rule:  Push `Unions` through `Filter` and `Project`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark unionAllMultiChildren

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10577.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10577


commit 73270c8aa7b7e387e7b0e75369dfcbf8c554aa5e
Author: gatorsmile 
Date:   2016-01-04T20:09:50Z

added a new logical operator UNIONS

commit d9811c7bb3f2c15ef9ba6fe95ec0b09f8f66b973
Author: gatorsmile 
Date:   2016-01-04T20:21:36Z

Merge remote-tracking branch 'upstream/master' into unionAllMultiChildren




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread gatorsmile
Github user gatorsmile commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168795851
  
Yeah, it will be better. Will do the change tonight. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/10577#issuecomment-168795855
  
+1, I'd prefer if there is only one operator that performs unions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48793856
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
+ * Combines all adjacent [[Union]] and [[Unions]] operators into a single 
[[Unions]].
+ */
+object CombineUnions extends Rule[LogicalPlan] {
+  private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = 
plan match {
+case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r)
--- End diff --

you should write this without using recursion to avoid stack overflow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12616] [SQL] Adding a New Logical Opera...

2016-01-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/10577#discussion_r48819727
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -595,6 +598,22 @@ object BooleanSimplification extends Rule[LogicalPlan] 
with PredicateHelper {
 }
 
 /**
+ * Combines all adjacent [[Union]] and [[Unions]] operators into a single 
[[Unions]].
+ */
+object CombineUnions extends Rule[LogicalPlan] {
+  private def collectUnionChildren(plan: LogicalPlan): Seq[LogicalPlan] = 
plan match {
+case Union(l, r) => collectUnionChildren(l) ++ collectUnionChildren(r)
--- End diff --

I see. Removing `Union` introduces a lot of work, but almost done. Will 
submit a commit tomorrow. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org