[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...

2017-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17770#discussion_r113641661
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -702,7 +705,7 @@ class Dataset[T] private[sql](
* @since 2.0.0
*/
   def join(right: Dataset[_]): DataFrame = withPlan {
-Join(logicalPlan, right.logicalPlan, joinType = Inner, None)
+Join(AnalysisBarrier(logicalPlan), right.logicalPlan, joinType = 
Inner, None)
--- End diff --

I am wondering if we should check there's duplication between right and 
left sides and decide using barrier or not for right side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...

2017-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17770#discussion_r113641438
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -702,7 +705,7 @@ class Dataset[T] private[sql](
* @since 2.0.0
*/
   def join(right: Dataset[_]): DataFrame = withPlan {
-Join(logicalPlan, right.logicalPlan, joinType = Inner, None)
+Join(AnalysisBarrier(logicalPlan), right.logicalPlan, joinType = 
Inner, None)
--- End diff --

For self-join de-duplication, we only set barrier for left side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...

2017-04-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17770#discussion_r113633693
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -203,7 +204,7 @@ class Dataset[T] private[sql](
* custom objects, e.g. collect.  Here we resolve and bind the encoder 
so that we can call its
* `fromRow` method later.
*/
-  private val boundEnc =
+  private lazy val boundEnc =
--- End diff --

We can't let `boundEnc` as lazy val because we need early exception when 
the encoder can't be resolved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...

2017-04-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17770#discussion_r113624301
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -72,6 +72,34 @@ object CurrentOrigin {
   }
 }
 
+case class Barrier(node: Option[TreeNode[_]] = None)
--- End diff --

My original thought is: If we use a barrier node, we need to modify many 
places where we create a new logical plan and wrap it with the barrier node.

I will revamp it with a barrier node.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...

2017-04-26 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/17770#discussion_r113474943
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala 
---
@@ -72,6 +72,34 @@ object CurrentOrigin {
   }
 }
 
+case class Barrier(node: Option[TreeNode[_]] = None)
--- End diff --

Why not just create a logical plan node and override the 
`transformUp`/`transformDown` functions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...

2017-04-26 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/17770

[SPARK-20392][SQL][WIP] Set barrier to prevent re-entering a tree

## What changes were proposed in this pull request?

It is reported that there is performance downgrade when applying ML 
pipeline for dataset with many columns but few rows.

Currently I think the performance downgrade is caused by the cost of 
exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations (e.g., `def select`) on DataFrames exchange between DataFrame 
abstraction and logical plans. It can be ignored in the usage of SQL.

However, it's not rare to chain dozens of pipeline stages in ML. When the 
query plan grows incrementally during running those stages, the cost spent on 
the exchange grows too. In particular, the `Analyzer` will go through the big 
query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally 
is reduced from about 1min to about 30 secs.

In particular, the time applying the pipeline locally is mostly spent on 
calling transform of the 137 `Bucketizer`s. Before the change, each call of 
`Bucketizer`'s transform can cost about 0.4 sec. So the total time spent on all 
`Bucketizer`s' transform is about 50 secs. After the change, each call only 
costs about 0.1 sec.

We also make `boundEnc` as lazy variable to reduce unnecessary running time.

Note: the codes and datasets provided by Barry Becker to re-produce this 
issue can be found on the JIRA.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-20392

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17770.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17770


commit fe4483240d209fa6b7267e521fd81231462475de
Author: Liang-Chi Hsieh 
Date:   2017-04-26T08:53:51Z

Set barrier to prevent re-analysis of analyzed plan.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org