[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17770#discussion_r113641661 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -702,7 +705,7 @@ class Dataset[T] private[sql]( * @since 2.0.0 */ def join(right: Dataset[_]): DataFrame = withPlan { -Join(logicalPlan, right.logicalPlan, joinType = Inner, None) +Join(AnalysisBarrier(logicalPlan), right.logicalPlan, joinType = Inner, None) --- End diff -- I am wondering if we should check there's duplication between right and left sides and decide using barrier or not for right side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17770#discussion_r113641438 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -702,7 +705,7 @@ class Dataset[T] private[sql]( * @since 2.0.0 */ def join(right: Dataset[_]): DataFrame = withPlan { -Join(logicalPlan, right.logicalPlan, joinType = Inner, None) +Join(AnalysisBarrier(logicalPlan), right.logicalPlan, joinType = Inner, None) --- End diff -- For self-join de-duplication, we only set barrier for left side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17770#discussion_r113633693 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -203,7 +204,7 @@ class Dataset[T] private[sql]( * custom objects, e.g. collect. Here we resolve and bind the encoder so that we can call its * `fromRow` method later. */ - private val boundEnc = + private lazy val boundEnc = --- End diff -- We can't let `boundEnc` as lazy val because we need early exception when the encoder can't be resolved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17770#discussion_r113624301 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -72,6 +72,34 @@ object CurrentOrigin { } } +case class Barrier(node: Option[TreeNode[_]] = None) --- End diff -- My original thought is: If we use a barrier node, we need to modify many places where we create a new logical plan and wrap it with the barrier node. I will revamp it with a barrier node. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/17770#discussion_r113474943 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala --- @@ -72,6 +72,34 @@ object CurrentOrigin { } } +case class Barrier(node: Option[TreeNode[_]] = None) --- End diff -- Why not just create a logical plan node and override the `transformUp`/`transformDown` functions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/17770 [SPARK-20392][SQL][WIP] Set barrier to prevent re-entering a tree ## What changes were proposed in this pull request? It is reported that there is performance downgrade when applying ML pipeline for dataset with many columns but few rows. Currently I think the performance downgrade is caused by the cost of exchange between DataFrame/Dataset abstraction and logical plans. Some operations (e.g., `def select`) on DataFrames exchange between DataFrame abstraction and logical plans. It can be ignored in the usage of SQL. However, it's not rare to chain dozens of pipeline stages in ML. When the query plan grows incrementally during running those stages, the cost spent on the exchange grows too. In particular, the `Analyzer` will go through the big query plan even most part of it is analyzed. By eliminating part of the cost, the time to run the example code locally is reduced from about 1min to about 30 secs. In particular, the time applying the pipeline locally is mostly spent on calling transform of the 137 `Bucketizer`s. Before the change, each call of `Bucketizer`'s transform can cost about 0.4 sec. So the total time spent on all `Bucketizer`s' transform is about 50 secs. After the change, each call only costs about 0.1 sec. We also make `boundEnc` as lazy variable to reduce unnecessary running time. Note: the codes and datasets provided by Barry Becker to re-produce this issue can be found on the JIRA. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-20392 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17770.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17770 commit fe4483240d209fa6b7267e521fd81231462475de Author: Liang-Chi Hsieh Date: 2017-04-26T08:53:51Z Set barrier to prevent re-analysis of analyzed plan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org