[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

viirya Thu, 18 Aug 2016 19:01:06 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14452
  
    @davies Thanks for comment.
    
    This is proposed to reuse common subquery results in the query plan. For 
example,
    
    WITH cte as (SELECT * FROM src) SELECT * FROM cte a JOIN cte b
    
    In the above query, the subquery cte will be materialized twice. With this 
PR, we just execute cte once. In the benchmark of TPC-DS query64, we have cut 
half of the running time with this PR.
    
    Simply said, we find the common subqueries in the query plan. For each 
distinct subquery, we have only one SparkPlan for it and attach it to the 
subqueries which produce the same results. Once the first subquery among them 
is executed, the RDD is kept and reused by other subqueries.
    
    We would continue to benchmark TPC-DS queries with this PR. Once the 
results are done, I will post here.
    
    Please let me know if I need to explain more in details.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

Reply via email to