[ 
https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Qiu updated SPARK-12114:
----------------------------
    External issue URL: https://github.com/apache/spark/pull/10087

> ColumnPruning rule fails in case of "Project <- Filter <- Join"
> ---------------------------------------------------------------
>
>                 Key: SPARK-12114
>                 URL: https://issues.apache.org/jira/browse/SPARK-12114
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Min Qiu
>
> For the query
> {quote}
> SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
>        o_totalprice, sum(l_quantity) 
> FROM customer join orders join lineitem 
>       on c_custkey = o_custkey AND o_orderkey = l_orderkey 
>      left outer join (SELECT l_orderkey tmp_orderkey 
>                       FROM lineitem 
>                       GROUP BY l_orderkey 
>                       HAVING sum(l_quantity) > 300) tmp 
>       on o_orderkey = tmp_orderkey 
> WHERE tmp_orderkey IS NOT NULL 
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice 
> ORDER BY o_totalprice DESC, o_orderdate
> {quote}
> The optimizedPlan is 
> {quote}
> Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
>  
>  Aggregate 
> \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], 
> \[c_name#38,c_custkey#37,o_orderkey#45,                
> o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
>   {color: green}Project 
> \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
>    Filter IS NOT NULL tmp_orderkey#35
>     Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
>      Join Inner, Some((c_custkey#37 = o_custkey#46))
>       MetastoreRelation default, customer, None
>       Join Inner, Some((o_orderkey#45 = l_orderkey#54))
>        MetastoreRelation default, orders, None
>        MetastoreRelation default, lineitem, None
>      Project \[tmp_orderkey#35]
>       Filter havingCondition#86
>        Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS 
> havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
>         Project \[l_orderkey#70,l_quantity#74]
>          MetastoreRelation default, lineitem, None
> {quote}
> Due to the pattern highlighted in green that the ColumnPruning rule fails to 
> deal with,  all columns of lineitem and orders tables are scanned. The 
> unneeded columns are also involved in the data Shuffling. The performance is 
> extremely bad if any one of the two tables is big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to