Min Qiu created SPARK-12114: ------------------------------- Summary: ColumnPruning rule fails in case of "Project <- Filter <- Join" Key: SPARK-12114 URL: https://issues.apache.org/jira/browse/SPARK-12114 Project: Spark Issue Type: Bug Components: SQL Reporter: Min Qiu
For the query {quote} SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) FROM customer join orders join lineitem on c_custkey = o_custkey AND o_orderkey = l_orderkey left outer join (SELECT l_orderkey tmp_orderkey FROM lineitem GROUP BY l_orderkey HAVING sum(l_quantity) > 300) tmp on o_orderkey = tmp_orderkey WHERE tmp_orderkey IS NOT NULL GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice DESC, o_orderdate {quote} The optimizedPlan is {quote} Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] Aggregate \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], \[c_name#38,c_custkey#37,o_orderkey#45, o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] {color: green}Project \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] Filter IS NOT NULL tmp_orderkey#35 Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} Join Inner, Some((c_custkey#37 = o_custkey#46)) MetastoreRelation default, customer, None Join Inner, Some((o_orderkey#45 = l_orderkey#54)) MetastoreRelation default, orders, None MetastoreRelation default, lineitem, None Project \[tmp_orderkey#35] Filter havingCondition#86 Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] Project \[l_orderkey#70,l_quantity#74] MetastoreRelation default, lineitem, None {quote} Due to the pattern highlighted in green that the ColumnPruning rule fails to deal with, all columns of lineitem and orders tables are scanned. The unneeded columns are also involved in the data Shuffling. The performance is extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org