[ https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Min Qiu updated SPARK-12114: ---------------------------- External issue URL: https://github.com/apache/spark/pull/10087 > ColumnPruning rule fails in case of "Project <- Filter <- Join" > --------------------------------------------------------------- > > Key: SPARK-12114 > URL: https://issues.apache.org/jira/browse/SPARK-12114 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Min Qiu > > For the query > {quote} > SELECT c_name, c_custkey, o_orderkey, o_orderdate, > o_totalprice, sum(l_quantity) > FROM customer join orders join lineitem > on c_custkey = o_custkey AND o_orderkey = l_orderkey > left outer join (SELECT l_orderkey tmp_orderkey > FROM lineitem > GROUP BY l_orderkey > HAVING sum(l_quantity) > 300) tmp > on o_orderkey = tmp_orderkey > WHERE tmp_orderkey IS NOT NULL > GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice > ORDER BY o_totalprice DESC, o_orderdate > {quote} > The optimizedPlan is > {quote} > Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] > > Aggregate > \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], > \[c_name#38,c_custkey#37,o_orderkey#45, > o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] > {color: green}Project > \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] > Filter IS NOT NULL tmp_orderkey#35 > Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} > Join Inner, Some((c_custkey#37 = o_custkey#46)) > MetastoreRelation default, customer, None > Join Inner, Some((o_orderkey#45 = l_orderkey#54)) > MetastoreRelation default, orders, None > MetastoreRelation default, lineitem, None > Project \[tmp_orderkey#35] > Filter havingCondition#86 > Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS > havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] > Project \[l_orderkey#70,l_quantity#74] > MetastoreRelation default, lineitem, None > {quote} > Due to the pattern highlighted in green that the ColumnPruning rule fails to > deal with, all columns of lineitem and orders tables are scanned. The > unneeded columns are also involved in the data Shuffling. The performance is > extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org