[ https://issues.apache.org/jira/browse/DRILL-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034788#comment-17034788 ]
Aditya Allamraju commented on DRILL-7566: ----------------------------------------- [~volodymyr] [~ihuzenko] The materialized behavior exists in almost all the commercial and open source database vendors like Oracle, Teradata and recently with Mysql since ver 8.x expect for SQL Server. I am not very familiar with Postgresql. But a quick look at their documentation states that they materialize CTE as well and made an exception for single CTE usage. What I am trying to say is that most Customers/users expect "WITH" clause materialize the computation. Few users may not have a control on modifying the SQL via BI tools. Regarding hive views, I used the query above to just show you the behavior. But the same thing can be observed for other sources(Parquet, text files, Mapr-db JSON etc.). The real query looks different and uses multiple CTE in the main query. > Performance of Common Table Expression query-15 > ----------------------------------------------- > > Key: DRILL-7566 > URL: https://issues.apache.org/jira/browse/DRILL-7566 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Affects Versions: 1.16.0 > Reporter: Aditya Allamraju > Priority: Major > > Observed that the WITH clause is materializing the computation it did as many > times as it is being referred in the main query. The purpose of CTE is > defeated here. > For instance, Query-15 of TPC: > {code:java} > . . . . . . . . . . . . . . )> WITH revenue0(supplier_no , total_revenue) AS ( > . . . . . . . . . . . . . . )> SELECT l_suppkey, SUM(l_extendedprice * (1 > - l_discount)) > . . . . . . . . . . . . . . )> FROM lineitem > . . . . . . . . . . . . . . )> WHERE l_shipdate >= '1996-07-01' > . . . . . . . . . . . . . . )> AND l_shipdate < DATE_ADD('1996-07-01', > INTERVAL '90' DAY) > . . . . . . . . . . . . . . )> GROUP BY l_suppkey ) > . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, > total_revenue > . . . . . . . . . . semicolon> FROM supplier, revenue0 > . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no > . . . . . . . . . . semicolon> AND total_revenue = (SELECT > MAX(total_revenue) FROM revenue0) > . . . . . . . . . . semicolon> ORDER BY s_suppkey; > +-----------+--------------------+-------------------------------+-----------------+---------------+ > | s_suppkey | s_name | s_address | > s_phone | total_revenue | > +-----------+--------------------+-------------------------------+-----------------+---------------+ > | 493 | Supplier#000000493 | 7tdI3AtlDll57sj5K48WLX j5RDbc | > 21-252-702-2543 | 1779637.1723 | > +-----------+--------------------+-------------------------------+-----------------+---------------+ > 1 row selected (9.093 seconds) > apache drill (hive.tpch_text)> > {code} > I just performed the above on a small subset. You can see the behavior from > the explain plan and query profile. But usually, CTE based queries are run on > huge tables like in users case(each table running into few TB's!). > Explain plan for above query: > {code:java} > apache drill (hive.tpch_text)> explain plan for > . . . . . . . . . . semicolon> WITH revenue0(supplier_no , total_revenue) AS ( > . . . . . . . . . . . . . . )> SELECT l_suppkey, SUM(l_extendedprice * (1 > - l_discount)) > . . . . . . . . . . . . . . )> FROM lineitem > . . . . . . . . . . . . . . )> WHERE l_shipdate >= '1996-07-01' > . . . . . . . . . . . . . . )> AND l_shipdate < DATE_ADD('1996-07-01', > INTERVAL '90' DAY) > . . . . . . . . . . . . . . )> GROUP BY l_suppkey ) > . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, > total_revenue > . . . . . . . . . . semicolon> FROM supplier, revenue0 > . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no > . . . . . . . . . . semicolon> AND total_revenue = (SELECT > MAX(total_revenue) FROM revenue0) > . . . . . . . . . . semicolon> ORDER BY s_suppkey; > +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+ > | text > | json > | > +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+ > | 00-00 Screen > 00-01 Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], > total_revenue=[$4]) > 00-02 SelectionVectorRemover > 00-03 Sort(sort0=[$0], dir0=[ASC]) > 00-04 Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], > s_phone=[$3], total_revenue=[$5]) > 00-05 Project(s_suppkey=[$3], s_name=[$4], s_address=[$5], > s_phone=[$6], l_suppkey=[$0], EXPR$1=[$1], EXPR$0=[$2]) > 00-06 HashJoin(condition=[=($3, $0)], joinType=[inner], > semi-join: =[false]) > 00-08 HashJoin(condition=[=($1, $2)], joinType=[inner], > semi-join: =[false]) > 00-10 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) > 00-12 Project(l_suppkey=[$0], EXPR$1=[$1]) > 00-14 HashToRandomExchange(dist0=[[$0]]) > 01-01 UnorderedMuxExchange > 03-01 Project(l_suppkey=[$0], EXPR$1=[$1], > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) > 03-02 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) > 03-03 Project(l_suppkey=[$0], $f1=[*($1, -(1, > $2))]) > 03-04 SelectionVectorRemover > 03-05 Filter(condition=[AND(>=($3, > '1996-07-01'), <($3, 1996-09-29 00:00:00))]) > 03-06 Scan(table=[[hive, tpch_text, > lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, > tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, > `l_shipdate`], numPartitions=0, partitions= null, > inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]]) > 00-09 StreamAgg(group=[{}], EXPR$0=[MAX($0)]) > 00-11 Project(EXPR$1=[$1]) > 00-13 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) > 00-15 Project(l_suppkey=[$0], EXPR$1=[$1]) > 00-16 HashToRandomExchange(dist0=[[$0]]) > 02-01 UnorderedMuxExchange > 04-01 Project(l_suppkey=[$0], EXPR$1=[$1], > E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)]) > 04-02 HashAgg(group=[{0}], EXPR$1=[SUM($1)]) > 04-03 Project(l_suppkey=[$0], $f1=[*($1, > -(1, $2))]) > 04-04 SelectionVectorRemover > 04-05 Filter(condition=[AND(>=($3, > '1996-07-01'), <($3, 1996-09-29 00:00:00))]) > 04-06 Scan(table=[[hive, tpch_text, > lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, > tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, > `l_shipdate`], numPartitions=0, partitions= null, > inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]]) > 00-07 Scan(table=[[hive, tpch_text, supplier]], > groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:supplier), > columns=[`s_suppkey`, `s_name`, `s_address`, `s_phone`], numPartitions=0, > partitions= null, inputDirectories=[maprfs:/tpch/supplier], > confProperties={}]]) > {code} > > Observe Step 03-06 and 04-06. One of them could be avoided. > -- This message was sent by Atlassian Jira (v8.3.4#803005)