[ 
https://issues.apache.org/jira/browse/DRILL-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034788#comment-17034788
 ] 

Aditya Allamraju commented on DRILL-7566:
-----------------------------------------

[~volodymyr] [~ihuzenko] 

The materialized behavior exists in almost all the commercial and open source 
database vendors like Oracle, Teradata and recently with Mysql since ver 8.x 
expect for SQL Server. I am not very familiar with Postgresql. But a quick look 
at their documentation states that they materialize CTE as well and made an 
exception for single CTE usage. What I am trying to say is that most 
Customers/users expect "WITH" clause materialize the computation. Few users may 
not have a control on modifying  the SQL via BI tools.

Regarding hive views, I used the query above to just show you the behavior. But 
the same thing can be observed for other sources(Parquet, text files, Mapr-db 
JSON etc.).

The real query looks different and uses multiple CTE in the main query. 

 

 

> Performance of Common Table Expression query-15
> -----------------------------------------------
>
>                 Key: DRILL-7566
>                 URL: https://issues.apache.org/jira/browse/DRILL-7566
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.16.0
>            Reporter: Aditya Allamraju
>            Priority: Major
>
> Observed that the WITH clause is materializing the computation it did as many 
> times as it is being referred in the main query. The purpose of CTE is 
> defeated here.
> For instance, Query-15 of TPC:
> {code:java}
> . . . . . . . . . . . . . . )> WITH revenue0(supplier_no , total_revenue) AS (
> . . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 
> - l_discount))
> . . . . . . . . . . . . . . )>     FROM lineitem
> . . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
> . . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', 
> INTERVAL '90' DAY)
> . . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
> . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, 
> total_revenue
> . . . . . . . . . . semicolon> FROM supplier, revenue0
> . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
> . . . . . . . . . . semicolon>   AND total_revenue = (SELECT 
> MAX(total_revenue) FROM revenue0)
> . . . . . . . . . . semicolon> ORDER BY s_suppkey;
> +-----------+--------------------+-------------------------------+-----------------+---------------+
> | s_suppkey |       s_name       |           s_address           |     
> s_phone     | total_revenue |
> +-----------+--------------------+-------------------------------+-----------------+---------------+
> | 493       | Supplier#000000493 | 7tdI3AtlDll57sj5K48WLX j5RDbc | 
> 21-252-702-2543 | 1779637.1723  |
> +-----------+--------------------+-------------------------------+-----------------+---------------+
> 1 row selected (9.093 seconds)
> apache drill (hive.tpch_text)>
> {code}
> I just performed the above on a small subset. You can see the behavior from 
> the explain plan and query profile. But usually, CTE based queries are run on 
> huge tables like in users case(each table running into few TB's!).
> Explain plan for above query:
> {code:java}
> apache drill (hive.tpch_text)> explain plan for 
> . . . . . . . . . . semicolon> WITH revenue0(supplier_no , total_revenue) AS (
> . . . . . . . . . . . . . . )>     SELECT l_suppkey, SUM(l_extendedprice * (1 
> - l_discount))
> . . . . . . . . . . . . . . )>     FROM lineitem
> . . . . . . . . . . . . . . )>     WHERE l_shipdate >= '1996-07-01'
> . . . . . . . . . . . . . . )>       AND l_shipdate < DATE_ADD('1996-07-01', 
> INTERVAL '90' DAY)
> . . . . . . . . . . . . . . )>     GROUP BY l_suppkey )
> . . . . . . . . . . semicolon> SELECT s_suppkey, s_name, s_address, s_phone, 
> total_revenue
> . . . . . . . . . . semicolon> FROM supplier, revenue0
> . . . . . . . . . . semicolon> WHERE s_suppkey = supplier_no
> . . . . . . . . . . semicolon>   AND total_revenue = (SELECT 
> MAX(total_revenue) FROM revenue0)
> . . . . . . . . . . semicolon> ORDER BY s_suppkey;
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> |                                       text                                  
>      |                                       json                             
>           |
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> | 00-00    Screen
> 00-01      Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], s_phone=[$3], 
> total_revenue=[$4])
> 00-02        SelectionVectorRemover
> 00-03          Sort(sort0=[$0], dir0=[ASC])
> 00-04            Project(s_suppkey=[$0], s_name=[$1], s_address=[$2], 
> s_phone=[$3], total_revenue=[$5])
> 00-05              Project(s_suppkey=[$3], s_name=[$4], s_address=[$5], 
> s_phone=[$6], l_suppkey=[$0], EXPR$1=[$1], EXPR$0=[$2])
> 00-06                HashJoin(condition=[=($3, $0)], joinType=[inner], 
> semi-join: =[false])
> 00-08                  HashJoin(condition=[=($1, $2)], joinType=[inner], 
> semi-join: =[false])
> 00-10                    HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 00-12                      Project(l_suppkey=[$0], EXPR$1=[$1])
> 00-14                        HashToRandomExchange(dist0=[[$0]])
> 01-01                          UnorderedMuxExchange
> 03-01                            Project(l_suppkey=[$0], EXPR$1=[$1], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
> 03-02                              HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 03-03                                Project(l_suppkey=[$0], $f1=[*($1, -(1, 
> $2))])
> 03-04                                  SelectionVectorRemover
> 03-05                                    Filter(condition=[AND(>=($3, 
> '1996-07-01'), <($3, 1996-09-29 00:00:00))])
> 03-06                                      Scan(table=[[hive, tpch_text, 
> lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, 
> tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, 
> `l_shipdate`], numPartitions=0, partitions= null, 
> inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
> 00-09                    StreamAgg(group=[{}], EXPR$0=[MAX($0)])
> 00-11                      Project(EXPR$1=[$1])
> 00-13                        HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 00-15                          Project(l_suppkey=[$0], EXPR$1=[$1])
> 00-16                            HashToRandomExchange(dist0=[[$0]])
> 02-01                              UnorderedMuxExchange
> 04-01                                Project(l_suppkey=[$0], EXPR$1=[$1], 
> E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
> 04-02                                  HashAgg(group=[{0}], EXPR$1=[SUM($1)])
> 04-03                                    Project(l_suppkey=[$0], $f1=[*($1, 
> -(1, $2))])
> 04-04                                      SelectionVectorRemover
> 04-05                                        Filter(condition=[AND(>=($3, 
> '1996-07-01'), <($3, 1996-09-29 00:00:00))])
> 04-06                                          Scan(table=[[hive, tpch_text, 
> lineitem]], groupscan=[HiveScan [table=Table(dbName:tpch_text, 
> tableName:lineitem), columns=[`l_suppkey`, `l_extendedprice`, `l_discount`, 
> `l_shipdate`], numPartitions=0, partitions= null, 
> inputDirectories=[maprfs:/tpch/lineitem], confProperties={}]])
> 00-07                  Scan(table=[[hive, tpch_text, supplier]], 
> groupscan=[HiveScan [table=Table(dbName:tpch_text, tableName:supplier), 
> columns=[`s_suppkey`, `s_name`, `s_address`, `s_phone`], numPartitions=0, 
> partitions= null, inputDirectories=[maprfs:/tpch/supplier], 
> confProperties={}]])
> {code}
>  
> Observe Step 03-06 and 04-06. One of them could be avoided.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to