[jira] [Work logged] (HIVE-24081) Enable pre-materializing CTEs referenced in scalar subqueries

ASF GitHub Bot (Jira) Fri, 28 Aug 2020 12:48:32 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-24081?focusedWorklogId=475962&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475962
 ]


ASF GitHub Bot logged work on HIVE-24081:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Aug/20 19:47
            Start Date: 28/Aug/20 19:47
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #1437:
URL: https://github.com/apache/hive/pull/1437#discussion_r479498296



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2591,6 +2591,10 @@ private static void 
populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     HIVE_CTE_MATERIALIZE_THRESHOLD("hive.optimize.cte.materialize.threshold", 
-1,

Review comment:
       Should we set the default to `2` now that it is only triggered in very 
specific cases?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java
##########
@@ -677,6 +685,33 @@ public void setNoScanAnalyzeCommand(boolean 
isNoScanAnalyzeCommand) {
   public boolean hasInsertTables() {
     return this.insertIntoTables.size() > 0 || 
this.insertOverwriteTables.size() > 0;
   }
+
+  public boolean isFullyAggregate() throws SemanticException {

Review comment:
       Although the method is evident, could we add a comment?
   
   Should this be static since it is a utility method that can be used beyond 
this scope?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java
##########
@@ -457,4 +469,17 @@ public boolean hasTableDefined() {
     return !(aliases.size() == 1 && 
aliases.get(0).equals(SemanticAnalyzer.DUMMY_TABLE));
   }
 
+  public void addSubqExprAlias(ASTNode expressionTree, SemanticAnalyzer 
semanticAnalyzer) throws SemanticException {
+    String alias = "__subexpr" + subQueryExpressionAliasCounter++;
+
+    // Recursively do the first phase of semantic analysis for the subquery
+    QBExpr qbexpr = new QBExpr(alias);
+
+    ASTNode subqref = (ASTNode) expressionTree.getChild(1);
+    semanticAnalyzer.doPhase1QBExpr(subqref, qbexpr, getId(), alias, 
isInsideView());

Review comment:
       Trying to understand this step. Does this lead to parsing the same 
subquery multiple times?

##########
File path: ql/src/test/queries/clientpositive/cte_mat_6.q
##########
@@ -0,0 +1,81 @@
+set hive.optimize.cte.materialize.threshold=1;
+
+create table t0(col0 int);
+
+insert into t0(col0) values
+(1),(2),
+(100),(100),(100),
+(200),(200);
+
+-- CTE is referenced from scalar subquery in the select clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+-- disable cte materialization
+set hive.optimize.cte.materialize.threshold=-1;
+
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+
+-- enable cte materialization
+set hive.optimize.cte.materialize.threshold=1;
+
+-- CTE is referenced from scalar subquery in the where clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0
+from t0
+where t0.col0 > (select small_count from cte)
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0
+from t0
+where t0.col0 > (select small_count from cte)
+order by t0.col0;
+
+-- CTE is referenced from scalar subquery in the having clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, count(*)
+from t0
+group by col0
+having count(*) > (select small_count from cte)
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, count(*)
+from t0
+group by col0
+having count(*) > (select small_count from cte)
+order by t0.col0;
+
+-- mix full aggregate and non-full aggregate ctes
+explain
+with cte1 as (select col0 as k1 from t0 where col0 = '5'),
+     cte2 as (select count(*) as all_count from t0),
+     cte3 as (select col0 as k3, col0 + col0 as k3_2x, count(*) as key_count 
from t0 group by col0)
+select t0.col0, count(*)
+from t0
+join cte1 on t0.col0 = cte1.k1
+join cte3 on t0.col0 = cte3.k3
+group by col0
+having count(*) > (select all_count from cte2)

Review comment:
       Could we add tests to make sure the optimization is only triggered for 
SELECT queries? For instance, I am thinking about CTAS and CMV statements, the 
optimization should not triggered in those cases (I guess it could lead some 
kind of side effect, at least for CMV).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 475962)
    Time Spent: 20m  (was: 10m)

> Enable pre-materializing CTEs referenced in scalar subqueries
> -------------------------------------------------------------
>
>                 Key: HIVE-24081
>                 URL: https://issues.apache.org/jira/browse/HIVE-24081
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Krisztian Kasa
>            Assignee: Krisztian Kasa
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> HIVE-11752 introduces materializing CTE based on config
> {code}
> hive.optimize.cte.materialize.threshold
> {code}
> Goal of this jira is
> * extending the implementation to support materializing CTE's referenced in 
> scalar subqueries
> * add a config to materialize CTEs with aggregate output only



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-24081) Enable pre-materializing CTEs referenced in scalar subqueries

Reply via email to