[jira] [Commented] (CALCITE-6188) Multi-query optimization

Jay Narale (Jira) Thu, 04 Jan 2024 16:52:04 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803368#comment-17803368
 ]


Jay Narale commented on CALCITE-6188:
-------------------------------------

We in Presto have recently supported CTE materialization 
[https://github.com/prestodb/presto/pull/20887. 
|https://github.com/prestodb/presto/pull/20887]The implementation heavily uses 
a new type of logical planNode called a sequence node. Though our current 
implementation is not 100% optimized, we have a similar goal of optimizing all 
DAGs together.

> Multi-query optimization
> ------------------------
>
>                 Key: CALCITE-6188
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6188
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Julian Hyde
>            Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>       SELECT * FROM emp2
>       EXCEPT
>       SELECT * FROM emp)
> MULTI
>   insertEmp AS (
>     INSERT INTO emp
>     TABLE empDelta),
>   insertEmpDeptno AS (
>     MERGE empDeptno AS e
>     USING TABLE empDelta AS d
>     ON e.deptno = d.deptno
>     WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
>     SELECT *
>     FROM emp AS e
>     JOIN dept USING (deptno)
>     WHERE e.job = 'MANAGER'
>     OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
>     FROM temp AS t
>     WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
>     FROM temp AS t
>     WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query that should create temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
>     FROM emp AS e2
>     JOIN dept AS d USING (deptno)
>     WHERE (e2.job = 'MANAGER'
>       OR d.location = 'CHICAGO')
>     AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
>     FROM emp AS e3
>     JOIN dept AS d USING (deptno)
>     WHERE (e3.job = 'MANAGER'
>       OR d.location = 'CHICAGO')
>     AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
>     SELECT *
>     FROM orders
>     WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
>     SELECT customerName, SUM(amount)
>     FROM newOrders
>     GROUP BY customerName),
>   restock AS (
>     SELECT productId
>     FROM inventory
>     WHERE productId IN (
>       SELECT productId FROM newOrders)
>     AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
>     SELECT * FROM t WHERE x > 5),
>   t00 AS (
>     SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
>     SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
>     SELECT * FROM t WHERE y > 3),
>   t10 AS (
>     SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
>     SELECT * FROM t000
>     INTERSECT
>     SELECT * FROM t10),
>   t3 AS (
>     SELECT * FROM u
>     INTERSECT
>     SELECT * FROM t00)
> MULTI (
>   q0 AS (SELECT * FROM t2),
>   q1 AS (SELECT * FROM t3),
>   d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1))
> {code}
> The data flow is the following DAG:
> {noformat}
> u --------------------+
>                       +--------------> t3 (q0)
>     +--> t0 --> t00 --+
>     |                 +--> t000 --+
> t --+                             +--> t2 (q1)
>     |         +--> t10 -----------+
>     +--> t1 --+
>               +----------------------> d0
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6188) Multi-query optimization

Reply via email to