[jira] [Comment Edited] (CALCITE-6188) Multi-query optimization
[ https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810169#comment-17810169 ] Julian Hyde edited comment on CALCITE-6188 at 1/24/24 2:41 AM: --- Thank you, [~zabetak]! I knew Hive had such a syntax, but I wasn't able to find it. Your example would translate to {code:java} WITH empDelta AS ( SELECT * FROM emp2 EXCEPT SELECT * FROM emp) MULTI insert1 AS ( INSERT INTO TABLE tbl1 SELECT * FROM empDelta WHERE empno > 100), insert2 AS ( INSERT INTO TABLE tbl2 SELECT * FROM empDelta WHERE empno < 50); {code} Mutations occurring inside the {{MULTI}} clause would cause a single row to be returned with a row count and status. Thus the whole statement might return {noformat} +---+---+ | insert1 | insert2 | ++--++--+ | status | rowCount | status | rowCount | ++--++--+ | ok | 10 | ok | 25 | ++--++--+ 1 row returned. {noformat} was (Author: julianhyde): Thank you, [~zabetak]! I knew Hive had such a syntax, but I wasn't able to find it. Your example would translate to {code:java} WITH empDelta AS ( SELECT * FROM emp2 EXCEPT SELECT * FROM emp) MULTI insert1 AS ( INSERT INTO TABLE tbl1 SELECT * WHERE empno > 100), insert2 AS ( INSERT INTO TABLE tbl2 SELECT * WHERE empno < 50); {code} Mutations occurring inside the {{MULTI}} clause would cause a single row to be returned with a row count and status. Thus the whole statement might return {noformat} +---+---+ | insert1 | insert 2 | ++--++--+ | status | rowCount | status | rowCount | ++--++--+ | ok | 10 | ok | 25 | ++--++--+ 1 row returned. {noformat} > Multi-query optimization > > > Key: CALCITE-6188 > URL: https://issues.apache.org/jira/browse/CALCITE-6188 > Project: Calcite > Issue Type: Improvement >Reporter: Julian Hyde >Priority: Major > > Devise extensions to SQL for multi-query optimization (MQO). Queries with > multiple input tables, multiple intermediate tables, and multiple output data > sets (result sets and DML operations) can be defined in one SQL statement, > optimized, and executed atomically. > There are many flavors of multi-query optimization, depending on whether each > occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or > "several". Our goal, here, is to allow them all to be expressed. We can then > devise planning strategies that work for particular flavors. > Examples of multi-queries: > * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table > but also updates an index, > * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that > represents an ETL/ELT job; > * {*}Multiple query outputs{*}. A query that produces several data sets (say > a list of invoices for > orders and a list of products that need to be restocked); > * {*}DAG query{*}. A query that uses intermediate results more than once. > See discussion in the [Multi-query optimization email > thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k]. > Here are some SQL examples. > We add a new keyword {{MULTI}} that represents a statement whose output > contains multiple data sets and DML operations, each with a unique name. For > intermediate results, we use the existing {{WITH}} clause. > h3. 1. Multi DML > Read from one or more tables, write to one or more tables. > An example is inserting into a table and also an index on that table > (represented as a sorted table). > {code:sql} > WITH > empDelta AS ( > SELECT * FROM emp2 > EXCEPT > SELECT * FROM emp) > MULTI > insertEmp AS ( > INSERT INTO emp > TABLE empDelta), > insertEmpDeptno AS ( > MERGE empDeptno AS e > USING TABLE empDelta AS d > ON e.deptno = d.deptno > WHEN NOT MATCHED THEN INSERT VALUES (deptno)); > {code} > h3. 2. Query that creates temporary table and uses it more than once > {code:sql} > WITH > temp AS ( > SELECT * > FROM emp AS e > JOIN dept USING (deptno) > WHERE e.job = 'MANAGER' > OR d.location = 'CHICAGO') > SELECT deptno, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 3. Query whose optimal plan might use a temporary table > This query produces the same result as the previous query. There is a common > relational expres
[jira] [Comment Edited] (CALCITE-6188) Multi-query optimization
[ https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803732#comment-17803732 ] Julian Hyde edited comment on CALCITE-6188 at 1/9/24 7:17 PM: -- I had an idea for DML. The idea is that each DML operation should return a CTE name that represents the table after the mutation. New names are required for each version of a mutated table. This approach is consistent with most functional programming languages' approach to mutation. For example, h3. 6. Use of tables after they have been mutated {code:java} WITH oddEmp = ( DELETE emp WHERE empno % 2 = 0), oddSalesEmp = ( SELECT * FROM oddEmp WHERE deptno = ( SELECT deptno FROM dept WHERE dname = 'SALES')), largeDept = ( SELECT deptno FROM emp GROUP BY deptno HAVING COUNT(*) > 100) SELECT max(sal) FROM oddSalesEmp WHERE deptno in (SELECT deptno FROM largeDept); {code} The first statement changes the value of "emp" but the result is not seen until after the whole statement completes. "largeDept" will be defined in terms of the original employees (both odd and even). "oddSalesEmp" is defined in terms of "oddEmp", the after image of the DELETE. The example does not show it, but any of those mutation steps could be a multi-query. If that multi-query returns several results then it could assign multiple names, {code:java} WITH ..., (newEmp, newDept, managers) = ( MULTI newEmp = (UPDATE emp SET sal = sal * 2), newDept = (UPDATE dept SET location = to_lower(location), managers = (SELECT * FROM emp WHERE job = 'MANAGER')) {code} Thus multi-queries can be nested inside multi-queries (subgraphs in graphs). was (Author: julianhyde): I had an idea for DML. The idea is that each DML operation should return a CTE name that represents the table after the mutation. New names are required for each version of a mutated table. This approach is consistent with most functional programming languages' approach to mutation. For example, {code:java} WITH oddEmp = ( DELETE emp WHERE empno % 2 = 0), oddSalesEmp = ( SELECT * FROM oddEmp WHERE deptno = ( SELECT deptno FROM dept WHERE dname = 'SALES')), largeDept = ( SELECT deptno FROM emp GROUP BY deptno HAVING COUNT(*) > 100) SELECT max(sal) FROM oddSalesEmp WHERE deptno in (SELECT deptno FROM largeDept); {code} The first statement changes the value of "emp" but the result is not seen until after the whole statement completes. "largeDept" will be defined in terms of the original employees (both odd and even). "oddSalesEmp" is defined in terms of "oddEmp", the after image of the DELETE. The example does not show it, but any of those mutation steps could be a multi-query. If that multi-query returns several results then it could assign multiple names, {code:java} WITH ..., (newEmp, newDept, managers) = ( MULTI newEmp = (UPDATE emp SET sal = sal * 2), newDept = (UPDATE dept SET location = to_lower(location), managers = (SELECT * FROM emp WHERE job = 'MANAGER')) {code} Thus multi-queries can be nested inside multi-queries (subgraphs in graphs). > Multi-query optimization > > > Key: CALCITE-6188 > URL: https://issues.apache.org/jira/browse/CALCITE-6188 > Project: Calcite > Issue Type: Improvement >Reporter: Julian Hyde >Priority: Major > > Devise extensions to SQL for multi-query optimization (MQO). Queries with > multiple input tables, multiple intermediate tables, and multiple output data > sets (result sets and DML operations) can be defined in one SQL statement, > optimized, and executed atomically. > There are many flavors of multi-query optimization, depending on whether each > occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or > "several". Our goal, here, is to allow them all to be expressed. We can then > devise planning strategies that work for particular flavors. > Examples of multi-queries: > * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table > but also updates an index, > * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that > represents an ETL/ELT job; > * {*}Multiple query outputs{*}. A query that produces several data sets (say > a list of invoices for > orders and a list of products that need to be restocked); > * {*}DAG query{*}. A query that uses intermediate results more than once. > See discussion in the [Multi-query optimization email > thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k]. > Here are some SQL examples. > We add a new keyword {{MULTI}} that represents a statement whose output > contains multiple data sets and DML operations, each with a unique name. For > intermediate res
[jira] [Comment Edited] (CALCITE-6188) Multi-query optimization
[ https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803691#comment-17803691 ] Jay Narale edited comment on CALCITE-6188 at 1/5/24 8:01 PM: - In our case for CTE materialization, the series and the parallel case is pretty clear. (dependent ctes are in series , independent in parallel and all ctes are in series with the main query) also, DML was not needed, since the materialized CTE is only used once the insert is completed. Multi SQL might need that or other operations if streaming is planned to be supported. was (Author: jaynarale): In our case for CTE materialization, the series and the parallel case is pretty clear. (dependent ctes are in series , independent in parallel) also, DML was not needed, since the materialized CTE is only used once the insert is completed. Multi SQL might need that or other operations if streaming is planned to be supported. > Multi-query optimization > > > Key: CALCITE-6188 > URL: https://issues.apache.org/jira/browse/CALCITE-6188 > Project: Calcite > Issue Type: Improvement >Reporter: Julian Hyde >Priority: Major > > Devise extensions to SQL so that queries with multiple input tables, multiple > intermediate tables, and multiple output data sets (result sets and DML > operations) can be defined in one SQL statement, optimized, and executed > atomically. > There are many flavors of multi-query optimization, depending on whether each > occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or > "several". Our goal, here, is to allow them all to be expressed. We can then > devise planning strategies that work for particular flavors. > Examples of multi-queries: > * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table > but also updates an index, > * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that > represents an ETL/ELT job; > * {*}Multiple query outputs{*}. A query that produces several data sets (say > a list of invoices for > orders and a list of products that need to be restocked); > * {*}DAG query{*}. A query that uses intermediate results more than once. > See discussion in the [Multi-query optimization email > thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k]. > Here are some SQL examples. > We add a new keyword {{MULTI}} that represents a statement whose output > contains multiple data sets and DML operations, each with a unique name. For > intermediate results, we use the existing {{WITH}} clause. > h3. 1. Multi DML > Read from one or more tables, write to one or more tables. > An example is inserting into a table and also an index on that table > (represented as a sorted table). > {code:sql} > WITH > empDelta AS ( > SELECT * FROM emp2 > EXCEPT > SELECT * FROM emp) > MULTI > insertEmp AS ( > INSERT INTO emp > TABLE empDelta), > insertEmpDeptno AS ( > MERGE empDeptno AS e > USING TABLE empDelta AS d > ON e.deptno = d.deptno > WHEN NOT MATCHED THEN INSERT VALUES (deptno)); > {code} > h3. 2. Query that creates temporary table and uses it more than once > {code:sql} > WITH > temp AS ( > SELECT * > FROM emp AS e > JOIN dept USING (deptno) > WHERE e.job = 'MANAGER' > OR d.location = 'CHICAGO') > SELECT deptno, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 3. Query whose optimal plan might use a temporary table > This query produces the same result as the previous query. There is a common > relational expression, so the optimizer should consider a DAG plan with a > reified intermediate result. > {code:sql} > SELECT deptno, > (SELECT AVG(e2.sal) > FROM emp AS e2 > JOIN dept AS d USING (deptno) > WHERE (e2.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e2.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(e3.sal) > FROM emp AS e3 > JOIN dept AS d USING (deptno) > WHERE (e3.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e3.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 4. Query that produces several data sets > {code:sql} > WITH > newOrders AS ( > SELECT * > FROM orders > WHERE orderDate > DATE '2023-01-25') > MULTI > invoices AS ( > SELECT customerName, SUM(amount) > FROM newOrders > GROUP BY customerName), > restock AS ( > SELECT productId > FROM inventory > WHERE productId IN ( > SELECT productId FROM newOrders) > AND itemsOnHand < 10); > {code} > h3. 5. Query with a complex DAG, multiple output data sets and one DML > {code:sql
[jira] [Comment Edited] (CALCITE-6188) Multi-query optimization
[ https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803376#comment-17803376 ] Jay Narale edited comment on CALCITE-6188 at 1/5/24 1:18 AM: - Not sure if it is the same but currently we treat the Sequence node strictly as a logical entity, which has the information of the DAG. We then manage the flow of the DAG through our scheduler. This approach eliminates the necessity for a physical operator to handle or maintain consistency and logical transformations can be applied independently. So our sequence node makes sure that all CTEs are materialized in the topological order of dependencies and then it executes the primary tree was (Author: jaynarale): Not sure if it is the same but currently we treat the Sequence node strictly as a logical entity, which has the information of the DAG. We then manage the flow of the DAG through our scheduler. This approach eliminates the necessity for a physical operator to handle or maintain consistency and logical transformations can be applied independently > Multi-query optimization > > > Key: CALCITE-6188 > URL: https://issues.apache.org/jira/browse/CALCITE-6188 > Project: Calcite > Issue Type: Improvement >Reporter: Julian Hyde >Priority: Major > > Devise extensions to SQL so that queries with multiple input tables, multiple > intermediate tables, and multiple output data sets (result sets and DML > operations) can be defined in one SQL statement, optimized, and executed > atomically. > There are many flavors of multi-query optimization, depending on whether each > occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or > "several". Our goal, here, is to allow them all to be expressed. We can then > devise planning strategies that work for particular flavors. > Examples of multi-queries: > * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table > but also updates an index, > * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that > represents an ETL/ELT job; > * {*}Multiple query outputs{*}. A query that produces several data sets (say > a list of invoices for > orders and a list of products that need to be restocked); > * {*}DAG query{*}. A query that uses intermediate results more than once. > See discussion in the [Multi-query optimization email > thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k]. > Here are some SQL examples. > We add a new keyword {{MULTI}} that represents a statement whose output > contains multiple data sets and DML operations, each with a unique name. For > intermediate results, we use the existing {{WITH}} clause. > h3. 1. Multi DML > Read from one or more tables, write to one or more tables. > An example is inserting into a table and also an index on that table > (represented as a sorted table). > {code:sql} > WITH > empDelta AS ( > SELECT * FROM emp2 > EXCEPT > SELECT * FROM emp) > MULTI > insertEmp AS ( > INSERT INTO emp > TABLE empDelta), > insertEmpDeptno AS ( > MERGE empDeptno AS e > USING TABLE empDelta AS d > ON e.deptno = d.deptno > WHEN NOT MATCHED THEN INSERT VALUES (deptno)); > {code} > h3. 2. Query that creates temporary table and uses it more than once > {code:sql} > WITH > temp AS ( > SELECT * > FROM emp AS e > JOIN dept USING (deptno) > WHERE e.job = 'MANAGER' > OR d.location = 'CHICAGO') > SELECT deptno, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 3. Query whose optimal plan might use a temporary table > This query produces the same result as the previous query. There is a common > relational expression, so the optimizer should consider a DAG plan with a > reified intermediate result. > {code:sql} > SELECT deptno, > (SELECT AVG(e2.sal) > FROM emp AS e2 > JOIN dept AS d USING (deptno) > WHERE (e2.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e2.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(e3.sal) > FROM emp AS e3 > JOIN dept AS d USING (deptno) > WHERE (e3.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e3.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 4. Query that produces several data sets > {code:sql} > WITH > newOrders AS ( > SELECT * > FROM orders > WHERE orderDate > DATE '2023-01-25') > MULTI > invoices AS ( > SELECT customerName, SUM(amount) > FROM newOrders > GROUP BY customerName), > restock AS ( > SELECT productId > FROM inventory > WHERE productId IN ( >
[jira] [Comment Edited] (CALCITE-6188) Multi-query optimization
[ https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803376#comment-17803376 ] Jay Narale edited comment on CALCITE-6188 at 1/5/24 1:15 AM: - Not sure if it is the same but currently we treat the Sequence node strictly as a logical entity, which has the information of the DAG. We then manage the flow of the DAG through our scheduler. This approach eliminates the necessity for a physical operator to handle or maintain consistency and logical transformations can be applied independently was (Author: jaynarale): Not sure if it is the same but currently we treat the Sequence node strictly as a logical entity, which has the information of the DAG. We then manage the flow of the Directed Acyclic Graph (DAG) through our scheduler. This approach eliminates the necessity for a physical operator to handle or maintain consistency and logical transformations can be applied independently > Multi-query optimization > > > Key: CALCITE-6188 > URL: https://issues.apache.org/jira/browse/CALCITE-6188 > Project: Calcite > Issue Type: Improvement >Reporter: Julian Hyde >Priority: Major > > Devise extensions to SQL so that queries with multiple input tables, multiple > intermediate tables, and multiple output data sets (result sets and DML > operations) can be defined in one SQL statement, optimized, and executed > atomically. > There are many flavors of multi-query optimization, depending on whether each > occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or > "several". Our goal, here, is to allow them all to be expressed. We can then > devise planning strategies that work for particular flavors. > Examples of multi-queries: > * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table > but also updates an index, > * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that > represents an ETL/ELT job; > * {*}Multiple query outputs{*}. A query that produces several data sets (say > a list of invoices for > orders and a list of products that need to be restocked); > * {*}DAG query{*}. A query that uses intermediate results more than once. > See discussion in the [Multi-query optimization email > thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k]. > Here are some SQL examples. > We add a new keyword {{MULTI}} that represents a statement whose output > contains multiple data sets and DML operations, each with a unique name. For > intermediate results, we use the existing {{WITH}} clause. > h3. 1. Multi DML > Read from one or more tables, write to one or more tables. > An example is inserting into a table and also an index on that table > (represented as a sorted table). > {code:sql} > WITH > empDelta AS ( > SELECT * FROM emp2 > EXCEPT > SELECT * FROM emp) > MULTI > insertEmp AS ( > INSERT INTO emp > TABLE empDelta), > insertEmpDeptno AS ( > MERGE empDeptno AS e > USING TABLE empDelta AS d > ON e.deptno = d.deptno > WHEN NOT MATCHED THEN INSERT VALUES (deptno)); > {code} > h3. 2. Query that creates temporary table and uses it more than once > {code:sql} > WITH > temp AS ( > SELECT * > FROM emp AS e > JOIN dept USING (deptno) > WHERE e.job = 'MANAGER' > OR d.location = 'CHICAGO') > SELECT deptno, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 3. Query whose optimal plan might use a temporary table > This query produces the same result as the previous query. There is a common > relational expression, so the optimizer should consider a DAG plan with a > reified intermediate result. > {code:sql} > SELECT deptno, > (SELECT AVG(e2.sal) > FROM emp AS e2 > JOIN dept AS d USING (deptno) > WHERE (e2.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e2.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(e3.sal) > FROM emp AS e3 > JOIN dept AS d USING (deptno) > WHERE (e3.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e3.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 4. Query that produces several data sets > {code:sql} > WITH > newOrders AS ( > SELECT * > FROM orders > WHERE orderDate > DATE '2023-01-25') > MULTI > invoices AS ( > SELECT customerName, SUM(amount) > FROM newOrders > GROUP BY customerName), > restock AS ( > SELECT productId > FROM inventory > WHERE productId IN ( > SELECT productId FROM newOrders) > AND itemsOnHand < 10); > {code} > h3. 5. Query with a complex DAG, multiple ou
[jira] [Comment Edited] (CALCITE-6188) Multi-query optimization
[ https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803376#comment-17803376 ] Jay Narale edited comment on CALCITE-6188 at 1/5/24 1:15 AM: - Not sure if it is the same but currently we treat the Sequence node strictly as a logical entity, which has the information of the DAG. We then manage the flow of the Directed Acyclic Graph (DAG) through our scheduler. This approach eliminates the necessity for a physical operator to handle or maintain consistency and logical transformations can be applied independently was (Author: jaynarale): Not sure if it is the same but currently we treat the Sequence node strictly as a logical entity, which has the information of the DAG. We then manage the flow of the Directed Acyclic Graph (DAG) through our scheduler. This approach eliminates the necessity for a physical operator to handle or maintain consistency. > Multi-query optimization > > > Key: CALCITE-6188 > URL: https://issues.apache.org/jira/browse/CALCITE-6188 > Project: Calcite > Issue Type: Improvement >Reporter: Julian Hyde >Priority: Major > > Devise extensions to SQL so that queries with multiple input tables, multiple > intermediate tables, and multiple output data sets (result sets and DML > operations) can be defined in one SQL statement, optimized, and executed > atomically. > There are many flavors of multi-query optimization, depending on whether each > occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or > "several". Our goal, here, is to allow them all to be expressed. We can then > devise planning strategies that work for particular flavors. > Examples of multi-queries: > * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table > but also updates an index, > * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that > represents an ETL/ELT job; > * {*}Multiple query outputs{*}. A query that produces several data sets (say > a list of invoices for > orders and a list of products that need to be restocked); > * {*}DAG query{*}. A query that uses intermediate results more than once. > See discussion in the [Multi-query optimization email > thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k]. > Here are some SQL examples. > We add a new keyword {{MULTI}} that represents a statement whose output > contains multiple data sets and DML operations, each with a unique name. For > intermediate results, we use the existing {{WITH}} clause. > h3. 1. Multi DML > Read from one or more tables, write to one or more tables. > An example is inserting into a table and also an index on that table > (represented as a sorted table). > {code:sql} > WITH > empDelta AS ( > SELECT * FROM emp2 > EXCEPT > SELECT * FROM emp) > MULTI > insertEmp AS ( > INSERT INTO emp > TABLE empDelta), > insertEmpDeptno AS ( > MERGE empDeptno AS e > USING TABLE empDelta AS d > ON e.deptno = d.deptno > WHEN NOT MATCHED THEN INSERT VALUES (deptno)); > {code} > h3. 2. Query that creates temporary table and uses it more than once > {code:sql} > WITH > temp AS ( > SELECT * > FROM emp AS e > JOIN dept USING (deptno) > WHERE e.job = 'MANAGER' > OR d.location = 'CHICAGO') > SELECT deptno, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 3. Query whose optimal plan might use a temporary table > This query produces the same result as the previous query. There is a common > relational expression, so the optimizer should consider a DAG plan with a > reified intermediate result. > {code:sql} > SELECT deptno, > (SELECT AVG(e2.sal) > FROM emp AS e2 > JOIN dept AS d USING (deptno) > WHERE (e2.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e2.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(e3.sal) > FROM emp AS e3 > JOIN dept AS d USING (deptno) > WHERE (e3.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e3.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 4. Query that produces several data sets > {code:sql} > WITH > newOrders AS ( > SELECT * > FROM orders > WHERE orderDate > DATE '2023-01-25') > MULTI > invoices AS ( > SELECT customerName, SUM(amount) > FROM newOrders > GROUP BY customerName), > restock AS ( > SELECT productId > FROM inventory > WHERE productId IN ( > SELECT productId FROM newOrders) > AND itemsOnHand < 10); > {code} > h3. 5. Query with a complex DAG, multiple output data sets and one DML > {c
[jira] [Comment Edited] (CALCITE-6188) Multi-query optimization
[ https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803368#comment-17803368 ] Jay Narale edited comment on CALCITE-6188 at 1/5/24 12:55 AM: -- We in Presto have recently supported CTE materialization [https://github.com/prestodb/presto/pull/20887. |https://github.com/prestodb/presto/pull/20887]The implementation heavily uses a new type of logical planNode called a sequence node which makes sure that deadlocks don't exist. Though our current implementation is not 100% optimized, we have a similar goal of optimizing all DAGs together. was (Author: jaynarale): We in Presto have recently supported CTE materialization [https://github.com/prestodb/presto/pull/20887. |https://github.com/prestodb/presto/pull/20887]The implementation heavily uses a new type of logical planNode called a sequence node. Though our current implementation is not 100% optimized, we have a similar goal of optimizing all DAGs together. > Multi-query optimization > > > Key: CALCITE-6188 > URL: https://issues.apache.org/jira/browse/CALCITE-6188 > Project: Calcite > Issue Type: Improvement >Reporter: Julian Hyde >Priority: Major > > Devise extensions to SQL so that queries with multiple input tables, multiple > intermediate tables, and multiple output data sets (result sets and DML > operations) can be defined in one SQL statement, optimized, and executed > atomically. > There are many flavors of multi-query optimization, depending on whether each > occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or > "several". Our goal, here, is to allow them all to be expressed. We can then > devise planning strategies that work for particular flavors. > Examples of multi-queries: > * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table > but also updates an index, > * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that > represents an ETL/ELT job; > * {*}Multiple query outputs{*}. A query that produces several data sets (say > a list of invoices for > orders and a list of products that need to be restocked); > * {*}DAG query{*}. A query that uses intermediate results more than once. > See discussion in the [Multi-query optimization email > thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k]. > Here are some SQL examples. > We add a new keyword {{MULTI}} that represents a statement whose output > contains multiple data sets and DML operations, each with a unique name. For > intermediate results, we use the existing {{WITH}} clause. > h3. 1. Multi DML > Read from one or more tables, write to one or more tables. > An example is inserting into a table and also an index on that table > (represented as a sorted table). > {code:sql} > WITH > empDelta AS ( > SELECT * FROM emp2 > EXCEPT > SELECT * FROM emp) > MULTI > insertEmp AS ( > INSERT INTO emp > TABLE empDelta), > insertEmpDeptno AS ( > MERGE empDeptno AS e > USING TABLE empDelta AS d > ON e.deptno = d.deptno > WHEN NOT MATCHED THEN INSERT VALUES (deptno)); > {code} > h3. 2. Query that creates temporary table and uses it more than once > {code:sql} > WITH > temp AS ( > SELECT * > FROM emp AS e > JOIN dept USING (deptno) > WHERE e.job = 'MANAGER' > OR d.location = 'CHICAGO') > SELECT deptno, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(sal) > FROM temp AS t > WHERE t.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 3. Query whose optimal plan might use a temporary table > This query produces the same result as the previous query. There is a common > relational expression, so the optimizer should consider a DAG plan with a > reified intermediate result. > {code:sql} > SELECT deptno, > (SELECT AVG(e2.sal) > FROM emp AS e2 > JOIN dept AS d USING (deptno) > WHERE (e2.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e2.deptno = e.deptno) AS deptAvgSal, > (SELECT AVG(e3.sal) > FROM emp AS e3 > JOIN dept AS d USING (deptno) > WHERE (e3.job = 'MANAGER' > OR d.location = 'CHICAGO') > AND e3.job = e.job) AS jobAvgSal > FROM e > WHERE e.deptno IN (10, 20); > {code} > h3. 4. Query that produces several data sets > {code:sql} > WITH > newOrders AS ( > SELECT * > FROM orders > WHERE orderDate > DATE '2023-01-25') > MULTI > invoices AS ( > SELECT customerName, SUM(amount) > FROM newOrders > GROUP BY customerName), > restock AS ( > SELECT productId > FROM inventory > WHERE productId IN ( > SELECT productId FROM newOrders) > AND itemsOnHand < 10); > {code} > h3. 5. Query with a complex DAG