[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-23 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810169#comment-17810169
 ] 

Julian Hyde commented on CALCITE-6188:
--

Thank you, [~zabetak]! I knew Hive had such a syntax, but I wasn't able to find 
it. Your example would translate to
{code:java}
WITH
  empDelta AS (
SELECT * FROM emp2
EXCEPT
SELECT * FROM emp)
MULTI
  insert1 AS (
INSERT INTO TABLE tbl1 SELECT * WHERE empno > 100),
  insert2 AS (
INSERT INTO TABLE tbl2 SELECT * WHERE empno < 50);
{code}
Mutations occurring inside the {{MULTI}} clause would cause a single row to be 
returned with a row count and status. Thus the whole statement might return
{noformat}
+---+---+
| insert1   | insert 2  |
++--++--+ 
| status | rowCount | status | rowCount |
++--++--+
| ok |   10 | ok |   25 |
++--++--+

1 row returned.
{noformat}

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL for multi-query optimization (MQO). Queries with 
> multiple input tables, multiple intermediate tables, and multiple output data 
> sets (result sets and DML operations) can be defined in one SQL statement, 
> optimized, and executed atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-22 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809478#comment-17809478
 ] 

Stamatis Zampetakis commented on CALCITE-6188:
--

Regarding the Multi DML use case, some systems (e.g., Hive, Snowflake) provide 
a SQL syntax for inserting data into multiple tables, usually know as MULTI 
TABLE INSERT statements.

The Hive syntax can be found 
[here|https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-InsertingdataintoHiveTablesfromqueries]
 and a basic example is outlined below. 
{code:sql}
FROM (SELECT * FROM emp2 EXCEPT SELECT * FROM emp) empDelta
INSERT INTO TABLE tbl1 SELECT * WHERE empno > 100
INSERT INTO TABLE tbl2 SELECT * WHERE empno < 50;
{code}
The Snowflake syntax along with examples can be found 
[here|https://docs.snowflake.com/en/sql-reference/sql/insert-multi-table].

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL for multi-query optimization (MQO). Queries with 
> multiple input tables, multiple intermediate tables, and multiple output data 
> sets (result sets and DML operations) can be defined in one SQL statement, 
> optimized, and executed atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>  

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-09 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804855#comment-17804855
 ] 

Julian Hyde commented on CALCITE-6188:
--

I forgot an important case. Two or more queries that can benefit from a common 
intermediate table (effectively a temporary materialized view). The optimizer 
should infer the intermediate table automatically.

h3. 7. Common intermediate table

{code}
MULTI
  empByJob AS (
SELECT job, COUNT(*) AS c
FROM emp
WHERE deptno < 30
GROUP BY job),
  empByDeptJob AS (
SELECT DISTINCT deptno, job
FROM emp);
{code}

Under a typical cost model, the following would be useful materialized view. 
The optimizer should infer it automatically.

{code}
empByDeptJob AS (
  SELECT deptno, job, COUNT(*) AS c, SUM(sal) AS sumSal
  FROM emp
  GROUP BY deptno, job)
{code}


> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL for multi-query optimization (MQO). Queries with 
> multiple input tables, multiple intermediate tables, and multiple output data 
> sets (result sets and DML operations) can be defined in one SQL statement, 
> optimized, and executed atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-08 Thread Jay Narale (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804548#comment-17804548
 ] 

Jay Narale commented on CALCITE-6188:
-

Makes sense, and I think we could reuse our scheduling dag logic in cases where 
dependencies exist or if the user specifies any dependencies.

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
> SELECT * FROM t WHERE y > 3),
>   t10 AS (
> SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
>   t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT * FROM t00)
> MULTI
>   q0 AS (SELECT * FROM t3),
>   q1 AS (SELECT * FROM t2),
>   d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1));
> {code}
> The data flow is the following DAG:
> {noformat}
> u +
>   

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-07 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804010#comment-17804010
 ] 

Julian Hyde commented on CALCITE-6188:
--

The DML options are probably syntactic sugar. Note that
{code:java}
oddEmp = (
  DELETE emp
  WHERE empno % 2 = 0),
newDept = (
  UPDATE dept
  SET location = to_lower(location)
  WHERE deptno < 20),
newLocation = (
  INSERT INTO location
  VALUES ('Austin', 'TX', 'HQ')){code}
are equivalent to
{code:java}
oddEmp = (
  SELECT *
  FROM emp
  EXCEPT
  SELECT *
  FROM emp
  WHERE empno % 2 = 0),
newDept = (
  SELECT deptno, dname,
  CASE
  WHEN deptno < 20 THEN to_lower(location)
  ELSE location
  END AS location
  FROM dept),
newLocation = (
  SELECT *
  FROM location
  UNION ALL
  VALUES ('Austin', 'TX', 'HQ')){code}
Except that the cost model is different (writing 999 rows may be more expensive 
than deleting 1 row from a 1,000 row table), and that  we will still need to do 
the 'real' DML at the end if the tables are global, not temporary, tables.

Converting the DML operations into relational operators gives the optimizer 
more opportunities.

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-05 Thread Jay Narale (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803759#comment-17803759
 ] 

Jay Narale commented on CALCITE-6188:
-

I think the second syntax will work well! This is awesome! 

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
> SELECT * FROM t WHERE y > 3),
>   t10 AS (
> SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
>   t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT * FROM t00)
> MULTI
>   q0 AS (SELECT * FROM t3),
>   q1 AS (SELECT * FROM t2),
>   d0 AS (DELETE FROM v WHERE v.x IN (SELECT x FROM t1));
> {code}
> The data flow is the following DAG:
> {noformat}
> u +
>   +--> t3 (q0)
> +--> t0 --> t00 --+
> | +--> 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-05 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803732#comment-17803732
 ] 

Julian Hyde commented on CALCITE-6188:
--

I had an idea for DML. The idea is that each DML operation should return a CTE 
name that represents the table after the mutation. 
New names are required for each version of a mutated table. This approach is 
consistent with most functional programming languages' approach to mutation. 
For example,
{code:java}
WITH
  oddEmp = (
DELETE emp WHERE empno % 2 = 0),
  oddSalesEmp = (
SELECT *
FROM oddEmp
WHERE deptno = (
  SELECT deptno
  FROM dept
  WHERE dname = 'SALES')),
  largeDept = (
SELECT deptno
FROM emp
GROUP BY deptno
HAVING COUNT(*) > 100)
SELECT max(sal)
FROM oddSalesEmp
WHERE deptno in (SELECT deptno FROM largeDept);
{code}
The first statement changes the value of "emp" but the result is not seen until 
after the whole statement completes. "largeDept" will be defined in terms of 
the original employees (both odd and even). "oddSalesEmp" is defined in terms 
of "oddEmp", the after image of the DELETE.

The example does not show it, but any of those mutation steps could be a 
multi-query. If that multi-query returns several results then it could assign 
multiple names,
{code:java}
WITH
  ...,
  (newEmp, newDept, managers) = (
MULTI
  newEmp = (UPDATE emp SET sal = sal * 2),
  newDept = (UPDATE dept SET location = to_lower(location),
  managers = (SELECT * FROM emp WHERE job = 'MANAGER')) {code}
Thus multi-queries can be nested inside multi-queries (subgraphs in graphs).

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-05 Thread Jay Narale (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803691#comment-17803691
 ] 

Jay Narale commented on CALCITE-6188:
-

In our case for CTE materialization, the series and the parallel case is pretty 
clear. (dependent ctes are in series , independent in parallel) also, DML was 
not needed, since the materialized CTE is only used once the insert is 
completed. Multi SQL might need that or other operations if streaming is 
planned to be supported.

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
> SELECT * FROM t WHERE y > 3),
>   t10 AS (
> SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
>   t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT * FROM t00)
> MULTI
>   q0 AS (SELECT * FROM t3),
>   q1 AS 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-05 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803650#comment-17803650
 ] 

Julian Hyde commented on CALCITE-6188:
--

I got some feedback on twitter that we need to be able to specify whether 
operations happen in series or parallel. If the items in the WITH clause are 
nodes, do we also need arcs in some form?

Also, do we need DML nodes? Every “WITH t” item at present is implicitly a 
“CREATE TEMPORARY TABLE t AS …” command. Do we need to support INSERT, UPDATE, 
DELETE, MERGE on intermediate results?

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
> SELECT * FROM t WHERE y > 3),
>   t10 AS (
> SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
>   t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-04 Thread Jay Narale (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803376#comment-17803376
 ] 

Jay Narale commented on CALCITE-6188:
-

Not sure if it is the same but currently we treat the Sequence node strictly as 
a logical entity, which has the information of the DAG. We then manage the flow 
of the Directed Acyclic Graph (DAG) through our scheduler. This approach 
eliminates the necessity for a physical operator to handle or maintain 
consistency.

 

 

 

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
> SELECT * FROM t WHERE y > 3),
>   t10 AS (
> SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
>   t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT * FROM t00)
> MULTI
>   q0 AS (SELECT * FROM t3),
>   q1 AS 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-04 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803373#comment-17803373
 ] 

Julian Hyde commented on CALCITE-6188:
--

Thank you, [~jaynarale]! The sequence node is not something I had considered.

I have only come across the case where one DAG node sends a boolean value 
(empty result set) to its successor, saying 'I've finished'. Whereas - if I 
understand correctly - your sequence says 'I've finished and I have released 
all resources'. 

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query whose optimal plan might use a temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan with a 
> reified intermediate result.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
> SELECT * FROM t WHERE y > 3),
>   t10 AS (
> SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
>   t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT * FROM t00)
> MULTI
>   q0 AS (SELECT * FROM t3),
>   q1 AS 

[jira] [Commented] (CALCITE-6188) Multi-query optimization

2024-01-04 Thread Jay Narale (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803368#comment-17803368
 ] 

Jay Narale commented on CALCITE-6188:
-

We in Presto have recently supported CTE materialization 
[https://github.com/prestodb/presto/pull/20887. 
|https://github.com/prestodb/presto/pull/20887]The implementation heavily uses 
a new type of logical planNode called a sequence node. Though our current 
implementation is not 100% optimized, we have a similar goal of optimizing all 
DAGs together.

> Multi-query optimization
> 
>
> Key: CALCITE-6188
> URL: https://issues.apache.org/jira/browse/CALCITE-6188
> Project: Calcite
>  Issue Type: Improvement
>Reporter: Julian Hyde
>Priority: Major
>
> Devise extensions to SQL so that queries with multiple input tables, multiple 
> intermediate tables, and multiple output data sets (result sets and DML 
> operations) can be defined in one SQL statement, optimized, and executed 
> atomically.
> There are many flavors of multi-query optimization, depending on whether each 
> occurrence of "multiple" in the previous paragraph is replaced with 0, 1 or 
> "several". Our goal, here, is to allow them all to be expressed. We can then 
> devise planning strategies that work for particular flavors.
> Examples of multi-queries:
>  * {*}Multiple DML outputs{*}. An INSERT statement that writes into a table 
> but also updates an index,
>  * {*}Multiple DML outputs, complex intermediate tables{*}. A DAG that 
> represents an ETL/ELT job;
>  * {*}Multiple query outputs{*}. A query that produces several data sets (say 
> a list of invoices for
> orders and a list of products that need to be restocked);
>  * {*}DAG query{*}. A query that uses intermediate results more than once.
> See discussion in the [Multi-query optimization email 
> thread|https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k].
> Here are some SQL examples.
> We add a new keyword {{MULTI}} that represents a statement whose output 
> contains multiple data sets and DML operations, each with a unique name. For 
> intermediate results, we use the existing {{WITH}} clause.
> h3. 1. Multi DML
> Read from one or more tables, write to one or more tables.
> An example is inserting into a table and also an index on that table 
> (represented as a sorted table).
> {code:sql}
> WITH
>   empDelta AS (
>   SELECT * FROM emp2
>   EXCEPT
>   SELECT * FROM emp)
> MULTI
>   insertEmp AS (
> INSERT INTO emp
> TABLE empDelta),
>   insertEmpDeptno AS (
> MERGE empDeptno AS e
> USING TABLE empDelta AS d
> ON e.deptno = d.deptno
> WHEN NOT MATCHED THEN INSERT VALUES (deptno));
> {code}
> h3. 2. Query that creates temporary table and uses it more than once
> {code:sql}
> WITH
>   temp AS (
> SELECT *
> FROM emp AS e
> JOIN dept USING (deptno)
> WHERE e.job = 'MANAGER'
> OR d.location = 'CHICAGO')
> SELECT deptno,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(sal)
> FROM temp AS t
> WHERE t.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 3. Query that should create temporary table
> This query produces the same result as the previous query. There is a common 
> relational expression, so the optimizer should consider a DAG plan.
> {code:sql}
> SELECT deptno,
>   (SELECT AVG(e2.sal)
> FROM emp AS e2
> JOIN dept AS d USING (deptno)
> WHERE (e2.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e2.deptno = e.deptno) AS deptAvgSal,
>   (SELECT AVG(e3.sal)
> FROM emp AS e3
> JOIN dept AS d USING (deptno)
> WHERE (e3.job = 'MANAGER'
>   OR d.location = 'CHICAGO')
> AND e3.job = e.job) AS jobAvgSal
> FROM e
> WHERE e.deptno IN (10, 20);
> {code}
> h3. 4. Query that produces several data sets
> {code:sql}
> WITH
>   newOrders AS (
> SELECT *
> FROM orders
> WHERE orderDate > DATE '2023-01-25')
> MULTI
>   invoices AS (
> SELECT customerName, SUM(amount)
> FROM newOrders
> GROUP BY customerName),
>   restock AS (
> SELECT productId
> FROM inventory
> WHERE productId IN (
>   SELECT productId FROM newOrders)
> AND itemsOnHand < 10);
> {code}
> h3. 5. Query with a complex DAG, multiple output data sets and one DML
> {code:sql}
> WITH
>   t0 AS (
> SELECT * FROM t WHERE x > 5),
>   t00 AS (
> SELECT * FROM t0 WHERE y < 10),
>   t000 AS (
> SELECT x, MIN(y) FROM t00 GROUP BY x),
>   t1 AS (
> SELECT * FROM t WHERE y > 3),
>   t10 AS (
> SELECT * FROM t1 WHERE x < 8),
>   t2 AS (
> SELECT * FROM t000
> INTERSECT
> SELECT * FROM t10),
>   t3 AS (
> SELECT * FROM u
> INTERSECT
> SELECT * FROM t00)
> MULTI (
>   q0 AS (SELECT * FROM t2),
>   q1 AS (SELECT * FROM t3),
>