Julian Hyde created CALCITE-6716:
------------------------------------
Summary: WITH REPEATABLE (deterministic CTEs)
Key: CALCITE-6716
URL: https://issues.apache.org/jira/browse/CALCITE-6716
Project: Calcite
Issue Type: Bug
Reporter: Julian Hyde
Add a REPEATABLE keyword to the WITH clause (that defines common-table
expressions, CTEs) to ensure that if the CTE is used more than once in the
query it will return the same set of rows each time (modulo sorting).
In SQL, most functions and relational operators are deterministic (return the
same results given the same inputs/arguments). Base tables (due to isolation
semantics) will return the same result if used twice in a query. Views and CTEs
will return the same results if all of their constituent tables, functions and
operators are deterministic.
A CTE is repeatable if it is based on a deterministic query. If the query is
nondeterministic, we can make the CTE deterministic by materializing it
(storing the results in memory or temp disk the first it is used). In many
systems CTEs are always materialized or it is an option see
([Postgres|https://dba.stackexchange.com/questions/257014/are-there-side-effects-to-postgres-12s-not-materialized-directive]).
But what if you want repeatable results without paying the cost of
materialization? Whether to materialize is a physical decision ('how') that can
be left to the optimizer, whereas repeatability is a logical requirement
('what'). SQL works best when the syntax allows people to ask for the 'what'
without specifying the 'how'.
Here is an example with proposed syntax:
{code}
WITH topManagers AS REPEATABLE (
SELECT *
FROM employees
WHERE job = 'MANAGER'
ORDER BY sal
LIMIT 5)
SELECT *
FROM topManagers
MINUS
SELECT *
FROM topManagers
{code}
Although topManagers is nondeterministic -- there could be two managers with
the same salary -- this query will always return 0 rows.
The implementation requires two tools:
* Determine whether a query is deterministic
* Modify a nondeterministic query so that it is deterministic (within the same
statement execution) and returns one of the valid results. Operation may fail,
and say 'I cannot make this deterministic'.
h2. Nondeterministic features and how to fix them
Here are some nondeterministic features, and ways to make them deterministic.
h3. LIMIT and OFFSET
A query with LIMIT or OFFSET is nondeterministic when there is ORDER BY or with
the ORDER BY is not exhaustive (may have ties). Example:
{code}
SELECT * FROM employees ORDER BY sal LIMIT 5
{code}
May be fixed by making the ORDER BY deterministic, e.g. by adding primary key
to break ties:
{code}
SELECT * FROM employees ORDER BY sal, empno LIMIT 5
{code}
(Not possible if there is no primary key.)
h3. ROWS inside window aggregate
ROWS inside a window aggregate has a similar problem to LIMIT and OFFSET, and
similar solution:
{code}
SELECT avg(sal) OVER (ORDER BY hiredate ROWS 3 PRECEDING)
FROM employees
{code}
Fixed:
{code}
SELECT avg(sal) OVER (ORDER BY hiredate, empno ROWS 3 PRECEDING)
FROM employees
{code}
h3. RANK and DENSE_RANK
TODO
h3. RANDOM
Not easily fixed.
h3. GROUP BY strings with case-insensitive collation
Suppose job is a VARCHAR column with case-insensitive collation. Suppose that
the values 'Sales' and 'SALES' both exist, and they collate the same. If we
execute the query
{code}
SELECT DISTINCT job
FROM employees
{code}
the result will either contain 'Sales' or 'SALES', but which occurs depends on
which value lands in the hash table first. We can fix as follows:
{code}
SELECT MIN(job COLLATE rawBits)
FROM employees
GROUP BY job
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)