Re: [I] Avoid inlining non deterministic CTE [datafusion]

2024-05-07 Thread via GitHub


tgujar commented on issue #10337:
URL: https://github.com/apache/datafusion/issues/10337#issuecomment-2098697931

   I think this would push the responsibility to the user to figure out what 
may be non-deterministic. I am not sure if this would be a good approach


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Avoid inlining non deterministic CTE [datafusion]

2024-05-06 Thread via GitHub


jonahgao commented on issue #10337:
URL: https://github.com/apache/datafusion/issues/10337#issuecomment-2096302497

   I worry that judging whether a query is non-deterministic may be not easy. 
Perhaps we can first leave this judgment to the user, only do this when the 
user specifies 
[Materialized](https://github.com/sqlparser-rs/sqlparser-rs/blob/a86c58b1c93960d50476713ff56e02bf94d41436/src/ast/query.rs#L452)
 and there are multiple references.
   
   DuckDB may have taken a similar approach.
   ```sh
   v0.10.3-dev779 d26007417b
   
   D with cte as (select random()) select * from cte union select * from cte;
   ┌┐
   │  random()  │
   │   double   │
   ├┤
   │ 0.9430218460038304 │
   │ 0.3114725165069103 │
   └┘
   D with cte as materialized (select random()) select * from cte union select 
* from cte;
   ┌─┐
   │  random()   │
   │   double│
   ├─┤
   │ 0.13616445031948388 │
   └─┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Avoid inlining non deterministic CTE [datafusion]

2024-05-03 Thread via GitHub


tgujar commented on issue #10337:
URL: https://github.com/apache/datafusion/issues/10337#issuecomment-2093034062

   I can work on this if we can confirm in this is indeed a correctly reported 
bug. Let me know what you think, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Avoid inlining non deterministic CTE [datafusion]

2024-05-01 Thread via GitHub


tgujar opened a new issue, #10337:
URL: https://github.com/apache/datafusion/issues/10337

   ### Describe the bug
   
   Currently Datafusion will inline all CTE, a non-deterministic expression can 
be executed multiple times producing different results
   
   
   ### To Reproduce
   
   Consider the following query which uses the `aggregate_test_100` data from 
datafusion-examples. Here, column c11 is a Float64 
   ```
   WITH cte as (
   SELECT sum(c4 * c11) as total 
   FROM aggregate_test_100 
   GROUP BY c1) 
   SELECT total 
   FROM cte 
   WHERE total = (select max(total) from cte)
   ```
   
   The optimized plan generated will inline the CTE and thus execute it twice
   ```
   Projection: cte.total
 Inner Join: cte.total = __scalar_sq_1.MAX(cte.total)
   SubqueryAlias: cte
 Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS 
total
   Aggregate: groupBy=[[aggregate_test_100.c1]], 
aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]]
 TableScan: aggregate_test_100 projection=[c1, c4, c11]
   SubqueryAlias: __scalar_sq_1
 Aggregate: groupBy=[[]], aggr=[[MAX(cte.total)]]
   SubqueryAlias: cte
 Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS 
total
   Aggregate: groupBy=[[aggregate_test_100.c1]], 
aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]]
 TableScan: aggregate_test_100 projection=[c1, c4, c11]
   ```
   
   ### Expected behavior
   
   Since summation here is dependent on ordering, I believe it is incorrect to 
inline the CTE here and execute it more than once.
   
   ### Additional context
   
   Related issue, which talks about possible advantages on not inlining CTE in 
some cases: https://github.com/apache/datafusion/issues/8777
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org