Re: [I] Avoid inlining non deterministic CTE [datafusion]
tgujar commented on issue #10337: URL: https://github.com/apache/datafusion/issues/10337#issuecomment-2098697931 I think this would push the responsibility to the user to figure out what may be non-deterministic. I am not sure if this would be a good approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Avoid inlining non deterministic CTE [datafusion]
jonahgao commented on issue #10337: URL: https://github.com/apache/datafusion/issues/10337#issuecomment-2096302497 I worry that judging whether a query is non-deterministic may be not easy. Perhaps we can first leave this judgment to the user, only do this when the user specifies [Materialized](https://github.com/sqlparser-rs/sqlparser-rs/blob/a86c58b1c93960d50476713ff56e02bf94d41436/src/ast/query.rs#L452) and there are multiple references. DuckDB may have taken a similar approach. ```sh v0.10.3-dev779 d26007417b D with cte as (select random()) select * from cte union select * from cte; ┌┐ │ random() │ │ double │ ├┤ │ 0.9430218460038304 │ │ 0.3114725165069103 │ └┘ D with cte as materialized (select random()) select * from cte union select * from cte; ┌─┐ │ random() │ │ double│ ├─┤ │ 0.13616445031948388 │ └─┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Avoid inlining non deterministic CTE [datafusion]
tgujar commented on issue #10337: URL: https://github.com/apache/datafusion/issues/10337#issuecomment-2093034062 I can work on this if we can confirm in this is indeed a correctly reported bug. Let me know what you think, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Avoid inlining non deterministic CTE [datafusion]
tgujar opened a new issue, #10337: URL: https://github.com/apache/datafusion/issues/10337 ### Describe the bug Currently Datafusion will inline all CTE, a non-deterministic expression can be executed multiple times producing different results ### To Reproduce Consider the following query which uses the `aggregate_test_100` data from datafusion-examples. Here, column c11 is a Float64 ``` WITH cte as ( SELECT sum(c4 * c11) as total FROM aggregate_test_100 GROUP BY c1) SELECT total FROM cte WHERE total = (select max(total) from cte) ``` The optimized plan generated will inline the CTE and thus execute it twice ``` Projection: cte.total Inner Join: cte.total = __scalar_sq_1.MAX(cte.total) SubqueryAlias: cte Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS total Aggregate: groupBy=[[aggregate_test_100.c1]], aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]] TableScan: aggregate_test_100 projection=[c1, c4, c11] SubqueryAlias: __scalar_sq_1 Aggregate: groupBy=[[]], aggr=[[MAX(cte.total)]] SubqueryAlias: cte Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS total Aggregate: groupBy=[[aggregate_test_100.c1]], aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]] TableScan: aggregate_test_100 projection=[c1, c4, c11] ``` ### Expected behavior Since summation here is dependent on ordering, I believe it is incorrect to inline the CTE here and execute it more than once. ### Additional context Related issue, which talks about possible advantages on not inlining CTE in some cases: https://github.com/apache/datafusion/issues/8777 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org