deepyaman opened a new issue, #23220:
URL: https://github.com/apache/datafusion/issues/23220

   ### Describe the bug
   
   When a non-deterministic/volatile function (e.g. `random()`, `uuid()`) is 
computed once in a subquery and then referenced multiple times in the outer 
projection, DataFusion >= 52.0.0 pushes the outer projection **into the 
file-scan `DataSourceExec`** and **inlines the subquery alias**, turning the 
single call into N independent calls.
   
   Two references to what should be the same "locked-in" value then diverge. 
This worked correctly in 51.0.0 and regressed in 52.0.0 (both 52.0.0 and 53.0.0 
are affected).
   
   It only reproduces with a **file scan** (Parquet/CSV); an in-memory 
`MemTable` is not affected, which points at projection pushdown into the file 
source.
   
   ### To Reproduce
   
   datafusion-cli:
   
   ```sql
   COPY (SELECT 1 AS id UNION ALL SELECT 2 UNION ALL SELECT 3) TO 't.parquet';
   CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION 't.parquet';
   
   EXPLAIN
   SELECT s.r AS x, s.r AS y
   FROM (SELECT random() AS r FROM t) AS s;
   ```
   
   **51.0.0 — correct** (`random()` evaluated once, then reused):
   ```
   ProjectionExec: expr=[r@0 as x, r@0 as y]
     ProjectionExec: expr=[random() as r]
       DataSourceExec: file_groups={...t.parquet}, file_type=parquet
   ```
   
   **52.0.0 / 53.0.0 — incorrect** (`random()` inlined and duplicated):
   ```
   DataSourceExec: file_groups={...t.parquet}, projection=[random() as x, 
random() as y], file_type=parquet
   ```
   
   Executing the query confirms `x != y` on 53.0.0, whereas `x == y` on 51.0.0.
   
   ### Expected behavior
   
   A volatile/non-deterministic expression aliased in a subquery should be 
evaluated **once** and reused by later references, as in 51.0.0. The optimizer 
should not inline/duplicate a volatile expression when pushing a projection 
into a scan (cf. #10337 for the CTE analogue).
   
   ### Additional context
   
   - Regression introduced in **52.0.0** (51.0.0 correct; 52.0.0 and 53.0.0 
affected).
   - Reproduces with Parquet and CSV file scans; not with in-memory tables.
   - Surfaced downstream in [Ibis](https://github.com/ibis-project/ibis), which 
relies on the subquery-aliasing pattern to "lock in" `random()`/`uuid()` values 
(`ibis/backends/tests/test_impure.py::test_impure_correlated` and 
`::test_chained_selections`). Equivalent Ibis reproducer:
   
   ```python
   import ibis
   from ibis import _
   
   con = ibis.datafusion.connect()
   ibis.memtable({"id": [1, 2, 3]}).to_parquet("t.parquet")  # file-backed; bug 
needs a file scan
   t = con.read_parquet("t.parquet")
   
   expr = t.select(common=ibis.random()).select(x=_.common, y=_.common)
   df = expr.execute()
   print((df.x == df.y).all())   # True on 51.0.0, False on >= 52.0.0
   ```
   
   ---
   *Generated-by: Claude Opus 4.8 <[email protected]>*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to