Fly-a-Kite opened a new issue, #22190:
URL: https://github.com/apache/datafusion/issues/22190

   ### Describe the bug
   
   ## Version
   
   - Python: `3.12.3`
   - Python package: `datafusion==53.0.0`
   - Arrow input via `pyarrow==24.0.0`
   - Platform used for reproduction: Ubuntu 24.04 x86_64, Linux 
`6.17.0-23-generic`
   
   ## Minimal Reproducer
   
   ```python
   import pyarrow as pa
   from datafusion import SessionContext
   
   ctx = SessionContext()
   batch = pa.RecordBatch.from_arrays(
       [
           pa.array(["a"], type=pa.string()),
           pa.array([None], type=pa.int64()),
       ],
       schema=pa.schema(
           [
               pa.field("g", pa.string(), nullable=True),
               pa.field("x", pa.int64(), nullable=True),
           ]
       ),
   )
   ctx.register_record_batches("t0", [[batch]])
   
   base = "SELECT g, MIN(x) AS min_x FROM t0 GROUP BY g"
   print(ctx.sql(f"SELECT min_x FROM ({base}) q LIMIT 20").to_pandas())
   print(ctx.sql(f"SELECT min_x FROM ({base}) q ORDER BY min_x ASC NULLS LAST 
LIMIT 20").to_pandas())
   ```
   
   ## Actual Result
   
   Standalone reproducer output:
   
   ```text
   datafusion=53.0.0
   pyarrow=24.0.0
   control:
      min_x
   0    NaN
   top-k:
   Empty DataFrame
   Columns: [min_x]
   Index: []
   top-k record-batch rows=0
   AssertionError: DataFusion dropped the group whose aggregate sort key is NULL
   ```
   
   The control query returns the grouped row with a NULL aggregate result:
   
   ```text
      min_x
   0    NaN
   ```
   
   The query with `ORDER BY min_x ASC NULLS LAST LIMIT 20` returns no rows:
   
   ```text
   Empty DataFrame
   Columns: [min_x]
   Index: []
   ```
   
   With two groups, one having `MIN(x) = NULL` and one having `MIN(x) = 5`, the 
`ORDER BY ... LIMIT 20` query returns only the non-NULL sort-key group. The 
limit is larger than the number of groups, so no grouped row should be removed.
   
   
   
   ## Expected Result
   
   Both single-group queries should return one row. `GROUP BY g` forms a group 
for `g = 'a'`; `MIN(x)` is NULL because all values in that group are NULL. 
Ordering with `NULLS LAST` and a limit larger than the result cardinality 
should preserve the row.
   
   
   
   ## Additional Context
   
    I found this with a cross-backend differential test. In the generated case, 
pandas, Polars eager/lazy, DuckDB, SQLite, and an independent DSL reference all 
preserved the grouped row whose aggregate sort key is NULL. DataFusion was the 
only backend that returned an empty result.
   
    I also ran targeted follow-up checks for the same pattern. The issue 
consistently reproduced for NULL `MIN(x)` ordered ascending and NULL `MAX(x)` 
ordered descending, while nearby controls such as plain NULL column top-k, NULL 
group key with non-NULL aggregate, and NULL `SUM(x)` / `AVG(x)` aggregate top-k 
preserved the row.
   
    I can share the standalone reproducer, boundary-check script, and full 
generated artifacts if useful.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to