kosiew opened a new pull request, #19252:
URL: https://github.com/apache/datafusion/pull/19252

   
   ## Which issue does this PR close?
   
   * Closes #18974.
   
   ## Rationale for this change
   
   The DataFusion CLI currently panics with an "index out of bounds" error when 
executing queries that use `GROUP BY GROUPING SETS(())`, such as:
   
   ```sql
   SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING SETS(())
   ```
   
   This panic originates in the physical aggregation code, which assumes that 
an empty list of grouping expressions always corresponds to "no grouping". That 
assumption breaks down in the presence of `GROUPING SETS`, where an empty set 
is a valid grouping set that should still produce a result row (and 
`__grouping_id`) rather than crashing.
   
   This PR fixes the panic by explicitly distinguishing:
   
   * true "no GROUP BY" aggregations, and
   * `GROUPING SETS`/`CUBE`/`ROLLUP` plans that may have empty grouping 
expressions but still require grouping-set semantics and a valid 
`__grouping_id`.
   
   The change restores robustness of the CLI and ensures standards-compliant 
behavior for grouping sets with empty sets.
   
   ## What changes are included in this PR?
   
   Summary of the main changes:
   
   * **Track grouping-set usage explicitly in `PhysicalGroupBy`:**
   
     * Add a `has_grouping_set: bool` field to `PhysicalGroupBy`.
     * Extend `PhysicalGroupBy::new` to accept the `has_grouping_set` flag.
     * Add helper methods:
   
       * `has_grouping_set(&self) -> bool` to expose the flag, and
       * `is_true_no_grouping(&self) -> bool` to represent the case of 
genuinely no grouping (no GROUP BY and no grouping sets).
   
   * **Correct group state construction for empty grouping with grouping sets:**
   
     * Update `PhysicalGroupBy::from_pre_group` so that it only treats 
`expr.is_empty()` as "no groups" when `has_grouping_set` is `false`.
     * For `GROUPING SETS(())`, we now build at least one group, avoiding the 
previous out-of-bounds access on `groups[0]`.
   
   * **Clarify when `__grouping_id` should be present:**
   
     * Replace the previous `is_single` logic with a clearer distinction based 
on `has_grouping_set`.
     * `num_output_exprs`, `output_exprs`, `num_group_exprs`, and 
`group_schema` now add the `__grouping_id` column only when `has_grouping_set` 
is `true`.
     * `is_single` is redefined as "simple GROUP BY" (no grouping sets), i.e. 
`!self.has_grouping_set`.
   
   * **Integrate the new semantics into `AggregateExec`:**
   
     * Use `group_by.is_true_no_grouping()` instead of 
`group_by.expr.is_empty()` when choosing between the specialized no-grouping 
aggregation path and grouped aggregation.
     * Ensure that `is_unordered_unfiltered_group_by_distinct` only treats 
plans as grouped when there are grouping expressions **and** no grouping sets 
(`!has_grouping_set`).
     * Preserve existing behavior for regular `GROUP BY` while correctly 
handling `GROUPING SETS` and related constructs.
   
   * **Support `__grouping_id` with the no-grouping aggregation stream:**
   
     * Extend `AggregateStreamInner` with an optional `grouping_id: 
Option<ScalarValue>` field.
     * Change `AggregateStream::new` to accept a `grouping_id` argument.
     * Introduce `prepend_grouping_id_column` to prepend a `__grouping_id` 
column to the finalized accumulator output when needed.
     * Wire this up so that no-grouping aggregations can still match a schema 
that includes `__grouping_id` in grouping-set scenarios.
   
   * **Planner and execution wiring updates:**
   
     * Update all `PhysicalGroupBy::new` call sites to pass the correct 
`has_grouping_set` value:
   
       * `false` for:
   
         * ordinary `GROUP BY` or truly no-grouping aggregates.
       * `true` for:
   
         * `GROUPING SETS`,
         * `CUBE`, and
         * `ROLLUP` physical planning paths.
     * Ensure `merge_grouping_set_physical_expr`, `create_cube_physical_expr`, 
and `create_rollup_physical_expr` correctly mark grouping-set plans.
   
   * **Protobuf / physical plan round-trip support:**
   
     * Extend `AggregateExecNode` in `datafusion.proto` with a new `bool 
has_grouping_set = 12;` field.
     * Update the generated `pbjson` and `prost` code to serialize and 
deserialize the new field.
     * When constructing `AggregateExec` from protobuf, pass the decoded 
`has_grouping_set` into `PhysicalGroupBy::new`.
     * When serializing an `AggregateExec` back to protobuf, set 
`has_grouping_set` based on `exec.group_expr().has_grouping_set()`.
     * Update round-trip physical plan tests to include the new field in their 
expectations.
   
   * **Tests and SQL logic coverage:**
   
     * Add sqllogictests for the previously failing cases in `grouping.slt`:
   
       * `SELECT COUNT(*) FROM test GROUP BY GROUPING SETS (());`
       * `SELECT SUM(v1) FROM generate_series(10) AS t1(v1) GROUP BY GROUPING 
SETS(())` (the original panic case).
     * Extend or adjust unit tests in `aggregates`, `physical_planner`, 
`filter_pushdown`, and `coop` modules to account for the `has_grouping_set` 
flag in `PhysicalGroupBy` and expected debug output.
     * Update proto round-trip tests to validate `has_grouping_set` is 
preserved.
   
   ## Are these changes tested?
   
   Yes.
   
   * New sqllogictests covering `GROUPING SETS(())` for both a regular table 
and `generate_series(10)`:
   
     * `grouping.slt` now asserts the expected scalar results (e.g. `2` and 
`55`), preventing regressions on this edge case.
   * Updated and existing Rust unit tests:
   
     * `physical-plan/src/aggregates` tests updated to include 
`has_grouping_set` in `PhysicalGroupBy` expectations.
     * Planner and optimizer tests (e.g. `physical_planner.rs`, 
`filter_pushdown`) updated to construct `PhysicalGroupBy` with the new flag.
     * Execution tests in `core/tests/execution/coop.rs` updated to reflect the 
new constructor and continue to exercise the no-grouping aggregation path.
     * Protobuf round-trip tests extended to verify that `has_grouping_set` is 
correctly serialized and deserialized.
   
   These tests collectively ensure that:
   
   * the panic is fixed,
   * the aggregation semantics for `GROUPING SETS(())` are correct, and
   * existing aggregate behavior remains unchanged for non-grouping-set queries.
   
   ## Are there any user-facing changes?
   
   Yes, but they are bug fixes and behavior clarifications rather than breaking 
changes:
   
   * Queries using `GROUP BY GROUPING SETS(())` no longer cause a runtime panic 
in the DataFusion CLI.
   
     * Instead, they return the expected single aggregate row (e.g. `COUNT(*)` 
or `SUM(v1)`), consistent with SQL semantics.
   * For plans using `GROUPING SETS`, `CUBE`, or `ROLLUP`, the internal 
`__grouping_id` column is now present consistently whenever grouping sets are 
in use, even when the grouping expressions are empty.
   * For ordinary `GROUP BY` queries that do not use grouping sets, behavior is 
unchanged: no unexpected `__grouping_id` column is added.
   
   No API signatures were changed in a breaking way for downstream users; the 
additions are internal flags and protobuf fields to accurately represent the 
physical plan.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to