kosiew opened a new pull request, #22161:
URL: https://github.com/apache/datafusion/pull/22161
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
* Part of #20118
## Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in
the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your
changes and offer better suggestions for fixes.
-->
This change implements a conservative optimization for removing unused
`UNNEST` operators in cases where the parent operator is duplicate-insensitive
and the unnested output is not referenced.
Previously, queries such as `GROUP BY` or `DISTINCT` over non-unnested
columns would still retain `UNNEST` even when it only introduced duplicate rows
and had no effect on the final grouped result. However, removing `UNNEST` is
only safe in narrowly scoped cases because empty or NULL lists can change row
cardinality by removing rows entirely.
This PR adds a targeted optimization that only removes `UNNEST` when it is
provably semantics-preserving.
## What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is
sometimes worth providing a summary of the individual changes in this PR.
-->
* Extend projection optimization for `Aggregate` plans to detect removable
`UNNEST` inputs.
* Add logic to eliminate `LogicalPlan::Unnest` when:
* the unnested columns are not referenced by required expressions,
* the parent aggregate is duplicate-insensitive (`GROUP BY` with no
aggregate expressions, including `DISTINCT`),
* the `UNNEST` input is provably guaranteed to preserve at least one row
per input row.
* Support pruning through an intermediate `Projection`.
* Add safety checks to avoid removing `UNNEST` when:
* grouped expressions reference unnested columns,
* the input list may be empty or NULL.
* Add helper logic for detecting non-empty literal list inputs across
supported list scalar types.
## Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are
they covered by existing tests)?
-->
Yes.
Added targeted optimizer unit tests covering:
* removal of unused non-empty literal `UNNEST` under `GROUP BY`,
* removal through an intermediate projection,
* preservation when unnested columns are referenced,
* preservation for empty list inputs.
Added sqllogictest coverage for:
* `GROUP BY` pruning,
* `DISTINCT` pruning,
* unsafe cases where removing `UNNEST` would change cardinality.
## Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
This change improves logical and physical plan optimization for certain
queries involving unused `UNNEST` expressions under `GROUP BY` or `DISTINCT`,
but does not introduce user-facing API changes.
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]