Tamar-Posen opened a new issue, #18850:
URL: https://github.com/apache/datafusion/issues/18850
### Describe the bug
In DataFusion 51.1.0, joins involving a `GROUP BY` may choose the wrong
build/probe side.
This happens because `AggregateExec` discards `total_byte_size` statistics,
leaving only row-count stats.
The join-selection rule then falls back to comparing `num_rows`, which
producing incorrect decisions for tables with uneven byte-size distributions
As a result, dynamic filters may be:
- Built on the large table, or
- Not generated at all.
### To Reproduce
This occurs when joining tables with extreme size asymmetry:
- large_bytes: ~50 rows, ~1 GB total
- many_rows: ~1,000,000 rows, ~50 KB total
Run:
```
SELECT *
FROM large_bytes
JOIN (
SELECT id, join_key
FROM many_rows
GROUP BY ALL
) AS many_rows
ON large_bytes.id = many_rows.join_key;
```
Inspect the optimized plan (`EXPLAIN` or `df.explain(false, true)`).
The optimizer may select `large_bytes` as the build side because the
aggregated `many_rows` loses its byte-size stats.
Minimal Rust test reproducing the issue:
[repro_aggregate_bytes.rs](https://gist.github.com/Tamar-Posen/758bca567c9a591e0b9e5d0a1fa53633)
### Expected behavior
- AggregateExec should preserve or approximate total_byte_size.
- Join selection should use byte-size when available.
- Dynamic filters should consistently build on the smaller table.
### Additional context
Relevant areas:
- physical-plan/aggregates/mod.rs – statistics propagation
- physical-optimizer/join_selection.rs
- physical-optimizer/filter_pushdown.rs
- physical-plan/joins/hash_join/exec.rs
This behavior appears to be a limitation (design choice) rather than an
incidental bug. Preserving or estimating byte-size stats after aggregation
would improve join selection for skewed row-size distributions
I’m happy to submit a PR after discussing the desired approach for
propagating or estimating byte-size stats through AggregateExec.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]