timsaucer opened a new pull request, #22146:
URL: https://github.com/apache/datafusion/pull/22146

   ## Which issue does this PR close?
   
   - Closes #.
   
   (Filing this directly; no issue yet. Happy to open one if preferred.)
   
   ## Rationale for this change
   
   `Statistics::with_fetch` unconditionally returned `Precision::Exact(0)` when 
the
   input had `nr <= skip`, even when the input was `Inexact(nr)` — promoting an
   estimated upper bound into an exact zero. The exactness flag then misleads
   downstream consumers (notably `AggregateStatistics` via 
`Count::value_from_stats`)
   into trusting a derived "0" and folding the count subtree to a literal.
   
   Concrete user-visible symptom reported on TPC-H Q22:
   
   ```rust
   let df = ctx.sql(q22_sql).await?;
   df.clone().show().await?;   // prints 7 rows (correct)
   df.count().await?;          // returns 0 (wrong)
   ```
   
   `EXPLAIN` for the count plan shows the outer count aggregate collapsed to
   `ProjectionExec([lit(0)]) -> PlaceholderRowExec`.
   
   After PR #21240 left uncorrelated scalar subqueries in the filter rather than
   rewriting them to joins, `FilterExec` can't use interval analysis on
   `ScalarSubqueryExpr`, falls back to the 20% default selectivity, and 
produces a
   small `Inexact` row estimate. A `LeftAnti` join whose estimated semi-overlap
   covers the outer estimate then yields `Inexact(0)`. That zero propagates
   through grouped aggregates whose `estimate_num_rows` returns the child stats
   unchanged when `value == 0`. The pre-existing `with_fetch` bug on a 
downstream
   `SortExec` finally promotes it to `Exact(0)`, which `AggregateStatistics`
   trusts.
   
   The root cause is the precision promotion in `with_fetch`. The PR fixes that;
   the surrounding plan-shape changes after #21240 just made it reachable.
   
   ## What changes are included in this PR?
   
   - `Statistics::with_fetch`: when `nr <= skip`, preserve the exactness of the
     input via `check_num_rows(Some(0), self.num_rows.is_exact().unwrap())`
     instead of always returning `Exact(0)`.
   - `datafusion-common`: new unit test `test_with_fetch_skip_all_rows_inexact`
     pinning the new behaviour.
   - `datafusion-physical-plan`: update the existing
     `test_row_number_statistics_for_global_limit` expectation that encoded the
     old (incorrect) promotion to expect `Inexact(0)` now.
   - `datafusion/sqllogictest/test_files/subquery.slt`: SLT regression test
     reproducing the user-visible `count(*)` symptom over a query that contains
     a scalar subquery, `not exists`, and a group-by on a derived column, backed
     by parquet sources so the data sources report Exact statistics.
   
   ## Are these changes tested?
   
   Yes:
   - Unit test in `datafusion-common` for the precision-preserving behaviour.
   - Updated unit test in `datafusion-physical-plan/limit.rs`.
   - SLT regression test in `subquery.slt` that fails without the fix
     (`count(*)` returns `0` instead of `2`) and passes with it.
   
   ## Are there any user-facing changes?
   
   Only the bug fix. No public API changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to