etiennepelissier opened a new pull request, #21112:
URL: https://github.com/apache/datafusion/pull/21112

   ## Which issue does this PR close?
   
   Closes #8698
   
   ## Rationale for this change
   
   Substrait's `RelCommon.hint.stats.row_count` carries row count statistics as 
an advisory hint. DataFusion was not reading or writing this field, meaning 
statistics were silently dropped when round-tripping logical plans through 
Substrait. This is useful for downstream optimizer rules that rely on row count 
estimates.
   
   ## What changes are included in this PR?
   
   **Producer** (`producer/rel/read_rel.rs`): when serializing a `TableScan`, 
attempt to downcast the `TableSource` to a `TableProvider` and read its 
`statistics()`. If `num_rows` is `Exact(n)` or `Inexact(n)`, populate 
`RelCommon.hint.stats.row_count` with `n as f64`.
   
   **Consumer** (`consumer/rel/read_rel.rs`): extract `row_count` from 
`RelCommon.hint.stats` on any `ReadRel`. When the resolved `TableProvider` has 
no statistics of its own, wrap it with a new private 
`StatisticsOverrideTableProvider` that returns the Substrait hint as 
`Precision::Inexact(n)`, making it available to DataFusion's optimizer and 
physical planning. Local provider statistics always take precedence over the 
hint.
   
   ## Are these changes tested?
   
   Two new integration tests in `roundtrip_logical_plan.rs`:
   
   - `producer_sets_row_count_hint`: registers a `TableWithStatistics` (exact 
row count = 100), converts the plan to Substrait, and asserts 
`ReadRel.common.hint.stats.row_count == 100.0`.
   - `consumer_injects_row_count_hint`: produces a Substrait plan from a 
provider with row count 42, consumes it against a `MemTable` (no statistics), 
and asserts the resulting provider exposes `Precision::Inexact(42)`.
   
   ## Are there any user-facing changes?
   
   No breaking API changes. The behavior is additive: Substrait plans produced 
by DataFusion now carry row count hints, and plans consumed by DataFusion now 
surface those hints through `TableProvider::statistics()` when no local 
statistics are present.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to