asolimando commented on PR #21996:
URL: https://github.com/apache/datafusion/pull/21996#issuecomment-4367087797

   Really interesting work @adriangb, addressing an important gap, this has 
potential to unlock a lot of further development on the statistics front.
   
   I've been working on several related pieces (ExpressionAnalyzer #21122, 
StatisticsRegistry #21483, StatisticsContext #21815) and wanted to offer some 
perspective on how they might connect.
   
   ## On the ExpressionAnalyzer (#21122) relationship
   
   Your framing of "orthogonal with overlap" is very convincing to me, but I 
wonder if the split might be even cleaner than presented. Keeping 
`StatisticsRequest` column-based (or extending to nested paths like `get_field` 
where the data source genuinely has the stat) seems like the natural boundary. 
For the `JOIN ON coalesce(t1.col, '') = t2.col` example you cite, one option 
would be to request stats on `t1.col` and let `ExpressionAnalyzer` derive how 
`coalesce` transforms them, rather than pushing expression semantics into the 
provider, which might have their own very different notion (classic impedance 
mismatch in DB systems). Your `RequestStatistics` rule already seems to be 
doing this naturally, since `add_column_refs` extracts base columns, ignoring 
expression structure, this feels like the right balance.
   
   ## Sketches, advanced stats, and the broader statistics infrastructure
   
   You mention "room to grow toward sketches, histograms, and selectivity stats 
later", which reminded me of the `ExtendedStatistics` mechanism introduced in 
the `StatisticsRegistry` (#21483), with exactly that use-case in mind. The 
`StatisticsRegistry` provides a way to override/enrich statistics for any 
operator via customizable `StatisticsProvider` (with existing built-in versions 
as base). `ExtendedStatistics` is its type-erased extension map that custom 
providers can populate and other custom ones can consume, without modifying 
core DataFusion types. It would be nice to think about how the request/response 
API from this PR could feed custom stats into `ExtendedStatistics` (e.g., a 
table provider that answers a `StatisticsRequest` with a sketch or histogram 
could surface that data where `StatisticsProvider` implementations can pick it 
up during propagation).
   
   If `StatisticsContext` (#21815, in review) lands, this could find its 
natural home there. The longer-term idea is for `StatisticsContext` to also 
carry the `ExpressionAnalyzer` and `StatisticsRegistry` themselves, so that all 
statistics infrastructure (collection, expression-level derivation, 
operator-level propagation, and custom extensions) is accessible from a single 
place, opening the door for richer stats flowing end-to-end.
   
   Happy to discuss in more detail if that would be useful!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to