asolimando commented on issue #21120:
URL: https://github.com/apache/datafusion/issues/21120#issuecomment-4138342060

   > > This links to paleolimbot's interest in stats propagation for a specific 
type of stats
   > 
   > I am new to this part of DataFusion but I took a look at the draft PR...we 
probably would be able to use this (but would need `Statistics` to be able to 
represent something pluggable, and would need the analyzer trait to be able to 
calculate something fancier than min/max and NDV). No pressure to cater to that 
here, but my personal motivating example is to ask for GeoStatistics (we have 
our own definition) for a specific column output of a `dyn ExecutionPlan` when 
planning a join.
   
   Thanks for your feedback @paleolimbot. This issue covers the expression 
level only, as I tried to keep the scope to the minimum to not overload 
reviewers. 
   
   But I am also working on a POC for a similar chain-of-responsibility 
registry for the operator level, to be able to override the statistics 
propagation mechanism for individual operators beyond what's implemented in 
`partition_statistics` today. The idea is to buy the same freedom we have today 
for connectors/data source, and physical rules, but for statistics.
   
   Being able to override operators' behavior for statistics propagation, would 
also allow supporting custom statistics, which matches exactly your use-case 
(provided a mechanism for storing them too, of course).
   
   Concretely speaking, I had in mind 
[DataSketches](https://datasketches.apache.org/) as custom stats, similar to 
what I implemented in 
[HIVE-26221](https://issues.apache.org/jira/browse/HIVE-26221) to add support 
for histogram-like statistics for range filters based on [KLL 
sketches](https://datasketches.apache.org/docs/KLL/KLLSketch.html), but it can 
very well be used for any kind of "custom" stats.
   
   I will hopefully have a draft PR/POC for when I will open this new issue, 
but I'd like to keep the two discussions separate for now to not broaden the 
scope too much, as you have mentioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to