adriangb commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3691561747
This is an incredible writeup! My understanding of the prior art on this is that we at one point added `PhysicalExpr::evaluate_bounds` (which is exactly what you are proposing in `propagate_range_stats`) and `Distribution` but these never made it to be widely used. I do not know exactly why this is in general, but I think in the case of Parquet row group / page stats evaluation it was mainly a performance concern: the current approach builds a modified expression tree and a RecordBatch then evaluates it, which in theory can be vectorized, etc. This has also been the blocker to adding tri-state logic in Parquet at least: because we use expressions like `BinaryExpr` the output is a `BooleanArray`, we can't distinguish between `KeepAll` and `Unknown`. Some thoughts based on this: - Could we add `Distribution` to `ColumnStats`? It could be useful to track alongside. - Should this be done at the via RecordBatches instead of custom Rust structs? Could we have something like `PhysicalExpr::prunable() -> Option<PrunablePhysicalExpr>` Where PrunablePhysicalExpr::evaluate(RecordBatch)` and the `RecordBatch` has a `Union(IntermediateStatsArray, IntermediateResultArray)` or something. I feel that working with `RecordBatch`es is a worse / more annoying API but if it has a significant performance impact maybe it's what we need to do. Overall though I 100% stand behind this effort 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
