alamb commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3693960061

   > My understanding of the prior art on this is that we at one point added 
[PhysicalExpr::evaluate_bounds](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds)
 (which is exactly what you are proposing in propagate_range_stats) and 
Distribution but these never made it to be widely used. I do not know exactly 
why this is in general, but I think in the case of Parquet row group / page 
stats evaluation it was mainly a performance concern:
   
   Yes, this is my recollection too. Specifically imagine trying to prune 
1000's of files -- with 
[PhysicalExpr::evaluate_bounds](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds)
 you have to call it 1000s of times, which will be really slow
   
   
   >  the current approach builds a modified expression tree and a RecordBatch 
then evaluates it, which in theory can be vectorized, etc. 
   
   To be clear, this is how PruningPredicate works, which is the key to making 
the evaluation fast -- it reuses all the optimized expression evaluation 
machinery.
   
   
   > IIRC, the paper also mentioned some balancing about compile time for 
pruning. Would we also have some basic heuristic approach to do the tradeoff?
   
   I think this is related to the performance concern above -- if we had a 
vectorized evaluator we may not have to add such a heuristic
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to