2010YOUY01 commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3705492170

   I have opened a review-ready PR covering the major APIs for this feature. 
The issue has also been updated to summarize the later discussion.
   
   @adriangb Regarding distribution statistics: they are not included in this 
PR, but I have thought about how they could be extended in the future with the 
current design. Conceptually, the changes would be:
   
   - Add a new distribution stats type to `ColumnStats` in  
     
https://github.com/apache/datafusion/pull/19609/changes#diff-7ef7398c3050edc8e11cd985fd57aeb21b2139794d0007b7fb5bb04865c8173dR250
   - Add some control logic inside `evaluate_pruning()` in  
     
https://github.com/apache/datafusion/pull/19609/changes#diff-2b3f5563d9441d3303b57e58e804ab07a10d198973eed20e7751b5a20b955e42R464,
 and adjust the evaluation steps to skip certain containers. This optimization 
is, I think, also part of the “short-circuit” family described in the issue.
   
   Regarding pruning on struct columns, I haven’t looked into it in detail yet. 
I’ll try to study the related implementations later. In the meantime, could you 
provide some example workloads you’d like this to support (e.g. what are the 
queries with struct predicates look like, and how the corresponding Parquet 
statistics are populated)? We can then figure out how to make this work in 
follow-up work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to