2010YOUY01 commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3705492170
I have opened a review-ready PR covering the major APIs for this feature.
The issue has also been updated to summarize the later discussion.
@adriangb Regarding distribution statistics: they are not included in this
PR, but I have thought about how they could be extended in the future with the
current design. Conceptually, the changes would be:
- Add a new distribution stats type to `ColumnStats` in
https://github.com/apache/datafusion/pull/19609/changes#diff-7ef7398c3050edc8e11cd985fd57aeb21b2139794d0007b7fb5bb04865c8173dR250
- Add some control logic inside `evaluate_pruning()` in
https://github.com/apache/datafusion/pull/19609/changes#diff-2b3f5563d9441d3303b57e58e804ab07a10d198973eed20e7751b5a20b955e42R464,
and adjust the evaluation steps to skip certain containers. This optimization
is, I think, also part of the “short-circuit” family described in the issue.
Regarding pruning on struct columns, I haven’t looked into it in detail yet.
I’ll try to study the related implementations later. In the meantime, could you
provide some example workloads you’d like this to support (e.g. what are the
queries with struct predicates look like, and how the corresponding Parquet
statistics are populated)? We can then figure out how to make this work in
follow-up work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]