srilman opened a new issue, #491: URL: https://github.com/apache/iceberg-python/issues/491
### Feature Request / Improvement I noticed that in DataScan.plan_files, when we apply filters at a partition and file metadata level, all we try to determine is whether the file has rows that never match the filter or some might match. However, we can also easily determine if all rows in the file match the filter. This can occur when a file can be fully determined on partitions or file metadata. This could enable some additional optimizations in file scan planning (in order of complexity): - When the filter is always true for all output files, we can skip the row-level filter in `DataScan.to_arrow` - When the filter is determined to be always true at the partition filter level, we can skip filtering on the file level - We can split files between `ROWS_MIGHT_MATCH` and `ROWS_ALWAYS_MATCH` and do half-and-half (at both partition -> file and output side) - We can go even more extreme and partially evaluate / simplify filters based on metadata before passing to later steps. The evaluation would work like: - If a expression is determined to always be false on all rows in a file/partition, then replace with AlwaysFalse - Likewise, if an expression would always be true on all rows, replace with AlwaysTrue - Simplify as walking up the tree -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org