srilman opened a new issue, #491:
URL: https://github.com/apache/iceberg-python/issues/491

   ### Feature Request / Improvement
   
   I noticed that in DataScan.plan_files, when we apply filters at a partition 
and file metadata level, all we try to determine is whether the file has rows 
that never match the filter or some might match. However, we can also easily 
determine if all rows in the file match the filter. This can occur when a file 
can be fully determined on partitions or file metadata.
   
   This could enable some additional optimizations in file scan planning (in 
order of complexity):
   - When the filter is always true for all output files, we can skip the 
row-level filter in `DataScan.to_arrow`
   - When the filter is determined to be always true at the partition filter 
level, we can skip filtering on the file level
   - We can split files between `ROWS_MIGHT_MATCH` and `ROWS_ALWAYS_MATCH` and 
do half-and-half (at both partition -> file and output side)
   - We can go even more extreme and partially evaluate / simplify filters 
based on metadata before passing to later steps. The evaluation would work like:
       - If a expression is determined to always be false on all rows in a 
file/partition, then replace with AlwaysFalse
       - Likewise, if an expression would always be true on all rows, replace 
with AlwaysTrue
       - Simplify as walking up the tree
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to