huaxingao commented on issue #10029: URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016304838
@cccs-jc Thanks a lot for your thorough investigation and analysis! The problem you described will also occur without a bloom filter. Let's use the where clause `col1=1 OR col2=1`. Assume the minimum for col1 is 0 and the maximum is 5, while the minimum for col2 is 2 and the maximum is 5. Let's also assume we do not have a bloom filter, col1 is dictionary encoded, and col2 is not. The statsFilter will determine that the col1 with value 1 is within the range of 0 to 5, so it returns `shouldRead = True`. Then statsFilter returns `shouldRead = False` because col2 with value 1 is out of range of 2 to 5. So the statsFilter returns `shouldRead = True` for `col1=1 OR col2=1`. The dictFilter will determine that the value 1 is not in the dictionary of col1, so it returns `shouldRead=False`. Then when dictFilter evaluates `col2=1`, it returns `shouldRead=True` because there is no dictionary so it can't rule it out. The dictFilter returns `shouldRead = True` for `col1=1 OR col2=1`. Since both the statsFilter and dictFilter returns True for `col1=1 OR col2=1`, we can't skip read the row group. It would be ideal if we could combine the `shouldRead=False` for `col2=1` in statsFilter and the `shouldRead=False` for `col1=1` in dictFilter, but it doesn't seem to be an easy way to do so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org