huaxingao commented on issue #10029:
URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016304838

   @cccs-jc Thanks a lot for your thorough investigation and analysis!
   
   The problem you described will also occur without a bloom filter. Let's use 
the where clause `col1=1 OR col2=1`. Assume the minimum for col1 is 0 and the 
maximum is 5, while the minimum for col2 is 2 and the maximum is 5. Let's also 
assume we do not have a bloom filter, col1 is dictionary encoded, and col2 is 
not.
   
   The statsFilter will determine that the col1 with value 1 is within the 
range of 0 to 5, so it returns `shouldRead = True`. Then statsFilter returns 
`shouldRead = False` because col2 with value 1 is out of range of 2 to 5. So 
the statsFilter returns `shouldRead = True` for `col1=1 OR col2=1`.
   
   The dictFilter will determine that the value 1 is not in the dictionary of 
col1, so it returns `shouldRead=False`. Then when dictFilter evaluates 
`col2=1`, it returns `shouldRead=True` because there is no dictionary so it 
can't rule it out. The dictFilter returns `shouldRead = True` for `col1=1 OR 
col2=1`.
   
   Since both the statsFilter and dictFilter returns True for `col1=1 OR 
col2=1`, we can't skip read the row group.
   
   It would be ideal if we could combine the `shouldRead=False` for `col2=1` in 
statsFilter and the `shouldRead=False` for `col1=1` in dictFilter, but it 
doesn't seem to be an easy way to do so.
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to