[GitHub] [parquet-mr] gszadovszky commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

via GitHub Tue, 07 Mar 2023 00:04:04 -0800


gszadovszky commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1457722446


   Thanks @yabola for coming up with this idea. Let's continue the discussion 
about the BloomFilter building idea in the jira.
   
   Meanwhile, I've been thinking about the actual problem as well. Currently, 
for row group filtering we are checking the min/max values first which is 
correct since this is the most fast thing to do. Then the dictionary and then 
the bloom filter. The ordering of the latter two is not obvious to me in every 
scenarios. At the time of filtering we did not start reading the actual row 
group so there is no advantage in I/O to read the dictionary first. 
Furthermore, searching something in the bloom filter is much faster than in the 
dictionary. And the size of the bloom filter is probably less than the size of 
the dictionary. Though, it would require some measurements to prove if it is a 
good idea to get the bloom filter before the dictionary. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] gszadovszky commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

Reply via email to