[ https://issues.apache.org/jira/browse/ARROW-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace resolved ARROW-15239. --------------------------------- Resolution: Fixed Issue resolved by pull request 12067 [https://github.com/apache/arrow/pull/12067] > [C++][Compute] Introduce Bloom filters to hash join > --------------------------------------------------- > > Key: ARROW-15239 > URL: https://issues.apache.org/jira/browse/ARROW-15239 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 6.0.0 > Reporter: Michal Nowakiewicz > Assignee: Michal Nowakiewicz > Priority: Major > Labels: pull-request-available, query-engine > Fix For: 8.0.0 > > Time Spent: 9h 40m > Remaining Estimate: 0h > > Bloom filters are a common way to improve performance of hash joins where > many rows on the probe side of the hash join do not have matches on the build > side. Bloom filters are often able to reduce the cost of eliminating such > rows early in the processing pipeline, since they are cheaper to probe than > the hash join hash table, but they can return false positives for a > reasonably small percentage of inputs. > This task is about introducing a data structure of register blocked Bloom > filter implementation (a practical modification of Bloom filter concept that > is specifically tuned for use in query processing related to hash joins and > both more space efficient and less costly than using hash table for > filtering). The data structure should provide functionality for parallel > construction from a vector of exec batches accumulated in memory and > vectorized lookup and filtering for a single exec batch. It should not have a > limit on the size of the Bloom filter (the number of inserted hashes), which > requires using 64-bit hashes for larger inputs. It should be verified that > build and probe costs are reasonable low and false positives rate is at most > few percent (which should be acceptable in use for query processing). -- This message was sent by Atlassian Jira (v8.20.1#820001)