clintropolis opened a new pull request #6584: fix druid-bloom-filter thread-safety URL: https://github.com/apache/incubator-druid/pull/6584 Alternate approach to what was taken in #6547, instead embedding the `BloomKFilter` implementation in `druid-bloom-filters` extension, and wrapping variables which are re-used instead of re-allocated per function call with `ThreadLocal` to fix the thread safety issues. Fixes #6546 The copied `BloomKFilter` is from `hive-storage-api` as of verison 2.7.0 instead of master, since [this commit](https://github.com/apache/hive/commit/87ce36b458350db141c4cb4b6336a9a01796370f#diff-e65fc506757ee058dc951d15a9a526c3L238) seems to have broken backwards compatibility by using 8 bytes instead of 4 to store and test ints/floats. I went with this version specifically to maintain compatibility with the hive version used in the original PR, #6222, despite I think the latest implementation being a bit better since it doesn't use the static variable. The dependency on `hive-storage-api` still exists for use of it's `Murmur3` implementation, though if we embedded that as well we could drop the dependency entirely. Longer term I think this could prove advantageous having control over the implementation, since it will allow me to implement `ByteBuffer` versions of methods to make #6397 more efficient. I collected benchmarks of this implementation of `BloomKFilter` and compared it to `BloomFilter` and they are fairly close, with a slight edge to `BloomKFilter` with larger filters: ``` Benchmark (capacity) (rows) Mode Cnt Score Error Units BloomKFilterBenchmark.testStringBloomFilter 1000000 3000000 avgt 20 419.952 ± 14.495 ms/op BloomKFilterBenchmark.testStringBloomFilter 10000000 3000000 avgt 20 632.029 ± 91.746 ms/op BloomKFilterBenchmark.testStringBloomFilter 100000000 3000000 avgt 20 829.343 ± 214.949 ms/op BloomKFilterBenchmark.testStringBloomKFilter 1000000 3000000 avgt 20 430.807 ± 17.243 ms/op BloomKFilterBenchmark.testStringBloomKFilter 10000000 3000000 avgt 20 617.019 ± 171.415 ms/op BloomKFilterBenchmark.testStringBloomKFilter 100000000 3000000 avgt 20 862.188 ± 270.329 ms/op ``` ``` Benchmark (capacity) (rows) Mode Cnt Score Error Units BloomKFilterBenchmark.testStringBloomFilter 1000000 10000000 avgt 20 1577.867 ± 35.568 ms/op BloomKFilterBenchmark.testStringBloomFilter 10000000 10000000 avgt 20 2220.278 ± 71.004 ms/op BloomKFilterBenchmark.testStringBloomFilter 100000000 10000000 avgt 20 2766.451 ± 496.881 ms/op BloomKFilterBenchmark.testStringBloomKFilter 1000000 10000000 avgt 20 1680.831 ± 39.377 ms/op BloomKFilterBenchmark.testStringBloomKFilter 10000000 10000000 avgt 20 1983.740 ± 70.657 ms/op BloomKFilterBenchmark.testStringBloomKFilter 100000000 10000000 avgt 20 2428.820 ± 480.301 ms/op ``` and for reference here is the unmodified `BloomKFilter`, which has similar test speeds as well ``` Benchmark (capacity) (rows) Mode Cnt Score Error Units BloomKFilterBenchmark.testStringBloomFilter 1000000 10000000 avgt 20 1599.278 ± 70.931 ms/op BloomKFilterBenchmark.testStringBloomFilter 10000000 10000000 avgt 20 2375.007 ± 149.834 ms/op BloomKFilterBenchmark.testStringBloomFilter 100000000 10000000 avgt 20 2639.661 ± 84.199 ms/op BloomKFilterBenchmark.testStringBloomKFilter 1000000 10000000 avgt 20 1673.166 ± 48.810 ms/op BloomKFilterBenchmark.testStringBloomKFilter 10000000 10000000 avgt 20 2045.913 ± 33.302 ms/op BloomKFilterBenchmark.testStringBloomKFilter 100000000 10000000 avgt 20 2289.513 ± 44.742 ms/op ``` I am going to test these changes on a test cluster, to ensure they function as expected and correctly resolve #6546, and will add a comment with the results here before we should consider merging.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org