clintropolis opened a new pull request #6584: fix druid-bloom-filter 
thread-safety
URL: https://github.com/apache/incubator-druid/pull/6584
 
 
   Alternate approach to what was taken in #6547, instead embedding the 
`BloomKFilter` implementation in `druid-bloom-filters` extension, and wrapping 
variables which are re-used instead of re-allocated per function call with 
`ThreadLocal` to fix the thread safety issues. 
   
   Fixes #6546
   
   The copied `BloomKFilter` is from `hive-storage-api` as of verison 2.7.0 
instead of master, since [this 
commit](https://github.com/apache/hive/commit/87ce36b458350db141c4cb4b6336a9a01796370f#diff-e65fc506757ee058dc951d15a9a526c3L238)
 seems to have broken backwards compatibility by using 8 bytes instead of 4 to 
store and test ints/floats. I went with this version specifically to maintain 
compatibility with the hive version used in the original PR, #6222, despite I 
think the latest implementation being a bit better since it doesn't use the 
static variable. The dependency on `hive-storage-api` still exists for use of 
it's `Murmur3` implementation, though if we embedded that as well we could drop 
the dependency entirely.
   
   Longer term I think this could prove advantageous having control over the 
implementation, since it will allow me to implement `ByteBuffer` versions of 
methods to make #6397 more efficient. 
   
   I collected benchmarks of this implementation of `BloomKFilter` and compared 
it to `BloomFilter` and they are fairly close, with a slight edge to 
`BloomKFilter` with larger filters:
   ```
   Benchmark                                     (capacity)   (rows)  Mode  Cnt 
   Score     Error  Units
   BloomKFilterBenchmark.testStringBloomFilter      1000000  3000000  avgt   20 
 419.952 ±  14.495  ms/op
   BloomKFilterBenchmark.testStringBloomFilter     10000000  3000000  avgt   20 
 632.029 ±  91.746  ms/op
   BloomKFilterBenchmark.testStringBloomFilter    100000000  3000000  avgt   20 
 829.343 ± 214.949  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter     1000000  3000000  avgt   20 
 430.807 ±  17.243  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter    10000000  3000000  avgt   20 
 617.019 ± 171.415  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter   100000000  3000000  avgt   20 
 862.188 ± 270.329  ms/op
   ```
   
   ```
   Benchmark                                     (capacity)    (rows)  Mode  
Cnt     Score     Error  Units
   BloomKFilterBenchmark.testStringBloomFilter      1000000  10000000  avgt   
20  1577.867 ±  35.568  ms/op
   BloomKFilterBenchmark.testStringBloomFilter     10000000  10000000  avgt   
20  2220.278 ±  71.004  ms/op
   BloomKFilterBenchmark.testStringBloomFilter    100000000  10000000  avgt   
20  2766.451 ± 496.881  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter     1000000  10000000  avgt   
20  1680.831 ±  39.377  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter    10000000  10000000  avgt   
20  1983.740 ±  70.657  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter   100000000  10000000  avgt   
20  2428.820 ± 480.301  ms/op
   ```
   
   and for reference here is the unmodified `BloomKFilter`, which has similar 
test speeds as well
   ```
   Benchmark                                     (capacity)    (rows)  Mode  
Cnt     Score     Error  Units
   BloomKFilterBenchmark.testStringBloomFilter      1000000  10000000  avgt   
20  1599.278 ±  70.931  ms/op
   BloomKFilterBenchmark.testStringBloomFilter     10000000  10000000  avgt   
20  2375.007 ± 149.834  ms/op
   BloomKFilterBenchmark.testStringBloomFilter    100000000  10000000  avgt   
20  2639.661 ±  84.199  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter     1000000  10000000  avgt   
20  1673.166 ±  48.810  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter    10000000  10000000  avgt   
20  2045.913 ±  33.302  ms/op
   BloomKFilterBenchmark.testStringBloomKFilter   100000000  10000000  avgt   
20  2289.513 ±  44.742  ms/op
   ```
   
   I am going to test these changes on a test cluster, to ensure they function 
as expected and correctly resolve #6546, and will add a comment with the 
results here before we should consider merging.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to