[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15676899#comment-15676899 ]
Aleksey Ponkin commented on SPARK-18252: ---------------------------------------- I did benchmarks(you can find it [here|https://github.com/ponkin/bloo-filter-perf]) - tests were executed with 1 million random strings dataset. 1. bloom filter with roaringbitmap 20-25% slower than good old BitArray implementation in put and migthContain operations. new put - Elapsed time: 2308348596ns old put - Elapsed time: 1700994048ns new mightContain - Elapsed time: 2191514527ns old mightContain - Elapsed time: 1637640048ns 2.full bloom filter with roaringbitmap(all expected elements are inserted) has the same or bigger size than bitarray version(as I said earlier roaringbitmap poorly compress half filled random strings) Conclusion: Benefits from using RoaringBitmap in BloomFilters are not clear. I think we can close the ticket. Thanks to everyone, sorry for waisting your time) > Improve serialized BloomFilter size > ----------------------------------- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.0.1 > Reporter: Aleksey Ponkin > Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org