[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15677521#comment-15677521 ]
Reynold Xin commented on SPARK-18252: ------------------------------------- Thanks - going to close this. > Improve serialized BloomFilter size > ----------------------------------- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.0.1 > Reporter: Aleksey Ponkin > Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org