[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674558#comment-15674558
 ] 

Aleksey Ponkin commented on SPARK-18252:
----------------------------------------

Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.


> Improve serialized BloomFilter size
> -----------------------------------
>
>                 Key: SPARK-18252
>                 URL: https://issues.apache.org/jira/browse/SPARK-18252
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.0.1
>            Reporter: Aleksey Ponkin
>            Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to