[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677521#comment-15677521 ] Reynold Xin commented on SPARK-18252: - Thanks - going to close this. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676986#comment-15676986 ] Gregory SSI-YAN-KAI commented on SPARK-18252: - I've worked on a custom implementation of Bloom filters, compressed with Roaring bitmaps some months ago. The first results I got showed that my implementation was consuming more memory and was somewhat slower than other implementations like guava. But my implementation was very basic and I believe that it should be possible to improve it so that it performs better in most usecases. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676908#comment-15676908 ] Aleksey Ponkin commented on SPARK-18252: The only thing that can be improved, IMHO - using right compression for BitArray during serialization, I think we can achieve 20 or 30% data compression using right encoder.s This feature offcourse must be optional. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676899#comment-15676899 ] Aleksey Ponkin commented on SPARK-18252: I did benchmarks(you can find it [here|https://github.com/ponkin/bloo-filter-perf]) - tests were executed with 1 million random strings dataset. 1. bloom filter with roaringbitmap 20-25% slower than good old BitArray implementation in put and migthContain operations. new put - Elapsed time: 2308348596ns old put - Elapsed time: 1700994048ns new mightContain - Elapsed time: 2191514527ns old mightContain - Elapsed time: 1637640048ns 2.full bloom filter with roaringbitmap(all expected elements are inserted) has the same or bigger size than bitarray version(as I said earlier roaringbitmap poorly compress half filled random strings) Conclusion: Benefits from using RoaringBitmap in BloomFilters are not clear. I think we can close the ticket. Thanks to everyone, sorry for waisting your time) > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674801#comment-15674801 ] Reynold Xin commented on SPARK-18252: - Those two methods are pretty inefficient. When we use this in SQL for joins, we'd want a hyper efficient probing without caring too much about the false positive rate (i.e. size will be small), and being able to have all the internal details exposed in a simple way would be critical there. It might be a good idea to create a version of the bloom filter that can be compressed, but I definitely don't want that to be the only implementation. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674821#comment-15674821 ] Aleksey Ponkin commented on SPARK-18252: I do not have any benchmarks, but I believe that iteration over true positions in RB must be pretty fast, since this is sort of Run Length Encoding, where only random access operations like get and set are pretty painful. We are definitely need some benchmarks for this. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674818#comment-15674818 ] Reynold Xin commented on SPARK-18252: - Regarding this - can you find some performance data on how the build time compares between roaring one and the normal bitset? I'm also thinking maybe we can have the default build returning a roaring one, and then have a way to return a simple bloom filter based on normal bitsets and SQL can use that one. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674791#comment-15674791 ] Aleksey Ponkin commented on SPARK-18252: well, I do not know anything about vectorized probing, but you can freely iterate over all true positions in RoaringBitmap with this method [http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#iterator--], you can get int[] of true positions in RB with this method [http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#toArray--]. Or I missed something? > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674767#comment-15674767 ] Reynold Xin commented on SPARK-18252: - For 3, the sketch package has no external dependency, and was created explicitly this way so bloom filter built in Spark can be used in other applications without having to worry about dependency conflicts. For 4, it is much easier to just create a vectorized version of the probing code when all we are dealing with is a simple for loop. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674558#comment-15674558 ] Aleksey Ponkin commented on SPARK-18252: Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674559#comment-15674559 ] Aleksey Ponkin commented on SPARK-18252: Hi, good points. 1. Problem with current implementation that bloom filter acquire memory for full bloom filter even if there are no items inside. RoaringBitmap is not compressing a bit vector, it is compact representation of BitSet. Also compressing random bit vector with usual compressions will not gain any effect, since it is random and half filled. To compress bloom filter we need to create "sparse" bitset ([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than apply something like arithmetic encoding. 2. This is true, we need to increment version inside bloom filter 3. RoaringBitmap is already used in Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]). What is the problem with dependencies? 4. Please can you tell me more about vectorized probing and why RoaringBitmap is not suitable for this? Thanks in advance. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674532#comment-15674532 ] Reynold Xin commented on SPARK-18252: - I'm not sure if it is worth fixing this: 1. We already compress data before we send them across the network. 2. This is also not a backward compatible change would require different versioning. 3. This brings in extra dependency for a package that has 0 external dependency. 4. We very likely will implement vectorized probing for bloom filter to be used in Spark SQL joins, and using roaring bitmap would make that a lot harder to do. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673202#comment-15673202 ] Aleksey Ponkin commented on SPARK-18252: [~srowen] I`ve created [PR|https://github.com/apache/spark/pull/15917] for this JIRA. Please, review. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673199#comment-15673199 ] Apache Spark commented on SPARK-18252: -- User 'ponkin' has created a pull request for this issue: https://github.com/apache/spark/pull/15917 > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650676#comment-15650676 ] Aleksey Ponkin commented on SPARK-18252: [~cloud_fan] Can you, please clear for me some things: [BloomFilterImpl.java | https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java#L93-L98] here you are using int combinedHash , but we can create BloomFilter with expected insertions that are much larger than Integer.MAX_VALUE. It turns out for me, that for such filters all changed bits inside BitArray will be in a range [0.. Integer.MAX_VALUE]. That means that we can not guarantee expected false positive probability for filters larger than Integer.MAX_VALUE. Is it correct? Or may be it is a bug? I`ve just checked guava library BloomFilter, and they use different strategies to translate element instances, to bit indexes [BloomFilterStrategies | https://github.com/google/guava/blob/6ded67ff124f1be4f318dbbeae136d0c995faf37/guava/src/com/google/common/hash/BloomFilterStrategies.java]. If it is a bug, do we need another lira for that? Or I can fix it along with this lira ticket? Thanks in advance! > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15641881#comment-15641881 ] Aleksey Ponkin commented on SPARK-18252: Sure, with great pleasure) > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15641874#comment-15641874 ] Sean Owen commented on SPARK-18252: --- [~ponkin] do you want to try replacing the custom BitArray class with RoaringBitmap? > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636492#comment-15636492 ] Wenchen Fan commented on SPARK-18252: - Using RoaringBitmap LGTM. I didn't know RoaringBitmap when I wrote the BitArray. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size
[ https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636061#comment-15636061 ] Sean Owen commented on SPARK-18252: --- Spark already also uses RoaringBitmap elsewhere. This should probably be standardized. [~cloud_fan] you created the BitArray class -- what's your view on using one or the other? we probably shouldn't have two. > Improve serialized BloomFilter size > --- > > Key: SPARK-18252 > URL: https://issues.apache.org/jira/browse/SPARK-18252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aleksey Ponkin >Priority: Minor > > Since version 2.0 Spark has BloomFilter implementation - > org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current > implementation is using custom class org.apache.spark.util.sketch.BitArray > for bit vector, which is allocating memory for the whole filter no matter how > many elements are set. Since BloomFilter can be serialized and sent over > network in distribution stage may be we need to use some kind of compressed > bloom filters? For example > [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or > [javaewah][https://github.com/lemire/javaewah]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org