[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-18 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677521#comment-15677521
 ] 

Reynold Xin commented on SPARK-18252:
-

Thanks - going to close this.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-18 Thread Gregory SSI-YAN-KAI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676986#comment-15676986
 ] 

Gregory SSI-YAN-KAI commented on SPARK-18252:
-

I've worked on a custom implementation of Bloom filters, compressed with 
Roaring bitmaps some months ago. The first results I got showed that my 
implementation was consuming more memory and was somewhat slower than other 
implementations like guava. But my implementation was very basic and I believe 
that it should be possible to improve it so that it performs better in most 
usecases.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-18 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676908#comment-15676908
 ] 

Aleksey Ponkin commented on SPARK-18252:


The only thing that can be improved, IMHO - using right compression for 
BitArray during serialization, I think we can achieve 20 or 30% data 
compression using right encoder.s This feature offcourse must be optional.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-18 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676899#comment-15676899
 ] 

Aleksey Ponkin commented on SPARK-18252:


I did benchmarks(you can find it 
[here|https://github.com/ponkin/bloo-filter-perf]) - tests were executed with 1 
million random strings dataset.
1. bloom filter with roaringbitmap 20-25% slower than good old BitArray 
implementation in put and migthContain operations.
new put - Elapsed time: 2308348596ns
old put - Elapsed time: 1700994048ns
new mightContain - Elapsed time: 2191514527ns
old mightContain - Elapsed time: 1637640048ns
2.full bloom filter with roaringbitmap(all expected elements are inserted) has 
the same or bigger size than bitarray version(as I said earlier roaringbitmap 
poorly compress half filled random strings)
Conclusion: Benefits from using RoaringBitmap in BloomFilters are not clear.
I think we can close the ticket. Thanks to everyone, sorry for waisting your 
time)



> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674801#comment-15674801
 ] 

Reynold Xin commented on SPARK-18252:
-

Those two methods are pretty inefficient.

When we use this in SQL for joins, we'd want a hyper efficient probing without 
caring too much about the false positive rate (i.e. size will be small), and 
being able to have all the internal details exposed in a simple way would be 
critical there.

It might be a good idea to create a version of the bloom filter that can be 
compressed, but I definitely don't want that to be the only implementation.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674821#comment-15674821
 ] 

Aleksey Ponkin commented on SPARK-18252:


I do not have any benchmarks, but I believe that iteration over true positions 
in RB must be pretty fast, since this is sort of Run Length Encoding, where 
only random access operations like get and set are pretty painful. We are 
definitely need some benchmarks for this.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674818#comment-15674818
 ] 

Reynold Xin commented on SPARK-18252:
-

Regarding this - can you find some performance data on how the build time 
compares between roaring one and the normal bitset?

I'm also thinking maybe we can have the default build returning a roaring one, 
and then have a way to return a simple bloom filter based on normal bitsets and 
SQL can use that one.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674791#comment-15674791
 ] 

Aleksey Ponkin commented on SPARK-18252:


well, I do not know anything about vectorized probing, but you can freely 
iterate over all true positions in RoaringBitmap with this method 
[http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#iterator--],
 you can get int[] of true positions in RB with this method 
[http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#toArray--].
 Or I missed something?

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674767#comment-15674767
 ] 

Reynold Xin commented on SPARK-18252:
-

For 3, the sketch package has no external dependency, and was created 
explicitly this way so bloom filter built in Spark can be used in other 
applications without having to worry about dependency conflicts.

For 4, it is much easier to just create a vectorized version of the probing 
code when all we are dealing with is a simple for loop.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674558#comment-15674558
 ] 

Aleksey Ponkin commented on SPARK-18252:


Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674559#comment-15674559
 ] 

Aleksey Ponkin commented on SPARK-18252:


Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674532#comment-15674532
 ] 

Reynold Xin commented on SPARK-18252:
-

I'm not sure if it is worth fixing this:

1. We already compress data before we send them across the network.
2. This is also not a backward compatible change would require different 
versioning.
3. This brings in extra dependency for a package that has 0 external dependency.
4. We very likely will implement vectorized probing for bloom filter to be used 
in Spark SQL joins, and using roaring bitmap would make that a lot harder to do.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673202#comment-15673202
 ] 

Aleksey Ponkin commented on SPARK-18252:


[~srowen] I`ve created [PR|https://github.com/apache/spark/pull/15917] for this 
JIRA.
Please, review.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673199#comment-15673199
 ] 

Apache Spark commented on SPARK-18252:
--

User 'ponkin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15917

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-09 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650676#comment-15650676
 ] 

Aleksey Ponkin commented on SPARK-18252:


[~cloud_fan] Can you, please clear for me some things:
[BloomFilterImpl.java | 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java#L93-L98]
 here you are using int combinedHash , but we can create BloomFilter with 
expected insertions that are much larger than Integer.MAX_VALUE. It turns out 
for me, that for such filters all changed bits inside BitArray will be in a 
range [0.. Integer.MAX_VALUE]. That means that we can not guarantee expected 
false positive probability for filters larger than Integer.MAX_VALUE. Is it 
correct? Or may be it is a bug? I`ve just checked guava library BloomFilter, 
and they use different strategies to translate element instances, to bit 
indexes [BloomFilterStrategies | 
https://github.com/google/guava/blob/6ded67ff124f1be4f318dbbeae136d0c995faf37/guava/src/com/google/common/hash/BloomFilterStrategies.java].
 If it is a bug, do we need another lira for that? Or I can fix it along with 
this lira ticket?
Thanks in advance!

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-06 Thread Aleksey Ponkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15641881#comment-15641881
 ] 

Aleksey Ponkin commented on SPARK-18252:


Sure, with great pleasure)

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15641874#comment-15641874
 ] 

Sean Owen commented on SPARK-18252:
---

[~ponkin] do you want to try replacing the custom BitArray class with 
RoaringBitmap?

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-04 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636492#comment-15636492
 ] 

Wenchen Fan commented on SPARK-18252:
-

Using RoaringBitmap LGTM. I didn't know RoaringBitmap when I wrote the BitArray.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636061#comment-15636061
 ] 

Sean Owen commented on SPARK-18252:
---

Spark already also uses RoaringBitmap elsewhere. This should probably be 
standardized. [~cloud_fan] you created the BitArray class -- what's your view 
on using one or the other? we probably shouldn't have two.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org