[GitHub] [spark] kudhru commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-15 Thread GitBox


kudhru commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-861695720


   > The implementation here seems to be obviously looking correct, but could 
we add some unit test for it? (e.g. in `BloomFilterSuite.scala`)
   > 
   
   Added a test as suggested.
   
   > In addition, it looks useful, but just wondering what's the motivation to 
add this? Is there any future change in Spark depending on this? I just checked 
[Guava's BloomFilter does not support this bitwise add as 
well](https://guava.dev/releases/20.0/api/docs/com/google/common/hash/BloomFilter.html).
   
   I was implementing [this 
paper](https://dl.acm.org/doi/10.1145/3267809.3267834) on filtering the 
non-overlapping keys from the tables participating in SQL join operation but I 
could not find an AND function required for combining the bloom filters 
belonging to each joining table. I also looked at how the authors of this paper 
[implemented this 
filtering](https://github.com/approxjoin/benchmarks/blob/master/micro-benchs/src/main/scala/ApproxJoinFlitering.scala)
 and found that they used a [third party bloom filter 
library](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/util/BloomFilter.scala).
 Hence, I thought it should be worthwhile to have this functionality in the 
spark repo itself. Let me know if this sounds ok or if you any doubts or 
confusions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kudhru commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-15 Thread GitBox


kudhru commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862048748


   I am a bit confused as to how and when will this PR be merged into the 
master branch. Could someone please clarify?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] kudhru commented on pull request #32907: [SPARK-35757][CORE] Add bitwise AND operation and functionality for intersecting bloom filters

2021-06-16 Thread GitBox


kudhru commented on pull request #32907:
URL: https://github.com/apache/spark/pull/32907#issuecomment-862144672


   Could anyone tell how to fix the failing **MiMa** tests?
   ```
   [error] spark-sketch: Failed binary compatibility check against 
org.apache.spark:spark-sketch_2.12:3.0.0! Found 1 potential problems (filtered 
1)
   3307
   [error]  * abstract method 
intersectInPlace(org.apache.spark.util.sketch.BloomFilter)org.apache.spark.util.sketch.BloomFilter
 in class org.apache.spark.util.sketch.BloomFilter is present only in current 
version
   3308
   [error]filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]
   ```
   
https://github.com/kudhru/spark/runs/2836577393?check_suite_focus=true#step:9:3306
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org