jmalkin opened a new pull request, #513: URL: https://github.com/apache/datasketches-java/pull/513
A Bloom filter isn't quite one of our normal sketches, but we've had requests over the years for one. Why do we need yet another implementation of this? Comparing versus Spark's implementation (itself based somewhat on Guava's) I noticed a couple things: 1. The theoretical size can exceed 2^31-1 bits -- but only 31 bits of the hash function are ever used. The index is always a positive 32-bit int. 2. Our library specializes in simple cross-language portability. While it may be a good idea to look at alternatives at some point, seamless data movement between languages is a known quantity. When we port this to C++/Python, we'll have that in ways that are at least somewhat more complicated with other versions. API change suggestions are quite welcomw. I'm wondering if I should move the public constructors entirely to the builder class, for instance. Since we have a couple newly donated membership filters, once we have API nomenclature down I plan to make a MembershipFilter abstract class or interface in the directory above so that we can have a common usage API. That should make using filters in distributed systems particularly useful -- deserialize any of them and blindly use regardless of the specific underlying implementation. That will make it easier for people to experiment with and, with luck, adopt the newer filters once they're production ready. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
