[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003765#comment-17003765 ] Claude Warren commented on COLLECTIONS-728: --- There may be some disconnect here. The HashFunction interface is not intended to generically describe hash functions, but is intended to provide the information necessary for Bloom filter usage of a Hash function. I am assuming that in your example code the HashFunction does not extend HashFunctionIdentity and that, in fact, HashFunctionIdenty is not available. The issue in this case is that without being able to check how the underlying hash function operates all you can check is that the equals() operator works for the instances of the class. Currently the `check` method you list is equivalent to f.getShape().equals( g.getShape() ). (e.g if the Shapes are the same the filters are comparable). Note: the Shape.equals() method currently uses the HashFunctionIdentity to verify the functions are the same. The `isUsing` method is effectively HashFunctionIdentity.COMMON_COMPARATOR.compare( f.getShape().getHashFunctionIdentity(), h ) == 0. Note: the current HashFunction extends HashFunctionIdentity. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003600#comment-17003600 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}we may have to agree to disagree on this one. {quote} Maybe, maybe not. I'm not even there yet: My main problem is that (IIUC) this discussion (about how to distinguish hash functions and how to check for Bloom filters "compatibility"[1]) should be be postponed to _after_ committing the "main" (as I see it, from my incomplete POV) functionality. Maybe we don't agree about what should be "main" (or core). If that's the case, please post to the "dev" ML; explain why we are stuck, and that additional reviewers are needed. [1] One could construe that such a check could be provided "externally", e.g. (random thought): {code:java} public class BloomCompatibilityChecker { /** Checks whethe the filters are comparable. */ public static boolean check(BloomFilter f, BloomFilter g) { return ...; } /** Checks whether the filter is using some specific hash function. */ public static boolean isUsing(BloomFilter f, HashFunction h) { return ...; } } {code} > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003589#comment-17003589 ] Claude Warren commented on COLLECTIONS-728: --- I think we may have to agree to disagree on this one. I can think of no way to compare hash function properties and only a same/not-same comparison possible using the equals() method. Given that there might be 2 implementations that yield identical results this is not an optimal solution. The decomposition of the properties into an interface makes it possible to identify such. Caching hasher is not part of this code base yet it builds upon it. It was built on the opportunity created by this code base as a natural upshot of being able to identify the properties of the hash function. If you wish I will contribute CachingHasher as part of this code. {quote}Would those methods that describe hash function's properties be called from inside this code? {quote} They already are using inside this code. It is part of determining if a Shape is the same for different filters. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002067#comment-17002067 ] Gilles Sadowski commented on COLLECTIONS-728: - bq. how do you propose to move the hash function [...] In the code which you propose for inclusion in [Collections], nothing requires serialization (hence the reluctance to impose {{Serializable}}). If some application requires serialization, it must create an instance of {{HashFunction}} that can be serialized. bq. What do you mean by "external code" You referred to "CachingHasher (from another project)": Another project != this project. "CachingHasher" is not part of this code base, yet it dictates an API. I'd be more comfortable starting with an internally consistent and minimal API. Then if a use case comes up (like "CachingHasher"), requests for enhancement can be considered. bq. The isIterative() question is information about how the hash is constructed In my (possibly short-sighted) view, [Collections] is the not the place to define an API for describing hash functions. Would those methods that describe hash function's properties be called from inside this code? > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001911#comment-17001911 ] Claude Warren commented on COLLECTIONS-728: --- I think I am confused. Without resorting to Serialized interface how do you propose to move the hash function from one system to another? What do you mean by: "So if in practice, there is a problem, my opinion is that the design must be modified without resorting to external code."? What do you mean by "external code" in this statement? What do you mean here: "I've asked the same question about methods like {{isIterative()}}, that also obviously only make sense in a Java environment." The isIterative() question is information about how the hash is constructed. That construction is not Java only. The example provided was from Cassandra and thus in Java, but the Kirsh and Mitzenmacher paper describes the process in mathematical terms and thus is applicable to any programming language. Similarly the Signed/Unsigned interpretation of the buffer makes a difference and is not a Java only implementation detail. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001889#comment-17001889 ] Gilles Sadowski commented on COLLECTIONS-728: - bq. CachingHasher (from another project) is not possible. Potentially irrelevant. Furthermore, I don't get how in principle providing _more_ information (i.e. the {{HashFunction}} instance itself rather than some of its "properties") can be an issue. So if in practice, there is a problem, my opinion is that the design must be modified without resorting to external code. bq. communicating with systems that are not java based Potentially irrelevant. I've asked the same question about methods like {{isIterative()}}, that also obviously only make sense in a Java environment. We had already noticed those non-converging opinions; it seems that we are repeating arguments. It could be my fault, in case I've missed key points in your descriptions. All I think I understand is that application developers need a check that different parts of a distributed system (both Java and non-Java) use the same hash function. IIUC, a crude check is OK for you (like {{String}} equality). But without indications on the capabilities of the non-Java parts of such a system (e.g. how do they represent {{isIterative()}}?), I'm wary to clutter the Java API, where the check could actually be made robust through standard Java practice (e.g. using {{equals(Object)}}). The simplest way out is Alex's proposal of a "registry", as a plain list of numbers (and a textual description of the associated "properties"). Even so, I'm maintaining that this is a second-order discussion, as the main Bloom filter functionality could be fully coded without this fool-proof checking of the hash function: Indeed, if one of the requirements of the distributed system is that the hash function is the same, it could simply be assumed. Sure, it's not robust; but neither is any of the proposed alternatives. I understand that you don't agree with this incremental approach; hence the only way out is to post to the "dev" ML, asking for more knowledgeable (or less picky ;)) people to have a look at your PR. I really hope that some solution can be found because it would be nice that Commons can host this Bloom filter implementation. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001847#comment-17001847 ] Claude Warren commented on COLLECTIONS-728: --- I don't think that isCompataible() as a single function works. 1) Without access to the cyclic/iterative setting the CachingHasher (from another project) is not possible. 2) When communicating with systems that are not java based how does the isCompatible() and function serialization work? I truly believe the best mechanism is to identify the differences between implementations and then provide properties for those differences. This will allow the identity of the implementation to be serialized to other systems as needed, without the need for Java Serializable interface or in fact Java at all on the remote system. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001053#comment-17001053 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}When Shapes or BloomFilters that were created with MyHash are serialized by the application (application responsibility) the HashFunctionIdentity will be serialized. The application deserializing the HashFunctionIdentity will either need to have the MyHash class or another implementation of the same function. {quote} Then, the question is: What does the Bloom filter functionality (that constitutes your code contribution to [Collections]) need beyond a {{HashFunction}} interface that defines a method {{isCompatible(HashFunction other)}}? In effect, my issue is still the same (I think) as several comments ago: How to ensure that some instance of {{HashFunctionIdentity}} will *really* be compatible with another (whether within the same JVM or after serialization)? It seems to me that whatever check is done in a "composed" way with {{isIterative()}}, {{isUnsigned()}}, {{getName()}}, ... can be achieved in a simple (and more robust, at the implementor's discretion) way with {{isCompatible(HashFunction)}}. Furthermore, in order to call {{isIterative()}}, etc. you obviously need a JVM at the other end of the serialization/deserialization pipeline. Now, if distributed processing is an integral feature (?) of using the {{BloomFilter}} class, perhaps that we must define as {code:java} interface HashFunction extends Serializable { // ... } {code} Note that what Sebb and others pointed out is that o.a. things, "it's difficult to code Serializable classes properly" but the rest of the quote ("We should not promise what we likely cannot continue to deliver") does not apply here, because the burden will be on implementors. And _they_ will choose how robust and future-proof their handling of {{Serializable}} needs to be (for their purpose). However, is it actually true that {{BloomFilter}} usage only makes sense in distributed applications? If not, I'd be wary of the above definition (since it will impose an unnecessary burden, or could provide a false sense of security if {{Serializable}} is not properly implemented). > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001015#comment-17001015 ] Claude Warren commented on COLLECTIONS-728: --- The hash computation is used in the Hasher class. It is generally only used when creating Bloom filters. In some cases it is used when determining if an object is in the filter. I originally implemented Serializable in several objects but that was removed early on. As Sebb pointed out there are security issues with classes that implement Serializable[1]. Your description of how to add a hash function is basically correct. However, I think there is a misunderstanding. The `HashFunctionIdentity` provides the information necessary it identify the function. it is used in the `Shape` class which in turn is used in the `BloomFilter` class. The `HashFunction` interface is much as you suggest except that `HashFunction` extends `HashFunctionIdentity`. So a user wanting to add a new hash function would create a HashFunction instance (MyHash) that identifies the function and contains a call to that function via the `compute()` method. When Shapes or BloomFilters that were created with MyHash are serialized by the application (application responsibility) the HashFunctionIdentity will be serialized. The application deserializing the HashFunctionIdentity will either need to have the MyHash class or another implementation of the same function. I hope this helps. [1] [https://lists.apache.org/thread.html/7a0bb63540e52b97f08c608f11c2fd9dd5b770c6f21cdc04a8c29e6d%40%3Cdev.commons.apache.org%3E] > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000951#comment-17000951 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}properties are now tracked in the HashFunctionIdentity. {quote} Perhaps {{HashFunctionProperties}} or {{HashFunctionSpecification}} would be a better name (?). {quote}An equals method on the HashFunction would require the HashFunction object be in some sense Serializable. {quote} IIUC correctly (?), a hash function can be implemented by a third party (not abiding by any of APIs to be defined here). Then to be usable within the {{BloomFilter}} framework defined here, an application developer will need * to wrap the function's "properties" in a class (say, {{MyHashFunctionIdentity}}) that implements {{HashFunctionIdentity}} * to define a serialization scheme for {{MyHashFunctionIndentity}} (if the application is distributed). If so, why not have a {{MyHashFunction}} class that wraps the full third-party hash function rather then just its "properties"? We could have, in [Collections] {code:java} /** * Interface for implementations used within this framework. */ public interface HashFunction { /** * Returns {@code true} when from a given input, this instance * computes the same output as the {@code other} instance. */ boolean isCompatible(HashFunction other); /** * Computes the hash value. * * @param input Input. * @param seed Seed. * @return the hash. */ long compute(byte[] input, int seed); } {code} As simple as can be (?). And, the application developer would responsible for implementing the notion of "compatibility" (in the same way that he is responsible for computing the hash value, including issues arising from using a buggy function): {code:java} import java.io.Serializable; import com.thirdparty.hash.NiceFunction; import org.apache.commons.collections.bloomfilter.HashFunction; public class MyHash implements HashFunction, Serializable { private static final byte[] COMP_TEST_A = new byte[] {- 19, 45, -34, 65, 1, 22, 17, 74}; private static final int COMP_TEST_B = -1395561; private static final long serialVersionUID = 123456789L; private NiceFunction f; // Assuming that "NiceFunction" is "Serializable". public MyHash(NiceFunction f) { this.f = f; } @Override public boolean isCompatible(HashFunction other) { if (other instanceof MyHash) { return true; } else { return Long.compare(compute(COMP_TEST_A, COMP_TEST_B), other.compute(COMP_TEST_A, COMP_TEST_B)) == 0; } } @Override public long compute(byte[] input, int seed) { // ... } } {code} Then, [Collections] could provide wrappers for the functions implemented in [Codec]. Casual users will get new functions at release time, while not preventing power users to define their own wrappers around experimental and non-standard functions. Or am I completely off base? bq. I would not expect the actual code for for the function to be sent. This would probably require that the listener be a Java based application so that it could have the HashFunction implementation. I'm confused. Where is the hash computation used? > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1748#comment-1748 ] Claude Warren commented on COLLECTIONS-728: --- In all the work I have done with Bloom filters over the past 8 years, I have come to believe that there are several pieces of information that are necessary to determine if the hashing is the same for two filters. 1) they hashing algorithm must be the same. Generally the algorithms are specified by name and this seems standard across the industry where hashing is used. 2) The results of the hash have to be converted into integer type values and the modulus taken. So the question becomes were the bits in the result buffer assumed to be signed or unsigned, little endian or bit endian. 3) How was the hashing used. In construction the Bloom filter multiple hashes are used. In traditional iterative methods this is done by calling the selected hash function with a different seed for each hash required. The second method described by Adam Kirsch and Micheal Mitzenmacher[1] has become more common and is used in applications like Cassandra[2]. There are other considerations, like is the hashing algorithm implemented correctly, but I believe they fall outside the normal bounds of equivalence checking during normal operation. I originally encoded these 3 properties into the name. While this made for simple checks it posed other problems and was, rightly so, objected to. Instead the 3 properties are now tracked in the HashFunctionIdentity. The identity is separated from the hash function implementation because there are times when the hash function implementation can not be preserved; specifically when sending the BloomFilter across a the wire. All we really want to send is the filter and the shape. The shape preserves the standard configuration of the Bloom filter (number of items, number of bits, number of hash functions) as well as the HashFunctionIdentity. I believe this to be the minimal set necessary to determine whether it is valid to compare two Bloom filters. There are other cases where specific properties of the hash function identity are required. For example, I have an implementation of Hasher that uses the first 2 results returned from a CYCLIC hash function to allow it to build Bloom filters with any number of items, number of bits, or number of hash functions. This is similar to the StaticHasher but useful where different sized filters are required to be built from the same object. An equals method on the HashFunction would require the HashFunction object be in some sense Serializable. That is, the object would have to be decomposable so that its identity could be sent across the wire. I would not expect the actual code for for the function to be sent. This would probably require that the listener be a Java based application so that it could have the HashFunction implementation. The HashFunctionIdentity object makes this much easier as the component values are easily serialized. [1] [https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf] [2] [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java#L60] > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996909#comment-16996909 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}I would like to avoid that discussion until the core components are in place. {quote} >From the outset, I have been trying to minimize discussion by suggesting just >that: Limiting the initial commit to the core functionality. If I'm not >mistaken, this is what lead to moving hash implementations from here to >[Codec], among other simplifications of the code. {quote}should not try to implement all hash functions [...] {quote} OK; let's assume that. {quote}provide a framework that identifies when the hash functions are not the same {quote} OK, but what is the simplest way to do that? IMHO, as we cannot hope that the {{HashFunction}} becomes a "standard", we should keep it as simple as possible for application developers. I.e. maybe just make it a "marker" interface (?). Would it be sufficient to have the Javadoc saying that applications using this (Bloom filter) functionality must define an {{equals(Object)}} method whose purpose is to identify when two {{HashFunction}} instances can be considered the same or not? > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996742#comment-16996742 ] Claude Warren commented on COLLECTIONS-728: --- I think there may be a nomenclature issue here: Does "same function" mean the identical implementation (same provider) or just the same results (same hash function, Signedness and Process type)? In my mind it is the later. The HashFunction is separate from the HashFunctionIdentity because there are cases where we don't need/want the hashing function implementation we just need to know what it is (e.g. stored in a database). Commons Collections can implement whatever hashes are deemed appropriate. It can implement them with 3rd-party hash algorithms, java core algorithms (e.g MD5), or Collections provided hash algorithms. I attempted the later early in this contribution but it was easier to move that to commons-codec and fix the implementation there than it was to keep that code in the contribution. There is nothing in the contribution that prohibits the development of a registry, it just does not provide such. Adding such would increase the size and would lead us into discussions about Factories, singletons, and registering new producers again. I would like to avoid that discussion until the core components are in place. Again, we can not and should not try to implement all hash functions, nor should we restrict others from using new/experimental hash function with this library. What we should do is provide a framework that identifies when the hash functions are not the same so that appropriate actions can be taken. I believe that is what this contribution does. I have used this contribution is 3 projects with 3 different development teams. We have managed to fit this contributions into the projects even though they have different requirements on the Bloom filters. I have easily retrofitted this contribution into packages that used their own Bloom filter implementations with minimal impact. Overall this contribution works. Am I missing something? > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996728#comment-16996728 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}{{HashFunction}} {quote} I still have the impression that we are carried away from the main contribution by an interface whose contents seems (?) to be defined by "out-of-bands" requirements. It seems (IIUC) that [Collections] will have to create the class that "implements" the interface, as a wrapper to a third-party hash function implementation (because the third party will most probably *not* depend on [Collections]); hence I don't see why not use an enum (or a "registry", as Alex suggested) in order to make the check obvious and simple: The _same_ function must be used whatever its specific "properties" (cyclic, or whatever). > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996674#comment-16996674 ] Claude Warren commented on COLLECTIONS-728: --- I have added HashFunctionIdentity to replace the HashFunction name. The identity contains the function name, Signedness, and ProcessType as well as the signature value. It also contains comparator implementations, and a couple of static methods to simplify generating the signature value. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996508#comment-16996508 ] Claude Warren commented on COLLECTIONS-728: --- Yes, I have hashers that depend on the function being cyclic. They currently check the name but this solution would be cleaner. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993591#comment-16993591 ] Gilles Sadowski commented on COLLECTIONS-728: - Would methods {{getName()}}, {{isIterative()}} and {{isUnsigned()}} be used in the code anywhere else other than within {{getSignature()}}? > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990888#comment-16990888 ] Claude Warren commented on COLLECTIONS-728: --- The idea of using ID's for each provider/hash function is interesting, but it will require that a group (Apache Commons in the proposal) take the responsibility for registering them. I don't know that Apache wants to take that on. Perhaps there is an interesting middle ground between the proposals from the 3 of us. Change the HashFunction interface shown above to not extend ToLongBiFunction<> but to contain a method `apply( byte[] buffer, int seed)` Also add a method `long getSignature()` that will be implemented as: {noformat} public long getSignature() { byte[] b = new byte[ getName().getBytes("UTF-8") ]; byte[] b2 = new byte[ b.length ]; b2[0] = (byte)((isIterative()?0x1:0) & (isUnsigned()?0x2:0)); System.arrayCopy( b1, 0, b2, 1, b1.length ); return apply( b, 0 ); } {noformat} Now, the value for the function is the function information hashed by the function. A list of function names and flags with correct signature can be provided and easily checked by implementers. The interface could be changed to an abstract class and the getSignature() listed above could be made final and implemented to store the signature in a Long the first time it is calculated. Now the getSignature() can be used to verify that the methods are the same, and the name is still available for user display. The implementation checks can be quickly added to the code base as a property file. Might this be a compromise that works for all of us? I think this means we still need to implement a JCA style factory and register an Apache commons provider as default. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990850#comment-16990850 ] Alex Herbert commented on COLLECTIONS-728: -- {quote}it would make more sense to follow the JCA pattern and have a factory that allowed users to register provider libraries and then request by name, flags, and optional provider. {quote} That would work with the same JVM. How do you register the same provider across VMs? An alternative is to maintain a register in Commons. Each provider is assigned a code (e.g. a 16-bit integer). Anyone who wishes to provide reference implementations can apply for a code from Commons. This then becomes their code that no-one else can use. A set number of code can be reserved for use by Commons to allocate to algorithms that the library (or core Java) provides. The register can be in a properties file included in the jar if necessary but I don't see the need for that. It should be public somewhere, e.g. a page on the website which is updated each release. This idea is akin to the use of fields in a TIFF file. Someone can register a field with Adobe for use in their TIFF reader/writer implementation. They can then put whatever they want in that field. A set number of fields are reserved by the TIFF standard and are not available. This does add a maintenance burden on Commons to respond to people who wish to register hash functions. But how often will that occur? Maybe never. If a personal user wishes to use the library with their own hash function they do not have to register it. Just use an arbitrary code and they are responsible to ensure compatibility across their own code. The Bloom filter library then only need verify two filters are created using the same hash function with a single integer (i.e. the code) comparison. No name checks, no hashing of function names, etc. You could enforce the properties isIterative and isUnsigned are for the first two bits of the code if you so desire. On another note, I think the performance of autoboxing requires that this interface is specialised to a primitive: {code:java} ToLongBiFunction {code} > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990840#comment-16990840 ] Claude Warren commented on COLLECTIONS-728: --- The issue of broken hashes is a problem and I admit that the name does not solve that, it does stop an MD5 hash being used with a Murmur128 hash, which was its original purpose. Stronger checks could be enforced, however when working with Bloom filters one is typically doing so in hopes of achieving a performance improvement (e.g. a quick check to filter out unnecessary longer lookups) in these cases performing a hash on every check/insertion is extra overhead that may not be tolerable to the application. As an indication of how sensitive this can be, I have been working on a multidimentional Bloom filter problem. The solution is so time constrained that looking up items in a Trie adds enough overhead that in many cases the multidimentional Bloom filter is no more efficient than a linear search of the equivalent list of Bloom filters. The overhead of an in-depth check (like the code necessary to navigate a Trie) will significantly impact the performance of the Bloom filter. I like the idea. Perhaps there is an additional utility class to be added to do such checks, but I don't think the check belongs in the mainline code. What we need is a quick, easily extensible, mechanism to determine if two hash functions are expected to produce the same result. I proposed a string (name), but am open to other ideas. I am thinking of a more binary encoding comprising a byte of flags and a string hash name and a string for the provider name. Only 2 flags defined (Signed/Unsigned and Cyclic/Iterative). The verification check would ensure that the flags and name are equal. A more in depth check could be implemented by anybody who wanted to restrict to a single implementation, etc. Perhaps it would make more sense to follow the JCA pattern and have a factory that allowed users to register provider libraries and then request by name, flags, and optional provider. The resulting implementation would therefore have to have methods to return the provider, and flag values. Something like: {noformat} public interface HashFunction extends ToLongBiFunction { String getName(); String getProvider(); boolean isIterative(); boolean isUnsigned(); } {noformat} The factory would work like the JCA factories in that the order of the provider determines which implementation of a HashFunction is returned when multiple providers provide the "same function". > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990829#comment-16990829 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}The naming is intended to provide a proxy for the implementation details {quote} That's what I had understood, but I find it a fragile way. Who gives the name? What if two developers implement the same hash algorithm? How can a third party developer be sure that identical names will provide the same behaviour? [ Case in point: the original name of a "broken" implementation (of some hash function) will *not* be "MyBrokenFunction" ;) ]. {quote}so that the user may be warned {quote} Could it be possible to enable a stronger check? E.g. compare the output of applying the hash computation to some known/reference input. {quote}to ensure that we have the same "vision" of this contribution. {quote} Good point. :) I don't know. I'd guess that most (?) codes in "Commons" are building blocks (for application developers) that provide a _complete_ solution to often encountered self-contained issues; broadly speaking, they would be readily usable, by _composition_, rather than requiring an extension to be fully functional. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990798#comment-16990798 ] Claude Warren commented on COLLECTIONS-728: --- Hasher names: The Shape of the Bloom filter is dependent several things. One is that the hashing function is consistent; that is it uses the same techniques to generate the hashed values. Two different implementations of a hash function are are the "same" as long as they generate the same values for the same input. (referred to below as the "same function"). Comparing bloom filters that do not use the same function does not make sense. Thus a mechanism to distinguish between filters with different hashing is desired so that the user may be warned when such an attempt is made. This is where the naming arises. The naming is intended to provide a proxy for the implementation details as well as provide the user with some idea of what hashes were used in order to assist in the resolution of the conflict. You will note that the Shape class equality check verifies that the name is the same for the same reason. Similar to the Java Cryptography Architecture (JCA) two providers may provide implementations of the same hash function. Users assume that the JCA implementation has been vetted and is correctly implemented. In the Bloom filter case an improperly implemented hash is only a serious issue in cross application communication. Note that the code in this contribution does not require the name format be followed, only that different implementations be named differently. I do have a Caching Hasher implementation in another application that requires a cyclic hash function and will fail if the hash name does not match that presented here. Perhaps it makes more sense to have a name an the booleans for cyclinc/iterated and signed/unsigned. But I don't see a way to provide users with the ability to use new, old or broken hash functions and be able to evaluate if the same function is being used without using a name. In my mind Enums are appropriate in two basic conditions: 1) you know all the possible values; or 2) you want to tightly control the acceptable values. Neither of these conditions apply in the case of hash function identification. addendum: In reading back over the previous post I was struck the the use of the term "library". Perhaps this is just a "nomenclature" thing or perhaps it is a "conceptual" thing. I see the Bloom filter contribution as a "framework", a scaffolding on which other developers may hang new implementations and tweeks. In my mind a "library" is generally something that is used without modification or extension. I just wanted to ensure that we have the same "vision" of this contribution. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990667#comment-16990667 ] Gilles Sadowski commented on COLLECTIONS-728: - bq. you know all the possible values. We don't, we can't, and we never will. Isn't your code assuming that every implementation of some algorithm will name it the same way (and that the name will comply with the Javadoc of {{HashFunction}} defined here? Conversely, one can imagine that two implementations could bear the same name, even if one is broken and actually not compatible with the other. bq. we would have to include enums for patently "broken" hashes Yes, if this library purports to support them. However given that in both cases (name or enum), we don't have control on the source code (of the hash algorithm implementation), I wonder whether there should really be a {{verifyHasher}} (i.e. it could simply be the application developer's responsibility to ensure consistency of his use of the {{BloomFilter}} functionality). > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990597#comment-16990597 ] Claude Warren commented on COLLECTIONS-728: --- Mutability: I think all issues are fixed with the exception of HasherBloomFilter. HasherBloomFilter extends BloomFilter. In the general case BloomFilter needs to have a merge operation. I have updated the javadoc for the HasherBloomFilter class to indicate that if merges are expected another Bloom filter implementation may be more appropriate. HashFunction: HashFunction extends ToLongBiFunction so it is more than just a name. We could make it more complex with boolean values for unsigned/signed and cyclic/iterative but to what end? I think this class is a simple as it can be but no simpler. An Enum assumes you know all the possible values. We don't, we can't, and we never will. In recent work with the Murmur128 hash in commons-codec it became apparent that there are several implementations of the hash that are "broken" with respect to the original "C" code. However, each of those is in use. If this library were to be used with any of those implementations we would have to include enums for patently "broken" hashes, or they have to have a way to extend the implementation in a way that fits into this system. Adding a name to the hash and not limiting it to an enum works. This is more akin to the Java Cryptography Architecture with out the necessity of registering providers. Though perhaps that is future work. MD5: fixed. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990552#comment-16990552 ] Gilles Sadowski commented on COLLECTIONS-728: - About mutability: * Couldn't {{shape}} be _final_ in {{BloomFilter}}? * Does {{HasherBloomFilter}} really gain anything from being mutable? IOW, what would be lost if the {{merge}} operation would return a new instance? * In class\{{MD5}}, why not make the fields _final_? Even though the instance is not immutable, it simplifies testing and reasoning about the code. Ditto for other similar classes. * In class {{BitSetBloomFilter}}, can't {{bitSet}} be _final_? Can't {{merge}} create a new instance? A lot of methods do indeed {{clone}} that instance field in order to compute a result. About interface {{HashFunction}}: * It is weird (IMHO) to have an interface with such a name where the API just contains a {{getName()}} method. * I reiterate my suggestion of using an enum. Did you have a look at {{RandomSource}} in "Commons RNG"? About class {{MD5}}: * It is awkard that users have to guard against a checked exception because of an *internal* error. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990527#comment-16990527 ] Claude Warren commented on COLLECTIONS-728: --- *API Issue* The Hash function takes a buffer and a seed and returns an integer. In general this is a simple operation. You might use MD5, SHA256, Murmur128_x64 or Murmur128_x86. So the name must specify what hash is used -- I think we can agree on this. The second the calculation may be iterative or cyclic. Iterative is probably the one you are familiar with where the hashing function is called with a different seed each time and that result used. cyclic is when two values are created: call them partA and partB. The first time the function is called partA is returned. for all subsequent calls partB is added to the previous results and returned. This method is documented in the Cassandra codebase and their experimentation has shown that it does not impact the randomness of the subsequent Bloom filter but does significantly speed up processing. Generally this method is used with a 128 bit hash as the result can be considers as 2 longs. Finally, the result of the function will differ depending on whether the numeric values were calculated using unsigned or signed arithmetic. This information needs to be passed down to components that do not keep a reference to the hasher or may not even have the hasher implementation available. When comparing 2 filters you have to know if the "shape" is the same. but the "shape" does not maintain a reference to he hash function. Classes like the static hasher which maintains a list of the bits that were enabled needs to know the shape but does not have a reference to the hash function. These classes need to be able to verify that the function that generated the values was the same. *Implementation Issues* Are you saying that all classes in Collections have to be immutable? Or that the classes noted are not mutable? They are mutable and we had a discussion about this wherein the standard uses of Bloom filters requires that they be mutable or significant overhead is imposed to create new Bloom filters whenever filters are merged (a common operation). > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990468#comment-16990468 ] Gilles Sadowski commented on COLLECTIONS-728: - Thanks; we are getting closer... Nit-picks: * Convention for opening braces (on the same line) * Class and field names (should be be complete names – e.g. {{Iter}}, {{md}}) * Javadoc: avoid use of HTML entities. API issue: * Interface {{HashFunction}}: I don't think that [Collections] is going to impose a convention on the naming of hash functions (or I'm missing something about the purpose of the {{getName()}} method). Its use to ensure compatibility between functions seems fragile; the comparison logic should rather be defined by the standard {{equals(Object)}} method, or some other {{Comparator}}. Implementation issues: * Classes are not mutable (e.g. {{BloomFilter}}, {{BitSetBloomFilter}}) * Nested {{Iter}} class (in {{StaticHasher}}) wraps an {{Iterator}} and using the {{PrimitiveIterator.OfInt}} would probably loose a lot of performance despite the impression of the contrary from a user's POV. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985222#comment-16985222 ] Claude Warren commented on COLLECTIONS-728: --- The usage.md was for you to understand the usage of bloom filters. I was not intending to submit it. I will remove it from the submission. as to the Dynamic.Factory, I have another hasher implementation (CachingHasher) that I have not included here (as I am trying to keep this to a minimum) but that Hasher also uses the factory model. CachingHasher has different characteristics from the Dynamic hasher and thus different set of acceptable hashing functions. I don't see a way to create a single Hasher.Factory implementation that will handle both. I think it is best for Dynamic.Factory to be a factory that returns DynamicHashers, and CachingHahser.Factory to be a factory that returns CachingHashers. It is a clear separation. Perhaps it makes sense to remove the Hasher DEFAULT from the Hasher.Factory interface to avoid confusion. I will raise the External dependency on the ML. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985141#comment-16985141 ] Gilles Sadowski commented on COLLECTIONS-728: - I don't think that it is Commons practice to have documentation files under the "src/main" directory. They go somewhere under "src/site" where they are collected to create the web site. It seems that the [Collections] Checkstyle rules are more lenient than for other Commons components... I'm still not convinced about the "dynamic" {{Factory}}. I again suggest that the first PR should only contain the core functionality. The new external dependency is likely to be frowned upon: Issue should be raised on the ML. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970260#comment-16970260 ] Claude Warren commented on COLLECTIONS-728: --- Any more comments on this contribution? > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965626#comment-16965626 ] Claude Warren commented on COLLECTIONS-728: --- Updated PR with following notes: I added MurmurHash3 to the RAT exclusion in the POM as It is public domain and I could not figure out how to get RAT to accept it. This is a temporary work-around until a final solution is discovered. I restructured the Hasher / Factory layout and made Hasher the top level interface. It now contains two (2) interfaces Hasher.Factory - defines only listFunctionNames() and useFunction(String) [ the last one probably needs a rename ] Hasher.Builder - defines three (3) with(X) methods and a build() method. Hasher.Factory.useFunction(String) takes the function name and returns a Builder for the Hasher. Hasher.Factory.DEFAULT is a static that contains the default Hasher.Factory implementation. Currently it contains the DynamicHasher.Factory as there is no other. I removed all ByteBuffer and LongBuffers in the interfaces as they cause problems with the Animal Sniffer tool and it just made life easier. [~erans] You were correct the SpanBuffer.contains( Span, Hasher ) and SpanBuffer.merge( Span, Hasher ) did not smell right. I removed the Span arguments from the calls as they are not necessary. The calls are now SpanBuffer.contains( Hasher ) and SpanBuffer.merge( Hasher ) Added a DynamicHasher.Builder that builds the DynamicHasher. Removed the "locked" internal state from DynamicHasher as it is immutable. Added a DynamicHasher.Factory that will create the DynamicHasher.Builder from the well known names. Compromised on the ability to new Hash functions to the DynamicHasher.Factory by implementing a protected register( String, Class ) method to allow developers to create a child class that adds new hashes. I hope that we can remove the MurmurHash3 code and switch to commons-codec hasher code, but the codec murmer3 hash implementation has a bug. I am working on a pull request to fix that. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965284#comment-16965284 ] Claude Warren commented on COLLECTIONS-728: --- I will provide update the PR with the stripped-down version. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964906#comment-16964906 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}The API always requires both arguments. {quote} OK. Sorry I missed that. So, would you provide a stripped-down PR with _only_ the core API ({{BloomFilter}}, {{Hasher}}, {{Shape}}, {{default}} methods for the "proven math"), and a _single_ variant for each concrete functionality (e.g. {{BitsetBloomFilter}} and the "Murmur" algorithm provided by an immutable {{Hasher.Factory}})? Thanks. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964893#comment-16964893 ] Claude Warren commented on COLLECTIONS-728: --- Quick note: I somewhat get the rationale until here: === begin quote === True the Hasher could be limited to a StaticHasher and the Shape would not be needed as the Static Hahser contains the Shape it was built with, but in the general case a Hasher does not have a Shape. Where it seems that the API could be improved so as to avoid having to pass null when the "shape" argument is unused. === end quote === The API _always_ requires both arguments. For the static hasher the call would look like `filter.merge( staticHasher.getShape(), staticHasher )` I will comment on other bits later when I have more time. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964861#comment-16964861 ] Gilles Sadowski commented on COLLECTIONS-728: - Thanks for explaining again. I somewhat get the rationale until here: {quote}True the Hasher could be limited to a StaticHasher and the Shape would not be needed as the Static Hahser contains the Shape it was built with, but in the general case a Hasher does not have a Shape. {quote} Where it seems that the API could (?) be improved so as to avoid having to pass {{null}} when the "shape" argument is unused. {quote}We will always be behind when it comes to implementing new algorithms using new hash functions. {quote} Just _one_ release behind. People who want new features should be around to provide a patch and help with the release. {quote}It is better to allow the users in the field to add algorithms as they see fit. {quote} I don't think so; it leads to unnecessary duplication (with a possible consequence being the "broken" algorithm case which you mention). {quote}not allowing for addition of others is very limiting. {quote} Not really if there is an "active community" (i.e. people willing to work together towards releasing new features). Once the design is in place, adding variants of existing features should be straightforward; a new hash function could be added at time "t0" and a release candidate could be prepared at "t0" + 72 hours (to allow for review), then published another 3 days later. {quote}they would be unable to add their "broken" hashing to the factory would be an inhibitor. {quote} "Commons" would provide a library of algorithms through its implementation of the {{Hasher.Factory}} interface, but application developers could define their own for a particular purpose (such as working around issues of their own ;)). {quote} why not provide a factory that [...] has the convenience methods to add hash functions? {quote} Because it puts the burden on us to ensure thread-safety (while it is trivially accomplished with immutable instances). > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964753#comment-16964753 ] Claude Warren commented on COLLECTIONS-728: --- === From the conversation above === Why doesn't Shape keep a reference to Hasher? That would avoid having to pass both to e.g. method contains in BloomFilter (and would make is unnecessary to check their consistency). (see the response to how are StaticHasher and DynamicHasher different for some extra background). The contains method with the Hasher was requested because one of the other Apache projects was interested in checking the presence of the Bloom filter without building a complete Bloom filter. They have the hasher and they know the shape it should be. So the method actually needs the Hasher for the data. Now, they can build the DynamicHasher and pass that, or they could build a StaticHasher (which does know it's shape), or they could implement another Hasher type. Placing the Hasher in the Shape would mean that the shape will always return the same data when what we are doing is trying to send different data with the same shape. Still it doesn't look right, API-wise; but again I'm lacking a clear view of all the requirements to provide an alternative. === end of quote === The methods that have both the Hasher and the Shape are really passing decomposed Bloom filters. The Hasher contains the hash values and the Shape identifies shape of the decomposed Bloom filter. True the Hasher could be limited to a StaticHasher and the Shape would not be needed as the Static Hahser contains the Shape it was built with, but in the general case a Hasher does not have a Shape. The reason for the separation is the Hasher applies a hash function to data and returns int values. The shape indicates to the Hasher how many times it should apply the hashing algorithm (k in the maths) to each data item being hashes (item = the X in DynamicHasher.with(X)). The Shape also give the upper limit for the values the hasher returns on the particular getBits() call. The upper limit is "m" in the maths. So Hasher + Shape = an iteration of integers that are the bits turned on the Bloom filter. A single hasher can generated Bloom filters of an almost infinite number of Shapes. --- snip -- === start of quote === >From KISS and DRY POVs, I disagree with both statements: The factory could be >immutable and readily provide all the useful algorithms (see e.g. >RandomSource). === end of quote === There are two issues with the idea that we can provide all the usefull algorithms. 1) We will always be behind when it comes to implementing new algorithms using new hash functions. It is better to allow the users in the field to add algorithms as they see fit. I remember an instructor one said the the only valid limits in computer science are zero, one and many. Providing an implementation with some but not allowing for addition of others is very limiting. 2) There are known cases (Cassandra comes to mind) where an implementation of a Hashing algorithm was incorrectly executed and released into the wild. If such an application wanted to use this library they would be unable to add their "broken" hashing to the factory would be an inhibitor. So why not provide a factory that " provide all the useful algorithms" and has the convenience methods to add hash functions? Once we add hash functions it seems a short step to resetting the factory to default. At a minimum this would be needed for testing. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964093#comment-16964093 ] Gilles Sadowski commented on COLLECTIONS-728: - bq. Yep, this was just a new look at a possible solution. I expect there to be a fair amount of cleanup to be done. The problem (for me at least) is that discussing about the design is tedious when having to read through fairly long files, and especially on GitHub. I can't comment on GitHub at all, and long comments here are not practical either (having to manually "quote", whereas it is automatic in a mail client). That's the reason, I guess, that prior to agreement on a design (for a new functionality), discussions were always held on the "dev" ML. bq. Interface does not support protected methods. In addition using default, which is described in the literature as a mechanism for extending existing interfaces, It isn't solely a "workaround" mechanism; they can be useful by themselves. Quoting from [this presentation|https://www.baeldung.com/java-static-default-methods]: {noformat} The most typical use of default methods in interfaces is to incrementally provide additional functionality to a given type without breaking down the implementing classes. In addition, they can be used to provide additional functionality around an existing abstract method {noformat} bq. to build a new interface doesn't seem right. Also, are there any cases of Bloom filters being anything more than a Bloom filter? For that reason, the default methods would be particularly apt to provide a functionality which all implementations are supposed to need and use. And the added advantage of a {{BloomFilter}} interface is that, in the (even unlikely) case where someone wants to code it from scratch, it is possible that the new class {{implements}} the interface. We could also provide "adapter" code that would delegate to the external implementations which you mentioned. bq. Passing the hasher [...] I feel there is something wrong there, API-wise, but I may be missing something so won't further comment for now. bq. The others are final as they are based on definitions and mathematical proofs from the literature. The final could be removed if desired. Great use for default methods IMHO (cf. above). Yet some hypothetical implementation could want to override them (e.g. if it caches part of the computation). bq. So the method actually needs the Hasher for the data. Still it doesn't look right, API-wise; but again I'm lacking a clear view of all the requirements to provide an alternative. :( bq. Because hashers can be added to the Factory, there needs to be a way to reset them. >From KISS and DRY POVs, I disagree with both statements: The factory could be >immutable and readily provide all the useful algorithms (see e.g. >[RandomSource|http://commons.apache.org/proper/commons-rng/commons-rng-simple/apidocs/org/apache/commons/rng/simple/RandomSource.html]). bq. Makes no difference to me, Hasher and HasherFactory could be defined on their own as well. The difference is in the clarity (and self-documenting nature) of the code. The {{Hasher}} interface is not tied to the particular {{HasherFactory}} concrete class; so the former should not be declared as a nested interface of the latter. Instead, we could have something like: {code} public interface Hasher { PrimitiveIterator.OfInt getBits(BloomFilter.Shape shape); interface Factory { T create(String name); } } {code} > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962970#comment-16962970 ] Claude Warren commented on COLLECTIONS-728: --- I originally answered this in an email, but it didn't get posted back here so here are my responses: * Nit: There are formatting issues which I mentioned in other reviews. Yep, this was just a new look at a possible solution. I expect there to be a fair amount of cleanup to be done. * Why an _abstract_ class instead of an _interface_ (and _default_ methods)? Interface does not support protected methods. In addition using _default_, which is described in the literature as a mechanism for extending existing interfaces, to build a new interface doesn't seem right. Also, are there any cases of Bloom filters being anything more than a Bloom filter? * Why pass a {{Hasher}} to the {{Shape}} constructor when only its "name" is used? Passing the hasher ensures that the name is actually bound. It could be just the name. By passing the Hasher there is some check on the validity of the name, but that is not necessary. * Why are methods _final_? It's redundant for _private_ ones, and strange for those name {{estimate...}} (one could imagine that subclasses could have their own way). I missed the private final, you are correct the final should be removed. The others are final as they are based on definitions and mathematical proofs from the literature. The final could be removed if desired. * Why doesn't {{Shape}} keep a reference to {{Hasher}}? That would avoid having to pass both to e.g. method {{contains}} in {{BloomFilter}} (and would make is unnecessary to check their consistency). (see the response to how are StaticHasher and DynamicHasher different for some extra background). The contains method with the Hasher was requested because one of the other Apache projects was interested in checking the presence of the Bloom filter without building a complete Bloom filter. They have the hasher and they know the shape it should be. So the method actually needs the Hasher for the data. Now, they can build the DynamicHasher and pass that, or they could build a StaticHasher (which does know it's shape), or they could implement another Hasher type. Placing the Hasher in the Shape would mean that the shape will always return the same data when what we are doing is trying to send different data with the same shape. * I don't get the {{reset()}} method in {{HasherFactory}}. At first sight, this class should be immutable. Because hashers can be added to the Factory, there needs to be a way to reset them. This is the similar to the problem in Apache Jena with the datatypes TypeMapper where the mapping between external objects and their datatype names is maintained. Items can be added but experience shows they also need to be reset. Particularly when testing. * In {{DynamicHasher}}, it seems that the "lock" construct could be advantageously replaced with a builder (i.e. the returned {{Hasher}} instance would only offer the {{getBits()}} API. Yes, that is a valid and reasonable construct. * It seems strange to define the {{Hasher}} interface inside {{HasherFactory}} (IMO that should be the other way around). Makes no difference to me, Hasher and HasherFactory could be defined on their own as well. * What is the difference between {{StaticHasher}} and {{DynamicHasher}}? First the similarity: Both return integers that are the indexes of bits to be turned on based on the shape argument to getBits( Shape ). DynamicHasher uses a list of ByteBuffers to generate the indexes on the fly. Can generate for any Shape. StaticHasher stores a list of indexes that were generated for a specific Shape. Can only provide indexes for the original Shape and does it by returning an iterator over a stored list of integers. (Much faster) * I'm afraid that the review is not exhaustive; again, I have a hard time figuring out the fundamental (design) structure from convenience features. The fundamental methods that the BloomFilter must have (using the new contains() rather than matches() method) are contains() and merge(). However for disparate implementations to be able to execute the contains() and merge() methods some common representation of the underlying bit storage must be made available. This implementation has 2: getBits() [ long buffer encoded bit pattern ], and getHasher() [ iterator over t the indexes of the bits that are turned on ]. These two solutions allow for performance gains to be made depending on the size of the Bloom filter (number of bits possible - m) and the number of bits that are turned on in the filter (hammingValue or cardinality). Fundamental checking methods are the methods that ensure the shapes of the f
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962154#comment-16962154 ] Gilles Sadowski commented on COLLECTIONS-728: - bq. implementation (https://github.com/Claudenw/BloomFilter/tree/Hasher) Nit: There are formatting issues which I mentioned in other reviews. * Why an _abstract_ class instead of an _interface_ (and _default_ methods)? * Why pass a {{Hasher}} to the {{Shape}} constructor when only its "name" is used? * Why are methods _final_? It's redundant for _private_ ones, and strange for those name {{estimate...}} (one could imagine that subclasses could have their own way). * Why doesn't {{Shape}} keep a reference to {{Hasher}}? That would avoid having to pass both to e.g. method {{contains}} in {{BloomFilter}} (and would make is unnecessary to check their consistency). * I don't get the {{reset()}} method in {{HasherFactory}}. At first sight, this class should be immutable. * In {{DynamicHasher}}, it seems that the "lock" construct could be advantageously replaced with a builder (i.e. the returned {{Hasher}} instance would only offer the {{getBits()}} API. * It seems strange to define the {{Hasher}} interface inside {{HasherFactory}} (IMO that should be the other way around). * What is the difference between {{StaticHasher}} and {{DynamicHasher}}? I'm afraid that the review is not exhaustive; again, I have a hard time figuring out the fundamental (design) structure from convenience features. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961433#comment-16961433 ] Alex Herbert commented on COLLECTIONS-728: -- {quote}your second match is incorrect. Match is (this & that = this) {quote} OK. I was thinking that this.match(that) == big.match(small). So do you propose to have effectively the following: {code:java} // a bloom filter with fewer bits can match a filter with more but not visa versa. public boolean match(BloomFilter other) { return (this & other) == this; } // a bloom filter with more bits can contain a filter with less but not visa versa. public boolean contains(BloomFilter other) { return (this & other) == other; } {code} This does seem redundant. I would favour dropping one; I'd keep contains() as this is more self documenting. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961283#comment-16961283 ] Claude Warren commented on COLLECTIONS-728: --- Based on the feedback and questions I have created a different implementation (https://github.com/Claudenw/BloomFilter/tree/Hasher) -- note this is the Hasher branch. I posted a description of the implementation on the "[collections] BloomFilter package architecture discussion" topic on the dev mailing list. [~aherbert] your second match is incorrect. Match is (this & that = this) So a bloom filter with fewer bits can match a filter with more but not visa versa. To address this I added a contains() method. Which is what you would want to call with a "decomposed" bloom filter. Perhaps "decomposed" is not a good phrase because in the case at hand it has never been "composed". Claude > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960025#comment-16960025 ] Alex Herbert commented on COLLECTIONS-728: -- Following on from a conversation had on GitHub for the initial BloomFilter code contribution PR the example there was a Bloom filter created for 10,000 objects with a false positive probability of 1/5. This requires a Bloom filter of 225,200 bits with 16 bits set per object hash. Thus even though the Bloom filter may require a lot of storage to represent the combination of thousands of objects each object only requires a very small amount of storage. In addition you only need to find a single bit index in the object hash that is not in the Bloom filter to determine that there is no match. Thus the common use case of checking if an object matches the filter is to compute a relatively small set of indices and check each one is in the Bloom filter. This would suggest the following: {code:java} public interface BloomFilter { public interface BitIterator { /** * Get the next set bit, or -1 if no more bits are set. * * @return the bit index (or -1) */ int nextSetBit(); static BitIterator of(int... bits) { return new BitIterator() { int count; @Override public int nextSetBit() { if (count < bits.length) { return bits[count++]; } return -1; } } } } default boolean match(BitIterator bits) { int index = bits.nextSetBit(); if (index == -1) { // Cannot match no indices return false; } while (index != -1) { if (!get(index)) { return false; } } return true; } boolean get(int bitIndex); }{code} Thus the Bloom filter satisfies the following: # A BloomFilter is a representation of the logical OR (combination) of many hashes. # A hash is a specification of indices that are set to 1. # A BloomFilter can answer the question: is bit index _n_ set? # A BloomFilter can answer the question: are all bit indices _N_ set? # A "hasher" is able to convert an object to a set of indices. # A comparison of each index in turn opens the possibility of dynamic computation of each hash index with fail fast matching. To use a BloomFilter would require an implementation of hashing that generates a number of indices for your objects and a concrete representation of the BloomFilter composite of many hashes. This is just an idea and I can see that if constant time access to the method {{get(int)}} is not available then the filter would be slow. It is an alternative and can be combined with the current idea to build an entire BloomFilter for each new object and perform a match of two BloomFilters: {code:java} public interface BloomFilter { public interface BitIterator { /** * Get the next set bit, or -1 if no more bits are set. * * @return the bit index (or -1) */ int nextSetBit(); } boolean match(BloomFilter other); boolean match(BitIterator bits); } {code} > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960002#comment-16960002 ] Gilles Sadowski commented on COLLECTIONS-728: - bq. there is nothing in it that allows you to add another filter. The file is not assumed to be complete; the point was to show that all the functionality, except state and states-specific implementations, could be in one (Java 8) interface. bq. I think this can be part of the interface but is not required. It could be moved to a separate class as it just acts as a calculator. Nested class {{Shape}} provides the "boiler-plate" code needed by all implementations. {quote} # Is a BloomFilter defined by its size and set bits? {quote} Probably, but a user should not care much, as long as the API provides instance, and they are used consistently (e.g. merging should only be done among instances produced from the same "builder"). {quote} # ... # Can we match BloomFilters of different lengths? {quote} I'd say "no" (?). At least, it seems like looking for trouble. bq. size and access to set bits should be available within the interface. Maybe, but I wouldn't make it part of the API before a use-case shows that it's absolutely necessary. bq. pad the smaller filter with nothing ? bq. core functionality IIUC, this should only refer to the methods in the {{BloomFilter}} interface. The {{Skeleton}} (a.k.a. {{ProtoBloomFilter}} in Claude's PR) contains the configuration information to be used by implementations of the interface in order to create compatible instances (that can be "merged" and "matched"). > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959847#comment-16959847 ] Alex Herbert commented on COLLECTIONS-728: -- Looking at BloomFilter.java class there is nothing in it that allows you to add another filter. This is a major use case for filters that guard collections, if an object is not present then add it to the existing Bloom filter then do something with the object. {code:java} boolean match(BloomFilter other); // Possible for a MutableBloomFilter interface? // Could return true if the filter was changed by the operation. // Or return void. boolean add(BloomFilter other); {code} The Shape class is a strong statement about the classic definition of a Bloom filter. It implements the Bloom equations for m, n, k, p: {noformat} "computing an array of m bits, and for up to n different elements, test k bits using positions chosen using hash functions. If all bits are set, the element probably already exists, with a false positive rate of p." {noformat} I think this can be part of the interface but is not required. It could be moved to a separate class as it just acts as a calculator. It does open the following questions: # Is a BloomFilter defined by its size and set bits? # Can we match BloomFilters of different lengths? I would assume the answer to 1 is perhaps (if a filter is built using hashing functions to choose bits). If there is no other way to build a Bloom filter then the size and access to set bits should be available within the interface: {code:java} int getNumberOfBits(); boolean get(int bitIndex); {code} I assume the answer to 2 is yes. The default would be to pad the smaller filter with nothing (i.e. zeros). I see the core functionality as: # Matching another filter # Adding another filter # Optionally defining key properties of the filter (the size and the set bits) Separation of filter functionality and implementation would apply to the Hash and Skeleton classes since the way to build a Bloom filter is separate from the filter functionality. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955217#comment-16955217 ] Gilles Sadowski commented on COLLECTIONS-728: - Assuming {{BloomFilterI2}}, what would be the next step if we want (?) a class where the state is provided by a {{BitSet}}? The following I suppose: {code:java} public BitsetBloomFilter implements BloomFilterI2 { private final BitSet state; // Constructor(s), builder, and so on. // Define methods for which an implementation is required, but // nothing else, assuming that reuse is good. } {code} Now, making a wild ;) guess, I'd have a hard time to convince myself that this implementation will be efficient given the generic passing around of data, with all the boxing and unboxing, insead of operating directly on the {{state}}. And, if obviously (?), we'd override the {{default}} methods, why the bloat? Moreover, as said in the previous comment, the interface forces all implementations to provide a method ({{stream()}} with a semantic that might not fit its underlying structure, or even that they would not need at all (say, in order to provide the {{hammingDistance}} property). This is not to say that the code in {{BloomFilterI2}} is never useful (I don't know) but its place would seem to be in an abstract class, that would e.g. ease prototyping of alternative implementations and test their correctness. In any case, even if it were actually heading in the right and I fail to see it for some reason, it would still be a refinement that could be left for once we agree on the "bare-bone" functionality. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955205#comment-16955205 ] Gilles Sadowski commented on COLLECTIONS-728: - {quote}IntStream stream(); // a stream of integers that are the indexes of the enabled bits. {quote} As already noted, I don't have the practical knowledge to know whether this is required for _all_ use-cases of the Bloom filter functionality. However, from a design POV, this is a kludge (IMHO) because, as I've tried to convey on the ML, it conflates the concept ("determines, with some probability of false positive, whether something exists") with how it is implemented (a specific, and potentially inefficient, representation of the underlying structure that provides one way to obtain the requested result). {quote}I have added BloomFilterI2 {quote} A particular example strikes me as the very illustration of what I think is the confusion between "BloomFilter" and "set of bits": {quote} {code:java} default int hammingValue() { return cardinality(); } default int hammingDistance( BloomFilterI2 other ) { return xorCardinality( other ); } {code} {quote} IIUC, {{hammingDistance}} is a core property of the Bloom filters; whereas {{xorCardinality}} is how it can be provided, given some assumption (i.e. "BitSetI") of the underlying representation. The latter is, in OO terminology, an implementation detail that should not percolate to the public API. Moreover, in this case, the methods {{hammingValue}} and {{hammingDistance}} are clearly redundant, and one would readily ask: Why don't we just drop them? > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955201#comment-16955201 ] Claude Warren commented on COLLECTIONS-728: --- BF_Func.md is a list of bloom filter functions that should be supported and what calculations they require. I have added BloomFilterI2 (to distinguish it from BloomFilter) that is an interface that implements the methods via default implementations but leaves 2 methods to be implemented. IntStream stream(); // a stream of integers that are the indexes of the enabled bits. BloomFilterI2 merge (BloomFilterI2 other ); // a Bloom filter constructed by ORing the other bloom filter with this. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946214#comment-16946214 ] Claude Warren commented on COLLECTIONS-728: --- Looking at the Log, distance, and some of the other functions perhaps it makes more sense for the CountingBloomFilter not to extend StandardBloomFilter and to move the statistical methods into a class of static methods. This would remove some overhead from the CountingBloomFilter. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946183#comment-16946183 ] Claude Warren commented on COLLECTIONS-728: --- BloomCollectionStatistics should be called BloomFilterGatedStatistics as that is where the requirement for the method getStats() comes from. I will change that. I have changed the BloomCollectionConfiguration from using a StandardBloomFilter for the gate to returning a BloomFilter for the gate. (Good catch, thank you) BloomCollectionConfiguration should be called BloomFilterGatedConfiguration as it applies to the BloomFilterGated interface, not specifically to the collection. I will change that. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945857#comment-16945857 ] Gilles Sadowski commented on COLLECTIONS-728: - bq. other implementations of BloomFilter. Would those reuse none of the current implementation ("StandardBloomFilter")? Is "BloomFilter" an abstract _concept_ or is it tied to a reference implementation? IOW, would all the expected contributions be a subclass (with overridden methods)? E.g. if a "BloomFilter" always comes with a notion of "getLog", then a _protected_ "getApproximateLog" would allow customizing it. You should really post (one post per such issue) to the "dev" ML to collect more opinions on whether an _interface_ is the preferred approach (based on the answers to the above questions). bq. build() methods: They are not "necessary". My point. ;-) Ditto: preferences should be asked on the "dev" ML. IMHO, the only (?) advantage (saving 2 keystrokes) is not worth almost doubling that class' API. bq. BloomCollection is not just an example. OK. Still, it is not clear to me whether it's worth creating a "collection" subpackage (again, a question for the "dev" ML). Can the classes in that package be used independently of {{BloomCollection}}? E.g. just by their name * BloomCollectionConfiguration * BloomCollectionStatistics it seems that the above classes could (?) be static nested classes. Also I suspect there is a relationship with the "interface" issue: In {{BloomCollectionConfiguration}} has a {{gate}} initialized with a {{StandardBloomFilter}} instance; thus, this higher-level utility assumes more than the {{BloomFilter}} API. At first sight, this doesn't look right... > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945717#comment-16945717 ] Claude Warren commented on COLLECTIONS-728: --- I am working to clean up the issues that [~erans] raised with the following notes: The BloomCollection can be used. It works as described but it is limited to wrapping Java Collection implementations. I expect that this will be sufficient for many implementations but for very large scale implementations it may be desirable to have the BloomFilterGated implement on top of a larger datastore. So the BloomCollection is not _just_ an example. I can for see other implementations of BloomFilter. This contribution contains 2 (though Counting is derived from Standard) there are a number of other implementations identified in https://en.wikipedia.org/wiki/Bloom_filter . I would expect to see some of those contributed in the future. With respect to duplicate build() methods: They are not "necessary". They are convenience methods akin to the Java MessageDigest update() and digest() methods. No, all the digest() methods in MessageDigest are not necessary but they are convenient. Yes, MessageDigest could have been implemented with only the no argument digest() method. The same can be said for the build() methods of the ProtoBloomFilter.Builder. The use cases for both are much the same: Having a single method to build a digest/ProtoBloomFilter from a single data object (e.g. messageDigest..digest( "foo" ) or builder.build("foo")) > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942393#comment-16942393 ] Gilles Sadowski commented on COLLECTIONS-728: - Thanks for the removal of {{Serializable}}. :) Class {{BloomCollectionStatistics}} * enum {{CHANGE}} should be {{Change}} (and the _elements_ must be in uppercase). * (nit) Javadoc: ** Line lengths in Javadoc seem a bit random. ** Ditto for alignment. ** Inconsistent use of "{@code ...}" (and I would avoid adding an "s" after the closing brace). * Avoid use of "l" as a variable name. * Method {{asInt(long)}} would be much clearer using if/else. Also, might need a better name ({{clampToInteger}} (?)) Class {{BloomFilterConfiguration}}: * Avoid use of primitive wrapper classes that entail unnecessary boxing/unboxing. * Remove the multiplication by "1.0" (and unnecessary parentheses) * (nit) Wrap long lines. Some of the issues mentioned earlier: * Add "final" for constants. * Remove duplicate "build" methods (or please give an example of why they are necessary). * (nit) Typos in Javadoc and trailing spaces (cf "git diff"). Is {{BloomCollection}} just an example (as stated in the Javadoc) or is it for actual use? Do you foresee other implementations of the {{BloomFilter}} interface? If not, we could rename {{StandardBloomFilter}} to {{BloomFilter}}... > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution
[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942269#comment-16942269 ] Claude Warren commented on COLLECTIONS-728: --- Updated pull request to include latest common-collections4 code. e.g. did a git pull from apache repository to sync with latest changes. > BloomFilter contribution > > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task >Reporter: Claude Warren >Priority: Minor > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)