[ https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962970#comment-16962970 ]
Claude Warren commented on COLLECTIONS-728: ------------------------------------------- I originally answered this in an email, but it didn't get posted back here so here are my responses: * Nit: There are formatting issues which I mentioned in other reviews. Yep, this was just a new look at a possible solution. I expect there to be a fair amount of cleanup to be done. * Why an _abstract_ class instead of an _interface_ (and _default_ methods)? Interface does not support protected methods. In addition using _default_, which is described in the literature as a mechanism for extending existing interfaces, to build a new interface doesn't seem right. Also, are there any cases of Bloom filters being anything more than a Bloom filter? * Why pass a {{Hasher}} to the {{Shape}} constructor when only its "name" is used? Passing the hasher ensures that the name is actually bound. It could be just the name. By passing the Hasher there is some check on the validity of the name, but that is not necessary. * Why are methods _final_? It's redundant for _private_ ones, and strange for those name {{estimate...}} (one could imagine that subclasses could have their own way). I missed the private final, you are correct the final should be removed. The others are final as they are based on definitions and mathematical proofs from the literature. The final could be removed if desired. * Why doesn't {{Shape}} keep a reference to {{Hasher}}? That would avoid having to pass both to e.g. method {{contains}} in {{BloomFilter}} (and would make is unnecessary to check their consistency). (see the response to how are StaticHasher and DynamicHasher different for some extra background). The contains method with the Hasher was requested because one of the other Apache projects was interested in checking the presence of the Bloom filter without building a complete Bloom filter. They have the hasher and they know the shape it should be. So the method actually needs the Hasher for the data. Now, they can build the DynamicHasher and pass that, or they could build a StaticHasher (which does know it's shape), or they could implement another Hasher type. Placing the Hasher in the Shape would mean that the shape will always return the same data when what we are doing is trying to send different data with the same shape. * I don't get the {{reset()}} method in {{HasherFactory}}. At first sight, this class should be immutable. Because hashers can be added to the Factory, there needs to be a way to reset them. This is the similar to the problem in Apache Jena with the datatypes TypeMapper where the mapping between external objects and their datatype names is maintained. Items can be added but experience shows they also need to be reset. Particularly when testing. * In {{DynamicHasher}}, it seems that the "lock" construct could be advantageously replaced with a builder (i.e. the returned {{Hasher}} instance would only offer the {{getBits()}} API. Yes, that is a valid and reasonable construct. * It seems strange to define the {{Hasher}} interface inside {{HasherFactory}} (IMO that should be the other way around). Makes no difference to me, Hasher and HasherFactory could be defined on their own as well. * What is the difference between {{StaticHasher}} and {{DynamicHasher}}? First the similarity: Both return integers that are the indexes of bits to be turned on based on the shape argument to getBits( Shape ). DynamicHasher uses a list of ByteBuffers to generate the indexes on the fly. Can generate for any Shape. StaticHasher stores a list of indexes that were generated for a specific Shape. Can only provide indexes for the original Shape and does it by returning an iterator over a stored list of integers. (Much faster) * I'm afraid that the review is not exhaustive; again, I have a hard time figuring out the fundamental (design) structure from convenience features. The fundamental methods that the BloomFilter must have (using the new contains() rather than matches() method) are contains() and merge(). However for disparate implementations to be able to execute the contains() and merge() methods some common representation of the underlying bit storage must be made available. This implementation has 2: getBits() [ long buffer encoded bit pattern ], and getHasher() [ iterator over t the indexes of the bits that are turned on ]. These two solutions allow for performance gains to be made depending on the size of the Bloom filter (number of bits possible - m) and the number of bits that are turned on in the filter (hammingValue or cardinality). Fundamental checking methods are the methods that ensure the shapes of the filter being compared or merged are the same. Thus the getShape() and the protected verifyX() methods. All other methods are "convenience" methods, though are generally expected in any developed Bloom filter implementation. Implementations are provided based on the fundamental methods getBits() and/or getHasher(). (new comment -- not in response to the above.) Another reviewer suggested getting rid of matches() and only have contains() as it is much more descriptive and when you have both Bloom filters (as required for matches) it is possible to call contains() instead (e.g. A.matches(B) is the same as B.contains( A )). This sounds reasonable to me. > BloomFilter contribution > ------------------------ > > Key: COLLECTIONS-728 > URL: https://issues.apache.org/jira/browse/COLLECTIONS-728 > Project: Commons Collections > Issue Type: Task > Reporter: Claude Warren > Priority: Minor > Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, > Usage.md > > > Contribution of BloomFilter library comprising base implementation and gated > collections. -- This message was sent by Atlassian Jira (v8.3.4#803005)