[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

Claude Warren (Jira) Wed, 30 Oct 2019 05:10:56 -0700


    [ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962970#comment-16962970
 ]


Claude Warren commented on COLLECTIONS-728:
-------------------------------------------

I originally answered this in an email, but it didn't get posted back here so 
here are my responses:
    

  * Nit: There are formatting issues which I mentioned in other reviews.

 
Yep, this was just a new look at a possible solution.  I expect there to be a 
fair amount of cleanup to be done.


    * Why an _abstract_ class instead of an _interface_ (and _default_ methods)?


Interface does not support protected methods.  In addition using _default_, 
which is described in the literature as a mechanism for extending existing 
interfaces, to build a new interface doesn't seem right.  Also, are there any 
cases of Bloom filters being anything more than a Bloom filter?
 

    * Why pass a {{Hasher}} to the {{Shape}} constructor when only its "name" 
is used?


Passing the hasher ensures that the name is actually bound.  It could be just 
the name.  By passing the Hasher there is some check on the validity of the 
name, but that is not necessary.
 

    * Why are methods _final_?  It's redundant for _private_ ones, and strange 
for those name {{estimate...}} (one could imagine that subclasses could have 
their own way).


I missed the private final, you are correct the final should be removed.

The others are final as they are based on definitions and mathematical proofs 
from the literature.  The final could be removed if desired. 

 

    * Why doesn't {{Shape}} keep a reference to {{Hasher}}?  That would avoid 
having to pass both to e.g. method {{contains}} in {{BloomFilter}} (and would 
make is unnecessary to check their consistency).


(see the response to how are StaticHasher and DynamicHasher different for some 
extra background).
The contains method with the Hasher was requested because one of the other 
Apache projects was interested in checking the presence of the Bloom filter 
without building a complete Bloom filter.  They have the hasher and they know 
the shape it should be.  So the method actually needs the Hasher for the data.

Now, they can build the DynamicHasher and pass that, or they could build a 
StaticHasher (which does know it's shape), or they could implement another 
Hasher type. 

Placing the Hasher in the Shape would mean that the shape will always return 
the same data when what we are doing is trying to send different data with the 
same shape.

 

    * I don't get the {{reset()}} method in {{HasherFactory}}.  At first sight, 
this class should be immutable.


Because hashers can be added to the Factory, there needs to be a way to reset 
them.  This is the similar to the problem in Apache Jena with  the datatypes 
TypeMapper where the mapping between external objects and their datatype names 
is maintained.  Items can be added but experience shows they also need to be 
reset.  Particularly when testing.

    * In {{DynamicHasher}}, it seems that the "lock" construct could be 
advantageously replaced with a builder (i.e. the returned {{Hasher}} instance 
would only offer the {{getBits()}} API.


Yes, that is a valid and reasonable construct.
 

    * It seems strange to define the {{Hasher}} interface inside 
{{HasherFactory}} (IMO that should be the other way around).


Makes no difference to me, Hasher and HasherFactory could be defined on their 
own as well.
 

    * What is the difference between {{StaticHasher}} and {{DynamicHasher}}?


First the similarity:  Both return integers that are the indexes of bits to be 
turned on based on the shape argument to getBits( Shape ).

DynamicHasher uses a list of ByteBuffers to generate the indexes on the fly.  
Can generate for any Shape.

StaticHasher stores a list of indexes that were generated for a specific Shape. 
 Can only provide indexes for the original Shape and does it by returning an 
iterator over a stored list of integers.  (Much faster)


    *  I'm afraid that the review is not exhaustive; again, I have a hard time 
figuring out the fundamental (design) structure from convenience features. 

The fundamental methods that the BloomFilter must have (using the new 
contains() rather than matches() method) are contains() and merge().
However for disparate implementations to be able to execute the contains() and 
merge() methods some common representation of the underlying bit storage must 
be made available.  This implementation has 2: getBits() [ long buffer encoded 
bit pattern ], and getHasher() [ iterator over t the indexes of the bits that 
are turned on ].  These two solutions allow for performance gains to be made 
depending on the size of the Bloom filter (number of bits possible - m) and the 
number of bits that are turned on in the filter (hammingValue or cardinality).

Fundamental checking methods are the methods that ensure the shapes of the 
filter being compared or merged  are the same.  Thus the getShape() and the 
protected verifyX() methods.

All other methods are "convenience" methods, though are generally expected in 
any developed Bloom filter implementation.  Implementations are provided based 
on the fundamental methods getBits() and/or getHasher().


 (new comment -- not in response to the above.)

Another reviewer suggested getting rid of matches() and only have contains() as 
it is much more descriptive and when you have both Bloom filters (as required 
for matches)  it is possible to call contains() instead (e.g. A.matches(B) is 
the same as B.contains( A )).  This sounds reasonable to me.



> BloomFilter contribution
> ------------------------
>
>                 Key: COLLECTIONS-728
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

Reply via email to