[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964753#comment-16964753
 ] 

Claude Warren commented on COLLECTIONS-728:
-------------------------------------------

=== From the conversation above ===

 Why doesn't Shape keep a reference to Hasher? That would avoid having to pass 
both to e.g. method contains in BloomFilter (and would make is unnecessary to 
check their consistency).

 (see the response to how are StaticHasher and DynamicHasher different for some 
extra background).
 The contains method with the Hasher was requested because one of the other 
Apache projects was interested in checking the presence of the Bloom filter 
without building a complete Bloom filter. They have the hasher and they know 
the shape it should be. So the method actually needs the Hasher for the data.

Now, they can build the DynamicHasher and pass that, or they could build a 
StaticHasher (which does know it's shape), or they could implement another 
Hasher type.

Placing the Hasher in the Shape would mean that the shape will always return 
the same data when what we are doing is trying to send different data with the 
same shape.

Still it doesn't look right, API-wise; but again I'm lacking a clear view of 
all the requirements to provide an alternative. 

=== end of quote ===

The methods that have both the Hasher and the Shape are really passing 
decomposed Bloom filters.  The Hasher contains the hash values and the Shape 
identifies shape of the decomposed Bloom filter.   True the Hasher could be 
limited to a StaticHasher and the Shape would not be needed as the Static 
Hahser contains the Shape it was built with, but in the general case a Hasher 
does not have a Shape. 

The reason for the separation is the Hasher applies a hash function to data and 
returns int values.
The shape indicates to the Hasher how many times it should apply the hashing 
algorithm (k in the maths) to each data item being hashes (item = the X in 
DynamicHasher.with(X)).  The Shape also give the upper limit for the values the 
hasher returns on the particular getBits() call.  The upper limit is "m" in the 
maths.

So Hasher + Shape = an iteration of integers that are the bits turned on the 
Bloom filter.

A single hasher can generated Bloom filters of an almost infinite number of 
Shapes.

--- snip --

=== start of quote === 

>From KISS and DRY POVs, I disagree with both statements: The factory could be 
>immutable and readily provide all the useful algorithms (see e.g. 
>RandomSource).

=== end of quote ===

There are two issues with the idea that we can provide all the usefull 
algorithms.

1) We will always be behind when it comes to implementing new algorithms using 
new hash functions.  It is better to allow the users in the field to add 
algorithms as they see fit.  I remember an instructor one said the the only 
valid limits in computer science are zero, one and many.  Providing an 
implementation with some but not allowing for addition of others is very 
limiting.

2) There are known cases (Cassandra comes to mind) where an implementation of a 
Hashing algorithm was incorrectly executed and released into the wild.  If such 
an application wanted to use this library they would be unable to add their 
"broken" hashing to the factory would be an inhibitor.

So why not provide a factory that " provide all the useful algorithms" and has 
the convenience methods to add hash functions?

Once we add hash functions it seems a short step to resetting the factory to 
default.  At a minimum this would be needed for testing.









> BloomFilter contribution
> ------------------------
>
>                 Key: COLLECTIONS-728
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
>             Project: Commons Collections
>          Issue Type: Task
>            Reporter: Claude Warren
>            Priority: Minor
>         Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to