subject:"\"\\\\\\\[jira\\\\\\\] \\\\\\\[Commented\\\\\\\] \\\\\\\(COLLECTIONS\\\\\\\-728\\\\\\\) BloomFilter contribution\""

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-26 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003765#comment-17003765
 ] 

Claude Warren commented on COLLECTIONS-728:
---

There may be some disconnect here.  The HashFunction interface is not intended 
to generically describe hash functions, but is intended to provide the 
information necessary for Bloom filter usage of a Hash function.

I am assuming that in your example code  the HashFunction does not extend 
HashFunctionIdentity and that, in fact, HashFunctionIdenty is not available.  
The issue in this case is that without being able to check how the underlying 
hash function operates all you can check is that the equals() operator works 
for the instances of the class.  Currently the `check` method you list is 
equivalent to f.getShape().equals( g.getShape() ). (e.g if the Shapes are the 
same the filters are comparable). Note: the Shape.equals() method currently 
uses the HashFunctionIdentity to  verify the functions are the same.  The 
`isUsing` method is effectively HashFunctionIdentity.COMMON_COMPARATOR.compare( 
f.getShape().getHashFunctionIdentity(), h ) == 0.  Note: the current 
HashFunction extends HashFunctionIdentity.

 

 

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-26 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003600#comment-17003600
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}we may have to agree to disagree on this one.
{quote}
Maybe, maybe not.
 I'm not even there yet: My main problem is that (IIUC) this discussion (about 
how to distinguish hash functions and how to check for Bloom filters 
"compatibility"[1]) should be be postponed to _after_ committing the "main" (as 
I see it, from my incomplete POV) functionality. Maybe we don't agree about 
what should be "main" (or core).
 If that's the case, please post to the "dev" ML; explain why we are stuck, and 
that additional reviewers are needed.

[1] One could construe that such a check could be provided "externally", e.g. 
(random thought):
{code:java}
public class BloomCompatibilityChecker {
/** Checks whethe the filters are comparable. */
public static boolean check(BloomFilter f, BloomFilter g) {
return ...;
}
/** Checks whether the filter is using some specific hash function. */
public static boolean isUsing(BloomFilter f, HashFunction h) {
 return ...;
}
}
{code}

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-26 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003589#comment-17003589
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I  think we may have to agree to disagree on this one.

I can think of no way to compare hash function properties and only a 
same/not-same comparison possible using the equals() method.  Given that there 
might be 2 implementations that yield identical results this is not an optimal 
solution.  The decomposition of the properties into an interface makes it 
possible to identify such.

Caching hasher is not part of this code base yet it builds upon it.  It was 
built on the opportunity created by this code base as a natural upshot of being 
able to identify the properties of the hash function.  If you wish I will 
contribute CachingHasher as part of this code.
{quote}Would those methods that describe hash function's properties be called 
from inside this code?
{quote}
They already are using inside this code.  It is part of determining if a Shape 
is the same for different filters.

 

 

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-22 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002067#comment-17002067
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

bq. how do you propose to move the hash function [...]

In the code which you propose for inclusion in [Collections], nothing requires 
serialization (hence the reluctance to impose {{Serializable}}).  If some 
application requires serialization, it must create an instance of 
{{HashFunction}} that can be serialized.

bq.  What do you mean by "external code"

You referred to "CachingHasher (from another project)": Another project != this 
project.
"CachingHasher" is not part of this code base, yet it dictates an API.  I'd be 
more comfortable starting with an internally consistent and minimal API.  Then 
if a use case comes up (like "CachingHasher"), requests for enhancement can be 
considered.

bq. The isIterative() question is information about how the hash is constructed

In my (possibly short-sighted) view, [Collections] is the not the place to 
define an API for describing hash functions.

Would those methods that describe hash function's properties be called from 
inside this code?

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-22 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001911#comment-17001911
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I think I am confused.  Without resorting to Serialized interface how do you 
propose to move the hash function from one system to another?

What do you mean by: "So if in practice, there is a problem, my opinion is that 
the design must be modified without resorting to external code."?  What do you 
mean by "external code" in this statement?

What do you mean here: "I've asked the same question about methods like 
{{isIterative()}}, that also obviously only make sense in a Java environment."  
The isIterative() question is information about how the hash is constructed.  
That construction is not Java only.  The example provided was from Cassandra 
and thus in Java, but the Kirsh and Mitzenmacher paper describes the process in 
mathematical terms and thus is applicable to any programming language.

Similarly the Signed/Unsigned interpretation of the buffer makes a difference 
and is not a Java only implementation detail.

 

 

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-22 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001889#comment-17001889
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

bq. CachingHasher (from another project) is not possible.

Potentially irrelevant.
Furthermore, I don't get how in principle providing _more_ information (i.e. 
the {{HashFunction}} instance itself rather than some of its "properties") can 
be an issue.
So if in practice, there is a problem, my opinion is that the design must be 
modified without resorting to external code.

bq. communicating with systems that are not java based

Potentially irrelevant.
I've asked the same question about methods like {{isIterative()}}, that also 
obviously only make sense in a Java environment.

We had already noticed those non-converging opinions; it seems that we are 
repeating arguments.  It could be my fault, in case I've missed key points in 
your descriptions.

All I think I understand is that application developers need a check that 
different parts of a distributed system (both Java and non-Java) use the same 
hash function.  IIUC, a crude check is OK for you (like {{String}} equality).  
But without indications on the capabilities of the non-Java parts of such a 
system (e.g. how do they represent {{isIterative()}}?), I'm wary to clutter the 
Java API, where the check could actually be made robust through standard Java 
practice (e.g. using {{equals(Object)}}).

The simplest way out is Alex's proposal of a "registry", as a plain list of 
numbers (and a textual description of the associated "properties").

Even so, I'm maintaining that this is a second-order discussion, as the main 
Bloom filter functionality could be fully coded without this fool-proof 
checking of the hash function: Indeed, if one of the requirements of the 
distributed system is that the hash function is the same, it could simply be 
assumed.  Sure, it's not robust; but neither is any of the proposed 
alternatives.

I understand that you don't agree with this incremental approach; hence the 
only way out is to post to the "dev" ML, asking for more knowledgeable (or less 
picky ;)) people to have a look at your PR.
I really hope that some solution can be found because it would be nice that 
Commons can host this Bloom filter implementation.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-22 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001847#comment-17001847
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I don't think that isCompataible() as a single function works. 

1) Without access to the cyclic/iterative setting the CachingHasher (from 
another project) is not possible. 

2) When communicating with systems that are not java based how does the 
isCompatible() and function serialization work?

I truly believe the best mechanism is to identify the differences between 
implementations and then provide properties for those differences.  This will 
allow the identity of the implementation to be serialized to other systems as 
needed, without the need for Java Serializable interface or in fact Java at all 
on the remote system.

 

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-20 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001053#comment-17001053
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}When Shapes or BloomFilters that were created with MyHash are serialized 
by the application (application responsibility) the HashFunctionIdentity will 
be serialized. The application deserializing the HashFunctionIdentity will 
either need to have the MyHash class or another implementation of the same 
function.
{quote}
Then, the question is: What does the Bloom filter functionality (that 
constitutes your code contribution to [Collections]) need beyond a 
{{HashFunction}} interface that defines a method {{isCompatible(HashFunction 
other)}}?
 In effect, my issue is still the same (I think) as several comments ago: How 
to ensure that some instance of {{HashFunctionIdentity}} will *really* be 
compatible with another (whether within the same JVM or after serialization)? 
It seems to me that whatever check is done in a "composed" way with 
{{isIterative()}}, {{isUnsigned()}}, {{getName()}}, ... can be achieved in a 
simple (and more robust, at the implementor's discretion) way with 
{{isCompatible(HashFunction)}}.
 Furthermore, in order to call {{isIterative()}}, etc. you obviously need a JVM 
at the other end of the serialization/deserialization pipeline.
 Now, if distributed processing is an integral feature (?) of using the 
{{BloomFilter}} class, perhaps that we must define as
{code:java}
interface HashFunction extends Serializable {
// ...
}
{code}
Note that what Sebb and others pointed out is that o.a. things, "it's difficult 
to code Serializable classes properly" but the rest of the quote ("We should 
not promise what we likely cannot continue to deliver") does not apply here, 
because the burden will be on implementors. And _they_ will choose how robust 
and future-proof their handling of {{Serializable}} needs to be (for their 
purpose).
 However, is it actually true that {{BloomFilter}} usage only makes sense in 
distributed applications? If not, I'd be wary of the above definition (since it 
will impose an unnecessary burden, or could provide a false sense of security 
if {{Serializable}} is not properly implemented).

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-20 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001015#comment-17001015
 ] 

Claude Warren commented on COLLECTIONS-728:
---

The hash computation is used in the Hasher class.  It is generally only used 
when creating Bloom filters.  In some cases it is used when determining if an 
object is in the filter.

I originally implemented Serializable in several objects but that was removed 
early on.  As Sebb pointed out there are security issues with classes that 
implement Serializable[1].

Your description of how to add a hash function is basically correct.  However, 
I think there is a misunderstanding.

The `HashFunctionIdentity` provides the information necessary it identify the 
function.  it is used in the `Shape` class which in turn is used in the 
`BloomFilter` class.

The `HashFunction` interface is much as you suggest except that `HashFunction` 
extends `HashFunctionIdentity`.  So a user wanting to add a new hash function 
would create a HashFunction instance (MyHash) that identifies the function and 
contains a call to that function via the `compute()` method.

When Shapes or BloomFilters that were created with MyHash are serialized by the 
application (application responsibility) the HashFunctionIdentity will be 
serialized.  The application deserializing the HashFunctionIdentity will either 
need to have the MyHash class or another implementation of the same function.

I hope this helps.

 

[1] 
[https://lists.apache.org/thread.html/7a0bb63540e52b97f08c608f11c2fd9dd5b770c6f21cdc04a8c29e6d%40%3Cdev.commons.apache.org%3E]

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-20 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000951#comment-17000951
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}properties are now tracked in the HashFunctionIdentity.
{quote}
Perhaps {{HashFunctionProperties}} or {{HashFunctionSpecification}} would be a 
better name (?).
{quote}An equals method on the HashFunction would require the HashFunction 
object be in some sense Serializable.
{quote}
IIUC correctly (?), a hash function can be implemented by a third party (not 
abiding by any of APIs to be defined here).
 Then to be usable within the {{BloomFilter}} framework defined here, an 
application developer will need
 * to wrap the function's "properties" in a class (say, 
{{MyHashFunctionIdentity}}) that implements {{HashFunctionIdentity}}
 * to define a serialization scheme for {{MyHashFunctionIndentity}} (if the 
application is distributed).

If so, why not have a {{MyHashFunction}} class that wraps the full third-party 
hash function rather then just its "properties"?
 We could have, in [Collections]
{code:java}
/**
 * Interface for implementations used within this framework.
 */
public interface HashFunction {
/**
 * Returns {@code true} when from a given input, this instance
 * computes the same output as the {@code other} instance.
 */
boolean isCompatible(HashFunction other);

/**
 * Computes the hash value.
 *
 * @param input Input.
 * @param seed Seed.
 * @return the hash.
 */
long compute(byte[] input, int seed);
}
{code}
As simple as can be (?).

And, the application developer would responsible for implementing the notion of 
"compatibility" (in the same way that he is responsible for computing the hash 
value, including issues arising from using a buggy function):
{code:java}
import java.io.Serializable;
import com.thirdparty.hash.NiceFunction;
import org.apache.commons.collections.bloomfilter.HashFunction;

public class MyHash implements HashFunction, Serializable {
private static final byte[] COMP_TEST_A = new byte[] {- 19, 45, -34, 65, 1, 
22, 17, 74};
private static final int COMP_TEST_B = -1395561;
private static final long serialVersionUID = 123456789L;
private NiceFunction f; // Assuming that "NiceFunction" is "Serializable".

public MyHash(NiceFunction f) {
this.f = f;
}

@Override
public boolean isCompatible(HashFunction other) {
if (other instanceof MyHash) {
return true;
} else {
return Long.compare(compute(COMP_TEST_A, COMP_TEST_B),
other.compute(COMP_TEST_A, COMP_TEST_B)) == 0;
}
}

@Override
public long compute(byte[] input, int seed) {
// ...
}
}
{code}
Then, [Collections] could provide wrappers for the functions implemented in 
[Codec]. Casual users will get new functions at release time, while not 
preventing power users to define their own wrappers around experimental and 
non-standard functions.

Or am I completely off base?

bq. I would not expect the actual code for for the function to be sent.  This 
would probably require that the listener be a Java based application so that it 
could have the HashFunction implementation.

I'm confused.  Where is the hash computation used?

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-19 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1748#comment-1748
 ] 

Claude Warren commented on COLLECTIONS-728:
---

In all the work I have done with Bloom filters over the past 8 years, I have 
come to believe that there are several pieces of information that are necessary 
to determine if the hashing is the same for two filters.

1) they hashing algorithm must be the same.  Generally the algorithms are 
specified by name and this seems standard across the industry where hashing is 
used.

2) The results of the hash have to be converted into integer type values and 
the modulus taken.  So the question becomes were the bits in the result buffer 
assumed to be signed or unsigned, little endian or bit endian.

3) How was the hashing used.  In construction the Bloom filter multiple hashes 
are used.  In traditional iterative methods this is done by calling the 
selected hash function with a different seed for each hash required.  The 
second method described by Adam Kirsch and Micheal Mitzenmacher[1] has become 
more common and is used in applications like Cassandra[2].

There are other considerations, like is the hashing algorithm implemented 
correctly, but I believe they fall outside the normal bounds of equivalence 
checking during normal operation.

I originally encoded these 3 properties into the name.  While this made for 
simple checks it posed other problems and was, rightly so, objected to.  
Instead the 3 properties are now tracked in the HashFunctionIdentity.

The identity is separated from the hash function implementation because there 
are times when the hash function implementation can not be preserved; 
specifically when sending the BloomFilter across a the wire.  All we really 
want to send is the filter and the shape.

The shape preserves the standard configuration of the Bloom filter (number of 
items, number of bits, number of hash functions) as well as the 
HashFunctionIdentity.  I believe this to be the minimal set necessary to 
determine whether it is valid to compare two Bloom filters.

There are other cases where specific properties of the hash function identity 
are required.  For example, I have an implementation of Hasher that uses the 
first 2 results returned from a CYCLIC hash function to allow it to build Bloom 
filters with any number of items, number of bits, or number of hash functions.  
This is similar to the StaticHasher but useful where different sized filters 
are required to be built from the same object.

An equals method on  the HashFunction would require the HashFunction object be 
in some sense Serializable.  That is, the object would have to be decomposable 
so that its identity could be sent across the wire.  I would not expect the 
actual code for for the function to be sent.  This would probably require that 
the listener be a Java based application so that it could have the HashFunction 
implementation.  The HashFunctionIdentity object makes this much easier as the 
component values are easily serialized.

 

[1] [https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf]

[2] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/BloomFilter.java#L60]

 

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-15 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996909#comment-16996909
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}I would like to avoid that discussion until the core components are in 
place.
{quote}
>From the outset, I have been trying to minimize discussion by suggesting just 
>that: Limiting the initial commit to the core functionality. If I'm not 
>mistaken, this is what lead to moving hash implementations from here to 
>[Codec], among other simplifications of the code.
{quote}should not try to implement all hash functions [...]
{quote}
OK; let's assume that.
{quote}provide a framework that identifies when the hash functions are not the 
same
{quote}
OK, but what is the simplest way to do that?
 IMHO, as we cannot hope that the {{HashFunction}} becomes a "standard", we 
should keep it as simple as possible for application developers. I.e. maybe 
just make it a "marker" interface (?). Would it be sufficient to have the 
Javadoc saying that applications using this (Bloom filter) functionality must 
define an {{equals(Object)}} method whose purpose is to identify when two 
{{HashFunction}} instances can be considered the same or not?

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-15 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996742#comment-16996742
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I  think there may be a nomenclature issue here:  Does "same function" mean the 
identical implementation (same provider) or just the same results (same hash 
function, Signedness and Process type)?  In my mind it is the later.

The HashFunction is separate from the HashFunctionIdentity because there are 
cases where we don't need/want the hashing function implementation we just need 
to know what it is (e.g. stored in a database).

Commons Collections can implement whatever hashes are deemed appropriate.  It 
can implement them with 3rd-party hash algorithms, java core algorithms (e.g 
MD5), or Collections provided hash algorithms.   I attempted the later early in 
this contribution but it was easier to move that to commons-codec and fix the 
implementation there than it was to keep that code in the contribution.

There is nothing in the contribution that prohibits the development of a 
registry, it just does not provide such.  Adding such would increase the size 
and would lead us into discussions about Factories, singletons, and registering 
new producers again.  I would like to avoid that discussion until the core 
components are in place.

 Again, we can not and should not try to implement all hash functions, nor 
should we restrict others from using new/experimental hash function with this 
library. 

What we should do is provide a framework that identifies when the hash 
functions are not the same so that appropriate actions can be taken.  I believe 
that is what this contribution does.

I have used this contribution is 3 projects with 3 different development teams. 
 We have managed to fit this contributions into the projects even though they 
have different requirements on the Bloom filters.  I have easily retrofitted 
this contribution into packages that used their own Bloom filter 
implementations with minimal impact.  Overall this contribution works. 

Am I missing something?

 

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-15 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996728#comment-16996728
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}{{HashFunction}}
{quote}
I still have the impression that we are carried away from the main contribution 
by an interface whose contents seems (?) to be defined by "out-of-bands" 
requirements.
 It seems (IIUC) that [Collections] will have to create the class that 
"implements" the interface, as a wrapper to a third-party hash function 
implementation (because the third party will most probably *not* depend on 
[Collections]); hence I don't see why not use an enum (or a "registry", as Alex 
suggested) in order to make the check obvious and simple: The _same_ function 
must be used whatever its specific "properties" (cyclic, or whatever).

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-15 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996674#comment-16996674
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I have added HashFunctionIdentity  to replace the HashFunction name. The 
identity contains the function name, Signedness, and ProcessType as well as the 
signature value.

It also contains comparator implementations, and a couple of static methods to 
simplify generating the signature value.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-14 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996508#comment-16996508
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Yes, I have hashers that depend on the function being cyclic.  They currently 
check the name but this solution would be cleaner.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-11 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993591#comment-16993591
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

Would methods {{getName()}}, {{isIterative()}} and {{isUnsigned()}} be used in 
the code anywhere else other than within {{getSignature()}}?

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-08 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990888#comment-16990888
 ] 

Claude Warren commented on COLLECTIONS-728:
---

The idea of using ID's for each provider/hash function is interesting, but it 
will require that a group (Apache Commons in the proposal) take the 
responsibility for registering them. I don't know that Apache wants to take 
that on.

Perhaps there is an interesting middle ground between the proposals from the 3 
of us.

Change the HashFunction interface shown above to not extend ToLongBiFunction<> 
but to contain a method `apply( byte[] buffer, int seed)`

Also add a method `long getSignature()` that will be implemented as:
{noformat}
public long getSignature() {
byte[] b = new byte[ getName().getBytes("UTF-8") ];
byte[] b2 = new byte[ b.length ];
b2[0] = (byte)((isIterative()?0x1:0) & (isUnsigned()?0x2:0));
System.arrayCopy( b1, 0, b2, 1, b1.length );
return apply( b, 0 );
}
{noformat}
Now, the value for the function is the function information hashed by the 
function. A list of function names and flags with correct signature can be 
provided and easily checked by implementers.

The interface could be changed to an abstract class and the getSignature() 
listed above could be made final and implemented to store the signature in a 
Long the first time it is calculated.

Now the getSignature() can be used to verify that the methods are the same, and 
the name is still available for user display. The implementation checks can be 
quickly added to the code base as a property file.

Might this be a compromise that works for all of us?

I  think this means we still need to implement a JCA style factory and register 
an Apache commons provider as default.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-08 Thread Alex Herbert (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990850#comment-16990850
 ] 

Alex Herbert commented on COLLECTIONS-728:
--

{quote}it would make more sense to follow the JCA pattern and have a factory 
that allowed users to register provider libraries and then request by name, 
flags, and optional provider.
{quote}
That would work with the same JVM. How do you register the same provider across 
VMs?

An alternative is to maintain a register in Commons. Each provider is assigned 
a code (e.g. a 16-bit integer). Anyone who wishes to provide reference 
implementations can apply for a code from Commons. This then becomes their code 
that no-one else can use. A set number of code can be reserved for use by 
Commons to allocate to algorithms that the library (or core Java) provides. The 
register can be in a properties file included in the jar if necessary but I 
don't see the need for that. It should be public somewhere, e.g. a page on the 
website which is updated each release.

This idea is akin to the use of fields in a TIFF file. Someone can register a 
field with Adobe for use in their TIFF reader/writer implementation. They can 
then put whatever they want in that field. A set number of fields are reserved 
by the TIFF standard and are not available.

This does add a maintenance burden on Commons to respond to people who wish to 
register hash functions. But how often will that occur? Maybe never. If a 
personal user wishes to use the library with their own hash function they do 
not have to register it. Just use an arbitrary code and they are responsible to 
ensure compatibility across their own code.

The Bloom filter library then only need verify two filters are created using 
the same hash function with a single integer (i.e. the code) comparison. No 
name checks, no hashing of function names, etc.

 

You could enforce the properties isIterative and isUnsigned are for the first 
two bits of the code if you so desire.

 

On another note, I think the performance of autoboxing requires that this 
interface is specialised to a primitive:
{code:java}
ToLongBiFunction
{code}

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-08 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990840#comment-16990840
 ] 

Claude Warren commented on COLLECTIONS-728:
---

The issue of broken hashes is a problem and I admit that the name does not 
solve that, it does stop an MD5 hash being used with a Murmur128 hash, which 
was its original purpose.

Stronger checks could be enforced, however when working with Bloom filters one 
is typically doing so in hopes of achieving a performance improvement (e.g. a 
quick check to filter out unnecessary longer lookups) in  these cases 
performing a hash on every check/insertion is extra overhead that may not be 
tolerable to the application.  As an indication of how sensitive this can be, I 
have been working on a multidimentional Bloom filter problem.  The solution is 
so time constrained that looking up items in a Trie adds enough overhead that 
in many cases the multidimentional Bloom filter is no more efficient than a 
linear search of the equivalent list of Bloom filters.  The overhead of an 
in-depth check (like the code necessary to navigate a Trie) will significantly 
impact the performance of the Bloom filter.

I like the idea.  Perhaps there is an additional utility class to be added to 
do such checks, but I don't think the check belongs in the mainline code. 

What we need is a quick, easily extensible, mechanism to determine if two hash 
functions are expected to produce the same result.  I proposed a string (name), 
but am open to other ideas.  I am  thinking of a more binary encoding 
comprising a byte of flags and a string hash name and a string for the provider 
name.  Only 2 flags defined (Signed/Unsigned and Cyclic/Iterative).  The 
verification check would ensure that the flags and name are equal.  A more in 
depth check could be implemented by anybody who wanted to restrict to a single 
implementation, etc.

Perhaps it would make more sense to follow the JCA pattern and have a factory 
that allowed users to register provider libraries and then request by name, 
flags, and optional provider.

The resulting implementation would therefore have to have methods to return the 
provider, and flag values.

Something like:
{noformat}
public interface HashFunction extends ToLongBiFunction {
String getName();
String getProvider();
boolean isIterative();
boolean isUnsigned();
}
{noformat}

The factory would work like the JCA factories in that the order of the provider 
determines which implementation of a HashFunction is returned when multiple 
providers provide the "same function".





> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-08 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990829#comment-16990829
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}The naming is intended to provide a proxy for the implementation details
{quote}
That's what I had understood, but I find it a fragile way. Who gives the name? 
What if two developers implement the same hash algorithm? How can a third party 
developer be sure that identical names will provide the same behaviour?  [ Case 
in point: the original name of a "broken" implementation (of some hash 
function) will *not* be "MyBrokenFunction" ;) ].
{quote}so that the user may be warned
{quote}
Could it be possible to enable a stronger check? E.g. compare the output of 
applying the hash computation to some known/reference input.
{quote}to ensure that we have the same "vision" of this contribution.
{quote}
Good point. :)
 I don't know. I'd guess that most (?) codes in "Commons" are building blocks 
(for application developers) that provide a _complete_ solution to often 
encountered self-contained issues; broadly speaking, they would be readily 
usable, by _composition_, rather than requiring an extension to be fully 
functional.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-08 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990798#comment-16990798
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Hasher names:

The Shape of the Bloom filter is dependent several things.  One is that the 
hashing function is consistent; that is it uses the same techniques to generate 
the hashed values.  Two different implementations of a hash function are are 
the "same" as long as they generate the same values for the same input. 
(referred to below as the "same function"). 

Comparing bloom filters that do not use the same function does not make sense.  
Thus a mechanism to distinguish between filters with different hashing is 
desired so that the user may be warned when such an attempt is made.  This is 
where the naming arises.  The naming is intended to provide a proxy for the 
implementation details as well as provide the user with some idea of what 
hashes were used in order to assist in the resolution of the conflict.

You will note that the Shape class equality check verifies that the name is the 
same for the same reason.

Similar to the Java Cryptography Architecture (JCA) two providers may provide 
implementations of the same hash function.  Users assume that the JCA 
implementation has been vetted and is correctly implemented.  In the Bloom 
filter case an improperly implemented hash is only a serious issue in cross 
application communication. 

Note that the code in this contribution does not require the name format be 
followed, only that different implementations be named differently.  I do have 
a Caching Hasher implementation in another application that requires a cyclic 
hash function and will fail if the hash name does not match that presented here.

Perhaps it makes more sense to have a name an the booleans for cyclinc/iterated 
and signed/unsigned.  But I don't see a way to provide users with the ability 
to use new, old or broken hash functions and be able to evaluate if the same 
function is being used without using a name. 

In my mind Enums are appropriate in two basic conditions: 1) you know all the 
possible values; or 2) you want to tightly control the acceptable values.  
Neither of  these conditions apply in the case of hash function identification.

addendum:

In reading back over the previous post I was struck the the use of the term 
"library".  Perhaps this is just a "nomenclature" thing or perhaps it is a 
"conceptual" thing.  I see the Bloom filter contribution as a "framework", a 
scaffolding on which other developers may hang new implementations and tweeks.  
In my mind a "library" is generally something that is used without modification 
or extension.  I just wanted to ensure that we have the same "vision" of this 
contribution.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-07 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990667#comment-16990667
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

bq. you know all the possible values. We don't, we can't, and we never will. 

Isn't your code assuming that every implementation of some algorithm will name 
it the same way (and that the name will comply with the Javadoc of 
{{HashFunction}} defined here?
Conversely, one can imagine that two implementations could bear the same name, 
even if one is broken and actually not compatible with the other.

bq. we would have to include enums for patently "broken" hashes

Yes, if this library purports to support them.

However given that in both cases (name or enum), we don't have control on the 
source code (of the hash algorithm implementation), I wonder whether there 
should really be a {{verifyHasher}} (i.e. it could simply be the application 
developer's responsibility to ensure consistency of his use of the 
{{BloomFilter}} functionality).

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-07 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990597#comment-16990597
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Mutability:  I think all issues are fixed with the exception of 
HasherBloomFilter.

HasherBloomFilter extends BloomFilter.  In the general case BloomFilter needs 
to have a merge operation.  I have updated the javadoc for the 
HasherBloomFilter class to indicate that if merges are expected another Bloom 
filter implementation may be more 
appropriate.

HashFunction:

HashFunction extends ToLongBiFunction so it is more than just 
a name.   We could make it more complex with boolean values for unsigned/signed 
and cyclic/iterative but to what end?  I think this class is a simple as it can 
be but no simpler.

An Enum assumes you know all the possible values.  We don't, we can't, and we 
never will.  In recent work with the Murmur128 hash in commons-codec it became 
apparent that there are several implementations of the hash that are "broken" 
with respect to the original "C" code.  However, each of those is in use.  If 
this library were to be used with any of those implementations we would have to 
include enums for patently "broken" hashes, or they have to have a way to 
extend the implementation in a way that fits into this system.  Adding a name 
to the hash and not limiting it to an enum works.  This is more akin to the 
Java Cryptography Architecture with out the necessity of registering providers. 
 Though perhaps that is future work.

MD5: fixed. 


> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-07 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990552#comment-16990552
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

About mutability:
 * Couldn't {{shape}} be _final_ in {{BloomFilter}}?
 * Does {{HasherBloomFilter}} really gain anything from being mutable? IOW, 
what would be lost if the {{merge}} operation would return a new instance?
 * In class\{{MD5}}, why not make the fields _final_? Even though the instance 
is not immutable, it simplifies testing and reasoning about the code. Ditto for 
other similar classes.
 * In class {{BitSetBloomFilter}}, can't {{bitSet}} be _final_? Can't {{merge}} 
create a new instance? A lot of methods do indeed {{clone}} that instance field 
in order to compute a result.

About interface {{HashFunction}}:
 * It is weird (IMHO) to have an interface with such a name where the API just 
contains a {{getName()}} method.
 * I reiterate my suggestion of using an enum. Did you have a look at 
{{RandomSource}} in "Commons RNG"?

About class {{MD5}}:
 * It is awkard that users have to guard against a checked exception because of 
an *internal* error.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-07 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990527#comment-16990527
 ] 

Claude Warren commented on COLLECTIONS-728:
---

*API Issue*

The Hash function takes a buffer and a seed and returns an integer.  In general 
this is a simple operation.  You might use MD5, SHA256, Murmur128_x64 or 
Murmur128_x86.  So the name must specify what hash is used -- I think we can 
agree on this.

The second the calculation may be iterative or cyclic.  Iterative is probably 
the one you are familiar with where the hashing function is called with a 
different seed each time and that result used.  cyclic is when two values are 
created: call them partA and partB.  The first time the function is called 
partA is returned.  for all subsequent calls partB is added to the previous 
results and returned.  This method is documented in the Cassandra codebase and 
their experimentation has shown that it does not impact the randomness of the 
subsequent Bloom filter but does significantly speed up processing.  Generally 
this method is used with a 128 bit hash as the result can be considers as 2 
longs.

Finally, the result of the function will differ depending on whether the 
numeric values were calculated using unsigned or signed arithmetic.

This information needs to be passed down to components that do not keep a 
reference to the hasher or may not even have the hasher implementation 
available.  When comparing 2 filters you have to know if the "shape" is the 
same.  but the "shape" does not maintain a reference to he hash function.

Classes like the static hasher which maintains a list of the bits that were 
enabled needs to know the shape but does not have a reference to the hash 
function.

These classes need to be able to verify that the function that generated the 
values was the same.

*Implementation Issues*

Are you saying that all classes in Collections have to be immutable?  Or that 
the classes noted are not mutable?  They are mutable and we had a discussion 
about this wherein the standard uses of Bloom filters requires  that they be 
mutable or significant overhead is imposed to create new Bloom filters whenever 
filters are merged (a common operation).




> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-12-07 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990468#comment-16990468
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

Thanks; we are getting closer...

Nit-picks:
 * Convention for opening braces (on the same line)
 * Class and field names (should be be complete names – e.g. {{Iter}}, {{md}})
 * Javadoc: avoid use of HTML entities.

API issue:
 * Interface {{HashFunction}}: I don't think that [Collections] is going to 
impose a convention on the naming of hash functions (or I'm missing something 
about the purpose of the {{getName()}} method). Its use to ensure compatibility 
between functions seems fragile; the comparison logic should rather be defined 
by the standard {{equals(Object)}} method, or some other {{Comparator}}.

Implementation issues:
 * Classes are not mutable (e.g. {{BloomFilter}}, {{BitSetBloomFilter}})
 * Nested {{Iter}} class (in {{StaticHasher}}) wraps an {{Iterator}} 
and using the {{PrimitiveIterator.OfInt}} would probably loose a lot of 
performance despite the impression of the contrary from a user's POV.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-29 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985222#comment-16985222
 ] 

Claude Warren commented on COLLECTIONS-728:
---

The usage.md was for you to understand the usage of bloom filters.  I was not 
intending to submit it.  I will remove it from the submission.

as to the Dynamic.Factory,  I have another hasher implementation 
(CachingHasher) that I have not included here (as I am trying to keep this to a 
minimum) but that Hasher also uses the factory model.  CachingHasher has 
different characteristics from the Dynamic hasher and thus different set of 
acceptable hashing functions.  I don't see a way to create a single 
Hasher.Factory implementation that will handle both.  I think it is best for 
Dynamic.Factory to be a factory that returns DynamicHashers, and 
CachingHahser.Factory to be a factory that returns CachingHashers.  It is a 
clear separation.  Perhaps it makes sense to remove the Hasher DEFAULT from the 
Hasher.Factory interface to avoid confusion.

I will raise the External dependency on the ML.



> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-29 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985141#comment-16985141
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

I don't think that it is Commons practice to have documentation files under the 
"src/main" directory.  They go somewhere under "src/site" where they are 
collected to create the web site.

It seems that the [Collections] Checkstyle rules are more lenient than for 
other Commons components...

I'm still not convinced about the "dynamic" {{Factory}}.
I again suggest that the first PR should only contain the core functionality.

The new external dependency is likely to be frowned upon: Issue should be 
raised on the ML.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-08 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970260#comment-16970260
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Any more comments on this contribution?

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-03 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965626#comment-16965626
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Updated PR with following notes:

I added MurmurHash3 to the RAT exclusion in the POM as It is public domain and 
I could not figure out how to get RAT to accept it. This is a temporary 
work-around until a final solution is discovered.

I restructured the Hasher / Factory layout and made Hasher the top level 
interface. It now contains two (2) interfaces
 Hasher.Factory - defines only listFunctionNames()  and useFunction(String) 
[ the last one probably needs a rename ]
 Hasher.Builder - defines three (3) with(X) methods and a build() method.

Hasher.Factory.useFunction(String) takes the function name and returns a 
Builder for the Hasher.
Hasher.Factory.DEFAULT is a static that contains the default Hasher.Factory 
implementation.  Currently it contains the DynamicHasher.Factory as there is no 
other.

I removed all ByteBuffer and LongBuffers in the interfaces as they cause 
problems with the Animal Sniffer tool and it just made life easier.

[~erans] You were correct the SpanBuffer.contains( Span, Hasher ) and 
SpanBuffer.merge( Span, Hasher ) did not smell right.  I removed the Span 
arguments from the calls as  they are not necessary.  The calls are now 
SpanBuffer.contains( Hasher ) and SpanBuffer.merge( Hasher )

Added a DynamicHasher.Builder that builds the DynamicHasher.  
Removed the "locked" internal state from DynamicHasher as it is immutable.
Added a DynamicHasher.Factory that will create the DynamicHasher.Builder from 
the well known names.
Compromised on the ability to new Hash functions to the DynamicHasher.Factory 
by implementing a protected register( String, Class ) method to allow 
developers to create a child class that adds new hashes.

I hope that we can remove the MurmurHash3 code and switch to commons-codec 
hasher code, but the codec murmer3 hash implementation has a bug.  I am working 
on a pull request to fix that.





> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-02 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965284#comment-16965284
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I will provide update the PR with the stripped-down version.



> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-01 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964906#comment-16964906
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}The API always requires both arguments.
{quote}
OK. Sorry I missed that.

So, would you provide a stripped-down PR with _only_ the core API 
({{BloomFilter}}, {{Hasher}}, {{Shape}}, {{default}} methods for the "proven 
math"), and a _single_ variant for each concrete functionality (e.g. 
{{BitsetBloomFilter}} and the "Murmur" algorithm provided by an immutable 
{{Hasher.Factory}})?  Thanks.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-01 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964893#comment-16964893
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Quick note:

I somewhat get the rationale until here:
 === begin quote ===
True the Hasher could be limited to a StaticHasher and the Shape would not 
be needed as the Static Hahser contains the Shape it was built with, but in the 
general case a Hasher does not have a Shape.

Where it seems that the API could be improved so as to avoid having to pass 
null when the "shape" argument is unused.

=== end quote ===

The API _always_ requires both arguments.  For the static hasher the call would 
look like `filter.merge( staticHasher.getShape(), staticHasher )`

I will comment on other bits later when I have more time.


> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-01 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964861#comment-16964861
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

Thanks for explaining again.
 I somewhat get the rationale until here:
{quote}True the Hasher could be limited to a StaticHasher and the Shape would 
not be needed as the Static Hahser contains the Shape it was built with, but in 
the general case a Hasher does not have a Shape.
{quote}
Where it seems that the API could (?) be improved so as to avoid having to pass 
{{null}} when the "shape" argument is unused.
{quote}We will always be behind when it comes to implementing new algorithms 
using new hash functions.
{quote}
Just _one_ release behind. People who want new features should be around to 
provide a patch and help with the release.
{quote}It is better to allow the users in the field to add algorithms as they 
see fit.
{quote}
I don't think so; it leads to unnecessary duplication (with a possible 
consequence being the "broken" algorithm case which you mention).
{quote}not allowing for addition of others is very limiting.
{quote}
Not really if there is an "active community" (i.e. people willing to work 
together towards releasing new features). Once the design is in place, adding 
variants of existing features should be straightforward; a new hash function 
could be added at time "t0" and a release candidate could be prepared at "t0" + 
72 hours (to allow for review), then published another 3 days later.
{quote}they would be unable to add their "broken" hashing to the factory would 
be an inhibitor.
{quote}
"Commons" would provide a library of algorithms through its implementation of 
the {{Hasher.Factory}} interface, but application developers could define their 
own for a particular purpose (such as working around issues of their own ;)).
{quote} why not provide a factory that [...] has the convenience methods to add 
hash functions?
{quote}
Because it puts the burden on us to ensure thread-safety (while it is trivially 
accomplished with immutable instances).

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-11-01 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964753#comment-16964753
 ] 

Claude Warren commented on COLLECTIONS-728:
---

=== From the conversation above ===

 Why doesn't Shape keep a reference to Hasher? That would avoid having to pass 
both to e.g. method contains in BloomFilter (and would make is unnecessary to 
check their consistency).

 (see the response to how are StaticHasher and DynamicHasher different for some 
extra background).
 The contains method with the Hasher was requested because one of the other 
Apache projects was interested in checking the presence of the Bloom filter 
without building a complete Bloom filter. They have the hasher and they know 
the shape it should be. So the method actually needs the Hasher for the data.

Now, they can build the DynamicHasher and pass that, or they could build a 
StaticHasher (which does know it's shape), or they could implement another 
Hasher type.

Placing the Hasher in the Shape would mean that the shape will always return 
the same data when what we are doing is trying to send different data with the 
same shape.

Still it doesn't look right, API-wise; but again I'm lacking a clear view of 
all the requirements to provide an alternative. 

=== end of quote ===

The methods that have both the Hasher and the Shape are really passing 
decomposed Bloom filters.  The Hasher contains the hash values and the Shape 
identifies shape of the decomposed Bloom filter.   True the Hasher could be 
limited to a StaticHasher and the Shape would not be needed as the Static 
Hahser contains the Shape it was built with, but in the general case a Hasher 
does not have a Shape. 

The reason for the separation is the Hasher applies a hash function to data and 
returns int values.
The shape indicates to the Hasher how many times it should apply the hashing 
algorithm (k in the maths) to each data item being hashes (item = the X in 
DynamicHasher.with(X)).  The Shape also give the upper limit for the values the 
hasher returns on the particular getBits() call.  The upper limit is "m" in the 
maths.

So Hasher + Shape = an iteration of integers that are the bits turned on the 
Bloom filter.

A single hasher can generated Bloom filters of an almost infinite number of 
Shapes.

--- snip --

=== start of quote === 

>From KISS and DRY POVs, I disagree with both statements: The factory could be 
>immutable and readily provide all the useful algorithms (see e.g. 
>RandomSource).

=== end of quote ===

There are two issues with the idea that we can provide all the usefull 
algorithms.

1) We will always be behind when it comes to implementing new algorithms using 
new hash functions.  It is better to allow the users in the field to add 
algorithms as they see fit.  I remember an instructor one said the the only 
valid limits in computer science are zero, one and many.  Providing an 
implementation with some but not allowing for addition of others is very 
limiting.

2) There are known cases (Cassandra comes to mind) where an implementation of a 
Hashing algorithm was incorrectly executed and released into the wild.  If such 
an application wanted to use this library they would be unable to add their 
"broken" hashing to the factory would be an inhibitor.

So why not provide a factory that " provide all the useful algorithms" and has 
the convenience methods to add hash functions?

Once we add hash functions it seems a short step to resetting the factory to 
default.  At a minimum this would be needed for testing.









> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-31 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964093#comment-16964093
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

bq. Yep, this was just a new look at a possible solution. I expect there to be 
a fair amount of cleanup to be done.

The problem (for me at least) is that discussing about the design is tedious 
when having to read through fairly long files, and especially on GitHub.  I 
can't comment on GitHub at all, and long comments here are not practical either 
(having to manually "quote", whereas it is automatic in a mail client).  That's 
the reason, I guess, that prior to agreement on a design (for a new 
functionality), discussions were always held on the "dev" ML.

bq. Interface does not support protected methods. In addition using default, 
which is described in the literature as a mechanism for extending existing 
interfaces,

It isn't solely a "workaround" mechanism; they can be useful by themselves.  
Quoting from [this 
presentation|https://www.baeldung.com/java-static-default-methods]:
{noformat}
The most typical use of default methods in interfaces is to incrementally 
provide additional functionality to a given type without breaking down the 
implementing classes.

In addition, they can be used to provide additional functionality around an 
existing abstract method
{noformat}

bq. to build a new interface doesn't seem right. Also, are there any cases of 
Bloom filters being anything more than a Bloom filter?

For that reason, the default methods would be particularly apt to provide a 
functionality which all implementations are supposed to need and use.
And the added advantage of a {{BloomFilter}} interface is that, in the (even 
unlikely) case where someone wants to code it from scratch, it is possible that 
the new class {{implements}} the interface.  We could also provide "adapter" 
code that would delegate to the external implementations which you mentioned.

bq. Passing the hasher [...]

I feel there is something wrong there, API-wise, but I may be missing something 
so won't further comment for now.

bq. The others are final as they are based on definitions and mathematical 
proofs from the literature. The final could be removed if desired. 

Great use for default methods IMHO (cf. above).  Yet some hypothetical 
implementation could want to override them (e.g. if it caches part of the 
computation).

bq. So the method actually needs the Hasher for the data.

Still it doesn't look right, API-wise; but again I'm lacking a clear view of 
all the requirements to provide an alternative. :(

bq. Because hashers can be added to the Factory, there needs to be a way to 
reset them.

>From KISS and DRY POVs, I disagree with both statements: The factory could be 
>immutable and readily provide all the useful algorithms (see e.g. 
>[RandomSource|http://commons.apache.org/proper/commons-rng/commons-rng-simple/apidocs/org/apache/commons/rng/simple/RandomSource.html]).

bq. Makes no difference to me, Hasher and HasherFactory could be defined on 
their own as well.

The difference is in the clarity (and self-documenting nature) of the code.  
The {{Hasher}} interface is not tied to the particular {{HasherFactory}} 
concrete class; so the former should not be declared as a nested interface of 
the latter.
Instead, we could have something like:
{code}
public interface Hasher {
PrimitiveIterator.OfInt getBits(BloomFilter.Shape shape);

interface Factory {
T create(String name);
}
}
{code}


> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-30 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962970#comment-16962970
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I originally answered this in an email, but it didn't get posted back here so 
here are my responses:


  * Nit: There are formatting issues which I mentioned in other reviews.

 
Yep, this was just a new look at a possible solution.  I expect there to be a 
fair amount of cleanup to be done.


* Why an _abstract_ class instead of an _interface_ (and _default_ methods)?


Interface does not support protected methods.  In addition using _default_, 
which is described in the literature as a mechanism for extending existing 
interfaces, to build a new interface doesn't seem right.  Also, are there any 
cases of Bloom filters being anything more than a Bloom filter?
 

* Why pass a {{Hasher}} to the {{Shape}} constructor when only its "name" 
is used?


Passing the hasher ensures that the name is actually bound.  It could be just 
the name.  By passing the Hasher there is some check on the validity of the 
name, but that is not necessary.
 

* Why are methods _final_?  It's redundant for _private_ ones, and strange 
for those name {{estimate...}} (one could imagine that subclasses could have 
their own way).


I missed the private final, you are correct the final should be removed.

The others are final as they are based on definitions and mathematical proofs 
from the literature.  The final could be removed if desired. 

 

* Why doesn't {{Shape}} keep a reference to {{Hasher}}?  That would avoid 
having to pass both to e.g. method {{contains}} in {{BloomFilter}} (and would 
make is unnecessary to check their consistency).


(see the response to how are StaticHasher and DynamicHasher different for some 
extra background).
The contains method with the Hasher was requested because one of the other 
Apache projects was interested in checking the presence of the Bloom filter 
without building a complete Bloom filter.  They have the hasher and they know 
the shape it should be.  So the method actually needs the Hasher for the data.

Now, they can build the DynamicHasher and pass that, or they could build a 
StaticHasher (which does know it's shape), or they could implement another 
Hasher type. 

Placing the Hasher in the Shape would mean that the shape will always return 
the same data when what we are doing is trying to send different data with the 
same shape.

 

* I don't get the {{reset()}} method in {{HasherFactory}}.  At first sight, 
this class should be immutable.


Because hashers can be added to the Factory, there needs to be a way to reset 
them.  This is the similar to the problem in Apache Jena with  the datatypes 
TypeMapper where the mapping between external objects and their datatype names 
is maintained.  Items can be added but experience shows they also need to be 
reset.  Particularly when testing.

* In {{DynamicHasher}}, it seems that the "lock" construct could be 
advantageously replaced with a builder (i.e. the returned {{Hasher}} instance 
would only offer the {{getBits()}} API.


Yes, that is a valid and reasonable construct.
 

* It seems strange to define the {{Hasher}} interface inside 
{{HasherFactory}} (IMO that should be the other way around).


Makes no difference to me, Hasher and HasherFactory could be defined on their 
own as well.
 

* What is the difference between {{StaticHasher}} and {{DynamicHasher}}?


First the similarity:  Both return integers that are the indexes of bits to be 
turned on based on the shape argument to getBits( Shape ).

DynamicHasher uses a list of ByteBuffers to generate the indexes on the fly.  
Can generate for any Shape.

StaticHasher stores a list of indexes that were generated for a specific Shape. 
 Can only provide indexes for the original Shape and does it by returning an 
iterator over a stored list of integers.  (Much faster)


*  I'm afraid that the review is not exhaustive; again, I have a hard time 
figuring out the fundamental (design) structure from convenience features. 

The fundamental methods that the BloomFilter must have (using the new 
contains() rather than matches() method) are contains() and merge().
However for disparate implementations to be able to execute the contains() and 
merge() methods some common representation of the underlying bit storage must 
be made available.  This implementation has 2: getBits() [ long buffer encoded 
bit pattern ], and getHasher() [ iterator over t the indexes of the bits that 
are turned on ].  These two solutions allow for performance gains to be made 
depending on the size of the Bloom filter (number of bits possible - m) and the 
number of bits that are turned on in the filter (hammingValue or cardinality).

Fundamental checking methods are the methods that ensure the shapes of the 
f

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-29 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962154#comment-16962154
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

bq. implementation (https://github.com/Claudenw/BloomFilter/tree/Hasher)

Nit: There are formatting issues which I mentioned in other reviews.

* Why an _abstract_ class instead of an _interface_ (and _default_ methods)?
* Why pass a {{Hasher}} to the {{Shape}} constructor when only its "name" is 
used?
* Why are methods _final_?  It's redundant for _private_ ones, and strange for 
those name {{estimate...}} (one could imagine that subclasses could have their 
own way).
* Why doesn't {{Shape}} keep a reference to {{Hasher}}?  That would avoid 
having to pass both to e.g. method {{contains}} in {{BloomFilter}} (and would 
make is unnecessary to check their consistency).
* I don't get the {{reset()}} method in {{HasherFactory}}.  At first sight, 
this class should be immutable.
* In {{DynamicHasher}}, it seems that the "lock" construct could be 
advantageously replaced with a builder (i.e. the returned {{Hasher}} instance 
would only offer the {{getBits()}} API.
* It seems strange to define the {{Hasher}} interface inside {{HasherFactory}} 
(IMO that should be the other way around).
* What is the difference between {{StaticHasher}} and {{DynamicHasher}}?

I'm afraid that the review is not exhaustive; again, I have a hard time 
figuring out the fundamental (design) structure from convenience features.  

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-28 Thread Alex Herbert (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961433#comment-16961433
 ] 

Alex Herbert commented on COLLECTIONS-728:
--

{quote}your second match is incorrect. Match is (this & that = this)
{quote}
OK. I was thinking that this.match(that) == big.match(small). So do you propose 
to have effectively the following:
{code:java}
// a bloom filter with fewer bits can match a filter with more but not visa 
versa. 
public boolean match(BloomFilter other) {
return (this & other) == this;
}

// a bloom filter with more bits can contain a filter with less but not visa 
versa. 
public boolean contains(BloomFilter other) {
return (this & other) == other;
}
{code}

This does seem redundant. I would favour dropping one; I'd keep contains() as 
this is more self documenting.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-28 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961283#comment-16961283
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Based on the feedback and questions I have created a different implementation 
(https://github.com/Claudenw/BloomFilter/tree/Hasher) -- note this is the 
Hasher branch.

I posted a description of the implementation on the "[collections] BloomFilter 
package architecture discussion" topic on the dev mailing list.

[~aherbert] your second match is incorrect.  Match is (this & that = this)  So 
a bloom filter with fewer bits can match a filter with more but not visa versa. 
 To address this I added a contains() method.  Which is what you would want to 
call with a "decomposed" bloom filter.  Perhaps "decomposed" is not a good 
phrase because in the case at hand it has never been "composed".

Claude

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-25 Thread Alex Herbert (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960025#comment-16960025
 ] 

Alex Herbert commented on COLLECTIONS-728:
--

Following on from a conversation had on GitHub for the initial BloomFilter code 
contribution PR the example there was a Bloom filter created for 10,000 objects 
with a false positive probability of 1/5. This requires a Bloom filter of 
225,200 bits with 16 bits set per object hash. Thus even though the Bloom 
filter may require a lot of storage to represent the combination of thousands 
of objects each object only requires a very small amount of storage. In 
addition you only need to find a single bit index in the object hash that is 
not in the Bloom filter to determine that there is no match.

Thus the common use case of checking if an object matches the filter is to 
compute a relatively small set of indices and check each one is in the Bloom 
filter. This would suggest the following:
{code:java}
public interface BloomFilter {
    public interface BitIterator {
        /**
         * Get the next set bit, or -1 if no more bits are set.
         *
         * @return the bit index (or -1)
         */
        int nextSetBit();

        static BitIterator of(int... bits) {
            return new BitIterator() {
                int count;

                @Override
                public int nextSetBit() {
                    if (count < bits.length) {
                        return bits[count++];
                    }
                    return -1;
                }
            }
        }
    }

    default boolean match(BitIterator bits) {
        int index = bits.nextSetBit();
        if (index == -1) {
            // Cannot match no indices
            return false;
        }
        while (index != -1) {
            if (!get(index)) {
                return false;
            }
        }
        return true;
    }

    boolean get(int bitIndex);
}{code}
Thus the Bloom filter satisfies the following:
 # A BloomFilter is a representation of the logical OR (combination) of many 
hashes.
 # A hash is a specification of indices that are set to 1.
 # A BloomFilter can answer the question: is bit index _n_ set?
 # A BloomFilter can answer the question: are all bit indices _N_ set?
 # A "hasher" is able to convert an object to a set of indices.
 # A comparison of each index in turn opens the possibility of dynamic 
computation of each hash index with fail fast matching.

To use a BloomFilter would require an implementation of hashing that generates 
a number of indices for your objects and a concrete representation of the 
BloomFilter composite of many hashes.

This is just an idea and I can see that if constant time access to the method 
{{get(int)}} is not available then the filter would be slow. It is an 
alternative and can be combined with the current idea to build an entire 
BloomFilter for each new object and perform a match of two BloomFilters:
{code:java}
public interface BloomFilter {
    public interface BitIterator {
        /**
         * Get the next set bit, or -1 if no more bits are set.
         *
         * @return the bit index (or -1)
         */
        int nextSetBit();
    }

    boolean match(BloomFilter other);

    boolean match(BitIterator bits);
}
{code}
 

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-25 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960002#comment-16960002
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

bq. there is nothing in it that allows you to add another filter.

The file is not assumed to be complete; the point was to show that all the 
functionality, except state and states-specific implementations, could be in 
one (Java 8) interface.

bq. I think this can be part of the interface but is not required. It could be 
moved to a separate class as it just acts as a calculator.

Nested class {{Shape}} provides the "boiler-plate" code needed by all 
implementations.

{quote}
# Is a BloomFilter defined by its size and set bits?
{quote}

Probably, but a user should not care much, as long as the API provides 
instance, and they are used consistently (e.g. merging should only be done 
among instances produced from the same "builder").

{quote}
# ...
# Can we match BloomFilters of different lengths?
{quote}

I'd say "no" (?).  At least, it seems like looking for trouble.

bq. size and access to set bits should be available within the interface.

Maybe, but I wouldn't make it part of the API before a use-case shows that it's 
absolutely necessary.

bq. pad the smaller filter with nothing

?

bq. core functionality

IIUC, this should only refer to the methods in the {{BloomFilter}} interface.

The {{Skeleton}} (a.k.a. {{ProtoBloomFilter}} in Claude's PR) contains the 
configuration information to be used by implementations of the interface in 
order to create compatible instances (that can be "merged" and "matched").

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-25 Thread Alex Herbert (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959847#comment-16959847
 ] 

Alex Herbert commented on COLLECTIONS-728:
--

Looking at BloomFilter.java class there is nothing in it that allows you to add 
another filter.

This is a major use case for filters that guard collections, if an object is 
not present then add it to the existing Bloom filter then do something with the 
object.
{code:java}
boolean match(BloomFilter other);

// Possible for a MutableBloomFilter interface?
// Could return true if the filter was changed by the operation.
// Or return void.
boolean add(BloomFilter other);
{code}
The Shape class is a strong statement about the classic definition of a Bloom 
filter. It implements the Bloom equations for m, n, k, p:

 
{noformat}
"computing an array of m bits, and for up to n different elements, test k bits 
using positions chosen using hash functions. If all bits are set, the element 
probably already exists, with a false positive rate of p." 
{noformat}
I think this can be part of the interface but is not required. It could be 
moved to a separate class as it just acts as a calculator.

It does open the following questions:
 # Is a BloomFilter defined by its size and set bits?
 # Can we match BloomFilters of different lengths?

I would assume the answer to 1 is perhaps (if a filter is built using hashing 
functions to choose bits). If there is no other way to build a Bloom filter 
then the size and access to set bits should be available within the interface:
{code:java}
int getNumberOfBits();
boolean get(int bitIndex);
{code}
I assume the answer to 2 is yes. The default would be to pad the smaller filter 
with nothing (i.e. zeros).

I see the core functionality as:
 # Matching another filter
 # Adding another filter
 # Optionally defining key properties of the filter (the size and the set bits)

Separation of filter functionality and implementation would apply to the Hash 
and Skeleton classes since the way to build a Bloom filter is separate from the 
filter functionality.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-19 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955217#comment-16955217
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

Assuming {{BloomFilterI2}}, what would be the next step if we want (?) a class 
where the state is provided by a {{BitSet}}?  The following I suppose:
{code:java}
public BitsetBloomFilter implements BloomFilterI2 {
private final BitSet state;

// Constructor(s), builder, and so on.

// Define methods for which an implementation is required, but
// nothing else, assuming that reuse is good.
}
{code}
Now, making a wild ;) guess, I'd have a hard time to convince myself that this 
implementation will be efficient given the generic passing around of data, with 
all the boxing and unboxing, insead of operating directly on the {{state}}.
 And, if obviously (?), we'd override the {{default}} methods, why the bloat?

Moreover, as said in the previous comment, the interface forces all 
implementations to provide a method ({{stream()}} with a semantic that might 
not fit its underlying structure, or even that they would not need at all (say, 
in order to provide the {{hammingDistance}} property).

This is not to say that the code in {{BloomFilterI2}} is never useful (I don't 
know) but its place would seem to be in an abstract class, that would e.g. ease 
prototyping of alternative implementations and test their correctness.  In any 
case, even if it were actually heading in the right and I fail to see it for 
some reason, it would still be a refinement that could be left for once we 
agree on the "bare-bone" functionality.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-19 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955205#comment-16955205
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

{quote}IntStream stream(); // a stream of integers that are the indexes of the 
enabled bits.
{quote}
As already noted, I don't have the practical knowledge to know whether this is 
required for _all_ use-cases of the Bloom filter functionality. However, from a 
design POV, this is a kludge (IMHO) because, as I've tried to convey on the ML, 
it conflates the concept ("determines, with some probability of false positive, 
whether something exists") with how it is implemented (a specific, and 
potentially inefficient, representation of the underlying structure that 
provides one way to obtain the requested result).
{quote}I have added BloomFilterI2
{quote}
A particular example strikes me as the very illustration of what I think is the 
confusion between "BloomFilter" and "set of bits":
{quote}
{code:java}
default int hammingValue() {
return cardinality();
}

default int hammingDistance( BloomFilterI2 other ) {
return xorCardinality( other );
}
{code}
{quote}
IIUC, {{hammingDistance}} is a core property of the Bloom filters; whereas 
{{xorCardinality}} is how it can be provided, given some assumption (i.e. 
"BitSetI") of the underlying representation. The latter is, in OO terminology, 
an implementation detail that should not percolate to the public API. Moreover, 
in this case, the methods {{hammingValue}} and {{hammingDistance}} are clearly 
redundant, and one would readily ask: Why don't we just drop them?

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-19 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955201#comment-16955201
 ] 

Claude Warren commented on COLLECTIONS-728:
---

BF_Func.md is a list of bloom filter functions that should be supported and 
what calculations they require.

I have added BloomFilterI2 (to distinguish it from BloomFilter) that is an 
interface that implements the methods via default implementations but leaves 2 
methods to be implemented.

IntStream stream(); // a stream of integers that are the indexes of the 
enabled bits.
BloomFilterI2 merge (BloomFilterI2 other );  // a Bloom filter  constructed 
by ORing the other bloom filter with this.



> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
> Attachments: BF_Func.md, BloomFilter.java, BloomFilterI2.java, 
> Usage.md
>
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-07 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946214#comment-16946214
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Looking at the Log, distance, and some of the other functions perhaps it makes 
more sense for the CountingBloomFilter not to extend StandardBloomFilter and to 
move the statistical methods into a class of static methods.  This would remove 
some overhead from the CountingBloomFilter.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-07 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946183#comment-16946183
 ] 

Claude Warren commented on COLLECTIONS-728:
---

BloomCollectionStatistics should  be called BloomFilterGatedStatistics as that 
is where the requirement for the method getStats() comes from.  I will change 
that.

I have changed the BloomCollectionConfiguration from using a 
StandardBloomFilter for the gate to returning a BloomFilter for the gate.  
(Good catch, thank you) 

BloomCollectionConfiguration should be called BloomFilterGatedConfiguration as 
it applies to the BloomFilterGated interface, not specifically to the 
collection.  I will change that.





> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-07 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945857#comment-16945857
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

bq. other implementations of BloomFilter.

Would those reuse none of the current implementation ("StandardBloomFilter")?
Is "BloomFilter" an abstract _concept_ or is it tied to a reference 
implementation?
IOW, would all the expected contributions be a subclass (with overridden 
methods)? E.g. if a "BloomFilter" always comes with a notion of "getLog", then 
a _protected_ "getApproximateLog" would allow customizing it.
You should really post (one post per such issue) to the "dev" ML to collect 
more opinions on whether an _interface_ is the preferred approach (based on the 
answers to the above questions).

bq. build() methods: They are not "necessary".

My point. ;-)
Ditto: preferences should be asked on the "dev" ML.
IMHO, the only (?) advantage (saving 2 keystrokes) is not worth almost doubling 
that class' API.

bq. BloomCollection is not just an example.

OK.
Still, it is not clear to me whether it's worth creating a "collection" 
subpackage (again, a question for the "dev" ML).  Can the classes in that 
package be used independently of {{BloomCollection}}? E.g. just by their name
* BloomCollectionConfiguration
* BloomCollectionStatistics

it seems that the above classes could (?) be static nested classes.

Also I suspect there is a relationship with the "interface" issue: In 
{{BloomCollectionConfiguration}} has a {{gate}} initialized with a 
{{StandardBloomFilter}} instance; thus, this higher-level utility assumes more 
than the {{BloomFilter}} API.  At first sight, this doesn't look right...


> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-07 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945717#comment-16945717
 ] 

Claude Warren commented on COLLECTIONS-728:
---

I am working to clean up the issues that [~erans] raised with the following 
notes:

The BloomCollection can be used.  It works as described but it is limited  to 
wrapping Java Collection implementations.  I expect that this will be 
sufficient for many implementations but for very large scale implementations it 
may be desirable to have the BloomFilterGated implement on top of a larger 
datastore.  So the BloomCollection is not _just_ an example.

I can for see other implementations of BloomFilter.  This contribution contains 
2 (though Counting is derived from Standard) there are a number of other 
implementations identified in https://en.wikipedia.org/wiki/Bloom_filter .  I 
would expect to see some of those contributed in the future.

With respect to duplicate build() methods:  They are not "necessary".  They are 
convenience methods akin to the Java MessageDigest update() and digest() 
methods.  No, all the digest() methods in MessageDigest are not necessary but 
they are convenient.   Yes, MessageDigest could have been implemented with only 
the no argument digest() method.  The same can be said for the build() methods 
of the ProtoBloomFilter.Builder.  The use cases for both are much the same:  
Having a single method to build a digest/ProtoBloomFilter from a single data 
object (e.g. messageDigest..digest( "foo" ) or builder.build("foo"))

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-01 Thread Gilles Sadowski (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942393#comment-16942393
 ] 

Gilles Sadowski commented on COLLECTIONS-728:
-

Thanks for the removal of {{Serializable}}. :)

Class {{BloomCollectionStatistics}}
* enum {{CHANGE}} should be {{Change}} (and the _elements_ must be in 
uppercase).
* (nit) Javadoc:
** Line lengths in Javadoc seem a bit random.
** Ditto for alignment.
** Inconsistent use of "{@code ...}" (and I would avoid adding an "s" after the 
closing brace).
* Avoid use of "l" as a variable name.
* Method {{asInt(long)}} would be much clearer using if/else.  Also, might need 
a better name ({{clampToInteger}} (?))

Class {{BloomFilterConfiguration}}:
* Avoid use of primitive wrapper classes that entail unnecessary 
boxing/unboxing.
* Remove the multiplication by "1.0" (and unnecessary parentheses)
* (nit) Wrap long lines.

Some of the issues mentioned earlier:
* Add "final" for constants.
* Remove duplicate "build" methods (or please give an example of why they are 
necessary).
* (nit) Typos in Javadoc and trailing spaces (cf "git diff").

Is {{BloomCollection}} just an example (as stated in the Javadoc) or is it 
for actual use?
Do you foresee other implementations of the {{BloomFilter}} interface?  If not, 
we could rename {{StandardBloomFilter}} to {{BloomFilter}}...

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (COLLECTIONS-728) BloomFilter contribution

2019-10-01 Thread Claude Warren (Jira)



[ 
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942269#comment-16942269
 ] 

Claude Warren commented on COLLECTIONS-728:
---

Updated pull request to include latest common-collections4 code.  e.g. did a 
git pull from apache repository to sync with latest changes.

> BloomFilter contribution
> 
>
> Key: COLLECTIONS-728
> URL: https://issues.apache.org/jira/browse/COLLECTIONS-728
> Project: Commons Collections
>  Issue Type: Task
>Reporter: Claude Warren
>Priority: Minor
>
> Contribution of BloomFilter library comprising base implementation and gated 
> collections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

53 matches

Mail list logo