> There is no primitive specialisation of Iterator for long https://docs.oracle.com/javase/8/docs/api/java/util/PrimitiveIterator.OfLong.html
On Sun, Apr 28, 2024 at 10:00 AM Alex Herbert <alex.d.herb...@gmail.com> wrote: > Hi Gary, > > I am in favour of using nomenclature and patterns that will be familiar to > a Java developer. But only if they match the familiar JDK use patterns. The > Bloom filter package has some atypical use patterns that have driven the > current API to where it is. I'll try and describe these below. > > On Sun, 28 Apr 2024 at 14:16, Gary Gregory <garydgreg...@gmail.com> wrote: > > > Hi Clause, Albert, and all, > > > > Since the introduction of lambdas in Java 8, Java has a well-defined > > terminology around the classic producer-consumer paradigm but (for > > reasons unknown to me) realized in the functional interfaces *Supplier > > and *Consumer. In addition, as of Java 5, we have the Iterable > > interface. > > > > In our new Bloom filter package we have new interfaces called > > *Producer (as opposed to *Supplier), where some of these new > > interfaces are formally annotated with @FunctionalInterface and some > > not (for example, BloomFilterProducer). > > > > My question is: Why call these "Producers" instead of "Suppliers"? Is > > the formal Bloom filter literature tied to the "Producer" terminology > > in a way that would make adapting to the Java term confusing? I know I > > brought up a similar topic recently, but I would like to revisit it > > now that I've started to read Claude's blog drafts. Even without > > making the current "Producers" formal suppliers by extending Supplier, > > would it be worth using the Java terminology? > > > > Claude is familiar with the literature and can comment on that. I would > defer to the literature if it is a common term. > > There is one notable distinction to JDK suppliers. Suppliers only supply 1 > element and must be repeatedly called to generate more. The Producers in > the BloomFilter package will supply multiple values. They are invoked using > a forEach pattern with the intention of supplying all the elements to a > predicate, not a consumer. If any of those elements is rejected by the > predicate then the rest of the elements are not supplied. So this is a > fail-fast bulk supplier. > > > > > > My second observation is that some might neither be "Producers" or > > "Suppliers" but instead be extensions of Iterable. For example, > > BitMapProducer is not a factory for instances of BitMap; the BitMap > > does not appear in the signatures of BitMapProducer methods. From a > > strict Java POV, this is (slightly) perplexing. > > > > Iterable was suggested in an earlier API, particular for the IndexProducer. > IIRC it was rejected on the basis of simplifying the code for the caller in > the fail-fast case. Otherwise every user of the iterator must implement > fail-fast loops over the elements. There may have been other reasons so it > could be worth a check in the mailing list archives. It would require going > back a few years but it was discussed on the dev list. > > The term BitMap refers to a long that holds 64-consecutive indices as > either present or absent. You can consider the sequential bitmaps > containing all indices from [0, n) as the serialized state of a Bloom > filter with n bits. This is essentially a BitSet as you can see from the > SimpleBloomFilter implementation. This originally wrapped a BitSet; it was > converted to directly implement the required read/write bit functionality > on the grounds of performance (no memory reallocation; no index checks). > > We do not have a BitMap class since we use a long primitive. A rename would > be to LongProducer causing a name clash with the JDK. Renaming to something > else is possible but I believe BitMap is a term from the literature. > > > > > > Instead (forgetting the class name issue for now), we could have: > > > > @FunctionalInterface > > public interface BitMapProducer extends Iterable<LongPredicate> {...} > > > > Which would let implementations define: > > > > Iterator<LongPredicate> iterator(); > > > > Instead of: > > > > boolean forEachBitMap(LongPredicate predicate); > > > > The BitMapProducer is not iterating LongPredicates. It is iterating longs > to be accepted by a single LongPredicate. The boolean return allows > signalling to stop the forEach loop. There is no primitive specialisation > of Iterator for long. There is a Spliterator.OfLong but that bundles some > other API that we do not wish to support, namely parallel streaming via > split and the ability to advance element by element (tryAdvance). Currently > we only implement the equivalent of the forEachRemaining pattern from > Spliterator. That accepts a consumer and so fail-fast would be done via > raising a runtime exception. Given that fail-fast is a key feature of a > Bloom filter then we do not want this to be implemented via exceptions. > > The primary use case for fail-fast is to stop as soon as a bit index is > found, or not found (case dependent). Consider a Bloom filter that has 20 > indices per hashed item. You have populated the filter with items, each has > 20 random indices. You then check if a new item is not contained in the > filter by creating indices for the new item with your hash function and > checking each index against those already in the filter. If your new > element has an index not in the filter, then you have not seen this element > before. This process can be done by creating all the indices by hashing as > the first step, then comparing each to the filter. Or you can generate > indices lazily, only creating the next index if the current one is present > in the filter. This requires fail-fast index iteration and dynamic index > generation by the hasher. You can walk through this looking at > the EnhancedDoubleHasher. This is what takes a byte[] (capturing > information on the item) and creates indices. But it only creates indices > until it is triggered to stop. > > Note that the filter can only tell you with 100% confidence that an item > has not been observed. It cannot tell you with 100% confidence it has been > observed, as the random indices created by hashing the item may all be > present in the collective indices from all items added to the filter. The > Shape class has the standard computations to create a filter with enough > bits to control the false-positive rate for the expected number of items > you wish to store. > > > > > > Same comment for IndexProducer. > > Same comment for BloomFilterProducer. > > Is this too much Java-ness? > > > > CellConsumer looks like a Predicate, not a traditional Java *Consumer. > > We have a specialization called LongBiPredicate so I propose we rename > > and extract CellConsumer as IntBiPredicate. > > > > Cell is a term from the literature. A cell is an index and a count. So > although you are accepting two int values, they are actually only one cell. > A rename to CellPredicate would be more appropriate to imitate > Predicate<Cell> where the Cell is a pair of int primitives. > > In contrast the LongBiPredicate is used to accept two BitMaps that have > been paired by iterating over two filters with the same shape. This is not > an extensible design as you can only iterate over a pair of filters, and > not n filters of the same shape. The use case is in standard set operations > such as union and intersect which are used to generate metrics used in the > literature (see the SetOperations class). This could be renamed to > BitMapBiPredicate. However I think the current name was chosen to be in > keeping with the JDK j.u.function package. > > Hope that helps set some background. > > Alex > > > > > > > TY! > > Gary > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > >