Re: [Collections] Suppliers, Iterables, and Producers

Brett Okken Sun, 28 Apr 2024 10:37:09 -0700

> There is no primitive specialisation
of Iterator for long

https://docs.oracle.com/javase/8/docs/api/java/util/PrimitiveIterator.OfLong.html




On Sun, Apr 28, 2024 at 10:00 AM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

> Hi Gary,
>
> I am in favour of using nomenclature and patterns that will be familiar to
> a Java developer. But only if they match the familiar JDK use patterns. The
> Bloom filter package has some atypical use patterns that have driven the
> current API to where it is. I'll try and describe these below.
>
> On Sun, 28 Apr 2024 at 14:16, Gary Gregory <garydgreg...@gmail.com> wrote:
>
> > Hi Clause, Albert, and all,
> >
> > Since the introduction of lambdas in Java 8, Java has a well-defined
> > terminology around the classic producer-consumer paradigm but (for
> > reasons unknown to me) realized in the functional interfaces *Supplier
> > and *Consumer. In addition, as of Java 5, we have the Iterable
> > interface.
> >
> > In our new Bloom filter package we have new interfaces called
> > *Producer (as opposed to *Supplier), where some of these new
> > interfaces are formally annotated with @FunctionalInterface and some
> > not (for example, BloomFilterProducer).
> >
> > My question is: Why call these "Producers" instead of "Suppliers"? Is
> > the formal Bloom filter literature tied to the "Producer" terminology
> > in a way that would make adapting to the Java term confusing? I know I
> > brought up a similar topic recently, but I would like to revisit it
> > now that I've started to read Claude's blog drafts. Even without
> > making the current "Producers" formal suppliers by extending Supplier,
> > would it be worth using the Java terminology?
> >
>
> Claude is familiar with the literature and can comment on that. I would
> defer to the literature if it is a common term.
>
> There is one notable distinction to JDK suppliers. Suppliers only supply 1
> element and must be repeatedly called to generate more. The Producers in
> the BloomFilter package will supply multiple values. They are invoked using
> a forEach pattern with the intention of supplying all the elements to a
> predicate, not a consumer. If any of those elements is rejected by the
> predicate then the rest of the elements are not supplied. So this is a
> fail-fast bulk supplier.
>
>
> >
> > My second observation is that some might neither be "Producers" or
> > "Suppliers" but instead be extensions of Iterable. For example,
> > BitMapProducer is not a factory for instances of BitMap; the BitMap
> > does not appear in the signatures of BitMapProducer methods. From a
> > strict Java POV, this is (slightly) perplexing.
> >
>
> Iterable was suggested in an earlier API, particular for the IndexProducer.
> IIRC it was rejected on the basis of simplifying the code for the caller in
> the fail-fast case. Otherwise every user of the iterator must implement
> fail-fast loops over the elements. There may have been other reasons so it
> could be worth a check in the mailing list archives. It would require going
> back a few years but it was discussed on the dev list.
>
> The term BitMap refers to a long that holds 64-consecutive indices as
> either present or absent. You can consider the sequential bitmaps
> containing all indices from [0, n) as the serialized state of a Bloom
> filter with n bits. This is essentially a BitSet as you can see from the
> SimpleBloomFilter implementation. This originally wrapped a BitSet; it was
> converted to directly implement the required read/write bit functionality
> on the grounds of performance (no memory reallocation; no index checks).
>
> We do not have a BitMap class since we use a long primitive. A rename would
> be to LongProducer causing a name clash with the JDK. Renaming to something
> else is possible but I believe BitMap is a term from the literature.
>
>
> >
> > Instead (forgetting the class name issue for now), we could have:
> >
> > @FunctionalInterface
> > public interface BitMapProducer extends Iterable<LongPredicate> {...}
> >
> > Which would let implementations define:
> >
> > Iterator<LongPredicate> iterator();
> >
> > Instead of:
> >
> > boolean forEachBitMap(LongPredicate predicate);
> >
>
> The BitMapProducer is not iterating LongPredicates. It is iterating longs
> to be accepted by a single LongPredicate. The boolean return allows
> signalling to stop the forEach loop. There is no primitive specialisation
> of Iterator for long. There is a Spliterator.OfLong but that bundles some
> other API that we do not wish to support, namely parallel streaming via
> split and the ability to advance element by element (tryAdvance). Currently
> we only implement the equivalent of the forEachRemaining pattern from
> Spliterator. That accepts a consumer and so fail-fast would be done via
> raising a runtime exception. Given that fail-fast is a key feature of a
> Bloom filter then we do not want this to be implemented via exceptions.
>
> The primary use case for fail-fast is to stop as soon as a bit index is
> found, or not found (case dependent). Consider a Bloom filter that has 20
> indices per hashed item. You have populated the filter with items, each has
> 20 random indices. You then check if a new item is not contained in the
> filter by creating indices for the new item with your hash function and
> checking each index against those already in the filter. If your new
> element has an index not in the filter, then you have not seen this element
> before. This process can be done by creating all the indices by hashing as
> the first step, then comparing each to the filter. Or you can generate
> indices lazily, only creating the next index if the current one is present
> in the filter. This requires fail-fast index iteration and dynamic index
> generation by the hasher. You can walk through this looking at
> the EnhancedDoubleHasher. This is what takes a byte[] (capturing
> information on the item) and creates indices. But it only creates indices
> until it is triggered to stop.
>
> Note that the filter can only tell you with 100% confidence that an item
> has not been observed. It cannot tell you with 100% confidence it has been
> observed, as the random indices created by hashing the item may all be
> present in the collective indices from all items added to the filter. The
> Shape class has the standard computations to create a filter with enough
> bits to control the false-positive rate for the expected number of items
> you wish to store.
>
>
> >
> > Same comment for IndexProducer.
> > Same comment for BloomFilterProducer.
> > Is this too much Java-ness?
> >
> > CellConsumer looks like a Predicate, not a traditional Java *Consumer.
> > We have a specialization called LongBiPredicate so I propose we rename
> > and extract CellConsumer as IntBiPredicate.
> >
>
> Cell is a term from the literature. A cell is an index and a count. So
> although you are accepting two int values, they are actually only one cell.
> A rename to CellPredicate would be more appropriate to imitate
> Predicate<Cell> where the Cell is a pair of int primitives.
>
> In contrast the LongBiPredicate is used to accept two BitMaps that have
> been paired by iterating over two filters with the same shape. This is not
> an extensible design as you can only iterate over a pair of filters, and
> not n filters of the same shape. The use case is in standard set operations
> such as union and intersect which are used to generate metrics used in the
> literature (see the SetOperations class). This could be renamed to
> BitMapBiPredicate. However I think the current name was chosen to be in
> keeping with the JDK j.u.function package.
>
> Hope that helps set some background.
>
> Alex
>
>
>
> >
> > TY!
> > Gary
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>

Re: [Collections] Suppliers, Iterables, and Producers

Reply via email to