LGTM. Maybe the current PR (LGTM) should be merged first, Alex, how does
that PR look to you?

Gary

On Fri, May 3, 2024, 11:44 AM Claude Warren <cla...@xenei.com> wrote:

> Gary and Alex,
>
> Any thoughts on this?
>
> Claude
>
> On Wed, May 1, 2024 at 7:55 AM Claude Warren <cla...@xenei.com> wrote:
>
>> Good suggestions.
>>
>> short-circuit. We could make this distinction by including it in the name:
>>> forEachUntil(Predicate ...), forEachUnless, ...
>>
>>
>> We need the unit name in the method name.  All Bloom filters implement
>> IndexProducer and BitmapProducer and since they use Predicate method
>> parameters they will conflict.
>>
>>
>> I have opened a ticket [1] with the list of tasks, which I think is now:
>>
>>    - Be clear that producers are like interruptible iterators with
>>    predicate tests acting as a switch to short-circuit the iteration.
>>    - Rename classes:
>>       - CellConsumer to CellPredicate (?)
>>       - Rename BitMap to BitMaps.
>>    - Rename methods:
>>       - Producer forEachX() to forEachUntil()
>>       - The semantic nomenclature:
>>       - Bitmaps are arrays of bits not a BitMaps object.
>>       - Indexes are ints and not an instance of a Collection object.
>>       - Cells are pairs of ints representing an index and a value.  They
>>       are not Pair<> objects.
>>       - Producers iterate over collections of the object (Bitmap, Index,
>>       Cell) applying a predicate to do work and stop the iteration early if
>>       necessary.  They are carriers/transporters of Bloom filter enabled 
>> bits.
>>       They allow us to query the contents of the Bloom filter in an
>>       implementation agnostic way.
>>
>>
>> In thinking about the term Producer, other terms could be used
>> Interrogator (sounds like you can add a query), Extractor might work.  But
>> it has also come to mind that there is a "compute" series of methods in the
>> ConcurrentMap class.  Perhaps the term we want is not "forEach", but
>> "process".  The current form of usage is something like:
>>
>> IndexProducer ip = ....
>> ip.forEachIndex(idx -> someIntPredicate)
>>
>> We could change the name from XProducer to XProcessor, or XExtractor; and
>> the method to processXs.  So the above code would look like:
>>
>> IndexExtractor ix = ....
>> ix.processIndexs(idx -> someIntPredicate)
>>
>> another example
>>
>> BitMapExtractor bx = .....
>> bx.processBitMaps(bitmap -> someBitMapPredicate)
>>
>> Claude
>>
>> [1] https://issues.apache.org/jira/browse/COLLECTIONS-854
>>
>>
>> On Tue, Apr 30, 2024 at 4:51 PM Gary D. Gregory <ggreg...@apache.org>
>> wrote:
>>
>>>
>>>
>>> On 2024/04/30 14:33:47 Alex Herbert wrote:
>>> > On Tue, 30 Apr 2024 at 14:45, Gary D. Gregory <ggreg...@apache.org>
>>> wrote:
>>> >
>>> > > Hi Claude,
>>> > >
>>> > > Thank you for the detailed reply :-) A few comments below.
>>> > >
>>> > > On 2024/04/30 06:29:38 Claude Warren wrote:
>>> > > > I will see if I can clarify the javadocs and make things clearer.
>>> > > >
>>> > > > What I think I specifically heard is:
>>> > > >
>>> > > >    - Be clear that producers are fast fail iterators with predicate
>>> > > tests.
>>> > > >    - Rename CellConsumer to CellPredicate (?)
>>> > >
>>> > > Agreed (as suggested by Albert)
>>> > >
>>> > > >    - The semantic nomenclature:
>>> > > >       - Bitmaps are arrays of bits not a BitMap object.
>>> > > >       - Indexes are ints and not an instance of a Collection
>>> object.
>>> > > >       - Cells are pairs of ints representing an index and a
>>> value.  They
>>> > > >       are not Pair<> objects.
>>> > > >       - Producers iterate over collections of the object (Bitmap,
>>> Index,
>>> > > >       Cell) applying a predicate to do work and stop the iteration
>>> early
>>> > > if
>>> > > >       necessary.  They are carriers/transporters of Bloom filter
>>> enabled
>>> > > bits.
>>> > > >       They allow us to query the contents of the Bloom filter in an
>>> > > >       implementation agnostic way.
>>> > >
>>> > > As you say naming is hard. The above is a great example and a good
>>> > > exercise I've gone through at work and in other FOSS projects:
>>> "Producers
>>> > > iterate over collections of the object...". In general when I see or
>>> write
>>> > > a Javadoc of the form "Foo bars" or "Runners walk" or "Walkers run",
>>> you
>>> > > get the idea ;-) I know that either the class (or method) name is
>>> bad or
>>> > > the Javadoc/documentation is bad; not _wrong_, just bad in the sense
>>> that
>>> > > it's confusing (to me).
>>> > >
>>> > > I am not advocating for a specific change ATM but I want to discuss
>>> the
>>> > > option because it is possible the current name is not as good as it
>>> could
>>> > > be. It could end up as an acceptable compromise if we cannot use
>>> more Java
>>> > > friendly terms though.
>>> > >
>>> > > Whenever I see a class that implements a "forEach"-kind of method, I
>>> think
>>> > > "Iterable".
>>> > >
>>> >
>>> > Here we should think "Collection", or generally more than 1. In the
>>> Java
>>> > sense an Iterable is something you can walk through to the
>>> > end, possibly removing elements as you go using the Iterator
>>> interface. We
>>> > would not require supporting removal, and we want to control a
>>> > short-circuit. We could make this distinction by including it in the
>>> name:
>>> > forEachUntil(Predicate ...), forEachUnless, ...
>>>
>>> I really like the idea of have the method name reflect the short-circuit
>>> aspect of the operation!
>>>
>>> We do not have to invent a new method name though, Stream uses
>>> "takeWhile(Predicate)" in Java 9:
>>> https://docs.oracle.com/javase/9/docs/api/java/util/stream/Stream.html#takeWhile-java.util.function.Predicate-
>>>
>>> The question becomes whether this name make the class too close to a
>>> Stream and introduces some kind of confusion. If that's the case, then
>>> forEachUntil() is good.
>>>
>>> >
>>> >
>>> > >
>>> > > Note the difference with "Iterator", and I had to lookup the
>>> difference
>>> > > since the former implements "forEach" and the  later
>>> "forEachRemaining"!
>>> > > "Iterable" is also a factory of "Iterator"s.
>>> > >
>>> > > Should the Producers ever be implementations of Iterable or Iterator?
>>> > > Right now, the answer is no because of the short-circuit aspect of
>>> using a
>>> > > predicate. I'm not using the term fail-fast here because I don't
>>> think of
>>> > > the iteration being in error (please tell me if I'm wrong).
>>> > >
>>> > > If not iterable, then we should not use that name as part of the
>>> class
>>> > > name. Generally, the short-circuit aspect of Producers do not make a
>>> bad
>>> > > candidates for implementations of Iterable since it can throw
>>> (unchecked)
>>> > > exceptions. Different for call sites granted, but I'm just
>>> mentioning it
>>> > > for fun.
>>> > >
>>> >
>>> > I already mentioned throwing runtime exceptions for the short-circuit
>>> > functionality, and that it was ruled out on the basis of performance
>>> (given
>>> > a lot of short-circuiting is expected) and convenience for the caller.
>>> I
>>> > don't think we should go there. Design the API for the intended
>>> purpose,
>>> > and not push it into a box that is easily recognisable.
>>>
>>> Sounds good.
>>>
>>> >
>>> >
>>> > >
>>> > > So maybe there's nothing to do. I just want to be clear about it. For
>>> > > example, I think of "factory" and "producer" as synonyms but in this
>>> case,
>>> > > this is not a traditional application of the factory pattern.
>>> > >
>>> > > As an aside I can see that Producers would not be Streams out of the
>>> box
>>> > > because Stream#filter(Predicate) filters but does not short-circuit
>>> > > iteration. While Stream#findAny() and #findFirst() don't fit the
>>> > > short-circuit bill, we could implement a #findLast() in a Stream
>>> > > implementation. What I do not know is if Streams otherwise fit the
>>> bill of
>>> > > Bloom filters.
>>> > >
>>> >
>>> > In the case of small loops with few instructions per loop, the
>>> overhead of
>>> > Streams is significant. Unfortunately we don't have any performance
>>> tests
>>> > for this library, but I wouldn't change to Streams without knowing it
>>> does
>>> > not impact performance. Performance is a key feature of Bloom filters.
>>> > Otherwise you can achieve some of their functionality with conventional
>>> > collections.
>>>
>>> OK.
>>>
>>> >
>>> >
>>> > >
>>> > > >
>>> > > > Does that basically cover the confusion?   If there are better
>>> terms,
>>> > > let's
>>> > > > hash them out now before I update the javadocs.
>>> > > >
>>> > > > As an aside, Cells and Bitmaps are referenced in the literature.
>>> For the
>>> > > > most part the rest is made up out of whole cloth.  So we could
>>> change
>>> > > > "Producer" to something else but we would need a good name.
>>> > >
>>> > > We have a class called BitMap and methods that use "BitMap" in the
>>> same
>>> > > but I think I am more comfortable with the term reuse now.
>>> > >
>>> > > The question that remains is must it be public? Since the Javadoc
>>> mentions
>>> > > it is about indices and bit positions, could all these methods be
>>> moved to
>>> > > the package-private IndexUtils? My concern is to reduce the public
>>> and
>>> > > protected API surface we will have to support and keep for the
>>> lifetime of
>>> > > the 4.x code base.
>>> > >
>>> >
>>> > It was originally package-private and Claude requested it be made
>>> public
>>> > for general re-use.
>>> >
>>> > We could use the name BitMaps, in keeping with the pluralisation of
>>> classes
>>> > seen in the JDK when working with elements of a given type: Executors,
>>> > Collections, etc.
>>>
>>> Agreed! I really like the plural convention for these static utility
>>> classes.
>>>
>>> Gary
>>>
>>> >
>>> > Alex
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>
>>>
>>
>> --
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>
>
> --
> LinkedIn: http://www.linkedin.com/in/claudewarren
>

Reply via email to