I agree, we should have a SinglePassGroupingCollector in Elasticsearch and
reduce the visibility of these expert classes in Lucene.
As it stands today, the FirstPassGroupingCollector could be a final class
imo.


Le jeu. 14 oct. 2021 à 18:42, Adrien Grand <jpou...@gmail.com> a écrit :

> I feel sorry for increasing the scope of all these requests for changes
> that you make, but the way Elasticsearch overrides this collector feels
> wrong to me as any change in the implementation details of this collector
> would probably break Elasticsearch's collector too. In my opinion,
> CollectedSearchGroup should not even be public. My preference would be to
> copy this collector to the Elasticsearch code base and fold the changes
> from Elasticsearch's CollapsingTopDocsCollector into it. I'm not super
> familiar with this code, so I might be missing something. Maybe Jim or Alan
> have an opinion.
>
> On Thu, Oct 14, 2021 at 1:48 PM Chris Hegarty
> <christopher.hega...@elastic.co.invalid> wrote:
>
>> In an effort to prepare Elasticsearch for modularization, we are
>> investigating and eliminating split packages. The situation has improved
>> through recent refactoring in Lucene 9.0 [1], but a number of split
>> packages still remain. This message identifies one such so that it can
>> be discussed in isolation, with a view to a potential solution either in
>> Lucene or possibly within Elasticsearch itself.
>>
>> Elasticsearch has a collapsing search collector that groups documents
>> based on field values and collapses based on the top sorted documents,
>> `CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
>> `FirstPassGroupingCollector` [3], and extends its functionality to
>> get the top docs in just a single pass. As a subclass, the CTDC
>> leverages the sorted top N unique groups by means of the protected
>> `FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`),
>> when performing the collapsing. Specifically, the
>> `CollectedSearchGroup.topDoc` field is of interest in order to retrieve
>> the number of the top document. The `topDoc` field is package-private
>> and therefore not normally accessible to the CTDC (without resorting to
>> nasty tricks!).
>>
>> Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
>> as a set of `CollectedSearchGroup`, and that CSG is a public class, it
>> would appear that the lack of public access to its state is likely an
>> oversight, rather than a deliberate design decision (otherwise, from
>> outside the package CSG adds no apparent value and appears as if a
>> marker interface, which is not useful to any subclasses).
>>
>> Minimally, the elasticsearch collector requires read access to the
>> `CollectedSearchGroup.topDoc` field. This could be achieved by adding
>> an accessor method to CSG, that returns the primitive int doc value.  But
>> this whole API seems fairly low-level and powerful (you should know what
>> you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
>> makes its state available through public fields, so maybe we could just
>> make CSG's state public too?
>>
>>
>> lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java
>>
>>  public class CollectedSearchGroup<T> extends SearchGroup<T> {
>> -  int topDoc;
>> -  int comparatorSlot;
>> +
>> +  /** The number of the top document. */
>> +  public int topDoc;
>> +
>> +  /** The field comparator slot. */
>> +  public int comparatorSlot;
>>  }
>>
>> -Chris.
>>
>> [1] https://issues.apache.org/jira/browse/LUCENE-9319
>> [2]
>> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
>> [3]
>> https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> Adrien
>

Reply via email to