I feel sorry for increasing the scope of all these requests for changes
that you make, but the way Elasticsearch overrides this collector feels
wrong to me as any change in the implementation details of this collector
would probably break Elasticsearch's collector too. In my opinion,
CollectedSearchGroup should not even be public. My preference would be to
copy this collector to the Elasticsearch code base and fold the changes
from Elasticsearch's CollapsingTopDocsCollector into it. I'm not super
familiar with this code, so I might be missing something. Maybe Jim or Alan
have an opinion.

On Thu, Oct 14, 2021 at 1:48 PM Chris Hegarty
<christopher.hega...@elastic.co.invalid> wrote:

> In an effort to prepare Elasticsearch for modularization, we are
> investigating and eliminating split packages. The situation has improved
> through recent refactoring in Lucene 9.0 [1], but a number of split
> packages still remain. This message identifies one such so that it can
> be discussed in isolation, with a view to a potential solution either in
> Lucene or possibly within Elasticsearch itself.
>
> Elasticsearch has a collapsing search collector that groups documents
> based on field values and collapses based on the top sorted documents,
> `CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's
> `FirstPassGroupingCollector` [3], and extends its functionality to
> get the top docs in just a single pass. As a subclass, the CTDC
> leverages the sorted top N unique groups by means of the protected
> `FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`),
> when performing the collapsing. Specifically, the
> `CollectedSearchGroup.topDoc` field is of interest in order to retrieve
> the number of the top document. The `topDoc` field is package-private
> and therefore not normally accessible to the CTDC (without resorting to
> nasty tricks!).
>
> Given that lucene's publicly extensible FPGC exposes the `orderedGroups`
> as a set of `CollectedSearchGroup`, and that CSG is a public class, it
> would appear that the lack of public access to its state is likely an
> oversight, rather than a deliberate design decision (otherwise, from
> outside the package CSG adds no apparent value and appears as if a
> marker interface, which is not useful to any subclasses).
>
> Minimally, the elasticsearch collector requires read access to the
> `CollectedSearchGroup.topDoc` field. This could be achieved by adding
> an accessor method to CSG, that returns the primitive int doc value.  But
> this whole API seems fairly low-level and powerful (you should know what
> you're doing - experts only!). Also, CSG's superclass, `SearchGroup`,
> makes its state available through public fields, so maybe we could just
> make CSG's state public too?
>
>
> lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java
>
>  public class CollectedSearchGroup<T> extends SearchGroup<T> {
> -  int topDoc;
> -  int comparatorSlot;
> +
> +  /** The number of the top document. */
> +  public int topDoc;
> +
> +  /** The field comparator slot. */
> +  public int comparatorSlot;
>  }
>
> -Chris.
>
> [1] https://issues.apache.org/jira/browse/LUCENE-9319
> [2]
> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java
> [3]
> https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Adrien

Reply via email to