I agree, we should have a SinglePassGroupingCollector in Elasticsearch and reduce the visibility of these expert classes in Lucene. As it stands today, the FirstPassGroupingCollector could be a final class imo.
Le jeu. 14 oct. 2021 à 18:42, Adrien Grand <jpou...@gmail.com> a écrit : > I feel sorry for increasing the scope of all these requests for changes > that you make, but the way Elasticsearch overrides this collector feels > wrong to me as any change in the implementation details of this collector > would probably break Elasticsearch's collector too. In my opinion, > CollectedSearchGroup should not even be public. My preference would be to > copy this collector to the Elasticsearch code base and fold the changes > from Elasticsearch's CollapsingTopDocsCollector into it. I'm not super > familiar with this code, so I might be missing something. Maybe Jim or Alan > have an opinion. > > On Thu, Oct 14, 2021 at 1:48 PM Chris Hegarty > <christopher.hega...@elastic.co.invalid> wrote: > >> In an effort to prepare Elasticsearch for modularization, we are >> investigating and eliminating split packages. The situation has improved >> through recent refactoring in Lucene 9.0 [1], but a number of split >> packages still remain. This message identifies one such so that it can >> be discussed in isolation, with a view to a potential solution either in >> Lucene or possibly within Elasticsearch itself. >> >> Elasticsearch has a collapsing search collector that groups documents >> based on field values and collapses based on the top sorted documents, >> `CollapsingTopDocsCollector` [2]. The CTDC is a subclass of lucene's >> `FirstPassGroupingCollector` [3], and extends its functionality to >> get the top docs in just a single pass. As a subclass, the CTDC >> leverages the sorted top N unique groups by means of the protected >> `FPGC.orderedGroups` field (of type `TreeSet<CollectedSearchGroup<T>`), >> when performing the collapsing. Specifically, the >> `CollectedSearchGroup.topDoc` field is of interest in order to retrieve >> the number of the top document. The `topDoc` field is package-private >> and therefore not normally accessible to the CTDC (without resorting to >> nasty tricks!). >> >> Given that lucene's publicly extensible FPGC exposes the `orderedGroups` >> as a set of `CollectedSearchGroup`, and that CSG is a public class, it >> would appear that the lack of public access to its state is likely an >> oversight, rather than a deliberate design decision (otherwise, from >> outside the package CSG adds no apparent value and appears as if a >> marker interface, which is not useful to any subclasses). >> >> Minimally, the elasticsearch collector requires read access to the >> `CollectedSearchGroup.topDoc` field. This could be achieved by adding >> an accessor method to CSG, that returns the primitive int doc value. But >> this whole API seems fairly low-level and powerful (you should know what >> you're doing - experts only!). Also, CSG's superclass, `SearchGroup`, >> makes its state available through public fields, so maybe we could just >> make CSG's state public too? >> >> >> lucene/grouping/src/java/org/apache/lucene/search/grouping/CollectedSearchGroup.java >> >> public class CollectedSearchGroup<T> extends SearchGroup<T> { >> - int topDoc; >> - int comparatorSlot; >> + >> + /** The number of the top document. */ >> + public int topDoc; >> + >> + /** The field comparator slot. */ >> + public int comparatorSlot; >> } >> >> -Chris. >> >> [1] https://issues.apache.org/jira/browse/LUCENE-9319 >> [2] >> https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/apache/lucene/search/grouping/CollapseTopFieldDocs.java >> [3] >> https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/FirstPassGroupingCollector.java >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > -- > Adrien >