jpountz opened a new issue, #12375:
URL: https://github.com/apache/lucene/issues/12375
### Description
It is a common need to run some logic after a segment has been collected.
Even though, I can't find previous instances of this discussion I'm pretty sure
that this has been raised several times in the past, and the answer was
essentially that this logic can easily be implemented on top of Lucene. One
good example of this is our own `FacetsCollector`, which collects the set of
matching docs per segment: `getLeafCollector` appends the set of doc IDs that
were collected on the previous segment to the set, and `getMatchingDocs` takes
care of the last segment, since `getLeafCollector` doesn't get called anymore
after the last segment has been collected.
However, this approach is not perfect. If you are leveraging Lucene's
concurrent search capabilities, this forces the post collection logic to run in
the current thread for at least one segment per slice, instead of using the
executor. This is a missed opportunity for search concurrency, since post
collection logic is not always cheap. For instance, in the case of
`FacetsCollector` it needs to run `DocIdSetBuilder.build()` which may need to
sort a large array of doc IDs. Having a `LeafCollector.postCollect()` API or
something along these lines would help address this issue, as `postCollect()`
would get called on the `IndexSearcher`'s `executor`.
I looked at our collectors to get a sense of how many of our collectors
could take advantage of a `postCollect()` hook and found the following ones:
- `org.apache.lucene.facet.FacetsCollector`
- `org.apache.lucene.search.grouping.BlockGroupingCollector`
- `org.apache.lucene.search.grouping.TermGroupFacetCollector`
- `org.apache.lucene.search.suggest.document.TopSuggestDocsCollector`
- `org.apache.lucene.search.CachingCollector`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]