[ 
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790576#comment-16790576
 ] 

Michael McCandless commented on LUCENE-8542:
--------------------------------------------

I think the core API change is quite minor and reasonable – letting the 
{{Collector.newCollector}} know which segments (slice) it will collect?  E.g. 
we already pass the {{LeafReaderContext}} to {{Collector.newLeafCollector}} so 
it's informed about the details of which segment it's about to collect.

 

I agree the motivating use case here is somewhat abusive, and a custom 
Collector is probably needed anyway, but I think this API change could help 
non-abusive cases too.

Alternatively we could explore fixing our default top hits collectors to not 
pre-allocate the full topN for every slice ... that is really unexpected 
behavior, and users have tripped up on this multiple times in the past causing 
us to make some partial fixes for it.

> Provide the LeafSlice to CollectorManager.newCollector to save memory on 
> small index slices
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8542
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Christoph Kaser
>            Priority: Minor
>         Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments. 
> When I run a query against this index with a huge number of results requested 
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch 
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use 
> paging and searchAfter, but our architecture does not allow this at the 
> moment.)
> The reason for the huge memory requirement is that the search [will create a 
> TopScoreDocCollector for each 
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
>  each one with numHits = 5 million. This is fine for the large segments, but 
> many of those segments are fairly small and only contain several thousand 
> documents. This wastes a huge amount of memory for queries with large values 
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the 
> following way:
>  * change the method newCollector to accept a parameter LeafSlice that can be 
> used to determine the total count of documents in the LeafSlice
>  * Maybe, in order to remain backwards compatible, it would be possible to 
> introduce this as a new method with a default implementation that calls the 
> old method - otherwise, it probably has to wait for Lucene 8?
>  * This can then be used to cap numHits for each TopScoreDocCollector to the 
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to