[
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790882#comment-16790882
]
Adrien Grand commented on LUCENE-8542:
--------------------------------------
Right I get how it can help with small slices, but at the same time I'm seeing
small slices as something that should be avoided in order to limit context
switching so I don't think we should design for small slices?
I think the JDK's PriorityQueue would be fine most of the time, except in the
worst case that every new hit is competitive, which happens when more recently
indexed documents are more likely to be competitive (eg. when sorting by
decreasing timestamp, which I'm seeing often) given that a single reordering
via updateTop would need to be replaced with one pop and one push. Can we make
our own PriorityQueue growable maybe? I'm not too concerned about performance
given that only add() would be affected, not top() and updateTop() which are
the ones that are important for performance.
> Provide the LeafSlice to CollectorManager.newCollector to save memory on
> small index slices
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-8542
> URL: https://issues.apache.org/jira/browse/LUCENE-8542
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Christoph Kaser
> Priority: Minor
> Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments.
> When I run a query against this index with a huge number of results requested
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use
> paging and searchAfter, but our architecture does not allow this at the
> moment.)
> The reason for the huge memory requirement is that the search [will create a
> TopScoreDocCollector for each
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
> each one with numHits = 5 million. This is fine for the large segments, but
> many of those segments are fairly small and only contain several thousand
> documents. This wastes a huge amount of memory for queries with large values
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the
> following way:
> * change the method newCollector to accept a parameter LeafSlice that can be
> used to determine the total count of documents in the LeafSlice
> * Maybe, in order to remain backwards compatible, it would be possible to
> introduce this as a new method with a default implementation that calls the
> old method - otherwise, it probably has to wait for Lucene 8?
> * This can then be used to cap numHits for each TopScoreDocCollector to the
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a
> patch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]