[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices

Adrien Grand (JIRA) Tue, 12 Mar 2019 11:56:18 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790882#comment-16790882
 ]


Adrien Grand commented on LUCENE-8542:
--------------------------------------

Right I get how it can help with small slices, but at the same time I'm seeing 
small slices as something that should be avoided in order to limit context 
switching so I don't think we should design for small slices?

I think the JDK's PriorityQueue would be fine most of the time, except in the 
worst case that every new hit is competitive, which happens when more recently 
indexed documents are more likely to be competitive (eg. when sorting by 
decreasing timestamp, which I'm seeing often) given that a single reordering 
via updateTop would need to be replaced with one pop and one push. Can we make 
our own PriorityQueue growable maybe? I'm not too concerned about performance 
given that only add() would be affected, not top() and updateTop() which are 
the ones that are important for performance.

> Provide the LeafSlice to CollectorManager.newCollector to save memory on 
> small index slices
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8542
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Christoph Kaser
>            Priority: Minor
>         Attachments: LUCENE-8542.patch
>
>
> I have an index consisting of 44 million documents spread across 60 segments. 
> When I run a query against this index with a huge number of results requested 
> (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch 
> was configured to use an ExecutorService.
> (I know this kind of query is fairly unusual and it would be better to use 
> paging and searchAfter, but our architecture does not allow this at the 
> moment.)
> The reason for the huge memory requirement is that the search [will create a 
> TopScoreDocCollector for each 
> segment|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L404],
>  each one with numHits = 5 million. This is fine for the large segments, but 
> many of those segments are fairly small and only contain several thousand 
> documents. This wastes a huge amount of memory for queries with large values 
> of numHits on indices with many segments.
> Therefore, I propose to change the CollectorManager - interface in the 
> following way:
>  * change the method newCollector to accept a parameter LeafSlice that can be 
> used to determine the total count of documents in the LeafSlice
>  * Maybe, in order to remain backwards compatible, it would be possible to 
> introduce this as a new method with a default implementation that calls the 
> old method - otherwise, it probably has to wait for Lucene 8?
>  * This can then be used to cap numHits for each TopScoreDocCollector to the 
> leafslice-size.
> If this is something that would make sense for you, I can try to provide a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8542) Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices

Reply via email to