[jira] [Commented] (LUCENE-6828) Speed up requests for many rows

Toke Eskildsen (JIRA) Tue, 06 Oct 2015 22:57:46 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946317#comment-14946317
 ]


Toke Eskildsen commented on LUCENE-6828:
----------------------------------------

Shai: That all sounds very right. Great idea with the custom collector test. I 
was aware of the sentinels being re-used in-search, but thank you for making 
sure. The garbage collection I talk about is between searches as the HitQueue 
themselves are not re-used.

Regarding sentinels then they do seem to give a boost, compared to 
no-sentinels-but-still-objects, when the queue size is small and the number of 
hits is large. I have not investigated this is much detail and I suspect it 
would help to visualize the performance of the different implementations with 
some graphs. As queue-size, hits, threads and implementation are all relevant 
knobs to try and tweak, that task will have to wait a bit.

Ramkumar: Large result sets with grouping is very relevant for us. However, the 
current packed queue implementation only handles floats+docIDs. If the 
comparator key can be expressed as a numeric, it should be possible to have 
fast heap-ordering (a numeric array to hold the key and a parallel object array 
for the values, where the values themselves are only accessed upon export).

> Speed up requests for many rows
> -------------------------------
>
>                 Key: LUCENE-6828
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6828
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 4.10.4, 5.3
>            Reporter: Toke Eskildsen
>            Priority: Minor
>              Labels: memory, performance
>
> Standard relevance ranked searches for top-X results uses the HitQueue class 
> to keep track of the highest scoring documents. The HitQueue is a binary heap 
> of ScoreDocs and is pre-filled with sentinel objects upon creation.
> Binary heaps of Objects in Java does not scale well: The HitQueue uses 28 
> bytes/element and memory access is scattered due to the binary heap algorithm 
> and the use of Objects. To make matters worse, the use of sentinel objects 
> means that even if only a tiny number of documents matches, the full amount 
> of Objects is still allocated.
> As long as the HitQueue is small (< 1000), it performs very well. If top-1M 
> results are requested, it performs poorly and leaves 1M ScoreDocs to be 
> garbage collected.
> An alternative is to replace the ScoreDocs with a single array of packed 
> longs, each long holding the score and the document ID. This strategy 
> requires only 8 bytes/element and is a lot lighter on the GC.
> Some preliminary tests has been done and published at 
> https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
> These indicate that a long[]-backed implementation is at least 3x faster than 
> vanilla HitDocs for top-1M requests.
> For smaller requests, such as top-10, the packed version also seems 
> competitive, when the amount of matched documents exceeds 1M. This needs to 
> be investigated further.
> Going forward with this idea requires some refactoring as Lucene is currently 
> hardwired to the abstract PriorityQueue. Before attempting this, it seems 
> prudent to discuss whether speeding up large top-X requests has any value? 
> Paging seems an obvious contender for requesting large result sets, but I 
> guess the two could work in tandem, opening up for efficient large pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6828) Speed up requests for many rows

Reply via email to