[ https://issues.apache.org/jira/browse/LUCENE-6828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14947390#comment-14947390 ]
Erick Erickson commented on LUCENE-6828: ---------------------------------------- Tom: This sounds like it can currently be handled by the export functionality (starting in 4.10), see: https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets. Note two restrictions: 1> the fl list must all be docValues 2> the /export handler doesn't support distributed OOB. Some of the newer (5.2?) Streaming operations/classes remove having to deal with the second restriction yourself. And note one very important bit: By restricting to docValues, the export functionality does _not_ have to go to disk, uncompress docs etc., since all the returned fields can be read from memory. So the export target rate is 400K rows/second..... > Speed up requests for many rows > ------------------------------- > > Key: LUCENE-6828 > URL: https://issues.apache.org/jira/browse/LUCENE-6828 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Affects Versions: 4.10.4, 5.3 > Reporter: Toke Eskildsen > Priority: Minor > Labels: memory, performance > > Standard relevance ranked searches for top-X results uses the HitQueue class > to keep track of the highest scoring documents. The HitQueue is a binary heap > of ScoreDocs and is pre-filled with sentinel objects upon creation. > Binary heaps of Objects in Java does not scale well: The HitQueue uses 28 > bytes/element and memory access is scattered due to the binary heap algorithm > and the use of Objects. To make matters worse, the use of sentinel objects > means that even if only a tiny number of documents matches, the full amount > of Objects is still allocated. > As long as the HitQueue is small (< 1000), it performs very well. If top-1M > results are requested, it performs poorly and leaves 1M ScoreDocs to be > garbage collected. > An alternative is to replace the ScoreDocs with a single array of packed > longs, each long holding the score and the document ID. This strategy > requires only 8 bytes/element and is a lot lighter on the GC. > Some preliminary tests has been done and published at > https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/ > These indicate that a long[]-backed implementation is at least 3x faster than > vanilla HitDocs for top-1M requests. > For smaller requests, such as top-10, the packed version also seems > competitive, when the amount of matched documents exceeds 1M. This needs to > be investigated further. > Going forward with this idea requires some refactoring as Lucene is currently > hardwired to the abstract PriorityQueue. Before attempting this, it seems > prudent to discuss whether speeding up large top-X requests has any value? > Paging seems an obvious contender for requesting large result sets, but I > guess the two could work in tandem, opening up for efficient large pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org