[ https://issues.apache.org/jira/browse/SOLR-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158674#comment-17158674 ]
ASF subversion and git services commented on SOLR-14608: -------------------------------------------------------- Commit ee9c3d083c60850c3e48d301356329cf0f017c86 in lucene-solr's branch refs/heads/jira/SOLR-14608-export from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ee9c3d0 ] SOLR-14608: Works with one string value sort field. > Faster sorting for the /export handler > -------------------------------------- > > Key: SOLR-14608 > URL: https://issues.apache.org/jira/browse/SOLR-14608 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Assignee: Andrzej Bialecki > Priority: Major > > The largest cost of the export handler is the sorting. This ticket will > implement an improved algorithm for sorting that should greatly increase > overall throughput for the export handler. > *The current algorithm is as follows:* > Collect a bitset of matching docs. Iterate over that bitset and materialize > the top level oridinals for the sort fields in the document and add them to > priority queue of size 30000. Then export the top 30000 docs, turn off the > bits in the bit set and iterate again until all docs are sorted and sent. > There are two performance bottlenecks with this approach: > 1) Materializing the top level ordinals adds a huge amount of overhead to the > sorting process. > 2) The size of priority queue, 30,000, adds significant overhead to sorting > operations. > *The new algorithm:* > Has a top level *merge sort iterator* that wraps segment level iterators that > perform segment level priority queue sorts. > *Segment level:* > The segment level docset will be iterated and the segment level ordinals for > the sort fields will be materialized and added to a segment level priority > queue. As the segment level iterator pops docs from the priority queue the > top level ordinals for the sort fields are materialized. Because the top > level ordinals are materialized AFTER the sort, they only need to be looked > up when the segment level ordinal changes. This takes advantage of the sort > to limit the lookups into the top level ordinal structures. This also > eliminates redundant lookups of top level ordinals that occur during the > multiple passes over the matching docset. > The segment level priority queues can be kept smaller than 30,000 to improve > performance of the sorting operations because the overall batch size will > still be 30,000 or greater when all the segment priority queue sizes are > added up. This allows for batch sizes much larger then 30,000 without using a > single large priority queue. The increased batch size means fewer iterations > over the matching docset and the decreased priority queue size means faster > sorting operations. > *Top level:* > A top level iterator does a merge sort over the segment level iterators by > comparing the top level ordinals materialized when the segment level docs are > popped from the segment level priority queues. This requires no extra memory > and will be very performant. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org