[jira] [Commented] (SOLR-5595) Distributed Sort: potential performance improvements & code readabiliity

Hoss Man (JIRA) Tue, 31 Dec 2013 10:40:23 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859610#comment-13859610
 ]


Hoss Man commented on SOLR-5595:
--------------------------------

Based on my understanding of the code, there are 3 major (overlapping) changes 
that could be made to help improve/clarify solr's distributed sorting code:

----

1) leveraging "fillFields"

The basic premise behind all of the work done in QueryComponent's 
"doFieldSortValues" method is summarized in this comment at the top of the 
method...

{noformat}
// The query cache doesn't currently store sort field values, and 
SolrIndexSearcher doesn't
// currently have an option to return sort field values.  Because of this, we
// take the documents given and re-derive the sort values.
{noformat}

While the query cache issue is certainly still true, improvements at the 
IndexSearcher level now make it possible to request that the TopDocCollector 
also record the sort values for each doc it collects -- these are available in 
the FieldDoc objects returned.

SOLR-5463 is already taking avantage of this feature for cursor based searching 
-- but that also bypasses the cache (for a variety of reasons).  if we enhance 
the query reesult cache to also preserve the sort values for each doc in the 
DocList, then the same "fillFields" feature could be used to pull back all of 
the sort values.

This would pretty much completely eliminate the need for 90% of the work 
currently done in doFieldSortValues -- and should be much faster since we'll be 
re-using the sort values already generated during the actual sorting, we won't 
need to hit the index again to re-derive them.

----

2) Let "fillFields" provide the score if needed for sorting

Assuming we start using IndexSearcher's "fillField" option, then we could 
probably simplify some of the logic in QueryComponent regarding sorting by 
score.  doFieldSortValues currently can't generate the score, so the 
coordinator has to ask for it explicitly in the fl so it can be used with 
merging.  These special edge cases could probably be removed, and the scores 
would come back along with the other sort values.

----

3) eliminate ShardDoc.sortFieldValues and use FieldDoc.fields

When a node is coordinating a distributed request, QueryComponent.mergeIds 
collects the docs returned by each shard into "ShardDoc" objects which have a 
sortFieldValues property containing the full list of all sort values (of all 
docs returned by that shard) tacked on to it in a convoluted nested structure 
that makes very little sense when looking at the code.  But ShardDoc already 
extends FieldDoc which has a "fields" array designed to store the sort fields.  
If mergeIds just populated the "fields" of each ShardDoc based on the 
sort_values returned from the shard, then the mergeIds method could be a lot 
simplier and the code would be a lot clearer to read.  It should also be 
possible to eliminate most/all of ShardFieldSortedHitQueue and instead leverage 
the logic in FieldValueHitQueue directly.


> Distributed Sort: potential performance improvements & code readabiliity
> ------------------------------------------------------------------------
>
>                 Key: SOLR-5595
>                 URL: https://issues.apache.org/jira/browse/SOLR-5595
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Hoss Man
>
> A lot of the work solr currently does for dealing with distributed sorting 
> was built based on older limitations in Lucene that no longer exist.  There 
> are opportunities to simplify the code significantly, which should result in 
> a speed up -- the biggest blocker at this point is some caching related 
> questions.
> I'll post my specific thoughts in a comment
> (This is inspired by some things I noticed working on SOLR-5463 - I didn't 
> want to convolute that issue with these performance improvement ideas which 
> could be dealt with separately)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5595) Distributed Sort: potential performance improvements & code readabiliity

Reply via email to