[ https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258710#comment-14258710 ]
Shalin Shekhar Mangar commented on SOLR-6810: --------------------------------------------- {quote} Something to keep in mind for future optimizations: If we can use searcher leases, we know exactly which documents we need to retrieve from step 2 and can pass their ordinals in step 3. That would appear to represent another very large speedup... if you need doc 42 and 77 from a shard, you can get just those two docs instead of docs 1 through 77. edit: either ordinals (positions in the ranked doc list) or internal lucene docids would work if we're using searcher leases. {quote} Maybe I missed something but if we make sure that step 2 is executed on the same replicas as step 1 (which we would have to do for searcher leases anyway) then the query results should already be in the cache and the ordinals in the ranked doc list are just the top N? bq. Which begs the question: what are the downsides of using docValues for the ID field by default, and are those downsides enough to implement this alternate merge implementation? I'm not saying otherwise... just throwing it out there. I don't know. I'll create a benchmark to experiment with these ideas. In any case, existing indexes where ID are not doc values will also get a speed up with this new algorithm. > Faster searching limited but high rows across many shards all with many hits > ---------------------------------------------------------------------------- > > Key: SOLR-6810 > URL: https://issues.apache.org/jira/browse/SOLR-6810 > Project: Solr > Issue Type: Improvement > Components: search > Reporter: Per Steffensen > Assignee: Shalin Shekhar Mangar > Labels: distributed_search, performance > Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, > branch_5x_rev1645549.patch > > > Searching "limited but high rows across many shards all with many hits" is > slow > E.g. > * Query from outside client: q=something&rows=1000 > * Resulting in sub-requests to each shard something a-la this > ** 1) q=something&rows=1000&fl=id,score > ** 2) Request the full documents with ids in the global-top-1000 found among > the top-1000 from each shard > What does the subject mean > * "limited but high rows" means 1000 in the example above > * "many shards" means 200-1000 in our case > * "all with many hits" means that each of the shards have a significant > number of hits on the query > The problem grows on all three factors above > Doing such a query on our system takes between 5 min to 1 hour - depending on > a lot of things. It ought to be much faster, so lets make it. > Profiling show that the problem is that it takes lots of time to access the > store to get id’s for (up to) 1000 docs (value of rows parameter) per shard. > Having 1000 shards its up to 1 mio ids that has to be fetched. There is > really no good reason to ever read information from store for more than the > overall top-1000 documents, that has to be returned to the client. > For further detail see mail-thread "Slow searching limited but high rows > across many shards all with high hits" started 13/11-2014 on > dev@lucene.apache.org -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org