[ https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258369#comment-14258369 ]
Shalin Shekhar Mangar commented on SOLR-6810: --------------------------------------------- Thanks Per. This is great. I'm still going through the patch in detail but I have a few questions and comments. {code} * Algorithm * - Shard-queries 1) Ask, by forwarding the outer query, each shard for relevance of the (up to) #rows most relevant matching documents * - Find among those relevances the #rows highest global relevances * Note for each shard (S) how many entries (docs_among_most_relevant(S)) it has among the #rows globally highest relevances * - Shard-queries 2) Ask, by forwarding the outer query, each shard S for id and relevance of the (up to) #docs_among_most_relevant(S) most relevant matching documents * - Find among those id/relevances the #rows id's with the highest global relevances (lets call this set of id's X) * - Shard-queries 3) Ask, by sending id's, each shard to return the documents from set X that it holds * - Return the fetched documents to the client {code} Since dqa.forceSkipGetIds is always true for this new algorithm then computing the set X is not necessary and we can just directly fetch all return fields from individual shards and return the response to the user. Is that correct? I think the DefaultProvider and DefaultDefaultProvider aren't necessary? We can just keep a single static ShardParams.getDQA(SolrParams params) method and modify it if we ever need to change the default. If a user wants to change the default, the dqa can be set in the "defaults" section of the search handler. Why do we need the switchToTestDQADefaultProvider() and switchToOriginalDQADefaultProvider() methods? You are already applying the DQA for each request so why is the switch necessary? Did you benchmark it against the current algorithm for other kinds of use-cases as well (3-5 shards, small number of rows)? Not asking for id can speed up responses there too I think. {quote} "all with many hits" means that each of the shards have a significant number of hits on the query {quote} Unless I missed something, the algorithm has no effect with respect to how many docs are hit by query on each shard? > Faster searching limited but high rows across many shards all with many hits > ---------------------------------------------------------------------------- > > Key: SOLR-6810 > URL: https://issues.apache.org/jira/browse/SOLR-6810 > Project: Solr > Issue Type: Improvement > Components: search > Reporter: Per Steffensen > Assignee: Shalin Shekhar Mangar > Labels: distributed_search, performance > Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, > branch_5x_rev1645549.patch > > > Searching "limited but high rows across many shards all with many hits" is > slow > E.g. > * Query from outside client: q=something&rows=1000 > * Resulting in sub-requests to each shard something a-la this > ** 1) q=something&rows=1000&fl=id,score > ** 2) Request the full documents with ids in the global-top-1000 found among > the top-1000 from each shard > What does the subject mean > * "limited but high rows" means 1000 in the example above > * "many shards" means 200-1000 in our case > * "all with many hits" means that each of the shards have a significant > number of hits on the query > The problem grows on all three factors above > Doing such a query on our system takes between 5 min to 1 hour - depending on > a lot of things. It ought to be much faster, so lets make it. > Profiling show that the problem is that it takes lots of time to access the > store to get id’s for (up to) 1000 docs (value of rows parameter) per shard. > Having 1000 shards its up to 1 mio ids that has to be fetched. There is > really no good reason to ever read information from store for more than the > overall top-1000 documents, that has to be returned to the client. > For further detail see mail-thread "Slow searching limited but high rows > across many shards all with high hits" started 13/11-2014 on > dev@lucene.apache.org -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org