[jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Shalin Shekhar Mangar (JIRA) Wed, 24 Dec 2014 08:48:03 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258369#comment-14258369
 ]


Shalin Shekhar Mangar commented on SOLR-6810:
---------------------------------------------

Thanks Per. This is great. I'm still going through the patch in detail but I 
have a few questions and comments.

{code}
     * Algorithm
     * - Shard-queries 1) Ask, by forwarding the outer query, each shard for 
relevance of the (up to) #rows most relevant matching documents
     * - Find among those relevances the #rows highest global relevances
     * Note for each shard (S) how many entries (docs_among_most_relevant(S)) 
it has among the #rows globally highest relevances
     * - Shard-queries 2) Ask, by forwarding the outer query, each shard S for 
id and relevance of the (up to) #docs_among_most_relevant(S) most relevant 
matching documents
     * - Find among those id/relevances the #rows id's with the highest global 
relevances (lets call this set of id's X)
     * - Shard-queries 3) Ask, by sending id's, each shard to return the 
documents from set X that it holds
     * - Return the fetched documents to the client 
{code}

Since dqa.forceSkipGetIds is always true for this new algorithm then computing 
the set X is not necessary and we can just directly fetch all return fields 
from individual shards and return the response to the user. Is that correct?

I think the DefaultProvider and DefaultDefaultProvider aren't necessary? We can 
just keep a single static ShardParams.getDQA(SolrParams params) method and 
modify it if we ever need to change the default. If a user wants to change the 
default, the dqa can be set in the "defaults" section of the search handler.

Why do we need the switchToTestDQADefaultProvider() and 
switchToOriginalDQADefaultProvider() methods? You are already applying the DQA 
for each request so why is the switch necessary?

Did you benchmark it against the current algorithm for other kinds of use-cases 
as well (3-5 shards, small number of rows)? Not asking for id can speed up 
responses there too I think.

{quote}
"all with many hits" means that each of the shards have a significant number of 
hits on the query
{quote}

Unless I missed something, the algorithm has no effect with respect to how many 
docs are hit by query on each shard?

> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6810
>                 URL: https://issues.apache.org/jira/browse/SOLR-6810
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, 
> branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is 
> slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among 
> the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant 
> number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on 
> a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the 
> store to get id’s for (up to) 1000 docs (value of rows parameter) per shard. 
> Having 1000 shards its up to 1 mio ids that has to be fetched. There is 
> really no good reason to ever read information from store for more than the 
> overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows 
> across many shards all with high hits" started 13/11-2014 on 
> dev@lucene.apache.org



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Reply via email to