[jira] [Comment Edited] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Yonik Seeley (JIRA) Fri, 26 Dec 2014 08:34:36 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259130#comment-14259130
 ]


Yonik Seeley edited comment on SOLR-6810 at 12/26/14 4:33 PM:
--------------------------------------------------------------

bq. Maybe I missed something but if we make sure that step 2 is executed on the 
same replicas as step 1 (which we would have to do for searcher leases anyway) 
then the query results should already be in the cache and the ordinals in the 
ranked doc list are just the top N?

When a different searcher is used (because of a commit) the ordinals could 
refer to different docs.
But this seems to lead to acceptable behavior (unlike using internal docids 
which leads to catastrophic types of fails).
You may get a different doc than expected in the second phase, but it will 
still be highly ranked.
The failure modes (if you can call them that) when the index changes seem to be 
relatively equivalent to using external IDs.

For straight sorting, the ordinals being requested would always be contiguous 
(as you say, top N when offset=0).

So if it's actually true that the behavior is comparable to the current 
strategy when the index changes between phases, we should consider changing 
implementations (as opposed to keeping the old implementation and making it 
configurable).

Benefits of new strategy:
- No need to retrieve external IDs on first phase (this is slow for stored 
fields, fast for docvalues)
- No need to return external IDs to the top-level searcher (reduced network 
traffic)
- Saves external ID -> internal docid lookup at the shard level on the last 
phase

Disadvantages of new strategy:
- Doesn't work well with non-standard sorts? (like a diversifying sort?)
- More sensitive to query-cache size (should be minor)
- Query re-execution when index changes (minor impact except for very high 
frequency commits?)

Everything I'm thinking of so far leads me to believe the new strategy should 
be the default.



was (Author: ysee...@gmail.com):
bq. Maybe I missed something but if we make sure that step 2 is executed on the 
same replicas as step 1 (which we would have to do for searcher leases anyway) 
then the query results should already be in the cache and the ordinals in the 
ranked doc list are just the top N?

When a different searcher is used (because of a commit) the ordinals could 
refer to different docs.
But this seems to lead to acceptable behavior (unlike using internal docids 
which leads to catastrophic types of fails).
You may get a different doc than expected in the second phase, but it will 
still be highly ranked.
The failure modes (if you can call them that) when the index changes seem to be 
relatively equivalent to using external IDs.

For straight sorting, the ordinals being requested would always be contiguous 
(as you say, top N when offset=0).

So if it's actually true that the behavior is comparable to the current 
strategy when the index changes between phases, we should consider changing 
implementations (as opposed to keeping the old implementation and making it 
configurable).

Benefits of new strategy:
- No need to retrieve external IDs on first phase (this is slow for stored 
fields, fast for docvalues)
- No need to return external IDs to the top-level searcher (reduced network 
traffic)
- Saves external ID -> internal docid lookup at the shard level on the last 
phase

Disadvantages of new strategy:
- Doesn't work well with non-standard sorts? (like a diversifying sort?)


> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6810
>                 URL: https://issues.apache.org/jira/browse/SOLR-6810
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, 
> branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is 
> slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among 
> the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant 
> number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on 
> a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the 
> store to get id’s for (up to) 1000 docs (value of rows parameter) per shard. 
> Having 1000 shards its up to 1 mio ids that has to be fetched. There is 
> really no good reason to ever read information from store for more than the 
> overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows 
> across many shards all with high hits" started 13/11-2014 on 
> dev@lucene.apache.org



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

Reply via email to