Re: Shard Keys and Distributed Search

Niran Fajemisin Sun, 02 Jun 2013 19:33:31 -0700

Thanks Daniel. 

That's exactly what I thought as well. I did try passing the distrib=false 
parameter and specifying the shards local to the initial server being invoked 
and yes it did localize the search to the initial server that was invoked. I 
unfortunately didn't see any marked improvement in performance as we have a 
very fast network (8Gbit host bust adaptor over a fiber channel etc.) and are 
backed by SSD on a SAN. The only (and most painful part) of our scenario is 
that we fetch a lot of documents at once....upwards of 50,000...yes...not the 
ideal use case for Solr (to put it mildly).


The performance as expected, even for such a large request is quite impressive 
when the inbound request is initially routed to the exact shard containing the 
documents (again we use shard key so the composite id router will be invoked). 
In this case I've noticed that no distributed search is performed...at least 
from my limited observation.

Thanks again for your response.

Cheers.



>________________________________
> From: Daniel Collins <danwcoll...@gmail.com>
>To: Solr User <solr-user@lucene.apache.org> 
>Sent: Saturday, June 1, 2013 4:09 AM
>Subject: Re: Shard Keys and Distributed Search
> 
>
>Yes it is doing a distributed search, Solr cloud will do that by default 
>unless you say distrib=false.
>
>My understanding of Solr's Load balancer is that it picks a random instance 
>from the list of available instances serving each shard.
>So in your example:
>
>1. Query comes in to Server 1, server 1 de-constructs it and works out which 
>shards it needs to query. It then gets a list (from ZK) of all the instances 
>in that collection which can service that shard, and the LB in Solr just picks 
>one (at random).
>2. It has picked Server 3 in your case, so the request goes there.
>3. The request is still a 2-stage process (in terms of what you see in the 
>logs), 1 query to get the docIds (using your query data) and then a second 
>"query" to get the stored fields, once it has the correct list of docs. This 
>is necessary because in a general multi-shard query, the responses will have 
>to go back to server 1 and be consolidated (not 100% sure of this area but I 
>believe this is true and it makes logical sense to me), so if you had a query 
>for 10 records that needed to access 4 shards, it would ask for the "top 10" 
>from each shard, then combine/sort them to get the overall "top 10", and then 
>get the stored fields for those 10 (which might be 5 from shard 1, 2 from 
>shard2 and 3 from shard3, nothing from shard4 for example).
>
>You are right that it seems counter intuitive from the users's perspective, 
>but I don't think Solr Cloud currently has any logic to favour a local 
>instance over a remote one, I guess that would be a change to CloudSolrServer? 
>Alternatively, you can do it in your client, send a non-distributed query, so 
>append "distrib=false&shards=localhost:8983/solr,localhost:7574/solr".
>
>-----Original Message----- From: Niran Fajemisin
>Sent: Friday, May 31, 2013 5:00 PM
>To: Solr User
>Subject: Shard Keys and Distributed Search
>
>Hi all,
>
>I'm trying to make sure that I understand under what circumstance a 
>distributed search is performed against Solr and if my general understanding 
>of what constitutes a distributed search is correct.
>
>I have a Solr collection that was created using the Collections API with the 
>following parameters: numShards=5  maxShardsPerNode=5  replicationFactor=4. 
>Given that we have 4 servers this will result in 5 shards being created on 
>each server. All documents indexed into Solr have a shard key specified as a 
>part of their document id, such that we can use the same shard key prefix as a 
>part of our query by specifying: shard.keys=myshardkey!
>
>My assumption was that when the search request is submitted, given that my 
>deployment topology has all possible shards available on each server, there 
>will be no need to call out to other servers in the cluster to fulfill the 
>search. What I am noticing is the following:
>
>1. Submit a search to Server 1 with the shard.keys parameter specified. (Note 
>again that replicas for shard 1-5 are all available on the Server 1.)
>2. The request is forwarded to a server other than Server 1, for example 
>Server 3.
>3. The  /select request handler of Server 3 is invoked. This proceeds to 
>execute the /select request, asking for the id and score fields for each 
>document that matches the submittted query. I also noticed that it passes the 
>shard.url parameter but states that distrib=false.
>4. Then *another* request is executed on Server 3 for the /select request 
>handler *again*. This time the ids returned from the previous search are 
>passed in as the ids parameters.
>5. Finally the results are passed back to the caller through the original 
>server, Server 1.
>
>This appears to a be full blown distributed shard being performed. My 
>expectation was that the search would be localized to the original server 
>(Server 1 in the example used above), given that it *should* be able to deduce 
>that the current server has a replica that can fulfill the requested search. 
>As the very least localizing the search against the shards on Server 1 instead 
>of going against the entire Solr cluster.
>
>My hope was that we would not have to go across the network, paying the 
>network transport penalty, for a search that could have been fulfilled from 
>the original Solr node, when the shard.keys param is specified.
>
>Any insight that can be provided will be greatly appreciated.
>
>Thanks all! 
>
>
>

Re: Shard Keys and Distributed Search

Reply via email to