Thanks Eric,
I will add that we have set commits to be only called by the loading program. 
We have turned off soft and autoCommits in the solrconfig.xml.
This is so when we upload, we move from one list of docs to the new list in one 
atomic operation (delete, add and then commit).

I'll also add: This index holds 500,000,000 docs and under heavy uploading we 
get the nodes going into recovery. I'm presuming it's down to the commits being 
too far apart and causing the replication nodes to falter. This heavy upload is 
a small window of time and to get around this issue, I remove the replicas 
during this period and then add them back afterwards. The new recovery mode 
issue looks like it was down to heavy upload but outside the designated period.

So the most likely scenario is that I've created the issue with my tweaking, 
hope you can point me in the right direction.


<autoCommit>
       <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
       <openSearcher>false</openSearcher>
  </autoCommit>

<autoSoftCommit>
       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
  </autoSoftCommit>

Regards

Russell Taylor



-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 22 May 2019 16:45
To: solr-user@lucene.apache.org
Subject: Re: CloudSolrClient (any version). Find the node your query has 
connected to.

WARNING - External email from lucene.apache.org

OK, now we’re cooking with oil.

First, nodes in recovery shouldn’t make any difference to a query. They should 
not serve any part of a query so I think/hope that’s a red herring. At worst a 
node in recovery should pass the query on to another replica that is _not_ 
recovering.

When you’re looking at this, be aware that as long as _Solr_ is up and running 
on a node, it’ll accept queries. For simplicity let's say Solr1 hosts _only_ 
collection1_shard1_replica1 (cs1r1).

Now you fire a query at Solr1. It has the topology from ZooKeeper as well as 
its own internal knowledge of hosted replicas. For a top-level query it should 
send sub-queries out only to healthy replicas, bypassing its own recovering 
replica.

Let’s claim you fire the query at Solr2. First if there’s been time to 
propagate the down state of cs1r1 to ZooKeeper and Solr2 has the state, it 
shouldn’t even send a subrequest to cs1r1.

Now let’s say Solr2 hasn’t gotten the message yet and does send a query to 
cs1r1. cs1r1 should know its state is recovering and either return an error the 
Solr2 (which will pick a new replica to send that subrequest to) or forward it 
on to another healthy replica, I’m not quite sure which. In any case it should 
_not_ service the request from cs1r1.

If you do prove that a node serving requests that is really in recovery, that’s 
a fairly serious bug and we need to know lots of details.


Second, even if you did have the URL Solr sends the query to it wouldn’t help. 
Once a Solr node receives a query, it does its _own_ round robin for a 
subrequest to one replica of each shard, get’s the replies back then goes back 
out to the same replica for the final documents. So you still wouldn’t know 
what replica served the queries.

The fact that you say things come back into sync after commit points to 
autocommit times. I’m assuming you have an autocommit setting that opens a new 
searcher (<openSearcher>true in the “autocommit” section or any positive time 
in the autoSoftCommit section of solrconfig.xml). These commit points will fire 
at different wall-clock time, resulting in replicas temporarily having 
different searchable documents. BTW, the same thing applies if you send 
“commitWithin” in a SolrJ cloudSolrClient.add command…

Anyway, if you just fire a query at a specific replica and add &distrib=false, 
the replica will bring back only documents from that replica. We’re talking the 
replica, so part of the URL will be the complete replica name like 
"…./solr/collection1_shard1_replica_n1/query?q=*:*&distrib=false”

A very quick test would be, when you have a replica in recovery, stop indexing 
and wait for your autocommit interval to expire (one that opens a new searcher) 
or issue a commit to the collection. My bet/hope is that your counts will be 
just fine. You can use the &distrib=false parameter to query each replica of 
the relevant shard directly…

Best,
Erick

> On May 22, 2019, at 8:09 AM, Russell Taylor <russell.tay...@theice.com> wrote:
>
> Hi Erick,
> Every time any of the replication nodes goes into recovery mode we start 
> seeing queries which don't match the correct count. I'm being told zookeeper 
> will give me the correct node (Not one in recovery), but I want to prove it 
> as the query issue only comes up when any of the nodes are in recovery mode. 
> The application loading the data shows the correct counts and after 
> committing we check the results and they look correct.
>
> If I can get the URL I can prove that the problem is due to doing the query 
> against a node in recovery mode.
>
> I hope that explains the problem, thanks for your time.
>
> Regards
>
> Russell Taylor
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 22 May 2019 15:50
> To: solr-user@lucene.apache.org
> Subject: Re: CloudSolrClient (any version). Find the node your query has 
> connected to.
>
> WARNING - External email from lucene.apache.org
>
> Why do you want to know? You’ve asked how do to X without telling us what 
> problem Y you’re trying to solve (the XY problem) and frequently that leads 
> to a lot of wasted time…..
>
> Under the covers CloudSolrClient uses a pretty simple round-robin load 
> balancer to pick a Solr node to send the query to so “it depends”…..
>
>> On May 22, 2019, at 5:51 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>> You have to provide the addresses of the zookeeper ensemble - it will figure 
>> it out on its own based on information in Zookeeper.
>>
>>> Am 22.05.2019 um 14:38 schrieb Russell Taylor <russell.tay...@theice.com>:
>>>
>>> Hi,
>>> Using CloudSolrClient, how do I find the node (I have 3 nodes for this 
>>> collection on our 6 node cluster) the query has connected to.
>>> I'm hoping to get the full URL if possible.
>>>
>>>
>>> Regards
>>>
>>> Russell Taylor
>>>
>>>
>>>
>>> ________________________________
>>>
>>> This message may contain confidential information and is intended for 
>>> specific recipients unless explicitly noted otherwise. If you have reason 
>>> to believe you are not an intended recipient of this message, please delete 
>>> it and notify the sender. This message may not represent the opinion of 
>>> Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and 
>>> does not constitute a contract or guarantee. Unencrypted electronic mail is 
>>> not secure and the recipient of this message is expected to provide 
>>> safeguards from viruses and pursue alternate means of communication where 
>>> privacy or a binding message is desired.
>
>
> ________________________________
>
> This message may contain confidential information and is intended for 
> specific recipients unless explicitly noted otherwise. If you have reason to 
> believe you are not an intended recipient of this message, please delete it 
> and notify the sender. This message may not represent the opinion of 
> Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and 
> does not constitute a contract or guarantee. Unencrypted electronic mail is 
> not secure and the recipient of this message is expected to provide 
> safeguards from viruses and pursue alternate means of communication where 
> privacy or a binding message is desired.


________________________________

This message may contain confidential information and is intended for specific 
recipients unless explicitly noted otherwise. If you have reason to believe you 
are not an intended recipient of this message, please delete it and notify the 
sender. This message may not represent the opinion of Intercontinental 
Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a 
contract or guarantee. Unencrypted electronic mail is not secure and the 
recipient of this message is expected to provide safeguards from viruses and 
pursue alternate means of communication where privacy or a binding message is 
desired.

Reply via email to