Re: Cluster down for long time after zookeeper disconnection

Alexandre Rafalovitch Mon, 10 Aug 2015 08:57:19 -0700

Did you look at release notes for Solr versions after your own?

I am pretty sure some similar things were identified and/or resolved
for 5.x. It may not help if you cannot migrate, but would at least
give a confirmation and maybe workaround on what you are facing.


Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 10 August 2015 at 11:37, danny teichthal <dannyt...@gmail.com> wrote:
> Hi,
> We are using Solr cloud with solr 4.10.4.
> On the passed week we encountered a problem where all of our servers
> disconnected from zookeeper cluster.
> This might be ok, the problem is that after reconnecting to zookeeper it
> looks like for every collection both replicas do not have a leader and are
> stuck in some kind of a deadlock for a few minutes.
>
> From what we understand:
> One of the replicas assume it ill be the leader and at some point starting
> to wait on leaderVoteWait, which is by default 3 minutes.
> The other replica is stuck on this part of code for a few minutes:
>  at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957)
>         at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921)
>         at
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521)
>         at
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392)
>
> Looks like replica 1 waits for a leader to be registered in the zookeeper,
> but replica 2 is waiting for replica 1.
> (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp).
>
> We have 100 collections distributed in 3 pairs of Solr nodes. Each
> collection has one shard with 2 replicas.
> As I understand from code and logs, all the collections are being
> registered synchronously, which means that we have to wait 3 minutes *
> number of collections for the whole cluster to come up. It could be more
> than an hour!
>
>
>
> 1. We thought about lowering leaderVoteWait to solve the problem, but we
> are not sure what is the risk?
>
> 2. The following thread is very similar to our case:
> http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down.
> Does anybody know if it is indeed a bug and if there's a related JIRA issue?
>
> 3. I see this on logs before the reconnection "Client session timed out,
> have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001,
> closing socket connection and attempting reconnect", does it mean that
> there was a disconnection of over 50 seconds between SOLR and zookeeper?
>
>
> Thanks in advance for your kind answer

Re: Cluster down for long time after zookeeper disconnection

Reply via email to