Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing.
Regards, Alex. ---- Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 10 August 2015 at 11:37, danny teichthal <dannyt...@gmail.com> wrote: > Hi, > We are using Solr cloud with solr 4.10.4. > On the passed week we encountered a problem where all of our servers > disconnected from zookeeper cluster. > This might be ok, the problem is that after reconnecting to zookeeper it > looks like for every collection both replicas do not have a leader and are > stuck in some kind of a deadlock for a few minutes. > > From what we understand: > One of the replicas assume it ill be the leader and at some point starting > to wait on leaderVoteWait, which is by default 3 minutes. > The other replica is stuck on this part of code for a few minutes: > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) > at > org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) > at > org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) > > Looks like replica 1 waits for a leader to be registered in the zookeeper, > but replica 2 is waiting for replica 1. > (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). > > We have 100 collections distributed in 3 pairs of Solr nodes. Each > collection has one shard with 2 replicas. > As I understand from code and logs, all the collections are being > registered synchronously, which means that we have to wait 3 minutes * > number of collections for the whole cluster to come up. It could be more > than an hour! > > > > 1. We thought about lowering leaderVoteWait to solve the problem, but we > are not sure what is the risk? > > 2. The following thread is very similar to our case: > http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down. > Does anybody know if it is indeed a bug and if there's a related JIRA issue? > > 3. I see this on logs before the reconnection "Client session timed out, > have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, > closing socket connection and attempting reconnect", does it mean that > there was a disconnection of over 50 seconds between SOLR and zookeeper? > > > Thanks in advance for your kind answer