Re: Cluster down for long time after zookeeper disconnection

2015-08-11 Thread danny teichthal
1. Erik, thanks,  I agree that it is really serious, but I think that the 3
minutes on this case were not mandatory.
On my case it was a deadlock, which smells like some kind of bug.
One replica is waiting for other to come up, before it takes leadership,
while the other is waiting for the election results.
If I will be able to reproduce it on 5.2.1, is it legitimate to file a JIRA
issue for that?

2. Regarding session timeouts, there's something about configuration that I
don't understand.
 If zkClientTimeout is set to 30 seconds, how come see in the log that
session expired after ~50 seconds.
Maybe I have a mismatch between zookeeper and solr configuration?

3. Resuming the question of leaderVoteWait parameter, I have seen in a few
threads that it may be reduced to a minimum.
I'm not clear about the full meaning, but I understand that it is meant to
prevent lose of update on cluster startup.
Can anyone confirm/clarify that?




Links for leaderVoteWait:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3ccajt9wnhivirpn79kttcn8ekafevhhmqwkfl-+i16kbz0ogl...@mail.gmail.com%3E

http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down

Relevant part from My zookeeper conf:
tickTime=2000
initLimit=10
syncLimit=5



On Tue, Aug 11, 2015 at 1:06 AM, Erick Erickson erickerick...@gmail.com
wrote:

 Not that I know of. With ZK as the one source of truth, dropping below
 quorum
 is Really Serious, so having to wait 3 minutes or so for action to be
 taken is the
 fallback.

 Best,
 Erick

 On Mon, Aug 10, 2015 at 1:34 PM, danny teichthal dannyt...@gmail.com
 wrote:
  Erick, I assume you are referring to zkClientTimeout, it is set to 30
  seconds. I also see these messages on Solr side:
   Client session timed out, have not heard from server in 48865ms for
  sessionid 0x44efbb91b5f0001, closing socket connection and attempting
  reconnect.
  So, I'm not sure what was the actual disconnection duration time, but it
  could have been up to a minute.
  We are working on finding the network issues root cause, but assuming
  disconnections will always occur, are there any other options to overcome
  this issues?
 
 
 
  On Mon, Aug 10, 2015 at 11:18 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  I didn't see the zk timeout you set (just skimmed). But if your
 Zookeeper
  was
  down _very_ termporarily, it may suffice to up the ZK timeout. The
 default
  in the 10.4 time-frame (if I remember correctly) was 15 seconds which
 has
  proven to be too short in many circumstances.
 
  Of course if your ZK was down for minutest this wouldn't help.
 
  Best,
  Erick
 
  On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com
  wrote:
   Hi Alexander ,
   Thanks for your reply, I looked at the release notes.
   There is one bug fix - SOLR-7503
   https://issues.apache.org/jira/browse/SOLR-7503 – register cores
   asynchronously.
   It may reduce the registration time since it is done on parallel, but
   still, 3 minutes (leaderVoteWait) is a long time to recover from a few
   seconds of disconnection.
  
   Except from that one I don't see any bug fix that addresses the same
   problem.
   I am able to reproduce it on 4.10.4 pretty easily, I will also try it
  with
   5.2.1 and see if it reproduces.
  
   Anyway, since migrating to 5.2.1 is not an option for me in the short
  term,
   I'm left with the question if reducing leaderVoteWait may help here,
 and
   what may be the consequences.
   If i understand correctly, there might be a chance of losing updates
 that
   were made on leader.
   From my side it is a lot worse to lose availability for 3 minutes.
  
   I would really appreciate a feedback on this.
  
  
  
  
   On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch 
  arafa...@gmail.com
   wrote:
  
   Did you look at release notes for Solr versions after your own?
  
   I am pretty sure some similar things were identified and/or resolved
   for 5.x. It may not help if you cannot migrate, but would at least
   give a confirmation and maybe workaround on what you are facing.
  
   Regards,
  Alex.
   
   Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
   http://www.solr-start.com/
  
  
   On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com
  wrote:
Hi,
We are using Solr cloud with solr 4.10.4.
On the passed week we encountered a problem where all of our
 servers
disconnected from zookeeper cluster.
This might be ok, the problem is that after reconnecting to
 zookeeper
  it
looks like for every collection both replicas do not have a leader
 and
   are
stuck in some kind of a deadlock for a few minutes.
   
From what we understand:
One of the replicas assume it ill be the leader and at some point
   starting
to wait on leaderVoteWait, which is by default 3 minutes.
The other replica is stuck on this part of code for a few minutes:
 at
  
 

Cluster down for long time after zookeeper disconnection

2015-08-10 Thread danny teichthal
Hi,
We are using Solr cloud with solr 4.10.4.
On the passed week we encountered a problem where all of our servers
disconnected from zookeeper cluster.
This might be ok, the problem is that after reconnecting to zookeeper it
looks like for every collection both replicas do not have a leader and are
stuck in some kind of a deadlock for a few minutes.

From what we understand:
One of the replicas assume it ill be the leader and at some point starting
to wait on leaderVoteWait, which is by default 3 minutes.
The other replica is stuck on this part of code for a few minutes:
 at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957)
at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921)
at
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521)
at
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392)

Looks like replica 1 waits for a leader to be registered in the zookeeper,
but replica 2 is waiting for replica 1.
(org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp).

We have 100 collections distributed in 3 pairs of Solr nodes. Each
collection has one shard with 2 replicas.
As I understand from code and logs, all the collections are being
registered synchronously, which means that we have to wait 3 minutes *
number of collections for the whole cluster to come up. It could be more
than an hour!



1. We thought about lowering leaderVoteWait to solve the problem, but we
are not sure what is the risk?

2. The following thread is very similar to our case:
http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down.
Does anybody know if it is indeed a bug and if there's a related JIRA issue?

3. I see this on logs before the reconnection Client session timed out,
have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001,
closing socket connection and attempting reconnect, does it mean that
there was a disconnection of over 50 seconds between SOLR and zookeeper?


Thanks in advance for your kind answer


Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread Alexandre Rafalovitch
Did you look at release notes for Solr versions after your own?

I am pretty sure some similar things were identified and/or resolved
for 5.x. It may not help if you cannot migrate, but would at least
give a confirmation and maybe workaround on what you are facing.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote:
 Hi,
 We are using Solr cloud with solr 4.10.4.
 On the passed week we encountered a problem where all of our servers
 disconnected from zookeeper cluster.
 This might be ok, the problem is that after reconnecting to zookeeper it
 looks like for every collection both replicas do not have a leader and are
 stuck in some kind of a deadlock for a few minutes.

 From what we understand:
 One of the replicas assume it ill be the leader and at some point starting
 to wait on leaderVoteWait, which is by default 3 minutes.
 The other replica is stuck on this part of code for a few minutes:
  at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957)
 at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921)
 at
 org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521)
 at
 org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392)

 Looks like replica 1 waits for a leader to be registered in the zookeeper,
 but replica 2 is waiting for replica 1.
 (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp).

 We have 100 collections distributed in 3 pairs of Solr nodes. Each
 collection has one shard with 2 replicas.
 As I understand from code and logs, all the collections are being
 registered synchronously, which means that we have to wait 3 minutes *
 number of collections for the whole cluster to come up. It could be more
 than an hour!



 1. We thought about lowering leaderVoteWait to solve the problem, but we
 are not sure what is the risk?

 2. The following thread is very similar to our case:
 http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down.
 Does anybody know if it is indeed a bug and if there's a related JIRA issue?

 3. I see this on logs before the reconnection Client session timed out,
 have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001,
 closing socket connection and attempting reconnect, does it mean that
 there was a disconnection of over 50 seconds between SOLR and zookeeper?


 Thanks in advance for your kind answer


Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread Erick Erickson
I didn't see the zk timeout you set (just skimmed). But if your Zookeeper was
down _very_ termporarily, it may suffice to up the ZK timeout. The default
in the 10.4 time-frame (if I remember correctly) was 15 seconds which has
proven to be too short in many circumstances.

Of course if your ZK was down for minutest this wouldn't help.

Best,
Erick

On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com wrote:
 Hi Alexander ,
 Thanks for your reply, I looked at the release notes.
 There is one bug fix - SOLR-7503
 https://issues.apache.org/jira/browse/SOLR-7503 – register cores
 asynchronously.
 It may reduce the registration time since it is done on parallel, but
 still, 3 minutes (leaderVoteWait) is a long time to recover from a few
 seconds of disconnection.

 Except from that one I don't see any bug fix that addresses the same
 problem.
 I am able to reproduce it on 4.10.4 pretty easily, I will also try it with
 5.2.1 and see if it reproduces.

 Anyway, since migrating to 5.2.1 is not an option for me in the short term,
 I'm left with the question if reducing leaderVoteWait may help here, and
 what may be the consequences.
 If i understand correctly, there might be a chance of losing updates that
 were made on leader.
 From my side it is a lot worse to lose availability for 3 minutes.

 I would really appreciate a feedback on this.




 On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

 Did you look at release notes for Solr versions after your own?

 I am pretty sure some similar things were identified and/or resolved
 for 5.x. It may not help if you cannot migrate, but would at least
 give a confirmation and maybe workaround on what you are facing.

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote:
  Hi,
  We are using Solr cloud with solr 4.10.4.
  On the passed week we encountered a problem where all of our servers
  disconnected from zookeeper cluster.
  This might be ok, the problem is that after reconnecting to zookeeper it
  looks like for every collection both replicas do not have a leader and
 are
  stuck in some kind of a deadlock for a few minutes.
 
  From what we understand:
  One of the replicas assume it ill be the leader and at some point
 starting
  to wait on leaderVoteWait, which is by default 3 minutes.
  The other replica is stuck on this part of code for a few minutes:
   at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957)
  at
  org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921)
  at
 
 org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521)
  at
 
 org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392)
 
  Looks like replica 1 waits for a leader to be registered in the
 zookeeper,
  but replica 2 is waiting for replica 1.
 
 (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp).
 
  We have 100 collections distributed in 3 pairs of Solr nodes. Each
  collection has one shard with 2 replicas.
  As I understand from code and logs, all the collections are being
  registered synchronously, which means that we have to wait 3 minutes *
  number of collections for the whole cluster to come up. It could be more
  than an hour!
 
 
 
  1. We thought about lowering leaderVoteWait to solve the problem, but we
  are not sure what is the risk?
 
  2. The following thread is very similar to our case:
 
 http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down
 .
  Does anybody know if it is indeed a bug and if there's a related JIRA
 issue?
 
  3. I see this on logs before the reconnection Client session timed out,
  have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001,
  closing socket connection and attempting reconnect, does it mean that
  there was a disconnection of over 50 seconds between SOLR and zookeeper?
 
 
  Thanks in advance for your kind answer



Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread Erick Erickson
Not that I know of. With ZK as the one source of truth, dropping below quorum
is Really Serious, so having to wait 3 minutes or so for action to be
taken is the
fallback.

Best,
Erick

On Mon, Aug 10, 2015 at 1:34 PM, danny teichthal dannyt...@gmail.com wrote:
 Erick, I assume you are referring to zkClientTimeout, it is set to 30
 seconds. I also see these messages on Solr side:
  Client session timed out, have not heard from server in 48865ms for
 sessionid 0x44efbb91b5f0001, closing socket connection and attempting
 reconnect.
 So, I'm not sure what was the actual disconnection duration time, but it
 could have been up to a minute.
 We are working on finding the network issues root cause, but assuming
 disconnections will always occur, are there any other options to overcome
 this issues?



 On Mon, Aug 10, 2015 at 11:18 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 I didn't see the zk timeout you set (just skimmed). But if your Zookeeper
 was
 down _very_ termporarily, it may suffice to up the ZK timeout. The default
 in the 10.4 time-frame (if I remember correctly) was 15 seconds which has
 proven to be too short in many circumstances.

 Of course if your ZK was down for minutest this wouldn't help.

 Best,
 Erick

 On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com
 wrote:
  Hi Alexander ,
  Thanks for your reply, I looked at the release notes.
  There is one bug fix - SOLR-7503
  https://issues.apache.org/jira/browse/SOLR-7503 – register cores
  asynchronously.
  It may reduce the registration time since it is done on parallel, but
  still, 3 minutes (leaderVoteWait) is a long time to recover from a few
  seconds of disconnection.
 
  Except from that one I don't see any bug fix that addresses the same
  problem.
  I am able to reproduce it on 4.10.4 pretty easily, I will also try it
 with
  5.2.1 and see if it reproduces.
 
  Anyway, since migrating to 5.2.1 is not an option for me in the short
 term,
  I'm left with the question if reducing leaderVoteWait may help here, and
  what may be the consequences.
  If i understand correctly, there might be a chance of losing updates that
  were made on leader.
  From my side it is a lot worse to lose availability for 3 minutes.
 
  I would really appreciate a feedback on this.
 
 
 
 
  On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch 
 arafa...@gmail.com
  wrote:
 
  Did you look at release notes for Solr versions after your own?
 
  I am pretty sure some similar things were identified and/or resolved
  for 5.x. It may not help if you cannot migrate, but would at least
  give a confirmation and maybe workaround on what you are facing.
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com
 wrote:
   Hi,
   We are using Solr cloud with solr 4.10.4.
   On the passed week we encountered a problem where all of our servers
   disconnected from zookeeper cluster.
   This might be ok, the problem is that after reconnecting to zookeeper
 it
   looks like for every collection both replicas do not have a leader and
  are
   stuck in some kind of a deadlock for a few minutes.
  
   From what we understand:
   One of the replicas assume it ill be the leader and at some point
  starting
   to wait on leaderVoteWait, which is by default 3 minutes.
   The other replica is stuck on this part of code for a few minutes:
at
  org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957)
   at
  
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921)
   at
  
 
 org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521)
   at
  
 
 org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392)
  
   Looks like replica 1 waits for a leader to be registered in the
  zookeeper,
   but replica 2 is waiting for replica 1.
  
 
 (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp).
  
   We have 100 collections distributed in 3 pairs of Solr nodes. Each
   collection has one shard with 2 replicas.
   As I understand from code and logs, all the collections are being
   registered synchronously, which means that we have to wait 3 minutes *
   number of collections for the whole cluster to come up. It could be
 more
   than an hour!
  
  
  
   1. We thought about lowering leaderVoteWait to solve the problem, but
 we
   are not sure what is the risk?
  
   2. The following thread is very similar to our case:
  
 
 http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down
  .
   Does anybody know if it is indeed a bug and if there's a related JIRA
  issue?
  
   3. I see this on logs before the reconnection Client session timed
 out,
   have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001,
   closing socket connection and attempting reconnect, 

Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread danny teichthal
Hi Alexander ,
Thanks for your reply, I looked at the release notes.
There is one bug fix - SOLR-7503
https://issues.apache.org/jira/browse/SOLR-7503 – register cores
asynchronously.
It may reduce the registration time since it is done on parallel, but
still, 3 minutes (leaderVoteWait) is a long time to recover from a few
seconds of disconnection.

Except from that one I don't see any bug fix that addresses the same
problem.
I am able to reproduce it on 4.10.4 pretty easily, I will also try it with
5.2.1 and see if it reproduces.

Anyway, since migrating to 5.2.1 is not an option for me in the short term,
I'm left with the question if reducing leaderVoteWait may help here, and
what may be the consequences.
If i understand correctly, there might be a chance of losing updates that
were made on leader.
From my side it is a lot worse to lose availability for 3 minutes.

I would really appreciate a feedback on this.




On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 Did you look at release notes for Solr versions after your own?

 I am pretty sure some similar things were identified and/or resolved
 for 5.x. It may not help if you cannot migrate, but would at least
 give a confirmation and maybe workaround on what you are facing.

 Regards,
Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/


 On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote:
  Hi,
  We are using Solr cloud with solr 4.10.4.
  On the passed week we encountered a problem where all of our servers
  disconnected from zookeeper cluster.
  This might be ok, the problem is that after reconnecting to zookeeper it
  looks like for every collection both replicas do not have a leader and
 are
  stuck in some kind of a deadlock for a few minutes.
 
  From what we understand:
  One of the replicas assume it ill be the leader and at some point
 starting
  to wait on leaderVoteWait, which is by default 3 minutes.
  The other replica is stuck on this part of code for a few minutes:
   at
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957)
  at
  org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921)
  at
 
 org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521)
  at
 
 org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392)
 
  Looks like replica 1 waits for a leader to be registered in the
 zookeeper,
  but replica 2 is waiting for replica 1.
 
 (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp).
 
  We have 100 collections distributed in 3 pairs of Solr nodes. Each
  collection has one shard with 2 replicas.
  As I understand from code and logs, all the collections are being
  registered synchronously, which means that we have to wait 3 minutes *
  number of collections for the whole cluster to come up. It could be more
  than an hour!
 
 
 
  1. We thought about lowering leaderVoteWait to solve the problem, but we
  are not sure what is the risk?
 
  2. The following thread is very similar to our case:
 
 http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down
 .
  Does anybody know if it is indeed a bug and if there's a related JIRA
 issue?
 
  3. I see this on logs before the reconnection Client session timed out,
  have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001,
  closing socket connection and attempting reconnect, does it mean that
  there was a disconnection of over 50 seconds between SOLR and zookeeper?
 
 
  Thanks in advance for your kind answer



Re: Cluster down for long time after zookeeper disconnection

2015-08-10 Thread danny teichthal
Erick, I assume you are referring to zkClientTimeout, it is set to 30
seconds. I also see these messages on Solr side:
 Client session timed out, have not heard from server in 48865ms for
sessionid 0x44efbb91b5f0001, closing socket connection and attempting
reconnect.
So, I'm not sure what was the actual disconnection duration time, but it
could have been up to a minute.
We are working on finding the network issues root cause, but assuming
disconnections will always occur, are there any other options to overcome
this issues?



On Mon, Aug 10, 2015 at 11:18 PM, Erick Erickson erickerick...@gmail.com
wrote:

 I didn't see the zk timeout you set (just skimmed). But if your Zookeeper
 was
 down _very_ termporarily, it may suffice to up the ZK timeout. The default
 in the 10.4 time-frame (if I remember correctly) was 15 seconds which has
 proven to be too short in many circumstances.

 Of course if your ZK was down for minutest this wouldn't help.

 Best,
 Erick

 On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com
 wrote:
  Hi Alexander ,
  Thanks for your reply, I looked at the release notes.
  There is one bug fix - SOLR-7503
  https://issues.apache.org/jira/browse/SOLR-7503 – register cores
  asynchronously.
  It may reduce the registration time since it is done on parallel, but
  still, 3 minutes (leaderVoteWait) is a long time to recover from a few
  seconds of disconnection.
 
  Except from that one I don't see any bug fix that addresses the same
  problem.
  I am able to reproduce it on 4.10.4 pretty easily, I will also try it
 with
  5.2.1 and see if it reproduces.
 
  Anyway, since migrating to 5.2.1 is not an option for me in the short
 term,
  I'm left with the question if reducing leaderVoteWait may help here, and
  what may be the consequences.
  If i understand correctly, there might be a chance of losing updates that
  were made on leader.
  From my side it is a lot worse to lose availability for 3 minutes.
 
  I would really appreciate a feedback on this.
 
 
 
 
  On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch 
 arafa...@gmail.com
  wrote:
 
  Did you look at release notes for Solr versions after your own?
 
  I am pretty sure some similar things were identified and/or resolved
  for 5.x. It may not help if you cannot migrate, but would at least
  give a confirmation and maybe workaround on what you are facing.
 
  Regards,
 Alex.
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com
 wrote:
   Hi,
   We are using Solr cloud with solr 4.10.4.
   On the passed week we encountered a problem where all of our servers
   disconnected from zookeeper cluster.
   This might be ok, the problem is that after reconnecting to zookeeper
 it
   looks like for every collection both replicas do not have a leader and
  are
   stuck in some kind of a deadlock for a few minutes.
  
   From what we understand:
   One of the replicas assume it ill be the leader and at some point
  starting
   to wait on leaderVoteWait, which is by default 3 minutes.
   The other replica is stuck on this part of code for a few minutes:
at
  org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957)
   at
  
 org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921)
   at
  
 
 org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521)
   at
  
 
 org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392)
  
   Looks like replica 1 waits for a leader to be registered in the
  zookeeper,
   but replica 2 is waiting for replica 1.
  
 
 (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp).
  
   We have 100 collections distributed in 3 pairs of Solr nodes. Each
   collection has one shard with 2 replicas.
   As I understand from code and logs, all the collections are being
   registered synchronously, which means that we have to wait 3 minutes *
   number of collections for the whole cluster to come up. It could be
 more
   than an hour!
  
  
  
   1. We thought about lowering leaderVoteWait to solve the problem, but
 we
   are not sure what is the risk?
  
   2. The following thread is very similar to our case:
  
 
 http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down
  .
   Does anybody know if it is indeed a bug and if there's a related JIRA
  issue?
  
   3. I see this on logs before the reconnection Client session timed
 out,
   have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001,
   closing socket connection and attempting reconnect, does it mean that
   there was a disconnection of over 50 seconds between SOLR and
 zookeeper?
  
  
   Thanks in advance for your kind answer