Re: Cluster down for long time after zookeeper disconnection
1. Erik, thanks, I agree that it is really serious, but I think that the 3 minutes on this case were not mandatory. On my case it was a deadlock, which smells like some kind of bug. One replica is waiting for other to come up, before it takes leadership, while the other is waiting for the election results. If I will be able to reproduce it on 5.2.1, is it legitimate to file a JIRA issue for that? 2. Regarding session timeouts, there's something about configuration that I don't understand. If zkClientTimeout is set to 30 seconds, how come see in the log that session expired after ~50 seconds. Maybe I have a mismatch between zookeeper and solr configuration? 3. Resuming the question of leaderVoteWait parameter, I have seen in a few threads that it may be reduced to a minimum. I'm not clear about the full meaning, but I understand that it is meant to prevent lose of update on cluster startup. Can anyone confirm/clarify that? Links for leaderVoteWait: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201307.mbox/%3ccajt9wnhivirpn79kttcn8ekafevhhmqwkfl-+i16kbz0ogl...@mail.gmail.com%3E http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down Relevant part from My zookeeper conf: tickTime=2000 initLimit=10 syncLimit=5 On Tue, Aug 11, 2015 at 1:06 AM, Erick Erickson erickerick...@gmail.com wrote: Not that I know of. With ZK as the one source of truth, dropping below quorum is Really Serious, so having to wait 3 minutes or so for action to be taken is the fallback. Best, Erick On Mon, Aug 10, 2015 at 1:34 PM, danny teichthal dannyt...@gmail.com wrote: Erick, I assume you are referring to zkClientTimeout, it is set to 30 seconds. I also see these messages on Solr side: Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect. So, I'm not sure what was the actual disconnection duration time, but it could have been up to a minute. We are working on finding the network issues root cause, but assuming disconnections will always occur, are there any other options to overcome this issues? On Mon, Aug 10, 2015 at 11:18 PM, Erick Erickson erickerick...@gmail.com wrote: I didn't see the zk timeout you set (just skimmed). But if your Zookeeper was down _very_ termporarily, it may suffice to up the ZK timeout. The default in the 10.4 time-frame (if I remember correctly) was 15 seconds which has proven to be too short in many circumstances. Of course if your ZK was down for minutest this wouldn't help. Best, Erick On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com wrote: Hi Alexander , Thanks for your reply, I looked at the release notes. There is one bug fix - SOLR-7503 https://issues.apache.org/jira/browse/SOLR-7503 – register cores asynchronously. It may reduce the registration time since it is done on parallel, but still, 3 minutes (leaderVoteWait) is a long time to recover from a few seconds of disconnection. Except from that one I don't see any bug fix that addresses the same problem. I am able to reproduce it on 4.10.4 pretty easily, I will also try it with 5.2.1 and see if it reproduces. Anyway, since migrating to 5.2.1 is not an option for me in the short term, I'm left with the question if reducing leaderVoteWait may help here, and what may be the consequences. If i understand correctly, there might be a chance of losing updates that were made on leader. From my side it is a lot worse to lose availability for 3 minutes. I would really appreciate a feedback on this. On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote: Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are stuck in some kind of a deadlock for a few minutes. From what we understand: One of the replicas assume it ill be the leader and at some point starting to wait on leaderVoteWait, which is by default 3 minutes. The other replica is stuck on this part of code for a few minutes: at
Cluster down for long time after zookeeper disconnection
Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are stuck in some kind of a deadlock for a few minutes. From what we understand: One of the replicas assume it ill be the leader and at some point starting to wait on leaderVoteWait, which is by default 3 minutes. The other replica is stuck on this part of code for a few minutes: at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) Looks like replica 1 waits for a leader to be registered in the zookeeper, but replica 2 is waiting for replica 1. (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). We have 100 collections distributed in 3 pairs of Solr nodes. Each collection has one shard with 2 replicas. As I understand from code and logs, all the collections are being registered synchronously, which means that we have to wait 3 minutes * number of collections for the whole cluster to come up. It could be more than an hour! 1. We thought about lowering leaderVoteWait to solve the problem, but we are not sure what is the risk? 2. The following thread is very similar to our case: http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down. Does anybody know if it is indeed a bug and if there's a related JIRA issue? 3. I see this on logs before the reconnection Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect, does it mean that there was a disconnection of over 50 seconds between SOLR and zookeeper? Thanks in advance for your kind answer
Re: Cluster down for long time after zookeeper disconnection
Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote: Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are stuck in some kind of a deadlock for a few minutes. From what we understand: One of the replicas assume it ill be the leader and at some point starting to wait on leaderVoteWait, which is by default 3 minutes. The other replica is stuck on this part of code for a few minutes: at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) Looks like replica 1 waits for a leader to be registered in the zookeeper, but replica 2 is waiting for replica 1. (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). We have 100 collections distributed in 3 pairs of Solr nodes. Each collection has one shard with 2 replicas. As I understand from code and logs, all the collections are being registered synchronously, which means that we have to wait 3 minutes * number of collections for the whole cluster to come up. It could be more than an hour! 1. We thought about lowering leaderVoteWait to solve the problem, but we are not sure what is the risk? 2. The following thread is very similar to our case: http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down. Does anybody know if it is indeed a bug and if there's a related JIRA issue? 3. I see this on logs before the reconnection Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect, does it mean that there was a disconnection of over 50 seconds between SOLR and zookeeper? Thanks in advance for your kind answer
Re: Cluster down for long time after zookeeper disconnection
I didn't see the zk timeout you set (just skimmed). But if your Zookeeper was down _very_ termporarily, it may suffice to up the ZK timeout. The default in the 10.4 time-frame (if I remember correctly) was 15 seconds which has proven to be too short in many circumstances. Of course if your ZK was down for minutest this wouldn't help. Best, Erick On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com wrote: Hi Alexander , Thanks for your reply, I looked at the release notes. There is one bug fix - SOLR-7503 https://issues.apache.org/jira/browse/SOLR-7503 – register cores asynchronously. It may reduce the registration time since it is done on parallel, but still, 3 minutes (leaderVoteWait) is a long time to recover from a few seconds of disconnection. Except from that one I don't see any bug fix that addresses the same problem. I am able to reproduce it on 4.10.4 pretty easily, I will also try it with 5.2.1 and see if it reproduces. Anyway, since migrating to 5.2.1 is not an option for me in the short term, I'm left with the question if reducing leaderVoteWait may help here, and what may be the consequences. If i understand correctly, there might be a chance of losing updates that were made on leader. From my side it is a lot worse to lose availability for 3 minutes. I would really appreciate a feedback on this. On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote: Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are stuck in some kind of a deadlock for a few minutes. From what we understand: One of the replicas assume it ill be the leader and at some point starting to wait on leaderVoteWait, which is by default 3 minutes. The other replica is stuck on this part of code for a few minutes: at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) Looks like replica 1 waits for a leader to be registered in the zookeeper, but replica 2 is waiting for replica 1. (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). We have 100 collections distributed in 3 pairs of Solr nodes. Each collection has one shard with 2 replicas. As I understand from code and logs, all the collections are being registered synchronously, which means that we have to wait 3 minutes * number of collections for the whole cluster to come up. It could be more than an hour! 1. We thought about lowering leaderVoteWait to solve the problem, but we are not sure what is the risk? 2. The following thread is very similar to our case: http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down . Does anybody know if it is indeed a bug and if there's a related JIRA issue? 3. I see this on logs before the reconnection Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect, does it mean that there was a disconnection of over 50 seconds between SOLR and zookeeper? Thanks in advance for your kind answer
Re: Cluster down for long time after zookeeper disconnection
Not that I know of. With ZK as the one source of truth, dropping below quorum is Really Serious, so having to wait 3 minutes or so for action to be taken is the fallback. Best, Erick On Mon, Aug 10, 2015 at 1:34 PM, danny teichthal dannyt...@gmail.com wrote: Erick, I assume you are referring to zkClientTimeout, it is set to 30 seconds. I also see these messages on Solr side: Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect. So, I'm not sure what was the actual disconnection duration time, but it could have been up to a minute. We are working on finding the network issues root cause, but assuming disconnections will always occur, are there any other options to overcome this issues? On Mon, Aug 10, 2015 at 11:18 PM, Erick Erickson erickerick...@gmail.com wrote: I didn't see the zk timeout you set (just skimmed). But if your Zookeeper was down _very_ termporarily, it may suffice to up the ZK timeout. The default in the 10.4 time-frame (if I remember correctly) was 15 seconds which has proven to be too short in many circumstances. Of course if your ZK was down for minutest this wouldn't help. Best, Erick On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com wrote: Hi Alexander , Thanks for your reply, I looked at the release notes. There is one bug fix - SOLR-7503 https://issues.apache.org/jira/browse/SOLR-7503 – register cores asynchronously. It may reduce the registration time since it is done on parallel, but still, 3 minutes (leaderVoteWait) is a long time to recover from a few seconds of disconnection. Except from that one I don't see any bug fix that addresses the same problem. I am able to reproduce it on 4.10.4 pretty easily, I will also try it with 5.2.1 and see if it reproduces. Anyway, since migrating to 5.2.1 is not an option for me in the short term, I'm left with the question if reducing leaderVoteWait may help here, and what may be the consequences. If i understand correctly, there might be a chance of losing updates that were made on leader. From my side it is a lot worse to lose availability for 3 minutes. I would really appreciate a feedback on this. On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote: Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are stuck in some kind of a deadlock for a few minutes. From what we understand: One of the replicas assume it ill be the leader and at some point starting to wait on leaderVoteWait, which is by default 3 minutes. The other replica is stuck on this part of code for a few minutes: at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) Looks like replica 1 waits for a leader to be registered in the zookeeper, but replica 2 is waiting for replica 1. (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). We have 100 collections distributed in 3 pairs of Solr nodes. Each collection has one shard with 2 replicas. As I understand from code and logs, all the collections are being registered synchronously, which means that we have to wait 3 minutes * number of collections for the whole cluster to come up. It could be more than an hour! 1. We thought about lowering leaderVoteWait to solve the problem, but we are not sure what is the risk? 2. The following thread is very similar to our case: http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down . Does anybody know if it is indeed a bug and if there's a related JIRA issue? 3. I see this on logs before the reconnection Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect,
Re: Cluster down for long time after zookeeper disconnection
Hi Alexander , Thanks for your reply, I looked at the release notes. There is one bug fix - SOLR-7503 https://issues.apache.org/jira/browse/SOLR-7503 – register cores asynchronously. It may reduce the registration time since it is done on parallel, but still, 3 minutes (leaderVoteWait) is a long time to recover from a few seconds of disconnection. Except from that one I don't see any bug fix that addresses the same problem. I am able to reproduce it on 4.10.4 pretty easily, I will also try it with 5.2.1 and see if it reproduces. Anyway, since migrating to 5.2.1 is not an option for me in the short term, I'm left with the question if reducing leaderVoteWait may help here, and what may be the consequences. If i understand correctly, there might be a chance of losing updates that were made on leader. From my side it is a lot worse to lose availability for 3 minutes. I would really appreciate a feedback on this. On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote: Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are stuck in some kind of a deadlock for a few minutes. From what we understand: One of the replicas assume it ill be the leader and at some point starting to wait on leaderVoteWait, which is by default 3 minutes. The other replica is stuck on this part of code for a few minutes: at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) Looks like replica 1 waits for a leader to be registered in the zookeeper, but replica 2 is waiting for replica 1. (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). We have 100 collections distributed in 3 pairs of Solr nodes. Each collection has one shard with 2 replicas. As I understand from code and logs, all the collections are being registered synchronously, which means that we have to wait 3 minutes * number of collections for the whole cluster to come up. It could be more than an hour! 1. We thought about lowering leaderVoteWait to solve the problem, but we are not sure what is the risk? 2. The following thread is very similar to our case: http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down . Does anybody know if it is indeed a bug and if there's a related JIRA issue? 3. I see this on logs before the reconnection Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect, does it mean that there was a disconnection of over 50 seconds between SOLR and zookeeper? Thanks in advance for your kind answer
Re: Cluster down for long time after zookeeper disconnection
Erick, I assume you are referring to zkClientTimeout, it is set to 30 seconds. I also see these messages on Solr side: Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect. So, I'm not sure what was the actual disconnection duration time, but it could have been up to a minute. We are working on finding the network issues root cause, but assuming disconnections will always occur, are there any other options to overcome this issues? On Mon, Aug 10, 2015 at 11:18 PM, Erick Erickson erickerick...@gmail.com wrote: I didn't see the zk timeout you set (just skimmed). But if your Zookeeper was down _very_ termporarily, it may suffice to up the ZK timeout. The default in the 10.4 time-frame (if I remember correctly) was 15 seconds which has proven to be too short in many circumstances. Of course if your ZK was down for minutest this wouldn't help. Best, Erick On Mon, Aug 10, 2015 at 1:06 PM, danny teichthal dannyt...@gmail.com wrote: Hi Alexander , Thanks for your reply, I looked at the release notes. There is one bug fix - SOLR-7503 https://issues.apache.org/jira/browse/SOLR-7503 – register cores asynchronously. It may reduce the registration time since it is done on parallel, but still, 3 minutes (leaderVoteWait) is a long time to recover from a few seconds of disconnection. Except from that one I don't see any bug fix that addresses the same problem. I am able to reproduce it on 4.10.4 pretty easily, I will also try it with 5.2.1 and see if it reproduces. Anyway, since migrating to 5.2.1 is not an option for me in the short term, I'm left with the question if reducing leaderVoteWait may help here, and what may be the consequences. If i understand correctly, there might be a chance of losing updates that were made on leader. From my side it is a lot worse to lose availability for 3 minutes. I would really appreciate a feedback on this. On Mon, Aug 10, 2015 at 6:55 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Did you look at release notes for Solr versions after your own? I am pretty sure some similar things were identified and/or resolved for 5.x. It may not help if you cannot migrate, but would at least give a confirmation and maybe workaround on what you are facing. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 10 August 2015 at 11:37, danny teichthal dannyt...@gmail.com wrote: Hi, We are using Solr cloud with solr 4.10.4. On the passed week we encountered a problem where all of our servers disconnected from zookeeper cluster. This might be ok, the problem is that after reconnecting to zookeeper it looks like for every collection both replicas do not have a leader and are stuck in some kind of a deadlock for a few minutes. From what we understand: One of the replicas assume it ill be the leader and at some point starting to wait on leaderVoteWait, which is by default 3 minutes. The other replica is stuck on this part of code for a few minutes: at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:957) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:921) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1521) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:392) Looks like replica 1 waits for a leader to be registered in the zookeeper, but replica 2 is waiting for replica 1. (org.apache.solr.cloud.ShardLeaderElectionContext.waitForReplicasToComeUp). We have 100 collections distributed in 3 pairs of Solr nodes. Each collection has one shard with 2 replicas. As I understand from code and logs, all the collections are being registered synchronously, which means that we have to wait 3 minutes * number of collections for the whole cluster to come up. It could be more than an hour! 1. We thought about lowering leaderVoteWait to solve the problem, but we are not sure what is the risk? 2. The following thread is very similar to our case: http://qnalist.com/questions/4812859/waitforleadertoseedownstate-when-leader-is-down . Does anybody know if it is indeed a bug and if there's a related JIRA issue? 3. I see this on logs before the reconnection Client session timed out, have not heard from server in 48865ms for sessionid 0x44efbb91b5f0001, closing socket connection and attempting reconnect, does it mean that there was a disconnection of over 50 seconds between SOLR and zookeeper? Thanks in advance for your kind answer