[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959941#comment-13959941 ] Mark Miller commented on SOLR-5952: --- I've got ApacheCon coming up next week, so I might be a bit behind on things, but I want to try and get this addressed pretty soon. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960240#comment-13960240 ] Mark Miller commented on SOLR-5952: --- I've got a collection with 3 shards, no replication and during heavy indexing I just saw the leader flash to DOWN. He stays there, though in ZooKeeper he is still the valid leader. This is 4.4 with heavy back-porting from future releases, but this may help track down this mysterious DOWN publication. I'm collecting the logs. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960811#comment-13960811 ] Jessica Cheng commented on SOLR-5952: - I tried to debug this a bit more today and I think my particular issue is actually with external collection (the state.json per collection mode). I'm unable to reproduce the mysterious DOWN state though, so it's great that you have. I'm going to open a separate jira to track stale state.json for external collection. Should we close this one or will you take it for the DOWN state? Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960818#comment-13960818 ] Mark Miller commented on SOLR-5952: --- Unfortunately, it did not end up being so mysterious for me - digging through the logs led to more answers. My perspective led me to think something like this was happening, but I was missing some info. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960852#comment-13960852 ] Jessica Cheng commented on SOLR-5952: - OK, I'm going to close this one and open a new bug on stale state.json in external collection then. Thanks Mark! Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958614#comment-13958614 ] Daniel Collins commented on SOLR-5952: -- I know there have been issues if the follower disconnected from ZK, then it will fail to take updates from the leader (since it can't confirm the source of the messages is the real leader), so the follower will get asked to recover, and will have to wait until it has a valid ZK connection in order to do that. But I believe there have been fixes around that area. In the example logs here though (I'm assuming host 1 was the leader) host1 says that its last published state was down? We might need to go further back in the trace history of that node, why did it publish itself as down but was still leader? Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959364#comment-13959364 ] Jessica Cheng commented on SOLR-5952: - Hi Daniel, {quote} I know there have been issues if the follower disconnected from ZK, then it will fail to take updates from the leader (since it can't confirm the source of the messages is the real leader), so the follower will get asked to recover, and will have to wait until it has a valid ZK connection in order to do that. But I believe there have been fixes around that area. {quote} What you describe doesn't seem to be related to this case. In this case, when the follower finally connected to zk again, there was no leader at all, and it fails to register itself when it hits the NoNodeException on /collections/test-1/leaders/shard2 to find the leader. It neither got to re-join the election nor to recover. {quote} In the example logs here though (I'm assuming host 1 was the leader) host1 says that its last published state was down? We might need to go further back in the trace history of that node, why did it publish itself as down but was still leader? {quote} Yes, this is where both Mark and I were expressing confusion about. However, I went back in the logs for hours trying to find the core being marked as down and I couldn't find it. (I grepped for publishing core from ZkController.publish.) Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958442#comment-13958442 ] Mark Miller commented on SOLR-5952: --- bq. original leader started recovery That's odd - the only thing that should cause this (especially if you don't see logging about the recovery being requested) is if there is a zk expiration - and that is what would cause isLeader to be reset. I'd expect that to be logged though. I'll start reading through the logs in a bit. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958445#comment-13958445 ] Mark Miller commented on SOLR-5952: --- To note, I've seen a report or two that does resemble this behavior. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958476#comment-13958476 ] Jessica Cheng commented on SOLR-5952: - {quote} That's odd - the only thing that should cause this (especially if you don't see logging about the recovery being requested) is if there is a zk expiration - and that is what would cause isLeader to be reset. I'd expect that to be logged though. {quote} As far as I can tell, isLeader (I'm talking about the one in clusterstate, not under /collections/xxx) is only cleared in ElectionContext.runLeaderProcess (I did a find usage for ZkStateReader.LEADER_PROP). I believe a zk expiration wouldn't automatically caused this to be cleared from clusterstate. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958477#comment-13958477 ] Jessica Cheng commented on SOLR-5952: - I however share your confusion about the lack of zookeeper-related logging. I went back for hours searching for it. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958484#comment-13958484 ] Mark Miller commented on SOLR-5952: --- bq. I'm talking about the one in clusterstate Oh - I thought you were talking about the isLeader flag that is kept on the CloudDescriptor. That clears things up a bit. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5952) Recovery race/ error
[ https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958489#comment-13958489 ] Jessica Cheng commented on SOLR-5952: - {quote} Oh - I thought you were talking about the isLeader flag that is kept on the CloudDescriptor. {quote} Ah, I see. Well, I guess it could've been either. I'd just assumed that clusterstate was the one that said it was the leader and CloudDescriptor was the one that said it wasn't, based on the if statement below failing: {quote} if (isLeader !cloudDesc.isLeader()) { throw new SolrException(ErrorCode.SERVER_ERROR, Cloud state still says we are leader.); } {quote} where isLeader was determined from clusterstate. Recovery race/ error Key: SOLR-5952 URL: https://issues.apache.org/jira/browse/SOLR-5952 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.7 Reporter: Jessica Cheng Assignee: Mark Miller Labels: leader, recovery, solrcloud, zookeeper Fix For: 4.8, 5.0 Attachments: recovery-failure-host1-log.txt, recovery-failure-host2-log.txt We're seeing some shard recovery errors in our cluster when a zookeeper error event happened. In this particular case, we had two replicas. The event from the logs look roughly like this: 18:40:36 follower (host2) disconnected from zk 18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader 18:40:39 follower successfully connected to zk after some trouble 19:03:35 follower register core/replica 19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2) Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again. Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper error event and never handled the event of the leader going away. (Mark, this is where I was saying in SOLR-3582 that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle Expired by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.) -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org