[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959941#comment-13959941
 ] 

Mark Miller commented on SOLR-5952:
---

I've got ApacheCon coming up next week, so I might be a bit behind on things, 
but I want to try and get this addressed pretty soon.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960240#comment-13960240
 ] 

Mark Miller commented on SOLR-5952:
---

I've got a collection with 3 shards, no replication and during heavy indexing I 
just saw the leader flash to DOWN. He stays there, though in ZooKeeper he is 
still the valid leader. This is 4.4 with heavy back-porting from future 
releases, but this may help track down this mysterious DOWN publication. I'm 
collecting the logs.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-04 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960811#comment-13960811
 ] 

Jessica Cheng commented on SOLR-5952:
-

I tried to debug this a bit more today and I think my particular issue is 
actually with external collection (the state.json per collection mode). I'm 
unable to reproduce the mysterious DOWN state though, so it's great that you 
have. I'm going to open a separate jira to track stale state.json for 
external collection. Should we close this one or will you take it for the 
DOWN state?

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-04 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960818#comment-13960818
 ] 

Mark Miller commented on SOLR-5952:
---

Unfortunately, it did not end up being so mysterious for me - digging through 
the logs led to more answers. My perspective led me to think something like 
this was happening, but I was missing some info.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-04 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960852#comment-13960852
 ] 

Jessica Cheng commented on SOLR-5952:
-

OK, I'm going to close this one and open a new bug on stale state.json in 
external collection then. Thanks Mark!

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-03 Thread Daniel Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958614#comment-13958614
 ] 

Daniel Collins commented on SOLR-5952:
--

I know there have been issues if the follower disconnected from ZK, then it 
will fail to take updates from the leader (since it can't confirm the source of 
the messages is the real leader), so the follower will get asked to recover, 
and will have to wait until it has a valid ZK connection in order to do that.  
But I believe there have been fixes around that area.

In the example logs here though (I'm assuming host 1 was the leader) host1 says 
that its last published state was down?  We might need to go further back in 
the trace history of that node, why did it publish itself as down but was still 
leader?

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-03 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959364#comment-13959364
 ] 

Jessica Cheng commented on SOLR-5952:
-

Hi Daniel,

{quote}
I know there have been issues if the follower disconnected from ZK, then it 
will fail to take updates from the leader (since it can't confirm the source of 
the messages is the real leader), so the follower will get asked to recover, 
and will have to wait until it has a valid ZK connection in order to do that. 
But I believe there have been fixes around that area.
{quote}
What you describe doesn't seem to be related to this case. In this case, when 
the follower finally connected to zk again, there was no leader at all, and it 
fails to register itself when it hits the NoNodeException on  
/collections/test-1/leaders/shard2 to find the leader. It neither got to 
re-join the election nor to recover.

{quote}
In the example logs here though (I'm assuming host 1 was the leader) host1 says 
that its last published state was down? We might need to go further back in the 
trace history of that node, why did it publish itself as down but was still 
leader?
{quote}
Yes, this is where both Mark and I were expressing confusion about. However, I 
went back in the logs for hours trying to find the core being marked as down 
and I couldn't find it. (I grepped for publishing core from 
ZkController.publish.)

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958442#comment-13958442
 ] 

Mark Miller commented on SOLR-5952:
---

bq. original leader started recovery 

That's odd - the only thing that should cause this (especially if you don't see 
logging about the recovery being requested) is if there is a zk expiration - 
and that is what would cause isLeader to be reset. I'd expect that to be logged 
though.

I'll start reading through the logs in a bit.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958445#comment-13958445
 ] 

Mark Miller commented on SOLR-5952:
---

To note, I've seen a report or two that does resemble this behavior.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-02 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958476#comment-13958476
 ] 

Jessica Cheng commented on SOLR-5952:
-

{quote}
That's odd - the only thing that should cause this (especially if you don't see 
logging about the recovery being requested) is if there is a zk expiration - 
and that is what would cause isLeader to be reset. I'd expect that to be logged 
though.
{quote}

As far as I can tell, isLeader (I'm talking about the one in clusterstate, not 
under /collections/xxx) is only cleared in ElectionContext.runLeaderProcess (I 
did a find usage for ZkStateReader.LEADER_PROP). I believe a zk expiration 
wouldn't automatically caused this to be cleared from clusterstate.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-02 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958477#comment-13958477
 ] 

Jessica Cheng commented on SOLR-5952:
-

I however share your confusion about the lack of zookeeper-related logging. I 
went back for hours searching for it.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958484#comment-13958484
 ] 

Mark Miller commented on SOLR-5952:
---

bq. I'm talking about the one in clusterstate

Oh - I thought you were talking about the isLeader flag that is kept on the 
CloudDescriptor. That clears things up a bit.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5952) Recovery race/ error

2014-04-02 Thread Jessica Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13958489#comment-13958489
 ] 

Jessica Cheng commented on SOLR-5952:
-

{quote}
Oh - I thought you were talking about the isLeader flag that is kept on the 
CloudDescriptor.
{quote}

Ah, I see. Well, I guess it could've been either. I'd just assumed that 
clusterstate was the one that said it was the leader and CloudDescriptor was 
the one that said it wasn't, based on the if statement below failing:

{quote}
if (isLeader  !cloudDesc.isLeader()) {
throw new SolrException(ErrorCode.SERVER_ERROR, Cloud state still says we 
are leader.);
}
{quote}

where isLeader was determined from clusterstate.

 Recovery race/ error
 

 Key: SOLR-5952
 URL: https://issues.apache.org/jira/browse/SOLR-5952
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.7
Reporter: Jessica Cheng
Assignee: Mark Miller
  Labels: leader, recovery, solrcloud, zookeeper
 Fix For: 4.8, 5.0

 Attachments: recovery-failure-host1-log.txt, 
 recovery-failure-host2-log.txt


 We're seeing some shard recovery errors in our cluster when a zookeeper 
 error event happened. In this particular case, we had two replicas. The 
 event from the logs look roughly like this:
 18:40:36 follower (host2) disconnected from zk
 18:40:38 original leader started recovery (there was no log about why it 
 needed recovery though) and failed because cluster state still says it's the 
 leader
 18:40:39 follower successfully connected to zk after some trouble
 19:03:35 follower register core/replica
 19:16:36 follower registration fails due to no leader (NoNode for 
 /collections/test-1/leaders/shard2)
 Essentially, I think the problem is that the isLeader property on the cluster 
 state is never cleaned up, so neither replicas are able to recover/register 
 in order to participate in leader election again.
 Looks like from the code that the only place that the isLeader property is 
 cleared from the cluster state is from ElectionContext.runLeaderProcess, 
 which assumes that the replica with the next election seqId will notice the 
 leader's node disappearing and run the leader process. This assumption fails 
 in this scenario because the follower experienced the same zookeeper error 
 event and never handled the event of the leader going away. (Mark, this is 
 where I was saying in SOLR-3582 that maybe the watcher in 
 LeaderElector.checkIfIamLeader does need to handle Expired by somehow 
 realizing that the leader is gone and clearing the isLeader state at least, 
 but it currently ignores all EventType.None events.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org