[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-29 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808004#comment-13808004
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I have to think about this one, but from the log excerpt that Raul posted, one 
problem I can see is that n.round has different values, so vote.equals in 
termPredicate when comparing the votes, no?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808081#comment-13808081
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Yes, that is what I mean.
The round value in the votes is updated in updateElectionVote() after the 
election is finished.
In the previous code (without the patch) the vote when the election was 
finished had the epoch of the leader. That is, the epoch that the new leader 
had when the election started.
In the code after the patch, the vote is updated in updateElectionVote() to the 
epoch that the leader is using after the election is finished, which is one 
more than the epoch that it was using when the election started.
I think that if newEpoch-1 is used to update the election vote, then things 
should be ok. If that is done, then servers with and without the patch should 
have the same value of epoch in the vote after the election is finished.
It is very good that [~rgs] has spotted this so soon, since it would have been 
seen in all upgrades from 3.4.5 to 3.4.6. On the other hand, consequences are 
not too serious. It only happens when servers with different versions are 
running in the same quorum and it only happens if there is an ensemble running 
(so there should be no interruption of the service).

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-29 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808194#comment-13808194
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-1732:
---

Hmm, I still think this could confuse people rolling a cluster. Sounds like we 
should revert this for the next release unless we have a fix for it. Smooth 
upgrades through rolling restarts are an expectation that ZooKeeper has always 
maintained. 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-28 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807102#comment-13807102
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-1732:
---

[~fpj], [~abranzyck]: did you guys test this patch when joining a cluster of 
servers running without this patch (i.e.: trunk, only without this patch)?

After rolling the first 2 followers - in a 5 member ensemble - the 3rd follower 
fails to join with this:

{noformat}
2013-10-28 18:43:18,134 - INFO  [WorkerReceiver[myid=4]] - Notification: 4 
(n.leader), 0x890415 (n.zxid), 0x6 (n.round), LOOKING (n.state), 4 (n.sid), 
0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2013-10-28 18:43:18,134 - INFO  [WorkerReceiver[myid=4]] - Notification: 2 
(n.leader), 0x88002c (n.zxid), 0x (n.round), FOLLOWING 
(n.state), 0 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2013-10-28 18:43:18,135 - INFO  [WorkerReceiver[myid=4]] - Notification: 2 
(n.leader), 0x88002c (n.zxid), 0x6 (n.round), LEADING (n.state), 2 (n.sid), 
0x88 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2013-10-28 18:43:18,135 - INFO  [WorkerReceiver[myid=4]] - Notification: 2 
(n.leader), 0x88002c (n.zxid), 0x6 (n.round), FOLLOWING (n.state), 3 
(n.sid), 0x88 (n.peerEPoch), LOOKING (my state)0 (n.config version)
2013-10-28 18:43:18,136 - INFO  [WorkerReceiver[myid=4]] - Notification: 2 
(n.leader), 0x88002c (n.zxid), 0x (n.round), FOLLOWING 
(n.state), 1 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version)
{noformat}

I am guessing IGNOREVALUE (0x) as the round value is causing 
issues? What was the expected behavior here (i.e.: when dealing with cluster 
members without this patch during an upgrade)?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-28 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807553#comment-13807553
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

hmm, this is odd. I don't understand why the notifications don't have the same 
round value, the don't care value in this case. The value is also not what I 
expected, so I might have done something wrong there. Let me have a closer look 
and report back.

Thanks for reporting, [~rgs].

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-28 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807565#comment-13807565
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-1732:
---

What's wrong with the round values? i.e.: the two new servers have IGNOREVALUE 
(sounds correct right?) and the older followers have the current round value 
(i.e.: 0x6). I thought the problem would be here:

{noformat}
 * @see 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732
  
 */
outofelection.put(n.sid, new Vote(n.leader,
IGNOREVALUE, IGNOREVALUE, n.peerEpoch, 
n.state));
if (termPredicate(outofelection, new Vote(n.leader,
IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))
 checkLeader(outofelection, n.leader, 
IGNOREVALUE)) {
{noformat}

IGNOREVALUE doesn't work here, because we are talking to un-patched cluster 
members.

Sorry if I am completely misleading you :) That's as far as I got with my 
analysis today. 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-28 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807588#comment-13807588
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I see, my mental model of the problem ignored the fact that there were servers 
with newer and older versions, my bad. I think the IGNOREVALUE is not really 
being ignored, I'll come up with a fix, but I'll do it in a different jira.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-28 Thread Thawan Kooburat (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807633#comment-13807633
 ] 

Thawan Kooburat commented on ZOOKEEPER-1732:


May be we should start considering automate rolling upgrade test?.  In jenkins 
we might be able to continuously grab 3.4 branch and perform rolling upgrade to 
3.5 and verify that quorum come up


 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-28 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807673#comment-13807673
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

Given that rolling upgrades seem to be very common, it doesn't sound like a bad 
idea to automate the testing. I think we can't do it with junit, or at least I 
don't know how.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801717#comment-13801717
 ] 

Hudson commented on ZOOKEEPER-1732:
---

SUCCESS: Integrated in ZooKeeper-trunk #2097 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/2097/])
ZOOKEEPER-1732. ZooKeeper server unable to join established ensemble (German 
Blanco via fpj) (fpj: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1534390)
* /zookeeper/trunk/CHANGES.txt
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java
* 
/zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java
* /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FLETest.java


 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800766#comment-13800766
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Sorry, I didn't realise you changed the name, and I was giving comments to my 
last version vs. the one before that.
To your changes, I only have the comment that I understand that you intend to 
apply the same formatting to the trunk patch, right?
I can do that if you want.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-21 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800769#comment-13800769
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I'm sorry too, I haven't been able to get to it. If you have some time and can 
generate the trunk patch, I'd appreciate it.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800881#comment-13800881
 ] 

Hadoop QA commented on ZOOKEEPER-1732:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12609472/ZOOKEEPER-1732.patch
  against trunk revision 1533161.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1713//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1713//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1713//console

This message is automatically generated.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, 
 ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800150#comment-13800150
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

+1 to the 3.4 patch. Thank you [~fpj], for the review and for making those 
changes, I shouldn't have left it like that. 
That means one blocker less to go for 3.4.6! Thanks a lot also for your work on 
the release.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788947#comment-13788947
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Is there anything else to be done in this one?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-08 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789374#comment-13789374
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

One thing I'm not happy about your patch is that you use zero as don't care 
values. For readability, I'd rather have perhaps different method calls or 
constants reflecting the fact that we are not taking those values into account. 
Adding comments to the code explaining what's going on sounds like a good thing 
to do.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784955#comment-13784955
 ] 

Hadoop QA commented on ZOOKEEPER-1732:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12606543/ZOOKEEPER-1732.patch
  against trunk revision 1528586.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1634//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1634//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1634//console

This message is automatically generated.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-03 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785329#comment-13785329
 ] 

Marshall McMullen commented on ZOOKEEPER-1732:
--

We've just run into this issue running tip of trunk 3.5.0 *without* this patch 
applied. Are there any proposed workarounds to this problem? I tried removing 
the stuck node from the ensemble and adding another node in as a replacement 
but it is now hitting the same problem... It can't join the ensemble either. 
I'm considering restarting all zookeeper servers in the hopes that a new round 
of leader election will reset things. Does this sound safe? Are there any other 
alternatives? Really appreciate any help.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-03 Thread Marshall McMullen (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785345#comment-13785345
 ] 

Marshall McMullen commented on ZOOKEEPER-1732:
--

Flavio, that suggestion worked perfectly! Simply restarting the leader caused a 
new round of leader election and things sorted themselves out within a few 
seconds. Thank you so much for such a prompt reply. Love this community! 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Critical
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, 
 ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-02 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784565#comment-13784565
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I have downgraded this issue to major, it is a corner case and unlikely to 
happen often, but we still need to fix it. 

I'm thinking that we should update the peer epoch at the end of syncWithLeader 
rather than where it is in registerWithLeader. After syncing, we know the 
current epoch, so we should just update it there. I was also thinking that the 
we could update the zxid as well, although it doesn't matter too much.

The indentation is still wrong for me.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784807#comment-13784807
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

I will update the patches as you propose. It also seems better to me to update 
this information when the ensemble is finally established.
With the current epoch I assume that you mean the new epoch for the ensemble 
and that this will have to be updated in the followers and in the leader (so no 
newEpoch-1, but newEpoch). If I misunderstood, please let me know.
I don't know how to choose a zxid that is agreed between the leader and all the 
followers. I will leave that for now, but if you know how to do it, please let 
me know.
I am assuming that you mean the modification in Learner.java for the 
indentation comment. The indentation was already wrong in that part of the 
code. If I am not wrong, there are 8 spaces and a tab in the lines above and 
below the modification. I believe I have put 8 spaces and a tab also in the 
modified lines. In the editors that I use it looks ok. If you want me to try to 
fix the indentation around the change, it is ok with me. If you mean another 
change or you don't see 8 spaces and a tab there or I should use a different 
combination for these lines, please let me know.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-10-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784822#comment-13784822
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Oooops! I just noticed that the change in Learner.java is the one that needs to 
be moved to another method, so if the last indentation comment was about that 
change, then please never mind about my response. Hopefully I will get the 
indentation right in the new method :-)

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
 Fix For: 3.4.6, 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-09-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781639#comment-13781639
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Thanks a lot for taking a look at this, and for your comments.
1 - I guess you mean the  if (n.leader != self.getId()) check in 
FastLeaderElection.java. I will remove that.
2 - This is because the final election vote in the leader will have newEpoch - 
1. The winning vote will contain the epoch of the leader, but once the 
election finishes, the leader increments the epoch. So actually newEpoch is 
the epoch that won the election plus one. And the vote reported in the Fast 
Leader Election is the vote that won the election. If we set the vote to just 
newEpoch, then we need to update the vote also in the Leader, and not only on 
Learners as it is currently done. At least that is how I think it works.
3 - I will fix identation and add the comment. The identation is funny around 
those lines though.
4 - I will change the name and change to protected.
5 - Also will do.
Patches will be uploaded in some minutes.


 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-09-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781747#comment-13781747
 ] 

Hadoop QA commented on ZOOKEEPER-1732:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12605873/ZOOKEEPER-1732.patch
  against trunk revision 1527398.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1606//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1606//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1606//console

This message is automatically generated.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-09-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781430#comment-13781430
 ] 

Hadoop QA commented on ZOOKEEPER-1732:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12604546/ZOOKEEPER-1732-3.4.patch
  against trunk revision 1527129.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1604//console

This message is automatically generated.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-08-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754728#comment-13754728
 ] 

János Grásl commented on ZOOKEEPER-1732:


Sorry for the last comment, OoO assistant replied to Germán's comment.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-08-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752196#comment-13752196
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Could anybody please review this?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-08-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752197#comment-13752197
 ] 

János Grásl commented on ZOOKEEPER-1732:


-
I am out of the office from 2013-08-26 to 2013-08-31. I check my emails on 
2013-09-01.
If you need immediate assistance or information about question, please contact 
Csaba Nagy(ecsanag).
This is an auto reply message sent by my Out of Office Assistance.
We only send and receive email on the basis of the term set out at
http://www.ericsson.com/email_disclaimer
-


 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Assignee: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-08-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730980#comment-13730980
 ] 

Hadoop QA commented on ZOOKEEPER-1732:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12596368/ZOOKEEPER-1732.patch
  against trunk revision 1503101.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1529//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1529//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1529//console

This message is automatically generated.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0

 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, 
 ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724085#comment-13724085
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

I think now that we also need to do something with the peerEpoch. I can't 
explain why this hasn't failed so far in my tests, maybe the corner case 
causing this problem is even more unlikely than what I thought. But the peer 
epoch value does get sent from the leader to the follower after election, 
right? So it would be possible to just update the value in the leader election 
information of the follower, during the synchronization phase of the Zab 
protocol, instead of loosening the restriction. In that way, there will be at 
least one check verifying that all the votes come from an ensemble established 
with the same epoch.
What do you think?
I will also run the tests again with a trace to see when the inconsistent 
ensemble is created.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-26 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720595#comment-13720595
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

bq. joining an ensemble that votes me as the leader.

I'm ok with removing it, this is an optimization. If the leader is being 
re-elected, then it means that the ensemble it is trying to join is not 
functional, since the leader is not present. 

To do it, you might as well check the change of ZOOKEEPER-1514 in checkLeader. 
I think the if block you added is not necessary if you make the change in check 
leader. 

bq. taking into account my own votes or votes that put me as a leader when 
joining an ensemble.

I don't think we are currently taking into account the vote of a LOOKING server 
when processing FOLLOWING/LEADING notifications. If you're talking about 
endVote, this is the vote corresponding to the leader it elected.

bq. removing the check for the election round when joining an established 
ensemble.

Let me give some insight here first. We need to have servers joining an 
established ensemble because a server may find that a quorum is already 
following some leader and if it follows the standard procedure of processing 
notifications, then there are some corner cases that can cause it to keep 
electing some other server that is also looking.

The danger of joining an established ensemble is the following. Say that a 
minority of followers support a leader L, and a majority M supports L'. L' has 
enough supporters and is able to commit txns. Now say that a server S in the 
ensemble of L' crashes and recovers. S talks to L and its minority now forming 
a majority (say there was one server missing to form a majority). L will tell 
all servers in its ensemble to truncate causing some txns to be lost. 

We have a couple of mechanisms that prevent this incorrect truncation from 
happening. First, S needs to receive FOLLOWING/LEADING notifications from a 
quorum, not including itself. In this case, the incorrect truncation only 
happens if S receives a stale message, a message from a server S' that later on 
followed L'. We prevent this case by having maximum one outstanding 
notification in QuorumCnxManager in the queue of a peer. If S' has followed L', 
then its notification must reflect it and S won't receive such a stale message.

Overall it sounds fine to only consider the server the followers are following. 
Note that not only the round could be different, but I believe the zxid could 
also be different.




 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: test_loosen_restrictions.tar.gz, zklog.tar.gz, 
 ZOOKEEPER-1732-LOOSEN_RESTRICTIONS.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720835#comment-13720835
 ] 

Hadoop QA commented on ZOOKEEPER-1732:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12594398/ZOOKEEPER-1732.patch
  against trunk revision 1503101.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1522//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1522//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1522//console

This message is automatically generated.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz, ZOOKEEPER-1732.patch


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-25 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719402#comment-13719402
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

By agree to vote, don't you need a different message pattern, even if the 
message content is the same? You're still changing the protocol here. Also, we 
don't need agreement, since different processes can have a different opinion 
about who the leader should be. They need to agree before they start a new 
epoch, but that's precisely what the recovery phase of zab does. It does a bit 
more actually, but the whole state sync up is not relevant to this discussion.

bq. it actually doesn't take part in the leader election logic

This is not entirely true, the LE step exposes a leader that has the highest 
zxid among a quorum of servers. Also, I think that you're using LE as the 
recovery phase of Zab, not that the initial protocol that finds a prospective 
leader.

bq. The new server just checks if the ensemble has a quorum and the leader is 
alive (sends a notification voting for itself)

I believe we have discussed this point in this jira. As you have observed, the 
ensemble is still able to make progress in the situation you have originally 
described, so the inconsistent LE information doesn't prevent zookeeper from 
doing work. The problem is getting a server stuck, which we fix by making sure 
that a follower is able to send notifications with state that reflects the 
latest leader election. 

One option I was actually considering is to loosen the constraint that all 
FOLLOWING/LEADING notifications need to come from the same LE round. This is 
possibly too conservative, so it might be ok to change it, but I need to think 
a bit more about it.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719483#comment-13719483
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

The proposal meant to set the election epoch to the same value during the 
initial phase of the Zab protocol. That same value would be the proposed 
epoch in the LEADERINFO structure. As you say, it is a change in the protocol, 
even if messages have the same information. And it wouldn't work if there are 
servers in the quorum with different behaviours (e.g. 3.4.5 and 3.4.6 with this 
change implemented), since they will end up reporting different election epoch.
I hope loosening the constrain works, that would really be an easy solution :-)


 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716152#comment-13716152
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

That sounds good to me.
I still see option 3 as more straight for solving the problem, but it does 
involve quite a mess with updating the protocol and test cases and so on for 
such a corner case. 
It seems that there are no more opinions on this.
Is it ok if I prepare a patch with the change in the leader election process 
that you suggest?
The test case might be a bit tricky, do you have any suggestion for that?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-23 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716210#comment-13716210
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I think you say that 3 is a more direct way of solving the problem because we 
would be enforcing that a follower is following the right instance of the 
leader. It is a fair observation, although I tend to think that leader election 
is unreliable in nature, so I can really go either way. 

Given that you are keen on implementing the changes to the recovery handshake, 
what if we try to outline the precise changes in both cases and try to 
determine which one to go with once we have that? He are some initial thoughts.

For option 3, we add LE information to the LearnerInfo message. The leader 
checks the version of the protocol and uses the new information in LearnerInfo 
in the case the protocol version is appropriate. In the case the leader 
instance information doesn't match, we have two choices:

# The leader drops the connection to the follower and the follower goes back to 
LOOKING;
# The leader sends to the follower the LE instance information and the follower 
updates its vote info.

For the second option, I believe we would need a new message, since LEADERINFO 
only contains a zxid. I'd rather avoid adding another message, though.

If we don't change the recovery handshake and use the other approach I 
outlined, then I believe all changes are concentrated in the FLE class, perhaps 
some in QuorumPeer as well, I'm not sure. We just need to call 
sendNotifications() upon receiving a notification while leading. For a 
follower, when receiving a notification from a LEADING server, it checks if its 
vote is still valid, updating otherwise. 


bq. Is it ok if I prepare a patch with the change in the leader election 
process that you suggest?

Sure, it would be great if you could propose a patch, independent of the 
approach we end up choosing.

bq. The test case might be a bit tricky, do you have any suggestion for that?

There are some FLE test cases that implement a mock server. I think we should 
do something similar here. Instead of trying to reproduce the race, we could 
just test that the follower correctly updates its information upon receiving a 
notification.
 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716227#comment-13716227
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

bq. For option 3, we add LE information to the LearnerInfo message ... 
How would you add the information to the LearnerInfo message? If the receiving 
side doesn't support the same version fo the protocol it will not be able to 
parse the message, right? Do you have in mind to use some current unused field? 
My suggestion was that in the message with LearnerInfo, the Learner only 
reports that it supports the additional information i.e. sending 0x10001 as the 
protocol version. The Leader sees this and then it will include the leader 
election information together with the LeaderInfo, only if the Learner supports 
this additional information. The Learner receives the message and it will read 
the leader election info only if the protocol supported by the Leader is also 
0x1001. At this point the Learner can just update its leader election 
information with the one it got from the Leader. No new message that way :-)
 
bq. If we don't change the recovery handshake and use the other approach I 
outlined, then I believe all changes are concentrated in the FLE class, ...

I also see it that way, and I would do it exactly as you say. It is around 20 
lines of code near the end of FastLeaderElection$Messenger$WorkerReceiver, 
something like this: [https://gist.github.com/germanblanco/6060741]. I don't 
see any changes in QuorumPeer, but maybe I am missing something.

bq. There are some FLE test cases that implement a mock server. I think we 
should do something similar here. Instead of trying to reproduce the race, we 
could just test that the follower correctly updates its information upon 
receiving a notification.

Sounds very good.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716234#comment-13716234
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

The changes in option 3 could be something like this: 
[https://gist.github.com/germanblanco/6060844].
The regression test doesn't work for these changes, so they still need work, 
but maybe it helps to explain my intention.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714965#comment-13714965
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

I tried to refresh the proposal simply by doing updateProposal(getInitId(), 
getInitLastLoggedZxid(), getPeerEpoch());, before sending a notification when 
the server that sends it is part of an established ensemble. The test didn't 
run for long enough time, because of other reasons, but I think now that it 
can't work anyway. Reading your alternatives now and the way Votes are 
compared, I see that zxid and epoch need to be the same in all members of the 
ensemble and in this race case the follower hasn't received the zxid that the 
leader used to finish the election.
My personal preference would be 3. Because it is faster (follower doesn't go 
back to LOOKING, it can just update the proposal with the info in LeaderInfo), 
and it doesn't depend on any more races that could lead to the follower not 
processing the notification from the leader.
If the protocol backward compatibility issues are just more work, then I will 
be very willing to help as much as I can.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714986#comment-13714986
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

bq. My personal preference would be 3. Because it is faster (follower doesn't 
go back to LOOKING, it can just update the proposal with the info in LeaderInfo)

I guess my description was not clear for 3. The idea was that the learner sends 
its LE info so that the leader can drop it if the learner info is stale, so the 
follower goes back to looking in this option as well. 

I'm not worried about making this case fast because this is such a corner case: 
a follower f elects a leader l, l crashes, l comes back, l is re-elected by a 
quorum that does not include f, f is able to connect to connect to l.   

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715026#comment-13715026
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

I see, trying to make this a bit faster doesn't make sense at all.
Sorry, but I am confused about the epoch handling in the initial negotiation 
between the Follower and the Leader.
The current FOLLOWERINFO QuorumPacket seems to contain already the 
acceptedEpoch, isn't it already possible for the Leader to check that value 
and reject the connection if it is wrong?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715042#comment-13715042
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

Right, this is a bit confusing. Epoch in the context of leader election is not 
the same as the epoch of zxids, that's one reason why I often call it LE round, 
to disambiguate. The epoch you're referring to has to do with txn identifiers 
(zxids), not LE epochs (or rounds). 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715056#comment-13715056
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Answering myself, I guess the acceptedEpoch in FOLLOWERINFO is not the epoch 
that was seen last in the election, but the epoch of the last transaction 
recorded by the Follower. So it doesn't help in this case.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715057#comment-13715057
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Sorry, I forgot to refresh the page :-)

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715060#comment-13715060
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

;-)

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715069#comment-13715069
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

I am thinking that, giving that we don't care if this is solved a few 
milliseconds later, one way to handle the backward compatibility issues would 
be this:
- Follower/Observer sends an increased protocol number in 
FOLLOWERINFO/OBSERVERINFO.
- When the Leader sees the new version, it sends the LE epoch and zxid in 
LeaderInfo.
- Follower/Observer check if the Leader has an updated protocol version and if 
it does, they read the LE epoch and zxid from LeaderInfo. If they are not the 
same as the LE epoch and zxid that they have, they start LOOKING again.

Does that sound ok?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715077#comment-13715077
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I'm not sure why you think it is better to touch the Zab protocol rather than 
simply stop following if the follower vote is outdated. It is the simplest 
thing we can do and it is correct. 

I don't think touching the Zab protocol is a good idea, so I still prefer 
option 1 above.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715089#comment-13715089
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

I would prefer not to touch the Zab protocol as well, but I don't want to leave 
any loose ends. I thought that there was no guarantee that the Follower will 
receive a notification from the Leader after it connects. 
Do you think that this will always happen?
This notification is sent through the leader election port, so I guess that it 
is difficult to make assumptions on when it reaches its destination with 
respect with when the Follower connects to the Leader. And going deeper in the 
unlikely combinations, the leader election connection could be lost at some 
point and Follower and Leader remain connected through the Zab connection.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715115#comment-13715115
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I understand your concern, it is a valid one. Our goal however shouldn't be to 
have every follower having the most recent notification from the leader. Our 
goal is to have a leader that has enough supporters so that we can make 
progress. If all servers are either following/leading, then it doesn't matter 
if a follower has stale LE information. But, it does become an issue in the 
case that you uncovered with your logs. In this scenario, the stale follower 
will receive a more recent notification from either the leader, when it is 
trying to be re-elected, or from the follower that is stuck. In either case, it 
will be able to determine that it is stale and stop following.

The bottom line is that the follower stops following only if it realizes that 
it is stale. If it doesn't hear anything, then it just keeps going. Does it 
work?

About changing the protocol, in my experience, changing messages is a pain 
because there are many subtle cases and it is quite easy to get it wrong. It is 
best not to touch it. I think we can take care of this case without really 
changing the messages we are sending. 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715120#comment-13715120
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

... I should have said from the server that is stuck, not from the follower 
that is stuck. 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715130#comment-13715130
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

If I am not wrong, the follower that is stuck never accepts the leadership of 
the Leader in the ensemble, because it is an established ensemble and it sees 
no quorum in it. So it will only send notifications proposing itself as the 
leader. And the leader of the ensemble sends the Notifications only to the 
follower that is stuck, or? So there is actually no chance for the stale 
follower to receive the updated leader election information after the initial 
election is finished.
I agree that one must be careful when changing network protocols and dealing 
with backwards compatibility, but if you asked me, I think that it is much 
easier to make mistakes doing multithreaded concurrent Java programming :-P

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715474#comment-13715474
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

bq. there is actually no chance for the stale follower to receive the updated 
leader election information after the initial election is finished.

What if the leader, upon receiving a notification, instead of responding only 
to the sender, it sends a batch of notifications? This way we can perform the 
check I was mentioning before and there is no real change to the protocol, we 
are just sending more messages every now and then. 



 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-22 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715514#comment-13715514
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

I was actually thinking that if we do what I just proposed above, then the 
follower should just update its vote. If a server is following another it 
believes to be the leader, then it doesn't matter how it got there, it matters 
that one believes it is following and the other believes it is leading. My 
proposal more concretely is to have the leader broadcasting its notification 
upon receiving a notification for a server that is LOOKING. If a follower 
receives a notification from the guy it is following and it notices that its 
vote is outdated, then it updates its vote accordingly. 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-21 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714783#comment-13714783
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

bq. If that is the case, then it would be solved by refreshing the proposal 
after the Follower connects to the Leader, right?

How would you refresh the proposal?

In my view, there are three ways to solve this issue:

# If a follower receives a notification from the guy it thinks it is following 
and the notification is more recent (later round or higher zxid), then it stops 
following and goes back to LOOKING.
# Same as before but instead of transitioning to LOOKING, the follower updates 
its notification and keeps trying to connect to the leader.
# We change LeanerInfo so that it carries leader election information for 
verification purposes.

I think we should do the first one. The second one bypasses the LE protocol, so 
I'm not in favor of going that direction, although it might not break anything 
if we do it. The third option changes the protocol, so it is a bit of a pain to 
deal with the backward compatibility stuff.  

 

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714506#comment-13714506
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

Thanks a lot for your analysis, Flavio!
If that is the case, then it would be solved by refreshing the proposal after 
the Follower connects to the Leader, right?
I will try to make a modification with a refreshment of the proposal and run 
the test again to check what happens.

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713801#comment-13713801
 ] 

Germán Blanco commented on ZOOKEEPER-1732:
--

It seems that the two servers in the enssemble are sending Notifications with a 
different peerEpoch to the one out of the ensemble:
2013-07-19 10:17:00,833 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 3 (n.leader), 
0xb80099 (n.zxid), 0xb9 (n.round), FOLLOWING (n.state), 2 (n.sid), 0xb8 
(n.peerEPoch), LOOKING (my state)
2013-07-19 10:17:00,833 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 3 (n.leader), 
0xb90052 (n.zxid), 0xba (n.round), LEADING (n.state), 3 (n.sid), 0xb9 
(n.peerEPoch), LOOKING (my state)
Is that correct?

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Critical
 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble

2013-07-19 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713933#comment-13713933
 ] 

Flavio Junqueira commented on ZOOKEEPER-1732:
-

Here is my analysis of the logs.

Server 3 has been elected two times, both times with support of Server 1:

{noformat}
2013-07-19 10:16:09,746 [myid:3] - DEBUG 
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to 
leave FLE instance: leader=3, zxid=0xb80099, my id=3, my state=LEADING

2013-07-19 10:16:26,667 [myid:3] - DEBUG 
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to 
leave FLE instance: leader=3, zxid=0xb90052, my id=3, my state=LEADING
{noformat}

Server 2 elects Server 3 but loses the connection to Server 3 right after:

{noformat}
2013-07-19 10:16:20,858 [myid:2] - INFO  
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:30102:Follower@63] - FOLLOWING - LEADER 
ELECTION TOOK - 47
2013-07-19 10:16:20,858 [myid:2] - WARN  
[RecvWorker:3:QuorumCnxManager$RecvWorker@762] - Connection broken for id 3, my 
id = 2, error = 
{noformat}

And it doesn't seem to go into a new round of leader election. Because it is 
not trying to elect a new leader, its vote reflects the state of the first 
leader instance of Server 3.

Now, Server 3 later on loses its connection to Server 1:

{noformat}
2013-07-19 10:16:34,307 [myid:3] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for id 1, my 
id = 3, error = 
{noformat}

but it doesn't seem to care, so it must have the support of Server 2. Server 2 
again seems to be referring to a previous leader instance of Server 3, so its 
support to Server 3 must be surviving the crash of Server 3 around 2013-07-19 
10:16:20,858 [myid:2] and my guess is that Server is getting confused about 
dropping the connection right after electing Server 3 and it is trying to 
establish a new connection, which succeeds when Server 3 comes back up. I think 
there is a race there

 ZooKeeper server unable to join established ensemble
 

 Key: ZOOKEEPER-1732
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection
Affects Versions: 3.4.5
 Environment: Windows 7, Java 1.7
Reporter: Germán Blanco
Priority: Blocker
 Fix For: 3.5.0, 3.4.6

 Attachments: zklog.tar.gz


 I have a test in which I do a rolling restart of three ZooKeeper servers and 
 it was failing from time to time.
 I ran the tests in a loop until the failure came out and it seems that at 
 some point one of the servers is unable to join the enssemble formed by the 
 other two.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira