[jira] Updated: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model
[ https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abmar Barros updated ZOOKEEPER-702: --- Attachment: ZOOKEEPER-702.patch Agreed with what Flavio said about the second point. The application scheduling interval is, indeed, much lower than the FD pinging interval. Attached the implementations of all initially suggested FDs and their unit tests. Also have included the suggestions Flavio gave concerning package naming and method scope. Once the Phi Accrual implementation needs to compute the Normal Distribution cdf, I have used the math-commons API, however, it still has to be added as an Ivy dependency. So, before experimenting, there are a few steps that need to be accomplished: * Create a receiveAppHeartbeat() method on the interface so that we can keep using zookeeper messages as heartbeats and analyze its impact and adapt it to each FD implementation * Adapt server side code to use the proposed FD interface * Add comments to pseudo codes * Expand unit tests * Enhance code documentation * Add math-commons dependency to Ivy > GSoC 2010: Failure Detector Model > - > > Key: ZOOKEEPER-702 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702 > Project: Zookeeper > Issue Type: Wish >Reporter: Henry Robinson >Assignee: Abmar Barros > Attachments: bertier-pseudo.txt, chen-pseudo.txt, > phiaccrual-pseudo.txt, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch > > > Failure Detector Module > Possible Mentor > Henry Robinson (henry at apache dot org) > Requirements > Java, some distributed systems knowledge, comfort implementing distributed > systems protocols > Description > ZooKeeper servers detects the failure of other servers and clients by > counting the number of 'ticks' for which it doesn't get a heartbeat from > other machines. This is the 'timeout' method of failure detection and works > very well; however it is possible that it is too aggressive and not easily > tuned for some more unusual ZooKeeper installations (such as in a wide-area > network, or even in a mobile ad-hoc network). > This project would abstract the notion of failure detection to a dedicated > Java module, and implement several failure detectors to compare and contrast > their appropriateness for ZooKeeper. For example, Apache Cassandra uses a > phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which > is much more tunable and has some very interesting properties. This is a > great project if you are interested in distributed algorithms, or want to > help re-factor some of ZooKeeper's internal code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-789) Improve FLE log messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881337#action_12881337 ] Hadoop QA commented on ZOOKEEPER-789: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12447713/ZOOKEEPER-789.patch against trunk revision 953041. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/119/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/119/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/119/console This message is automatically generated. > Improve FLE log messages > > > Key: ZOOKEEPER-789 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-789 > Project: Zookeeper > Issue Type: Improvement >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-789.patch > > > Notification messages are quite important to determine what is going with > leader election. The main idea of this improvement is name the fields we > output in notification log messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-784) server-side functionality for read-only mode
[ https://issues.apache.org/jira/browse/ZOOKEEPER-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881313#action_12881313 ] Sergey Doroshenko commented on ZOOKEEPER-784: - --- Question about session handling --- For now (latest patch) when r/o client connects to a partitioned server, new "fake" session is created for it -- fake because it exists only in this server. When server regains the quorum, during this session's revalidation it will be rejected as invalid since leader didn't see this it. >From users' point of view it'd be good to transparently upgrade such session >to a usual session. The idea is that if leader sees that given session is >invalid but also belongs to read-only client then it re-assigns new id to id >and sends this new id to the client. The idea is good but seems error-prone. For example, what to do with r/o clients that were partitioned (and connected to partitioned server) for longer than sess timeout? At leader their sessions have already been expired, so they should be rejected. Re-assigning new session to such clients doesn't look right. But the problem here is that when servers see an invalid session they can't tell if they never saw it or if they saw it but it was expired. So, if quorum rejects r/o session (which implies this session either was never seen by quorum or is expired), there are two options from users' point of view: * ZooKeeper object becomes invalid, and application should create a new one. Reliable and consistent with current ZK * server transparently re-assigns session id for such client. Seems to have many potential problems, as described above What do you think? > server-side functionality for read-only mode > > > Key: ZOOKEEPER-784 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-784 > Project: Zookeeper > Issue Type: Sub-task >Reporter: Sergey Doroshenko >Assignee: Sergey Doroshenko > Attachments: ZOOKEEPER-784.patch, ZOOKEEPER-784.patch, > ZOOKEEPER-784.patch > > > As per http://wiki.apache.org/hadoop/ZooKeeper/GSoCReadOnlyMode , create > ReadOnlyZooKeeperServer which comes into play when peer is partitioned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881282#action_12881282 ] Vishal K commented on ZOOKEEPER-790: I will try out the patch. I will try it out on 3.3.0 since that is the version we are currently using. -Vishal > Last processed zxid set prematurely while establishing leadership > - > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira >Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-790.patch > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881280#action_12881280 ] Vishal K commented on ZOOKEEPER-335: I will try out the patch. FYI I am using 3.3.0. > zookeeper servers should commit the new leader txn to their logs. > - > > Key: ZOOKEEPER-335 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 > Project: Zookeeper > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Mahadev konar >Assignee: Mahadev konar >Priority: Blocker > Fix For: 3.4.0 > > Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz > > > currently the zookeeper followers do not commit the new leader election. This > will cause problems in a failure scenarios with a follower acking to the same > leader txn id twice, which might be two different intermittent leaders and > allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-790: --- Assignee: Flavio Paiva Junqueira Fix Version/s: 3.4.0 Priority: Blocker (was: Major) > Last processed zxid set prematurely while establishing leadership > - > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira >Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-790.patch > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881244#action_12881244 ] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- I have created a new jira for this issue: ZOOKEEPER-790. There is a patch there. > zookeeper servers should commit the new leader txn to their logs. > - > > Key: ZOOKEEPER-335 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 > Project: Zookeeper > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Mahadev konar >Assignee: Mahadev konar >Priority: Blocker > Fix For: 3.4.0 > > Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz > > > currently the zookeeper followers do not commit the new leader election. This > will cause problems in a failure scenarios with a follower acking to the same > leader txn id twice, which might be two different intermittent leaders and > allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-790: - Attachment: ZOOKEEPER-790.patch > Last processed zxid set prematurely while establishing leadership > - > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira > Fix For: 3.3.2 > > Attachments: ZOOKEEPER-790.patch > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Fix For: 3.3.2 Attachments: ZOOKEEPER-790.patch The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-789) Improve FLE log messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-789: --- Fix Version/s: 3.4.0 > Improve FLE log messages > > > Key: ZOOKEEPER-789 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-789 > Project: Zookeeper > Issue Type: Improvement >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-789.patch > > > Notification messages are quite important to determine what is going with > leader election. The main idea of this improvement is name the fields we > output in notification log messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-789) Improve FLE log messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-789: - Status: Patch Available (was: Open) Affects Version/s: 3.3.1 Fix Version/s: 3.3.2 > Improve FLE log messages > > > Key: ZOOKEEPER-789 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-789 > Project: Zookeeper > Issue Type: Improvement >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira > Fix For: 3.3.2 > > Attachments: ZOOKEEPER-789.patch > > > Notification messages are quite important to determine what is going with > leader election. The main idea of this improvement is name the fields we > output in notification log messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-789) Improve FLE log messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-789: - Assignee: Flavio Paiva Junqueira > Improve FLE log messages > > > Key: ZOOKEEPER-789 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-789 > Project: Zookeeper > Issue Type: Improvement >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira > Attachments: ZOOKEEPER-789.patch > > > Notification messages are quite important to determine what is going with > leader election. The main idea of this improvement is name the fields we > output in notification log messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-789) Improve FLE log messages
[ https://issues.apache.org/jira/browse/ZOOKEEPER-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-789: - Attachment: ZOOKEEPER-789.patch I have created a method to print info about a notification and placed the call to the method to right after the notification structure is created in FastLeaderElection.WorkerReceiver.run(). This way all received notifications are logged. There is no test, since this is just a modification to message logging. > Improve FLE log messages > > > Key: ZOOKEEPER-789 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-789 > Project: Zookeeper > Issue Type: Improvement >Reporter: Flavio Paiva Junqueira > Attachments: ZOOKEEPER-789.patch > > > Notification messages are quite important to determine what is going with > leader election. The main idea of this improvement is name the fields we > output in notification log messages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881236#action_12881236 ] Patrick Hunt commented on ZOOKEEPER-335: Vishal, if Flavio provides you with a patch could you apply it and verify with your configuration? Flavio, please provide an initial patch that people could use to verify. We'll hold off on a release until you add the test(s), but this would be great to start with. Thanks all for helping to track this down! I'd like to fast track a 3.3.2 release, so if possible please make this a priority. > zookeeper servers should commit the new leader txn to their logs. > - > > Key: ZOOKEEPER-335 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 > Project: Zookeeper > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Mahadev konar >Assignee: Mahadev konar >Priority: Blocker > Fix For: 3.4.0 > > Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz > > > currently the zookeeper followers do not commit the new leader election. This > will cause problems in a failure scenarios with a follower acking to the same > leader txn id twice, which might be two different intermittent leaders and > allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881168#action_12881168 ] Flavio Paiva Junqueira commented on ZOOKEEPER-335: -- Thanks for detailed assessment, Vishal. In Step b, the fact that the process believes it is the leader is not a problem, and it happens because we queue notification messages during leader election. The real issue is that leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. I noticed that this follower check was not there before (not present in 3.0 branch), and it might have been introduced when we did the observer reorganization. For now I propose that we move line Leader.java:281 to Leader.java:470. It simply changes the point in which we set the last processed zxid to one in which we know that a quorum of followers supports the leader. I reasoned a bit about it and verified that tests pass. A patch for the change I'm proposing is trivial, but a unit test will require some work, so I'd rather hear opinions first. Also, please note that this problem is not related to the topic of this jira, so we might consider working on a different jira from this point on. > zookeeper servers should commit the new leader txn to their logs. > - > > Key: ZOOKEEPER-335 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 > Project: Zookeeper > Issue Type: Bug > Components: server >Affects Versions: 3.1.0 >Reporter: Mahadev konar >Assignee: Mahadev konar >Priority: Blocker > Fix For: 3.4.0 > > Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz > > > currently the zookeeper followers do not commit the new leader election. This > will cause problems in a failure scenarios with a follower acking to the same > leader txn id twice, which might be two different intermittent leaders and > allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.