[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236934#comment-13236934 ]
Todd Lipcon commented on HADOOP-8163: ------------------------------------- Hi Bikas. I think your ideas have some merit, especially with regard to a fully general election framework. But since we only have one user of this framework at this point (HDFS) and we currently only support a single standby node, I would prefer to punt these changes to another JIRA as additional improvements. This will let us move forward with the high priority task of auto failover for HA NNs, rather than getting distracted making this extremely general. bq. Secondly, we are performing blocking calls on the ZKClient callback that happens on the ZK threads. It is advisable to not block ZK client threads for long This is only the case if you have other operations that are waiting on timely delivery of callbacks. In the case of the election framework, all of our notifications from ZK have to be received in-order and processed sequentially, or else we have a huge explosion of possible interactions to worry about. Doing blocking calls in the callbacks will _not_ result in lost ZK leases, etc. To quote from the ZK programmer's guide: "All IO happens on the IO thread (using Java NIO). All event callbacks happen on the event thread. Session maintenance such as reconnecting to ZooKeeper servers and maintaining heartbeat is done on the IO thread. Responses for synchronous methods are also processed in the IO thread. All responses to asynchronous methods and watch events are processed on the event thread... Callbacks do not block the processing of the IO thread or the processing of the synchronous calls" bq. Thirdly, how about using the setData(breadcrumb, appData, version)? Let me see about making this change. Like you said, it's a good safety check. > Improve ActiveStandbyElector to provide hooks for fencing old active > -------------------------------------------------------------------- > > Key: HADOOP-8163 > URL: https://issues.apache.org/jira/browse/HADOOP-8163 > Project: Hadoop Common > Issue Type: Improvement > Components: ha > Affects Versions: 0.23.3, 0.24.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt, > hadoop-8163.txt > > > When a new node becomes active in an HA setup, it may sometimes have to take > fencing actions against the node that was formerly active. This JIRA extends > the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK > directory, which acts as a second copy of the active node's information. > Then, if the active loses its ZK session, the next active to be elected may > easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira