[ https://issues.apache.org/jira/browse/HADOOP-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13236151#comment-13236151 ]
Bikas Saha commented on HADOOP-8163: ------------------------------------ Thanks for clarifying! I was afraid I would have to look out for code changes that might have been covered by the "cleaned up" tests. The new test code looks cleaner. Thanks! :) I still have to get my head around the changes to the real ZK test :P - consider renaming zkMostRecentFilePath to something like zkMostRecentActiveFilePath. zkMostRecentFilePath is open to being misunderstood. Same for MOST_RECENT_FILENAME. - actually MostRecent seems to be a misnomer to me. I think it actually is LockOwnerInfo/LeaderInfo. zkLockOwnerInfoPath/tryDeleteLeaderInfo etc. - why is ensureBaseNode() needed? In it we are creating a new set of znodes with the given zkAcls which may or may not be the correct thing. eg. if the admin simply forgot to create the appropriate znode path before starting the service it might be ok to fail. Instead of trying to create the path ourselves with permissions that may or may not be appropriate for the entire path. I would be wary of doing this. What is the use case? - consider renaming baseNodeExists() to parentNodeExists() or renaming the parentZnodeName parameter in the constructor to baseNode for consistency. Perhaps this could be called in the constructor to check that the znode exists and be done with config issues. No need for ensureBaseNode() above. - this must be my newbie java skills but I find something like - prefixPath.append("/").append(pathParts[index]) or znodeWorkingDir.subString(0, znodeWorkingDir.nextIndexOf('/')) - more readable than prefixPath = Joiner.on("/").join(Arrays.asList(pathParts).subList(0, i)). It might also be more efficient but thats not relevant for this situation. - public synchronized void quitElection(boolean needFence) - Dont we want to delete the permanent znode for standby's too? Why check if state is active. It anyways calls a tryDelete* method that should be harmless. - tryDeleteMostRecentNode() - From my understanding of "tryFunction" - this function should be not really be asserting that some state holds. If it should assert then we should remove "try" from the name. - in zkDoWithRetries there is a NUM_RETRIES field that could be used instead of 3. - why are we exposing "public synchronized ZooKeeper getZKClient()"? What operations does the HA service need to perform that needs direct access to the zkClient? I am stressing on this because the Elector is trying to provide an interface for leader election and management protocol. Hence, any necessary activity should be available through the interface. The client of the Elector should not manipulate/read/interact directly with ZK. If it needs ZK for some other activity then it could create another ZK client. Making this a public method adds it to the public API of the elector and I am wary of doing it. - the following code seems to have issues LOG.info("Old node exists: " + StringUtils.byteToHexString(data)); appClient.fenceOldActive(data); try { deleteWithRetries(zkMostRecentFilePath, stat.getVersion()); } catch (KeeperException e) { LOG.error("Hit error " + e + " trying to delete info znode " + "after fencing!", e); throw e; } say appClient.fenceOldActive(data) is a long running operation (and likely to be so). While that is happening, the state of the world changes and this elector is not longer the lock owner. When appClient.fenceOldActive(data) will complete then the code will go ahead and delete the lockOwnerZnode at zkMostRecentFilePath. This node could be from the new leader who had successfully fenced and become active. The version number parameter might accidentally save us but would likely be 0 all the time. - what happens if the leader lost the lock, tried to delete its znode, failed to do so, exited anyways, then became the next owner and found the existing mostrecent znode. I think it will try to fence itself. We could add service side logic that checks against this but IMO since the fencing strategy via permanent znodes is an artifact of ActiveStandbyElector, the elector should itself be guarding against this. - It would be great to see the functional test with RealZK enhanced to cover some of the permanent znode use cases. I remember fixing some issues in my own code that were exposed to the reality of the real ZK :P > Improve ActiveStandbyElector to provide hooks for fencing old active > -------------------------------------------------------------------- > > Key: HADOOP-8163 > URL: https://issues.apache.org/jira/browse/HADOOP-8163 > Project: Hadoop Common > Issue Type: Improvement > Components: ha > Affects Versions: 0.23.3, 0.24.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hadoop-8163.txt, hadoop-8163.txt, hadoop-8163.txt > > > When a new node becomes active in an HA setup, it may sometimes have to take > fencing actions against the node that was formerly active. This JIRA extends > the ActiveStandbyElector which adds an extra non-ephemeral node into the ZK > directory, which acts as a second copy of the active node's information. > Then, if the active loses its ZK session, the next active to be elected may > easily locate the unfenced node to take the appropriate actions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira