[ 
https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5572:
---------------------------

    Fix Version/s: 0.96.0
           Status: Patch Available  (was: Open)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on 
> SessionExpiredException. One is explicitly not managed. However, is seems 
> that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a 
> quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is 
> equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be 
> removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a 
> single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false 
> HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a 
> SessionExpiredException it will try to recover, calling 
> HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false 
> (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and 
> then try to become the active master. If it cannot, it will return false (and 
> that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of 
> ActiveMasterManager#blockUntilBecomingActiveMaster. 
> blockUntilBecomingActiveMaster says it will return false if there is any 
> error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master 
> address. If it's the same port & host, it deletes the nodes, that will start 
> a recursive call to blockUntilBecomingActiveMaster. This second call succeeds 
> (we became the active master) and return true. This result is ignored by the 
> first blockUntilBecomingActiveMaster: it return false (even if we actually 
> became the active master), hence the whole suite call returns false and 
> HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the 
> expired zk session since the master zk node is still there." but we're 
> actually doing a check just for this and deleting the node. If we were not 
> ignoring the result, we would return "true", so we would not stop the master, 
> so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to