[jira] [Commented] (HBASE-7407) TestMasterFailover under tests some cases and over tests some others

nkeywal (JIRA) Wed, 02 Jan 2013 10:00:17 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542260#comment-13542260
 ]


nkeywal commented on HBASE-7407:
--------------------------------

v2 contains the fix + an unrelated doc fix.

Here are the changes:

Core:
1) When the master was restarting, if .META. was currently opening, it used to 
restart another assignement. It seems extreme and risky. I removed it.
2) When we restart the master, it resends an open to the regionserver for the 
znode in the M_ZK_REGION_CLOSING state. Previously it was relaying on the 
timeout (5 minutes by default) before offline the node.
3) Same as 2) for nodes in M_ZK_REGION_OFFLINE
4) If a znode is RS_ZK_REGION_SPLIT and the regionserver offline we do a 
forceOffline to trigger a reassign.

Test:
1) I removed the timeout change: it was set to a few seconds instead of 5 
minutes. It was hiding that the recovery was not very efficient in some cases.
2) I removed a few test cases that where creating wrong internal states, and 
waiting for the timeout to clean them.

                
> TestMasterFailover under tests some cases and over tests some others
> --------------------------------------------------------------------
>
>                 Key: HBASE-7407
>                 URL: https://issues.apache.org/jira/browse/HBASE-7407
>             Project: HBase
>          Issue Type: Bug
>          Components: master, test
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 7407.v1.patch, 7407.v2.patch
>
>
> The tests are done with this settings:
>     conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000);
>     conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000);
> As a results:
> 1) some tests seems to work, but in real life, the recovery would take 5 
> minutes or more, as in production there always higher. So we don't see the 
> real issues.
> 2) The tests include specific cases that should not happen in production. It 
> works because the timeout catches everything, but these scenarios do not need 
> to be optimized, as they cannot happen. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7407) TestMasterFailover under tests some cases and over tests some others

Reply via email to