[
https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542260#comment-13542260
]
nkeywal commented on HBASE-7407:
--------------------------------
v2 contains the fix + an unrelated doc fix.
Here are the changes:
Core:
1) When the master was restarting, if .META. was currently opening, it used to
restart another assignement. It seems extreme and risky. I removed it.
2) When we restart the master, it resends an open to the regionserver for the
znode in the M_ZK_REGION_CLOSING state. Previously it was relaying on the
timeout (5 minutes by default) before offline the node.
3) Same as 2) for nodes in M_ZK_REGION_OFFLINE
4) If a znode is RS_ZK_REGION_SPLIT and the regionserver offline we do a
forceOffline to trigger a reassign.
Test:
1) I removed the timeout change: it was set to a few seconds instead of 5
minutes. It was hiding that the recovery was not very efficient in some cases.
2) I removed a few test cases that where creating wrong internal states, and
waiting for the timeout to clean them.
> TestMasterFailover under tests some cases and over tests some others
> --------------------------------------------------------------------
>
> Key: HBASE-7407
> URL: https://issues.apache.org/jira/browse/HBASE-7407
> Project: HBase
> Issue Type: Bug
> Components: master, test
> Affects Versions: 0.96.0
> Reporter: nkeywal
> Assignee: nkeywal
> Priority: Minor
> Attachments: 7407.v1.patch, 7407.v2.patch
>
>
> The tests are done with this settings:
> conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000);
> conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000);
> As a results:
> 1) some tests seems to work, but in real life, the recovery would take 5
> minutes or more, as in production there always higher. So we don't see the
> real issues.
> 2) The tests include specific cases that should not happen in production. It
> works because the timeout catches everything, but these scenarios do not need
> to be optimized, as they cannot happen.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira