[ https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542260#comment-13542260 ]
nkeywal commented on HBASE-7407: -------------------------------- v2 contains the fix + an unrelated doc fix. Here are the changes: Core: 1) When the master was restarting, if .META. was currently opening, it used to restart another assignement. It seems extreme and risky. I removed it. 2) When we restart the master, it resends an open to the regionserver for the znode in the M_ZK_REGION_CLOSING state. Previously it was relaying on the timeout (5 minutes by default) before offline the node. 3) Same as 2) for nodes in M_ZK_REGION_OFFLINE 4) If a znode is RS_ZK_REGION_SPLIT and the regionserver offline we do a forceOffline to trigger a reassign. Test: 1) I removed the timeout change: it was set to a few seconds instead of 5 minutes. It was hiding that the recovery was not very efficient in some cases. 2) I removed a few test cases that where creating wrong internal states, and waiting for the timeout to clean them. > TestMasterFailover under tests some cases and over tests some others > -------------------------------------------------------------------- > > Key: HBASE-7407 > URL: https://issues.apache.org/jira/browse/HBASE-7407 > Project: HBase > Issue Type: Bug > Components: master, test > Affects Versions: 0.96.0 > Reporter: nkeywal > Assignee: nkeywal > Priority: Minor > Attachments: 7407.v1.patch, 7407.v2.patch > > > The tests are done with this settings: > conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000); > conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000); > As a results: > 1) some tests seems to work, but in real life, the recovery would take 5 > minutes or more, as in production there always higher. So we don't see the > real issues. > 2) The tests include specific cases that should not happen in production. It > works because the timeout catches everything, but these scenarios do not need > to be optimized, as they cannot happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira