[ https://issues.apache.org/jira/browse/HBASE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546008#comment-13546008 ]
nkeywal edited comment on HBASE-7407 at 1/7/13 4:23 PM: -------------------------------------------------------- bq. I was thinking should we call unassign/open instead which handles more error cases, such as ServerNotRunningYetException in case the regionserver is online but not ready for RPC? I've refactored the code to share the open region call with the assign function. I don't think we can do a full assign in this case: the opening may be in progress already. In this code, we still abort if we have an error. It can be changed. bq. Another thing is that RegionStates@regionOnline should be used to online a region when we revert to original state if sendRegionClose returns false. #updateRegionState will keep the region in transition. I've change it to a full assign, it seems correct as well, and then we should not have this issue. What do you think? Lastly, I've added a lock during the process failover, imho we can meet race conditions without this... was (Author: nkeywal): bq. I was thinking should we call unassign/open instead which handles more error cases, such as ServerNotRunningYetException in case the regionserver is online but not ready for RPC? I've refactored the code to share the open region call with the assign function. I don't think we can do a full assign in this case: the opening may be in progress already. In this code, we still abort if we have an error. It can be changed. bq. Another thing is that RegionStates@regionOnline should be used to online a region when we revert to original state if sendRegionClose returns false. #updateRegionState will keep the region in transition. I've change it to a full assign, it seems correct as well, and then we should not have this issue. What do you think? > TestMasterFailover under tests some cases and over tests some others > -------------------------------------------------------------------- > > Key: HBASE-7407 > URL: https://issues.apache.org/jira/browse/HBASE-7407 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment, test > Affects Versions: 0.96.0 > Reporter: nkeywal > Assignee: nkeywal > Priority: Minor > Attachments: 7407.v1.patch, 7407.v2.patch, 7407.v3.patch > > > The tests are done with this settings: > conf.setInt("hbase.master.assignment.timeoutmonitor.period", 2000); > conf.setInt("hbase.master.assignment.timeoutmonitor.timeout", 4000); > As a results: > 1) some tests seems to work, but in real life, the recovery would take 5 > minutes or more, as in production there always higher. So we don't see the > real issues. > 2) The tests include specific cases that should not happen in production. It > works because the timeout catches everything, but these scenarios do not need > to be optimized, as they cannot happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira