[ https://issues.apache.org/jira/browse/HBASE-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13435353#comment-13435353 ]
Hadoop QA commented on HBASE-6060: ---------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541080/HBASE-6060_latest.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 21 new or modified tests. +1 hadoop2.0. The patch compiles against the hadoop 2.0 profile. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 5 javac compiler warnings (more than the trunk's current 4 warnings). -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestRollingRestart org.apache.hadoop.hbase.master.TestDistributedLogSplitting org.apache.hadoop.hbase.regionserver.handler.TestOpenRegionHandler Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/2589//console This message is automatically generated. > Regions's in OPENING state from failed regionservers takes a long time to > recover > --------------------------------------------------------------------------------- > > Key: HBASE-6060 > URL: https://issues.apache.org/jira/browse/HBASE-6060 > Project: HBase > Issue Type: Bug > Components: master, regionserver > Reporter: Enis Soztutar > Assignee: rajeshbabu > Fix For: 0.96.0, 0.92.3, 0.94.2 > > Attachments: 6060-94-v3.patch, 6060-94-v4_1.patch, > 6060-94-v4_1.patch, 6060-94-v4.patch, 6060_alternative_suggestion.txt, > 6060_suggestion2_based_off_v3.patch, 6060_suggestion_based_off_v3.patch, > 6060_suggestion_toassign_rs_wentdown_beforerequest.patch, 6060-trunk_2.patch, > 6060-trunk_3.patch, 6060-trunk.patch, 6060-trunk.patch, HBASE-6060-92.patch, > HBASE-6060-94.patch, HBASE-6060_latest.patch, HBASE-6060_latest.patch, > HBASE-6060-trunk_4.patch, HBASE-6060_trunk_5.patch > > > we have seen a pattern in tests, that the regions are stuck in OPENING state > for a very long time when the region server who is opening the region fails. > My understanding of the process: > > - master calls rs to open the region. If rs is offline, a new plan is > generated (a new rs is chosen). RegionState is set to PENDING_OPEN (only in > master memory, zk still shows OFFLINE). See HRegionServer.openRegion(), > HMaster.assign() > - RegionServer, starts opening a region, changes the state in znode. But > that znode is not ephemeral. (see ZkAssign) > - Rs transitions zk node from OFFLINE to OPENING. See > OpenRegionHandler.process() > - rs then opens the region, and changes znode from OPENING to OPENED > - when rs is killed between OPENING and OPENED states, then zk shows OPENING > state, and the master just waits for rs to change the region state, but since > rs is down, that wont happen. > - There is a AssignmentManager.TimeoutMonitor, which does exactly guard > against these kind of conditions. It periodically checks (every 10 sec by > default) the regions in transition to see whether they timedout > (hbase.master.assignment.timeoutmonitor.timeout). Default timeout is 30 min, > which explains what you and I are seeing. > - ServerShutdownHandler in Master does not reassign regions in OPENING > state, although it handles other states. > Lowering that threshold from the configuration is one option, but still I > think we can do better. > Will investigate more. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira