[ https://issues.apache.org/jira/browse/HBASE-5806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13264950#comment-13264950 ]
Chinna Rao Lalam commented on HBASE-5806: ----------------------------------------- for #1 above, RegionServer is crashed at SplitTransaction.createDaughters(Server, RegionServerServices) in while removing from online regions() {code} if (!testing) { services.removeFromOnlineRegions(this.parent.getRegionInfo().getEncodedName()); } {code} Here where ever the regionserver is crashed the ephemeral node will be deleted and master will get the notification of nodeDeleted() where it will be cleared from RIT But the ServerShutdownHandler executed first than the nodeDeleted() event for the region node. You can see that from the below logs {noformat} 2012-04-06 14:35:08,841 DEBUG org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Removed test,,1333702991530.cdfa837563e75ac5f4dc128680cc8da8. from list of regions to assign because in RIT; region state: SPLITTING 2012-04-06 14:35:12,981 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Ephemeral node deleted, regionserver crashed?, clearing from RIT; rs=test,,1333702991530.cdfa837563e75ac5f4dc128680cc8da8. state=SPLITTING, ts=1333703059260, server=HOST-10-18-40-25,60020,1333695183392 {noformat} In this situation the below code populated that region {code} List<RegionState> regionsInTransition = this.services.getAssignmentManager(). processServerShutdown(this.serverName); {code} and it is in !rit.isClosing() && !rit.isPendingClose() so the region is deleted from the hris {code} for (RegionState rit : regionsInTransition) { if (!rit.isClosing() && !rit.isPendingClose()) { LOG.debug("Removed " + rit.getRegion().getRegionNameAsString() + " from list of regions to assign because in RIT; region state: " + rit.getState()); if (hris != null) hris.remove(rit.getRegion()); } } {code} The fix in SSH addresses #1. #2 came because of HBASE-5615. However HBASE-5615 was reverted. #3 comes when master restarts after sp1itting is done and before CJ has cleared the region from META. So while rebuilding the user region we ensure that the offlined parent region is not again taken into account. #2 and #3 are together taken care in this patch such that the fix does solve both the problems. > Handle split region related failures on master restart and RS restart > --------------------------------------------------------------------- > > Key: HBASE-5806 > URL: https://issues.apache.org/jira/browse/HBASE-5806 > Project: HBase > Issue Type: Bug > Affects Versions: 0.92.1 > Reporter: ramkrishna.s.vasudevan > Assignee: Chinna Rao Lalam > Fix For: 0.92.2, 0.96.0, 0.94.1 > > Attachments: HBASE-5806.patch > > > This issue is raised to solve issues that comes out of partial region split > happened and the region node in the ZK which is in RS_ZK_REGION_SPLITTING and > RS_ZK_REGION_SPLIT is not yet processed. > This also tries to address HBASE-5615. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira