[ https://issues.apache.org/jira/browse/HBASE-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041952#comment-13041952 ]
Jieshan Bean commented on HBASE-3937: ------------------------------------- How about just modify the case of PENDING_OPEN as following? Or just modified the assign method suggested by J-D? {noformat} case PENDING_OPEN: LOG.info("Region has been PENDING_OPEN for too " + "long, reassigning region=" + regionInfo.getRegionNameAsString()); // when is the ZK of state OPENING or others,Change into OFFLINE String pendingNode = ZKAssign.getNodeName(watcher, regionInfo.getEncodedName()); Stat pendingStat = new Stat(); try { RegionTransitionData pendingData = ZKAssign.getDataNoWatch( watcher, pendingNode, pendingStat); if ((null != pendingData) && (pendingData.getEventType() != EventType.M_ZK_REGION_OFFLINE)) { pendingData = new RegionTransitionData( EventType.M_ZK_REGION_OFFLINE, regionInfo.getRegionName(), master.getServerName()); if (ZKUtil.setData(watcher, pendingNode, pendingData.getBytes(), pendingStat.getVersion())) { // Node is now OFFLINE, let's trigger another assignment ZKUtil.getDataAndWatch(watcher, pendingNode); LOG.info("Successfully transitioned region=" + regionInfo.getRegionNameAsString() + " from " + pendingData.getEventType() + " to OFFLINE and forcing a new assignment."); } } } catch (KeeperException ke) { LOG.error("ZK KeeperException timing out CLOSING region", ke); } assigns.put(regionState.getRegion(), Boolean.TRUE); break; {noformat} > Region PENDING-OPEN timeout with un-expected ZK node state leads to an > endless loop > ----------------------------------------------------------------------------------- > > Key: HBASE-3937 > URL: https://issues.apache.org/jira/browse/HBASE-3937 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.3 > Reporter: Jieshan Bean > Assignee: Jieshan Bean > Fix For: 0.90.4 > > > I describe the scenario of how this problem happened: > 1.HMaster assigned the region A to RS1. So the RegionState was set to > PENDING_OPEN. > 2.For there's too many opening requests, the open process on RS1 was blocked. > 3.Some time later, TimeoutMonitor found the assigning of A was timeout. For > the RegionState was in PENDING_OPEN, went into the following handler > process(Just put the region into an waiting-assigning set): > case PENDING_OPEN: > LOG.info("Region has been PENDING_OPEN for too " + > "long, reassigning region=" + > regionInfo.getRegionNameAsString()); > assigns.put(regionState.getRegion(), Boolean.TRUE); > break; > So we can see that, under this case, we consider the ZK node state was > OFFLINE. Indeed, in an normal disposal, it's OK. > 4.But before the real-assigning, the requests of RS1 was disposed. So that > affected the new-assigning. For it update the ZK node state from OFFLINE to > OPENING. > 5.The new assigning started, so it send region to open in RS2. But while the > opening, it should update the ZK node state from OFFLINE to OPENING. For the > current state is OPENING, so this operation failed. > So this region couldn't be open success anymore. > So I think, to void this problem , under the case of PENDING_OPEN of > TiemoutMonitor, we should transform the ZK node state to OFFLINE first. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira