I have filed an issue, and I'll commit a patch soon(For I still need to do some test on the patch). Issue Address: https://issues.apache.org/jira/browse/HBASE-3937
It indeed has something relating to HBASE-3789 I'm still looking into this issue. Any further discussion, I'll add into comments. Thanks for looking into this problem Stack. Jieshan Bean -------------- Thanks for digging in Jean. Your diagnosis below looks right to me -- the bit about master trying to reset OFFSET before reassigning. It will help if a regionserver has set it OPENING in the meantime. How do you propsose to handle the case where we fail setting it to OFFLINE because RS1 has already set it OPENING? Will you just drop the transaction. Mind filing and issue and if possible, a patch? You might want to checkout J-D's work in this area too: HBASE-3879 [1] (In particular, the comment where he describes his fix: [2]). There is some overlap but I do not think he has addressed what you see in the below. Yours, St.Ack 1. https://issues.apache.org/jira/browse/HBASE-3789 2. https://issues.apache.org/jira/browse/HBASE-3789?focusedCommentId=13039368&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13039368 On Sun, May 29, 2011 at 6:48 PM, bijieshan <bijies...@huawei.com> wrote: > After I trace into the logs and the code, I found the problem. > Maybe I didn't describe the problem correctly. The title is also puzzling. > > I try again to show the scenario of how to create the problem: > > 1.HMaster assigned the region A to RS1. So the RegionState was set to > PENDING_OPEN. > 2.For there's too many opening requests, the open process on RS1 was blocked. > 3.Some time later, TimeoutMonitor found the assigning of A was timeout. For > the RegionState was in PENDING_OPEN, went into the following handler > process(Just put the region into an waiting-assigning set): > > case PENDING_OPEN: > LOG.info("Region has been PENDING_OPEN for too " + > "long, reassigning region=" + > regionInfo.getRegionNameAsString()); > assigns.put(regionState.getRegion(), Boolean.TRUE); > break; > So we can see that, under this case, we consider the ZK node state was > OFFLINE. Indeed, in an normal disposal, it's OK. > > 4.But before the real-assigning, the requests of RS1 was disposed. So that > affected the new-assigning. For it update the ZK node state from OFFLINE to > OPENING. > > 5.The new assigning started, so it send region to open in RS2. But while the > opening, it should update the ZK node state from OFFLINE to OPENING. For the > current state is OPENING, so this operation failed. > So this region couldn't be open success anymore. > > So I think, to void this problem , under the case of PENDING_OPEN of > TiemoutMonitor, we should transform the ZK node state to OFFLINE first. > > Thanks! > > Jieshan Bean > > ------------------------ > > Hi, > During that time, there's too many regions were assigning. > I have read the related code, but the problem is still scratch my head over. > The fact is the region could not open for the zk state is not the expect one. > > 2011-05-20 16:02:58,993 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:20020-0x1300c11b4f30051 Attempt to transition the unassigned > node for d7555a12586e6c788ca55017224b5a51 from M_ZK_REGION_OFFLINE to > RS_ZK_REGION_OPENING failed, the node existed but was in the state > RS_ZK_REGION_OPENING set by the server 157-5-111-11,20020,1305875930161 > > So the question is, under what condition could cause the inconsistently > states? > > This is the a segment of HMaster logs around that time(There's so many logs > like this) > > 15:49:47,864 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: > Assigning region ufdr,051410,1305873959469.14cfc2222fff69c0b44bf2cdc9e20dd1. > to 157-5-111-13,20020,1305877624933 > 2011-05-20 15:49:47,867 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENED, server=157-5-111-14,20020,1305877627727, > region=5910a81f573f8e9e255db473e9407ab4 > 2011-05-20 15:49:47,867 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. > state=PENDING_OPEN, ts=1305877600490 > 2011-05-20 15:49:47,867 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED > event for 5910a81f573f8e9e255db473e9407ab4; deleting unassigned node > 2011-05-20 15:49:47,867 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan > was found (or we are ignoring an existing plan) for > ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb. so generated a > random one; hri=ufdr,051998,1305873973067.193c64299a34361f21e637ad203c8abb., > src=, dest=157-5-111-12,20020,1305877626108; 4 (online=4, exclude=null) > available servers > > Regards, > Jieshan Bean > > > > -------------- > > I was asking about what was going on in the master during that time, I > really would like to see it. It should be some time after that > exception: > > 2011-05-20 15:49:48,122 ERROR > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed > open of region=ufdr,010142,1305873720296.46a1a44714226105c11f82a2f7c6d8fa. > > About resetting the znode, as you can see in TimeoutMonitor we don't > really care if it was reset or not as it should take care of doing it. > The issue here is getting at the root of the problem. > > J-D > >