[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ramkrishna.s.vasudevan updated HBASE-5120: ------------------------------------------ Status: Open (was: Patch Available) > Timeout monitor races with table disable handler > ------------------------------------------------ > > Key: HBASE-5120 > URL: https://issues.apache.org/jira/browse/HBASE-5120 > Project: HBase > Issue Type: Bug > Affects Versions: 0.92.0 > Reporter: Zhihong Yu > Priority: Blocker > Fix For: 0.94.0, 0.92.1 > > Attachments: HBASE-5120.patch, HBASE-5120_1.patch, > HBASE-5120_2.patch, HBASE-5120_3.patch > > > Here is what J-D described here: > https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 > I think I will retract from my statement that it "used to be extremely racy > and caused more troubles than it fixed", on my first test I got a stuck > region in transition instead of being able to recover. The timeout was set to > 2 minutes to be sure I hit it. > First the region gets closed > {quote} > 2012-01-04 00:16:25,811 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > sv4r5s38,62023,1325635980913 for region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > {quote} > 2 minutes later it times out: > {quote} > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636185810, server=null > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,027 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > (offlining) > {quote} > 100ms later the master finally gets the event: > {quote} > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, > region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED > event for 1a4b111bcc228043e89f59c4c3f6a791 > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so > deleting ZK node and removing from regions in transition, skipping assignment > of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Deleting existing unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED > 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Successfully deleted unassigned node for > region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED > {quote} > At this point everything is fine, the region was processed as closed. But > wait, remember that line where it said it was going to force an unassign? > {quote} > 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Creating unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state > 2012-01-04 00:18:30,328 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Server null returned > java.lang.NullPointerException: Passed server is null for > 1a4b111bcc228043e89f59c4c3f6a791 > {quote} > Now the master is confused, it recreated the RIT znode but the region doesn't > even exist anymore. It even tries to shut it down but is blocked by NPEs. Now > this is what's going on. > The late ZK notification that the znode was deleted (but it got recreated > after): > {quote} > 2012-01-04 00:19:33,285 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: The znode of region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been > deleted. > {quote} > Then it prints this, and much later tries to unassign it again: > {quote} > 2012-01-04 00:19:46,607 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region > to clear regions in transition; > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636310328, server=null > ... > 2012-01-04 00:20:39,623 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region > to clear regions in transition; > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636310328, server=null > 2012-01-04 00:20:39,864 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636310328, server=null > 2012-01-04 00:20:39,864 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:20:39,865 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > (offlining) > 2012-01-04 00:20:39,865 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it > is not currently assigned anywhere > {quote} > And this is still ongoing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira