[ https://issues.apache.org/jira/browse/HBASE-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159113#comment-13159113 ]
Ted Yu commented on HBASE-4729: ------------------------------- {code} + * Updates the RegionState and sends the CLOSE RPC unless regions is being {code} The above should read '... region is being' {code} + * of existence). TODO: What to do if split fails and is rolled back and + * parent is revivified? {code} The above would be handled in another JIRA. How about introducing a new state, such as RS_ZK_PARENT_REGION_CLOSE_THRU_SPLIT, so that rolling back failed split can decide what to do ? {code} + } catch (KeeperException ke) { + LOG.warn("Presuming failed getData on " + path + "; presuming " + + "split and that the region to unassign, " + encodedName + + ", no longer exists -- confirm", ke); + return; + } {code} We should verify the above assumption by checking that ke is NoNodeException. If ke is not NoNodeException, we should abort. > Race between online altering and splitting kills the master > ----------------------------------------------------------- > > Key: HBASE-4729 > URL: https://issues.apache.org/jira/browse/HBASE-4729 > Project: HBase > Issue Type: Bug > Affects Versions: 0.92.0 > Reporter: Jean-Daniel Cryans > Assignee: ramkrishna.s.vasudevan > Priority: Critical > Fix For: 0.92.0, 0.94.0 > > Attachments: 4729-v2.txt, 4729-v3.txt, 4729.txt > > > I was running an online alter while regions were splitting, and suddenly the > master died and left my table half-altered (haven't restarted the master yet). > What killed the master: > {quote} > 2011-11-02 17:06:44,428 FATAL org.apache.hadoop.hbase.master.HMaster: > Unexpected ZK exception creating node CLOSING > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /hbase/unassigned/f7e1783e65ea8d621a4bc96ad310f101 > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:110) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:459) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:441) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:769) > at > org.apache.hadoop.hbase.zookeeper.ZKAssign.createNodeClosing(ZKAssign.java:568) > at > org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1722) > at > org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1661) > at org.apache.hadoop.hbase.master.BulkReOpen$1.run(BulkReOpen.java:69) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {quote} > A znode was created because the region server was splitting the region 4 > seconds before: > {quote} > 2011-11-02 17:06:40,704 INFO > org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of > region TestTable,0012469153,1320253135043.f7e1783e65ea8d621a4bc96ad310f101. > 2011-11-02 17:06:40,704 DEBUG > org.apache.hadoop.hbase.regionserver.SplitTransaction: > regionserver:62023-0x132f043bbde0710 Creating ephemeral node for > f7e1783e65ea8d621a4bc96ad310f101 in SPLITTING state > 2011-11-02 17:06:40,751 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:62023-0x132f043bbde0710 Attempting to transition node > f7e1783e65ea8d621a4bc96ad310f101 from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLITTING > ... > 2011-11-02 17:06:44,061 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:62023-0x132f043bbde0710 Successfully transitioned node > f7e1783e65ea8d621a4bc96ad310f101 from RS_ZK_REGION_SPLITTING to > RS_ZK_REGION_SPLIT > 2011-11-02 17:06:44,061 INFO > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the > master to process the split for f7e1783e65ea8d621a4bc96ad310f101 > {quote} > Now that the master is dead the region server is spewing those last two lines > like mad. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira