[ https://issues.apache.org/jira/browse/HBASE-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113145#comment-13113145 ]
Hudson commented on HBASE-4452: ------------------------------- Integrated in HBase-0.92 #15 (See [https://builds.apache.org/job/HBase-0.92/15/]) HBASE-4452 Possibility of RS opening a region though tickleOpening fails due to znode version mismatch (Ramkrishna) tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java > Possibility of RS opening a region though tickleOpening fails due to znode > version mismatch > ------------------------------------------------------------------------------------------- > > Key: HBASE-4452 > URL: https://issues.apache.org/jira/browse/HBASE-4452 > Project: HBase > Issue Type: Bug > Reporter: ramkrishna.s.vasudevan > Assignee: ramkrishna.s.vasudevan > Priority: Critical > Fix For: 0.90.5 > > Attachments: 4452.90, HBASE-4452.patch > > > Consider the following code > {code} > long period = Math.max(1, assignmentTimeout/ 3); > long lastUpdate = now; > while (!signaller.get() && t.isAlive() && !this.server.isStopped() && > !this.rsServices.isStopping() && (endTime > now)) { > long elapsed = now - lastUpdate; > if (elapsed > period) { > // Only tickle OPENING if postOpenDeployTasks is taking some time. > lastUpdate = now; > tickleOpening("post_open_deploy"); > } > {code} > Whenever the postopenDeploy tasks takes considerable time we try to > tickleOpening so that there is no timeout deducted. But before it could do > this if the TimeoutMonitor tries to assign the node to another RS then the > other RS will move the node from OFFLINE to OPENING. Hence when the first RS > tries to do tickleOpening the operation will fail. Now here lies the problem, > {code} > String encodedName = this.regionInfo.getEncodedName(); > try { > this.version = > ZKAssign.retransitionNodeOpening(server.getZooKeeper(), > this.regionInfo, this.server.getServerName(), this.version); > } catch (KeeperException e) { > {code} > Now this.version becomes -1 as the operation failed. > Now as in the first code snippet as the return type is not captured after > tickleOpening() fails we go on with moving the node to OPENED. Here again we > dont have any check for this condition as already the version has been > changed to -1. Hence the OPENING to OPENED becomes successful. Chances of > double assignment. > {noformat} > 2011-09-22 00:57:29,930 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x1328ceaa1ff000d Attempt to transition the unassigned > node for 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to > RS_ZK_REGION_OPENING failed, the node existed but was version 5 not the > expected version 2 > 2011-09-22 00:57:33,494 WARN > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed > refreshing OPENING; region=69797d064f773d1aa9adba56e7ff90a3, > context=post_open_deploy > 2011-09-22 00:58:02,356 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x1328ceaa1ff000d Attempting to transition node > 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to > RS_ZK_REGION_OPENED > 2011-09-22 00:58:11,853 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x1328ceaa1ff000d Successfully transitioned node > 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to > RS_ZK_REGION_OPENED > 2011-09-22 00:58:13,956 DEBUG > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened > t9,,1316633193606.69797d064f773d1aa9adba56e7ff90a3. > {noformat} > Correct me if this analysis is wrong. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira