[ https://issues.apache.org/jira/browse/HBASE-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112535#comment-13112535 ]
Ted Yu commented on HBASE-4452: ------------------------------- Minor comment: {code} + // InterruptedException too. If so, we failed. Even if tickle opening fails + // then it is a failure. {code} I think we don't need 'Even' above. Also, I would initialize the new boolean with false. Running test suite. > Possibility of RS opening a region though tickleOpening fails due to znode > version mismatch > ------------------------------------------------------------------------------------------- > > Key: HBASE-4452 > URL: https://issues.apache.org/jira/browse/HBASE-4452 > Project: HBase > Issue Type: Bug > Reporter: ramkrishna.s.vasudevan > Assignee: ramkrishna.s.vasudevan > Priority: Critical > Attachments: HBASE-4452.patch > > > Consider the following code > {code} > long period = Math.max(1, assignmentTimeout/ 3); > long lastUpdate = now; > while (!signaller.get() && t.isAlive() && !this.server.isStopped() && > !this.rsServices.isStopping() && (endTime > now)) { > long elapsed = now - lastUpdate; > if (elapsed > period) { > // Only tickle OPENING if postOpenDeployTasks is taking some time. > lastUpdate = now; > tickleOpening("post_open_deploy"); > } > {code} > Whenever the postopenDeploy tasks takes considerable time we try to > tickleOpening so that there is no timeout deducted. But before it could do > this if the TimeoutMonitor tries to assign the node to another RS then the > other RS will move the node from OFFLINE to OPENING. Hence when the first RS > tries to do tickleOpening the operation will fail. Now here lies the problem, > {code} > String encodedName = this.regionInfo.getEncodedName(); > try { > this.version = > ZKAssign.retransitionNodeOpening(server.getZooKeeper(), > this.regionInfo, this.server.getServerName(), this.version); > } catch (KeeperException e) { > {code} > Now this.version becomes -1 as the operation failed. > Now as in the first code snippet as the return type is not captured after > tickleOpening() fails we go on with moving the node to OPENED. Here again we > dont have any check for this condition as already the version has been > changed to -1. Hence the OPENING to OPENED becomes successful. Chances of > double assignment. > {noformat} > 2011-09-22 00:57:29,930 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x1328ceaa1ff000d Attempt to transition the unassigned > node for 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to > RS_ZK_REGION_OPENING failed, the node existed but was version 5 not the > expected version 2 > 2011-09-22 00:57:33,494 WARN > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed > refreshing OPENING; region=69797d064f773d1aa9adba56e7ff90a3, > context=post_open_deploy > 2011-09-22 00:58:02,356 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x1328ceaa1ff000d Attempting to transition node > 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to > RS_ZK_REGION_OPENED > 2011-09-22 00:58:11,853 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x1328ceaa1ff000d Successfully transitioned node > 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to > RS_ZK_REGION_OPENED > 2011-09-22 00:58:13,956 DEBUG > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened > t9,,1316633193606.69797d064f773d1aa9adba56e7ff90a3. > {noformat} > Correct me if this analysis is wrong. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira