[ 
https://issues.apache.org/jira/browse/HBASE-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-4452:
------------------------------------------

    Description: 
Consider the following code
{code}
    long period = Math.max(1, assignmentTimeout/ 3);
    long lastUpdate = now;
    while (!signaller.get() && t.isAlive() && !this.server.isStopped() &&
        !this.rsServices.isStopping() && (endTime > now)) {
      long elapsed = now - lastUpdate;
      if (elapsed > period) {
        // Only tickle OPENING if postOpenDeployTasks is taking some time.
        lastUpdate = now;
        tickleOpening("post_open_deploy");
      }
{code}
Whenever the postopenDeploy tasks takes considerable time we try to 
tickleOpening so that there is no timeout deducted.  But before it could do 
this if the TimeoutMonitor tries to assign the node to another RS then the 
other RS will move the node from OFFLINE to OPENING.  Hence when the first RS 
tries to do tickleOpening the operation will fail. Now here lies the problem,
{code}
    String encodedName = this.regionInfo.getEncodedName();
    try {
      this.version =
        ZKAssign.retransitionNodeOpening(server.getZooKeeper(),
          this.regionInfo, this.server.getServerName(), this.version);
    } catch (KeeperException e) {
{code}
Now this.version becomes -1 as the operation failed.
Now as in the first code snippet as the return type is not captured after 
tickleOpening() fails we go on with moving the node to OPENED.  Here again we 
dont have any check for this condition as already the version has been changed 
to -1.  Hence the OPENING to OPENED becomes successful. Chances of double 
assignment.
{noformat}
2011-09-22 00:57:29,930 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1328ceaa1ff000d Attempt to transition the unassigned node 
for 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING failed, the node existed but was version 5 not the 
expected version 2
2011-09-22 00:57:33,494 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
refreshing OPENING; region=69797d064f773d1aa9adba56e7ff90a3, 
context=post_open_deploy
2011-09-22 00:58:02,356 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1328ceaa1ff000d Attempting to transition node 
69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENED
2011-09-22 00:58:11,853 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1328ceaa1ff000d Successfully transitioned node 
69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENED
2011-09-22 00:58:13,956 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened 
t9,,1316633193606.69797d064f773d1aa9adba56e7ff90a3.

{noformat}
Correct me if this analysis is wrong.

  was:
Consider the following code
{code}
    long period = Math.max(1, assignmentTimeout/ 3);
    long lastUpdate = now;
    while (!signaller.get() && t.isAlive() && !this.server.isStopped() &&
        !this.rsServices.isStopping() && (endTime > now)) {
      long elapsed = now - lastUpdate;
      if (elapsed > period) {
        // Only tickle OPENING if postOpenDeployTasks is taking some time.
        lastUpdate = now;
        tickleOpening("post_open_deploy");
      }
{code}
Whenever the postopenDeploy tasks takes considerable time we try to 
tickleOpening so that there is no timeout deducted.  But before it could do 
this if the TimeoutMonitor tries to assign the node to another RS then the 
other RS will move the node from OFFLINE to OPENING.  Hence when the first RS 
tries to do tickleOpening the operation will fail. Now here lies the problem,
{code}
    String encodedName = this.regionInfo.getEncodedName();
    try {
      this.version =
        ZKAssign.retransitionNodeOpening(server.getZooKeeper(),
          this.regionInfo, this.server.getServerName(), this.version);
    } catch (KeeperException e) {
{code}
Now this.version becomes -1 as the operation failed.
Now as in the first code snippet as the return type is not captured after 
tickleOpening() failes we go on with moving the node to OPENED.  Here again we 
dont have any check for this condition as already the version has been changed 
to -1.  Hence the OPENING to OPENED becomes successful. Chances of double 
assignment.
{noformat}
2011-09-22 00:57:29,930 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1328ceaa1ff000d Attempt to transition the unassigned node 
for 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING failed, the node existed but was version 5 not the 
expected version 2
2011-09-22 00:57:33,494 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
refreshing OPENING; region=69797d064f773d1aa9adba56e7ff90a3, 
context=post_open_deploy
2011-09-22 00:58:02,356 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1328ceaa1ff000d Attempting to transition node 
69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENED
2011-09-22 00:58:11,853 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1328ceaa1ff000d Successfully transitioned node 
69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENED
2011-09-22 00:58:13,956 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened 
t9,,1316633193606.69797d064f773d1aa9adba56e7ff90a3.

{noformat}
Correct me if this analysis is wrong.

        Summary: Possibility of RS opening a region though tickleOpening fails 
due to znode version mismatch  (was: Possiblilty of RS opening a region though 
tickleOpening fails due to znode version mismatch)

> Possibility of RS opening a region though tickleOpening fails due to znode 
> version mismatch
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4452
>                 URL: https://issues.apache.org/jira/browse/HBASE-4452
>             Project: HBase
>          Issue Type: Bug
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>
> Consider the following code
> {code}
>     long period = Math.max(1, assignmentTimeout/ 3);
>     long lastUpdate = now;
>     while (!signaller.get() && t.isAlive() && !this.server.isStopped() &&
>         !this.rsServices.isStopping() && (endTime > now)) {
>       long elapsed = now - lastUpdate;
>       if (elapsed > period) {
>         // Only tickle OPENING if postOpenDeployTasks is taking some time.
>         lastUpdate = now;
>         tickleOpening("post_open_deploy");
>       }
> {code}
> Whenever the postopenDeploy tasks takes considerable time we try to 
> tickleOpening so that there is no timeout deducted.  But before it could do 
> this if the TimeoutMonitor tries to assign the node to another RS then the 
> other RS will move the node from OFFLINE to OPENING.  Hence when the first RS 
> tries to do tickleOpening the operation will fail. Now here lies the problem,
> {code}
>     String encodedName = this.regionInfo.getEncodedName();
>     try {
>       this.version =
>         ZKAssign.retransitionNodeOpening(server.getZooKeeper(),
>           this.regionInfo, this.server.getServerName(), this.version);
>     } catch (KeeperException e) {
> {code}
> Now this.version becomes -1 as the operation failed.
> Now as in the first code snippet as the return type is not captured after 
> tickleOpening() fails we go on with moving the node to OPENED.  Here again we 
> dont have any check for this condition as already the version has been 
> changed to -1.  Hence the OPENING to OPENED becomes successful. Chances of 
> double assignment.
> {noformat}
> 2011-09-22 00:57:29,930 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:60020-0x1328ceaa1ff000d Attempt to transition the unassigned 
> node for 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
> RS_ZK_REGION_OPENING failed, the node existed but was version 5 not the 
> expected version 2
> 2011-09-22 00:57:33,494 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
> refreshing OPENING; region=69797d064f773d1aa9adba56e7ff90a3, 
> context=post_open_deploy
> 2011-09-22 00:58:02,356 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:60020-0x1328ceaa1ff000d Attempting to transition node 
> 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
> RS_ZK_REGION_OPENED
> 2011-09-22 00:58:11,853 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> regionserver:60020-0x1328ceaa1ff000d Successfully transitioned node 
> 69797d064f773d1aa9adba56e7ff90a3 from RS_ZK_REGION_OPENING to 
> RS_ZK_REGION_OPENED
> 2011-09-22 00:58:13,956 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened 
> t9,,1316633193606.69797d064f773d1aa9adba56e7ff90a3.
> {noformat}
> Correct me if this analysis is wrong.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to