[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

[email protected] (JIRA) Wed, 07 Sep 2011 03:16:14 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098828#comment-13098828
 ]

[email protected] commented on HBASE-4015:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1668/#review1783
-----------------------------------------------------------

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4098>

    Now i have moved it into AM itself.  

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4099>

    If you see the implementation of processOpeningState it is same as the 
handling of the previous OPENING state handling by timeout monitor.   Just that 
now we call it immediately instead of timeout monitor acting on it.

    "Should we at least check the znode again before going ahead with assign? "

    Now for the possibility of the znode moving to OPENED state before we try 
to assign to a new RS, our new logic of mulitiple checks that we do before a 
node can be forcefully moved to OFFLINE will handle the scenario that you have 
told.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4100>

    changed the method name

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4101>

    formatting done and the variable name has been changed from reassign to 
hijackAndPreempt.

    Stack last time you asked to change the variable name to reassign :)).

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4102>

    invoke from timeout monitor has been removed everywhere.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4103>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4104>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4105>

    Before we try changing the znode state if some failure happens then it 
means RS succeeded.  So in this place we need not change the RS state. Correct 
me if am wrong 

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4106>

    Next time will not repeat.  When we refactor a piece of code to a new api 
and apply format on that the formatter does this.
    Sometimes it gets missed out stack. Sorry

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/1668/#comment4107>

    This is an existing code.  just the code has been moved into a method

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
<https://reviews.apache.org/r/1668/#comment4108>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
<https://reviews.apache.org/r/1668/#comment4109>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
<https://reviews.apache.org/r/1668/#comment4110>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
<https://reviews.apache.org/r/1668/#comment4111>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
<https://reviews.apache.org/r/1668/#comment4112>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
<https://reviews.apache.org/r/1668/#comment4113>

    done

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
<https://reviews.apache.org/r/1668/#comment4114>

    This statement will increment the version number.  But the stat object is 
got before this step.
    hence on successful creation of the node with offline state we create a new 
watcher and so the version+1 will give me the current exact version in the node.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
<https://reviews.apache.org/r/1668/#comment4115>

    done

- ramkrishna

On 2011-08-29 12:11:23, ramkrishna vasudevan wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1668/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-08-29 12:11:23)
bq.  
bq.  
bq.  Review request for Ted Yu, Michael Stack, Jean-Daniel Cryans, and Jonathan 
Gray.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  HBASE-4015 - updated patch.
bq.  
bq.  
bq.          invokeTimeOutManager(regionState.getRegion(),
bq.              TimeOutOperationType.ASSIGN);
bq.  
bq.  Instead of storing the state when timeout was deducted we process it
bq.  using future task.
bq.  ================================================================
bq.  
bq.        case RS_ZK_REGION_OPENING:
bq.          // TODO: Could check if it was on deadServers.  If it was, then we 
could
bq.          // do what happens in TimeoutMonitor when it sees this condition.
bq.  
bq.          // Just insert region into RIT
bq.          // If this never updates the timeout will trigger new assignment
bq.          if (regionInfo.isMetaRegion() || regionInfo.isRootRegion()) {
bq.            regionsInTransition.put(encodedRegionName, new RegionState(
bq.                regionInfo, RegionState.State.OPENING, data.getStamp(), data
bq.                    .getOrigin()));
bq.            processOpeningState(regionInfo);
bq.            break;
bq.          }
bq.          regionsInTransition.put(encodedRegionName, new 
RegionState(regionInfo,
bq.              RegionState.State.OPENING, data.getStamp(), data.getOrigin()));
bq.          break;
bq.  
bq.  This change is for HBASE-4203. META and ROOT table need not wait till 
timeout
bq.  ======================================================================
bq.  
bq.  In forceRegionStateToOffline()
bq.  
bq.      } else {
bq.        // If invoked from timeout monitor donot force it to OFFLINE. Based 
on the
bq.        // state we will decide if to change in-memory state to OFFLINE or 
not.  It will
bq.        // be done before setting the znode to OFFLINE state.
bq.        if (!timeOutMonitorReAllocate) {
bq.          LOG.debug("Forcing OFFLINE; was=" + state);
bq.          state.update(RegionState.State.OFFLINE);
bq.        }
bq.  If the timeout monitor tries to reallcoate the node then dont make
bq.  the inmemory state to OFFLINE.
bq.  But the noraml assign flow doesnot expect the inmemory state to OFFLINE.
bq.  Hence the above change.  This is continued with the check in 
bq.  assign()
bq.  int setOfflineInZooKeeper(final RegionState state,
bq.        boolean timeOutMonitorReAllocate) {
bq.      // If invoked from timeoutmonitor the current state in memory need not 
be
bq.      // OFFLINE.  
bq.      if (!timeOutMonitorReAllocate && !state.isClosed() && 
!state.isOffline()) {
bq.            this.master.abort("Unexpected state trying to OFFLINE; " + state,
bq.            new IllegalStateException());
bq.        return -1;
bq.      }
bq.  ======================================================================
bq.  
bq.      boolean allowCreation = false;
bq.      // If the isReAllocate is true and the current state is PENDING_OPEN
bq.      // or OPENING then update the inmemory state to PENDING_OPEN. This is
bq.      // important because
bq.      // if timeoutmonitor deducts that a region was in OPENING state for a 
long
bq.      // time but by the
bq.      // time timeout monitor tranits the node to OFFLINE the RS would have 
opened
bq.      // the node and the
bq.      // state in znode will be RS_ZK_REGION_OPENED. Inorder to invoke the
bq.      // OpenedRegionHandler
bq.      // we expect the inmemeory state to be PENDING_OPEN or OPENING.
bq.      // For all other cases we can change the inmemory state to OFFLINE.
bq.      if (timeOutMonitorReAllocate
bq.          && (state.getState().equals(RegionState.State.PENDING_OPEN) || 
state
bq.              .getState().equals(RegionState.State.OPENING))) {
bq.        state.update(RegionState.State.PENDING_OPEN);
bq.        allowCreation = false;
bq.      } else {
bq.        state.update(RegionState.State.OFFLINE);
bq.        allowCreation = true;
bq.      }
bq.  This change is quite tricky.  
bq.  In normal assign flow the unassigned node for the region will not be 
present
bq.  Hence we need to allow the creation of the node newly.
bq.  But in timeout monitor case we will have the node present in some state 
hence 
bq.  we decide whether to create node newly or not inside 
ZKAssign.createOrForceNodeOffline
bq.  
bq.  The above code also updates the inmemory state of OFFLINE or PENDING_OPEN
bq.  which was not update in the previous forceRegionStateToOffline() call.
bq.  
==============================================================================
bq.  In ZKAssign.java()
bq.      if (version == -1) {
bq.        // If timeoutmonitor deducts a node to be in OPENING state but 
before it
bq.        // could
bq.        // transit to OFFLINE state if RS had opened the region then the 
Master
bq.        // deletes the
bq.        // assigned region znode. In that case the znode will not exist. So 
we
bq.        // should not
bq.        // create the znode again which will lead to double assignment.
bq.        if (timeOutMonitorReAllocate && !allowCreation) {
bq.          return -1;
bq.        }
bq.  this part prevents double assignment
bq.  If timeoutmonitor tries to force to OFFLINE state an existing region which 
was in RIT
bq.  but before it could do that if the node was opened then 
openedregionhandler will delete
bq.  the node hence we should not create the node again.
bq.  
bq.  
=======================================================================================
bq.  In ZkAssign.java()
bq.      } else {
bq.        RegionTransitionData curDataInZNode = ZKAssign.getDataNoWatch(zkw, 
region
bq.            .getEncodedName(), stat);
bq.        // Do not move the node to OFFLINE if znode is in any of the 
following
bq.        // state.
bq.        // Because these are already executed states.
bq.        if (timeOutMonitorReAllocate && null != curDataInZNode) {
bq.          EventType eventType = curDataInZNode.getEventType();
bq.          if (eventType.equals(EventType.RS_ZK_REGION_CLOSING)
bq.              || eventType.equals(EventType.RS_ZK_REGION_CLOSED)
bq.              || eventType.equals(EventType.RS_ZK_REGION_OPENED)) {
bq.            return -1;
bq.          }
bq.        }
bq.  This check prevents from moving the node to OFFLINE state if just before
bq.  the node is tried to force to OFFLINE the RS would have changed the state 
bq.  to either CLOSING or CLOSED or OPENED.  now again moving to OFFLINE
bq.  will lead to douoble assignment and will an additional operaition.
bq.  
bq.  
====================================================================================
bq.  In ZKassign.java()
bq.        boolean setData = false;
bq.        try {
bq.          setData = ZKUtil.setData(zkw, node, data.getBytes(), version);
bq.          // Setdata throws KeeperException which aborts the Master. So we 
are
bq.          // catching it here.
bq.          // If just before setting the znode to OFFLINE if the RS has made 
any
bq.          // change to the
bq.          // znode state then we need to return -1.
bq.        } catch (KeeperException kpe) {
bq.          LOG.info("Version mismatch while setting the node to OFFLINE 
state.");
bq.          return -1;
bq.        }
bq.  This change is actually to avoid the master from abortng. If the forceful 
OFFLINE is in
bq.  progrss but just before setting the RS has changed the state.
bq.  Then the setData will fail leading to master abort.
bq.  Hence we are catching the exception.
bq.  
====================================================================================
bq.  
bq.  In assignmentManager.java
bq.  +      if (setOfflineInZK && versionOfOfflineNode == -1)
bq.  +        return;
bq.  This is nothing but the refactoring done in the existing code.
bq.  -     if (setOfflineInZK && !setOfflineInZooKeeper(state)) return;
bq.  So if setting the version is unsuccessful return.
bq.  
=====================================================================================
bq.  In ZKassign.java()
bq.  // the below check ensure that double assignment doesnot happen.
bq.      // When the node is created for the first time then the expected 
version
bq.      // that is
bq.      // passed will be -1 and the version in znode will be 0.
bq.      // In all other cases the version in znode will be > 0.
bq.      else if (beginState.equals(EventType.M_ZK_REGION_OFFLINE)
bq.          && endState.equals(EventType.RS_ZK_REGION_OPENING)
bq.          && expectedVersion == -1 && stat.getVersion() != 0) {
bq.        LOG.warn(zkw.prefix("Attempt to transition the " + "unassigned node 
for "
bq.            + encoded + " from " + beginState + " to " + endState + " 
failed, "
bq.            + "the node existed but was version " + stat.getVersion()
bq.            + " not the expected version " + expectedVersion));
bq.        return -1;
bq.      }
bq.  As the comment explains when the node is created for first time the 
expectedversion will
bq.  be -1 but the actual version will be 0.  Here the scenario is
bq.  If RS1 has not tranitioned the node from OFFLINE to OPENING if
bq.  RS2 gets the call from Master after forcefully chaning to OFFLINE the
bq.  Rs2 will take the control of the node.
bq.  At that time if RS1 starts transmitting the node then we should not allow
bq.  it.
bq.  
===============================================================================
bq.  
bq.  
bq.  This addresses bug HBASE-4015.
bq.      https://issues.apache.org/jira/browse/HBASE-4015
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/TimeOutManagerCallable.java
 PRE-CREATION 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenMetaHandler.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRootHandler.java
 1161985 
bq.    
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
 1161985 
bq.  
bq.  Diff: https://reviews.apache.org/r/1668/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Yes.  But could not add new test case.
bq.  TestMasterFailOver is passing with the current changes also.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  ramkrishna
bq.  
bq.

> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
>                 Key: HBASE-4015
>                 URL: https://issues.apache.org/jira/browse/HBASE-4015
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4015_1_trunk.patch, HBASE-4015_2_trunk.patch, 
> HBASE-4015_reprepared_trunk_2.patch, Timeoutmonitor with state diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition 
> generator, mostly making things worse rather than better. It does it's own 
> thing for a while without caring for what's happening in the rest of the 
> master.
> The first thing that needs to happen is that the regions should not be 
> processed in one big batch, because that sometimes can take minutes to 
> process (meanwhile a region that timed out opening might have opened, then 
> what happens is it will be reassigned by the TimeoutMonitor generating the 
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure 
> how to do it in a scalable way in this case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

Reply via email to