[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

jirapos...@reviews.apache.org (JIRA) Fri, 23 Sep 2011 18:51:56 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113869#comment-13113869
 ]


jirapos...@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/
-----------------------------------------------------------

(Updated 2011-09-24 01:50:02.986731)


Review request for hbase.


Changes
-------

Thanks folks for the review. I have fixes most of them. The below comments 
explain why the rests aren't fixed and some general questions asked.

1. I wondered if RootRegionTracker and MetaNodeTracker are really needed. 
Instead of waiting for ZK notification, checking with ZK directly should be ok. 
This won't have much impact on performance given most of the time where there 
isn't much regions movement. For now, we can keep RootRegionTracker and 
MetaNodeTracker. We can open a separate jira if that is needed.
2. Took out "refresh" parameter in CatalogTracker.getMetaServerConnection. All 
the callers of this function call the function will "true". So at this point, I 
just took it out.
3. Per Jonathan suggestions, modified ZookeeperNodeTracker to add refresh flag 
to getData, blockUntilAvailable.
4. About the question from Stack, yes, it looks like the same as 3809. it could 
be the same as 4245.
5. I put a more detailed description in ServerShutdownHandler.java about why we 
need to resubmit another ServerShutdownHandler request back to thread pool if 
the server carries -ROOT- or .META.
6. Regarding Jonathan's suggestion about relying on notifyAll() from -ROOT- 
inside waitForMeta, I just fixed the timeout value issue instead, in case later 
we decide RootRegionTracker isn't that useful.
7. Regarding Stack's HLogSplitting question, if the shutdown server carries 
-ROOT- or .META., it will first do HLogSplitting, and then resubmit another 
ServerShutdownHandler request for the same server which doesn't do 
HLogSplitting.


Summary
-------

1. Add more logging.
2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When 
waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout 
value is large. So it doesn't retry in case .ROOT. is updated; add the proper 
implementation for CatalogTracker.verifyMetaRegionLocation
4. Check for the latest -ROOT- and .META. region location during the handling 
of server shutdown.
5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't 
block and wait for .META. availability. Resubmit another ServerShutdownHandler 
for regular regions.


This addresses bug HBASE-4455.
    https://issues.apache.org/jira/browse/HBASE-4455


Diffs (updated)
-----

  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java
 1172205 
  
http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
 1172205 

Diff: https://reviews.apache.org/r/2007/diff


Testing
-------

Keep Master up all the time, do rolling restart of RSs like this - stop RS1, 
wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start 
RS2, wait for 2 seconds, etc. The program can run for couple hours until it 
stops. -ROOT- and .META. are available during that time.


Thanks,

Ming



> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in 
> AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, 
> wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start 
> RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. 
> regions aren't in "regions in transtion" from AssignmentManager point of 
> view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is 
> invoked to check if it contains -ROOT- region. That is due to long delay from 
> ZK notification and async nature of the system. Here is an example, even 
> though new root region server sea-lab-1,60020,1316380133656 is set at T2, at 
> T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location 
> still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode 
> /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO 
> org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region 
> location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler 
> to be executed, root=false, meta=true, current Root Location: 
> sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
> master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode 
> /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or 
> .META. availability could be blocked. If meanwhile, the new server that 
> -ROOT- or .META. is being assigned restarted, another instance of 
> MetaServerShutdownHandler is queued. Eventually, all 
> MetaServerShutdownHandler worker threads are filled up. It looks like 
> HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Reply via email to