[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

stack (Commented) (JIRA) Fri, 17 Feb 2012 10:58:27 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210443#comment-13210443
 ]


stack commented on HBASE-5270:
------------------------------

Why we do this Chunhui?

{code}
+            // We set serverLoad with one region, it could differentiate with
+            // regionserver which is started just now
+            HServerLoad serverLoad = new HServerLoad();
+            serverLoad.setNumberOfRegions(1);
{code}

How you know it has a region?

This needs a bit of javadoc explaining its parsing a ServerName:

getServerInfo

Change the data  member name from isLogSplitted to logSplit and then the method 
name becomes isLogSplit.

Fix this long line:

{code}
+      return getConfiguration().getBoolean("TestingMaster.sleep", false) ? new 
GatedNodeDeleteRegionServerTracker(
{code}

Can you just do +            super.nodeDeleted(path); instead of +            
GatedNodeDeleteRegionServerTracker.super.nodeDeleted(path);?

Just wondering if you can do your test w/o timeouts but instead by using gates? 
 For example, this wait:

+    cluster.getConfiguration().setInt("TestingMaster.sleep.duration", 10000);

Or, can you interrupt the wait once your conditions are met so we don't have to 
wait the ten seconds?

For your trunk patch need to categorize your test (see other tests in trunk.... 
see how they have a bit of code on the end and then at top there is an 
annotation).

Sleep for shorter here:

{code}
+    while (!master.isLogSplittedAfterStartup()) {
+      Thread.sleep(1000);
{code}

See '13.4.1.2. Writing Tests' in the manual here 
http://hbase.apache.org/book.html#hbase.tests for some recommendations around 
timeouts, etc.

Why the need for this timeout:

{code}
+    Thread.sleep(10000 * 2);
+    ((GatedNodeDeleteRegionServerTracker) master.getRegionServerTracker()).gate
+        .set(false);
{code}

Can we do w/o?  Timing dependent tests can be flakey.  The timings make for 
different results in different contexts.

What happens if this is not true:

{code}
+    if (rootServerNum != metaServerNum) {
{code}

Do we not test?

There is some duplicated code in this method?  Maybe break it out into a method?

For example, this code:

{code}
+      // First abort master
+      for (MasterThread mt : cluster.getLiveMasterThreads()) {
+        if (mt.getMaster().isActiveMaster()) {
+          mt.getMaster().abort("Aborting for tests",
+              new Exception("Trace info"));
+          mt.join();
+          break;
+        }
+      }
+      LOG.debug("Master is aborted");
+      master = (TestingMaster) cluster.startMaster().getMaster();
+      while (!master.isLogSplittedAfterStartup()) {
+        Thread.sleep(1000);
+      }
+      LOG.debug("splitted:" + master.isLogSplittedAfterStartup()
+          + ",initialized:" + master.isInitialized());
{code}

In dead servers, one returns a boolean and the other does not.  One ups a 
counter and the other does not.  Should they be the same?

Can we be more specific in this note?

{code}
+   * Dead servers under processing by the ServerShutdownHander. 
{code}

Whats that mean?  Its while the server is being processed by 
ServerShutdownHandler exclusively -- these are the inProgress servers?

Perhaps rename getLogDirIfExists as getServerLogDirIfExists?

In HMaster, fix my bad javadoc.  It says 'Used testing' for 
createRegionServerTracker.  Should be 'Override to change master's 
RegionServerTracker creation. Used testing'.  Similar for splitLogAfterStartup.

So, what happens if a server had root and meta and its not expired when we do 
failover?  We'll expire it processing root.  Will we expire it a second time 
processing meta?  Perhaps the answer is no because the first expiration will 
clear the meta state in master?

Who is clearing +    for (; serverName != null; ) { in the wait loop?

Should we check the master stopped flag in this loop too in case its been 
stopped while this is running?

Does this not already exist in HMaster? getRegionServerTracker  Does it have to 
be public (maybe it does for tests?)

Do we need to add this?  getOnlineServerNames  Can you not get it from 
getOnlineServers?  This is a convenience utility?

Is there possibility of race here?

{code}
+    // Remove regions in RIT, they are may being processed by the SSH.
+    synchronized (regionsInTransition) {
+      nodes.removeAll(regionsInTransition.keySet());
+    }
{code}

Perhaps SSH has put up something in RIT because its done an assign and here we 
are blanket removing them all?

Otherwise, patch is coming along Chunhui.  Thanks.

                
> Handle potential data loss due to concurrent processing of processFaileOver 
> and ServerShutdownHandler
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5270
>                 URL: https://issues.apache.org/jira/browse/HBASE-5270
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master
>            Reporter: Zhihong Yu
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
> 5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch, 
> hbase-5270.patch, hbase-5270v2.patch, sampletest.txt
>
>
> This JIRA continues the effort from HBASE-5179. Starting with Stack's 
> comments about patches for 0.92 and TRUNK:
> Reviewing 0.92v17
> isDeadServerInProgress is a new public method in ServerManager but it does 
> not seem to be used anywhere.
> Does isDeadRootServerInProgress need to be public? Ditto for meta version.
> This method param names are not right 'definitiveRootServer'; what is meant 
> by definitive? Do they need this qualifier?
> Is there anything in place to stop us expiring a server twice if its carrying 
> root and meta?
> What is difference between asking assignment manager isCarryingRoot and this 
> variable that is passed in? Should be doc'd at least. Ditto for meta.
> I think I've asked for this a few times - onlineServers needs to be 
> explained... either in javadoc or in comment. This is the param passed into 
> joinCluster. How does it arise? I think I know but am unsure. God love the 
> poor noob that comes awandering this code trying to make sense of it all.
> It looks like we get the list by trawling zk for regionserver znodes that 
> have not checked in. Don't we do this operation earlier in master setup? Are 
> we doing it again here?
> Though distributed split log is configured, we will do in master single 
> process splitting under some conditions with this patch. Its not explained in 
> code why we would do this. Why do we think master log splitting 'high 
> priority' when it could very well be slower. Should we only go this route if 
> distributed splitting is not going on. Do we know if concurrent distributed 
> log splitting and master splitting works?
> Why would we have dead servers in progress here in master startup? Because a 
> servershutdownhandler fired?
> This patch is different to the patch for 0.90. Should go into trunk first 
> with tests, then 0.92. Should it be in this issue? This issue is really hard 
> to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
> this trunk patch?
> This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

Reply via email to