[jira] [Commented] (HBASE-13895) DATALOSS: Region assigned before WAL replay when abort

stack (JIRA) Wed, 01 Jul 2015 23:25:36 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611546#comment-14611546
 ]


stack commented on HBASE-13895:
-------------------------------

Ok. Added missing patch and the addendum that fixes failing 
TestAssignmentManagerOnCluster tests. Agree with fix for UT (I love unit tests).

For branch-1+ I applied addendum and checked I got all patch this time.

On branch-2, I applied the original patch plus version of master addendum. I 
made master same as branch-1s. The master addendum makes logic different. Why 
[~enis]? I'll addendum the master is intended. I am talking about this hunk in 
master addendum patch:

{code}
 14 @@ -891,12 +891,16 @@ public class AssignmentManager {
 15          LOG.warn("Server " + server + " region CLOSE RPC returned false 
for " +
 16            region.getRegionNameAsString());
 17        } catch (Throwable t) {
 18 +        long sleepTime = 0;
 19 +        Configuration conf = this.server.getConfiguration();
 20          if (t instanceof RemoteException) {
 21            t = ((RemoteException)t).unwrapRemoteException();
 22          }
 23 -        if (t instanceof NotServingRegionException
 24 +        if (t instanceof RegionServerAbortedException
 25              || t instanceof RegionServerStoppedException
 26              || t instanceof ServerNotRunningYetException) {
 27 +
 28 +        } else if (t instanceof NotServingRegionException) {
 29            LOG.debug("Offline " + region.getRegionNameAsString()
 30              + ", it's not any more on " + server, t);
 31            regionStates.updateRegionState(region, State.OFFLINE);
{code}

whereas in original patch we have this (set a sleeptime...)

{code}
411 @@ -1866,11 +1867,19 @@ public class AssignmentManager extends 
ZooKeeperListener {
412          LOG.warn("Server " + server + " region CLOSE RPC returned false 
for " +
413            region.getRegionNameAsString());
414        } catch (Throwable t) {
415 +        long sleepTime = 0;
416 +        Configuration conf = this.server.getConfiguration();
417          if (t instanceof RemoteException) {
418            t = ((RemoteException)t).unwrapRemoteException();
419          }
420          boolean logRetries = true;
421 -        if (t instanceof NotServingRegionException
422 +        if (t instanceof RegionServerAbortedException) {
423 +          // RS is aborting, we cannot offline the region since the region 
may need to do WAL
424 +          // recovery. Until we see  the RS expiration, we should retry.
425 +          sleepTime = 1 + conf.getInt(RpcClient.FAILED_SERVER_EXPIRY_KEY,
426 +            RpcClient.FAILED_SERVER_EXPIRY_DEFAULT);
427 +
428 +        } else if (t instanceof NotServingRegionException
429              || t instanceof RegionServerStoppedException
430              || t instanceof ServerNotRunningYetException) {

{code}

Thanks for catching my misapply.

> DATALOSS: Region assigned before WAL replay when abort
> ------------------------------------------------------
>
>                 Key: HBASE-13895
>                 URL: https://issues.apache.org/jira/browse/HBASE-13895
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0
>
>         Attachments: 13895.master.patch, hbase-13895_addendum-master.patch, 
> hbase-13895_addendum.patch, hbase-13895_v1-branch-1.1.patch
>
>
> Opening a place holder till finish analysis.
> I have dataloss running ITBLL at 3B (testing HBASE-13877). Most obvious 
> culprit is the double-assignment that I can see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13895) DATALOSS: Region assigned before WAL replay when abort

Reply via email to