[ https://issues.apache.org/jira/browse/HBASE-13895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611546#comment-14611546 ]
stack commented on HBASE-13895: ------------------------------- Ok. Added missing patch and the addendum that fixes failing TestAssignmentManagerOnCluster tests. Agree with fix for UT (I love unit tests). For branch-1+ I applied addendum and checked I got all patch this time. On branch-2, I applied the original patch plus version of master addendum. I made master same as branch-1s. The master addendum makes logic different. Why [~enis]? I'll addendum the master is intended. I am talking about this hunk in master addendum patch: {code} 14 @@ -891,12 +891,16 @@ public class AssignmentManager { 15 LOG.warn("Server " + server + " region CLOSE RPC returned false for " + 16 region.getRegionNameAsString()); 17 } catch (Throwable t) { 18 + long sleepTime = 0; 19 + Configuration conf = this.server.getConfiguration(); 20 if (t instanceof RemoteException) { 21 t = ((RemoteException)t).unwrapRemoteException(); 22 } 23 - if (t instanceof NotServingRegionException 24 + if (t instanceof RegionServerAbortedException 25 || t instanceof RegionServerStoppedException 26 || t instanceof ServerNotRunningYetException) { 27 + 28 + } else if (t instanceof NotServingRegionException) { 29 LOG.debug("Offline " + region.getRegionNameAsString() 30 + ", it's not any more on " + server, t); 31 regionStates.updateRegionState(region, State.OFFLINE); {code} whereas in original patch we have this (set a sleeptime...) {code} 411 @@ -1866,11 +1867,19 @@ public class AssignmentManager extends ZooKeeperListener { 412 LOG.warn("Server " + server + " region CLOSE RPC returned false for " + 413 region.getRegionNameAsString()); 414 } catch (Throwable t) { 415 + long sleepTime = 0; 416 + Configuration conf = this.server.getConfiguration(); 417 if (t instanceof RemoteException) { 418 t = ((RemoteException)t).unwrapRemoteException(); 419 } 420 boolean logRetries = true; 421 - if (t instanceof NotServingRegionException 422 + if (t instanceof RegionServerAbortedException) { 423 + // RS is aborting, we cannot offline the region since the region may need to do WAL 424 + // recovery. Until we see the RS expiration, we should retry. 425 + sleepTime = 1 + conf.getInt(RpcClient.FAILED_SERVER_EXPIRY_KEY, 426 + RpcClient.FAILED_SERVER_EXPIRY_DEFAULT); 427 + 428 + } else if (t instanceof NotServingRegionException 429 || t instanceof RegionServerStoppedException 430 || t instanceof ServerNotRunningYetException) { {code} Thanks for catching my misapply. > DATALOSS: Region assigned before WAL replay when abort > ------------------------------------------------------ > > Key: HBASE-13895 > URL: https://issues.apache.org/jira/browse/HBASE-13895 > Project: HBase > Issue Type: Bug > Affects Versions: 1.2.0 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0 > > Attachments: 13895.master.patch, hbase-13895_addendum-master.patch, > hbase-13895_addendum.patch, hbase-13895_v1-branch-1.1.patch > > > Opening a place holder till finish analysis. > I have dataloss running ITBLL at 3B (testing HBASE-13877). Most obvious > culprit is the double-assignment that I can see. -- This message was sent by Atlassian JIRA (v6.3.4#6332)