[ 
https://issues.apache.org/jira/browse/HBASE-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790289#comment-16790289
 ] 

lujie commented on HBASE-22017:
-------------------------------

After long time debug, I have found the reason of this bug. This is a total 
*data race bug*.

While shutdown the RegionSever who hold the meta table, it will close the 
leases:

 
{code:java}
public void close() {
   this.stopRequested = true;
   leases.clear();
   LOG.info("Closed leases");
}
{code}
And while the HMaster use RSRpcServices#scan to scan the table, it will

 

 
{code:java}
# see line 3345
try {
  // Remove lease while its being processed in server; protects against case
  // where processing of request takes > lease expiration time.
   lease = regionServer.leases.removeLease(scannerName);
} catch (LeaseException e) {
  throw new ServiceException(e);
}
{code}
in removeLease, it do :

 

 
{code:java}
Lease removeLease(final String leaseName) throws LeaseException {
  Lease lease = leases.remove(leaseName);
  if (lease == null) {
   throw new LeaseException("lease '" + leaseName + "' does not exist");
  }
  return lease;
}
{code}
Due to lease is closed, so lease == null and  removeLease throw LeaseException.

So it is a data race bug, and the share memory is 
{code:java}
leases{code}
I have checked other place that access the *leases,*  and find they have safety 
check, like:
{code:java}
public void renewLease(final String leaseName) throws LeaseException {
if (this.stopRequested) {// here is safety check
   return;
}
Lease lease = leases.get(leaseName);

if (lease == null ) {
    throw new LeaseException("lease '" + leaseName +
    "' does not exist or has already expired");
   }
   lease.resetExpirationTime();
}
{code}
I will give the patch soon.

 

> Failed to become active master due to lease 'XXX' does not exist
> ----------------------------------------------------------------
>
>                 Key: HBASE-22017
>                 URL: https://issues.apache.org/jira/browse/HBASE-22017
>             Project: HBase
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Critical
>         Attachments: logs.zip
>
>
> Test cluster: hadoop11(master), hadoop14(slave), haoop15(slave).
> before code execute at 
> org.apache.hadoop.hbase.regionserver.HStore#getScanner(function)#2027(line 
> number), hadoop15 shutdown, then master startup fails
> {code:java}
> 2019-03-06 01:36:17,040 ERROR [master/hadoop11:16000:becomeActiveMaster] 
> master.HMaster: ***** ABORTING master hadoop11,16000,1551807353275: Unhandled 
> exception. Starting shutdown. *****
> org.apache.hadoop.hbase.regionserver.LeaseException: 
> org.apache.hadoop.hbase.regionserver.LeaseException: lease 
> '3449673378019934209' does not exist
> at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:224)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3434)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42002)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:100)
> at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:90)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
> at 
> org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:344)
> at 
> org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:242)
> at 
> org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58)
> at 
> org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127)
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192)
> at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:387)
> at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:361)
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
> at 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to