[ https://issues.apache.org/jira/browse/HBASE-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16790289#comment-16790289 ]
lujie commented on HBASE-22017: ------------------------------- After long time debug, I have found the reason of this bug. This is a total *data race bug*. While shutdown the RegionSever who hold the meta table, it will close the leases: {code:java} public void close() { this.stopRequested = true; leases.clear(); LOG.info("Closed leases"); } {code} And while the HMaster use RSRpcServices#scan to scan the table, it will {code:java} # see line 3345 try { // Remove lease while its being processed in server; protects against case // where processing of request takes > lease expiration time. lease = regionServer.leases.removeLease(scannerName); } catch (LeaseException e) { throw new ServiceException(e); } {code} in removeLease, it do : {code:java} Lease removeLease(final String leaseName) throws LeaseException { Lease lease = leases.remove(leaseName); if (lease == null) { throw new LeaseException("lease '" + leaseName + "' does not exist"); } return lease; } {code} Due to lease is closed, so lease == null and removeLease throw LeaseException. So it is a data race bug, and the share memory is {code:java} leases{code} I have checked other place that access the *leases,* and find they have safety check, like: {code:java} public void renewLease(final String leaseName) throws LeaseException { if (this.stopRequested) {// here is safety check return; } Lease lease = leases.get(leaseName); if (lease == null ) { throw new LeaseException("lease '" + leaseName + "' does not exist or has already expired"); } lease.resetExpirationTime(); } {code} I will give the patch soon. > Failed to become active master due to lease 'XXX' does not exist > ---------------------------------------------------------------- > > Key: HBASE-22017 > URL: https://issues.apache.org/jira/browse/HBASE-22017 > Project: HBase > Issue Type: Bug > Reporter: lujie > Assignee: lujie > Priority: Critical > Attachments: logs.zip > > > Test cluster: hadoop11(master), hadoop14(slave), haoop15(slave). > before code execute at > org.apache.hadoop.hbase.regionserver.HStore#getScanner(function)#2027(line > number), hadoop15 shutdown, then master startup fails > {code:java} > 2019-03-06 01:36:17,040 ERROR [master/hadoop11:16000:becomeActiveMaster] > master.HMaster: ***** ABORTING master hadoop11,16000,1551807353275: Unhandled > exception. Starting shutdown. ***** > org.apache.hadoop.hbase.regionserver.LeaseException: > org.apache.hadoop.hbase.regionserver.LeaseException: lease > '3449673378019934209' does not exist > at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:224) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3434) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42002) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:100) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:90) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) > at > org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:344) > at > org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:242) > at > org.apache.hadoop.hbase.client.ScannerCallable.rpcCall(ScannerCallable.java:58) > at > org.apache.hadoop.hbase.client.RegionServerCallable.call(RegionServerCallable.java:127) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:192) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:387) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:361) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)