Hello, I am testing how HBase works in case of failure and experienced unexpected behavior. If someone could tell me how to shorten recovery time, that would be appreciated.
What I tried was, first client start writing to HBase, and then, shut down one of the slave node machines. I expected the writing operation stops for a while (1 minute or so) and after HBase automatically moves failed regions to another slave node, client will be able to write again. But actually, it took about 15 minutes for the cluster to recover. What was going on? I already have applied configurations presented on https://hbase.apache.org/book.html#mttr My environment is: - 3 master(NN, HMaster, ZK, etc.) nodes and 4 slave nodes(DN, RS), all of them are virtual machines. - HDP 2.4.3 (HBase 1.1.1) - CentOS 6.7 Here are HMaster and NameNode log. HMaster detected failure of a regionserver at 15:41:16 and trying to start recovery, but throwing RetriesExhaustedException again and again, then, finally finished recoverying at 15:57:11. I noticed that there are some logs about lease recovery in NameNode log, but not sure this is related. ---- HMaster log --- 2016-02-17 15:41:16,003 INFO [main-EventThread] zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [host7516.mydomain,16020,1455618299902] 2016-02-17 15:41:16,005 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] handler.MetaServerShutdownHandler: Splitting hbase:meta logs for host7516.mydomain,16020,1455618299902 2016-02-17 15:41:16,010 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.SplitLogManager: dead splitlog workers [host7516.mydomain,16020,1455618299902] 2016-02-17 15:41:16,011 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.SplitLogManager: started splitting 1 logs in [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting] for [host7516. mydomain,16020,1455618299902] 2016-02-17 15:41:16,016 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..met a.1455690873222.meta acquired by host7518.mydomain,16020,1455618298475 2016-02-17 15:41:16,266 INFO [host7512.mydomain,16000,1455618257882_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C14556182 99902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta=last_update = 1455691276196 last_version = 2 cur_worker_name = host7518.mydomain,16020,1455618298475 status = in_progress incarnation = 0 resub mits = 0 batch = installed = 1 done = 0 error = 0} 2016-02-17 15:41:20,481 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..met a.1455690873222.meta entered state: DONE host7518.mydomain,16020,1455618298475 2016-02-17 15:41:20,487 INFO [main-EventThread] wal.WALSplitter: Archived processed log hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta to hdfs://hdpts/apps/hbase/data/oldWALs/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta 2016-02-17 15:41:20,487 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta 2016-02-17 15:41:20,492 WARN [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.SplitLogManager: returning success without actually splitting and deleting all the log files in path hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting 2016-02-17 15:41:20,492 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.SplitLogManager: finished splitting (more than or equal to) 83 bytes in 1 log files in [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting] in 4481ms 2016-02-17 15:41:20,492 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] handler.MetaServerShutdownHandler: Server host7516.mydomain,16020,1455618299902 was carrying META. Trying to assign. 2016-02-17 15:41:20,492 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates: Transition {1588230740 state=OPEN, ts=1455690873330, server=host7516.mydomain,16020,1455618299902} to {1588230740 state=OFFLINE, ts=1455691280492, server=host7516.mydomain,16020,1455618299902} 2016-02-17 15:41:20,493 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates: Offlined 1588230740 from host7516.mydomain,16020,1455618299902 2016-02-17 15:41:27,495 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at address=host7516.mydomain,16020,1455618299902, exception=Connection timed out 2016-02-17 15:41:27,495 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] zookeeper.MetaTableLocator: Deleting hbase:meta region location in ZooKeeper 2016-02-17 15:41:27,500 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.AssignmentManager: Setting node as OFFLINED in ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} 2016-02-17 15:41:27,502 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to host7517.mydomain,16020,1455618296942 2016-02-17 15:41:27,502 INFO [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates: Transition {1588230740 state=OFFLINE, ts=1455691287500, server=host7516.mydomain,16020,1455618299902} to {1588230740 state=PENDING_OPEN, ts=1455691287502, server=host7517.mydomain,16020,1455618296942} 2016-02-17 15:41:27,512 INFO [AM.ZK.Worker-pool2-t62] master.RegionStates: Transition {1588230740 state=PENDING_OPEN, ts=1455691287502, server=host7517.mydomain,16020,1455618296942} to {1588230740 state=OPENING, ts=1455691287512, server=host7517.mydomain,16020,1455618296942} 2016-02-17 15:41:27,598 INFO [AM.ZK.Worker-pool2-t63] master.RegionStates: Transition {1588230740 state=OPENING, ts=1455691287512, server=host7517.mydomain,16020,1455618296942} to {1588230740 state=OPEN, ts=1455691287598, server=host7517.mydomain,16020,1455618296942} 2016-02-17 15:41:27,598 INFO [AM.ZK.Worker-pool2-t63] coordination.ZkOpenRegionCoordination: Handling OPENED of 1588230740 from host7512.mydomain,16000,1455618257882; deleting unassigned node 2016-02-17 15:42:57,601 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] handler.ServerShutdownHandler: Received exception accessing hbase:meta during server shutdown of host7516.mydomain,16020,1455618299902, retrying hbase:meta read org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=351, exceptions: Wed Feb 17 15:42:57 KST 2016, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=90304: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=host7516.mydomain,16020,1455618299902, seqNum=0 at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602) at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:156) at org.apache.hadoop.hbase.MetaTableAccessor.getServerUserRegions(MetaTableAccessor.java:555) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:177) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=90304: row '' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=host7516.mydomain,16020,1455618299902, seqNum=0 at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64) ... 3 more Caused by: java.io.IOException: Call to HOST7516.mydomain/192.168.184.73:16020 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1388, waitTime=90001, operationTimeout=90000 expired. at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1262) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1230) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651) at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126) ... 4 more Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1388, waitTime=90001, operationTimeout=90000 expired. at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70) at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1204) ... 14 more ...(Same error happens again and again) 2016-02-17 15:56:55,625 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] handler.ServerShutdownHandler: Splitting logs for host7516.mydomain,16020,1455618299902 before assignment; region count=73 2016-02-17 15:56:55,627 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: dead splitlog workers [host7516.mydomain,16020,1455618299902] 2016-02-17 15:56:55,628 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: started splitting 1 logs in [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting] for [host7516.line. ism,16020,1455618299902] 2016-02-17 15:56:55,634 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.defa ult.1455690304230 acquired by host7517.mydomain,16020,1455618296942 2016-02-17 15:56:56,266 INFO [host7512.mydomain,16000,1455618257882_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C14556182 99902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.default.1455690304230=last_update = 1455692215679 last_version = 2 cur_worker_name = host7517.mydomain,16020,1455618296942 status = in_progress incarnation = 0 resubmit s = 0 batch = installed = 1 done = 0 error = 0} 2016-02-17 15:57:01,336 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.defa ult.1455690304230 entered state: DONE host7517.mydomain,16020,1455618296942 2016-02-17 15:57:01,343 INFO [main-EventThread] wal.WALSplitter: Archived processed log hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.14556903042 30 to hdfs://hdpts/apps/hbase/data/oldWALs/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 2016-02-17 15:57:01,343 INFO [main-EventThread] coordination.SplitLogManagerCoordination: Done splitting /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C14556182 99902.default.1455690304230 2016-02-17 15:57:01,346 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: finished splitting (more than or equal to) 83 bytes in 1 log files in [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,14556 18299902-splitting] in 5718ms ... 2016-02-17 15:57:01,353 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.AssignmentManager: Bulk assigning 73 region(s) across 3 server(s), round-robin=true 2016-02-17 15:57:01,353 INFO [host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-0] master.AssignmentManager: Assigning 24 region(s) to host7515.mydomain,16020,1455691124033 2016-02-17 15:57:01,353 INFO [host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-1] master.AssignmentManager: Assigning 24 region(s) to host7517.mydomain,16020,1455618296942 2016-02-17 15:57:01,353 INFO [host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-2] master.AssignmentManager: Assigning 25 region(s) to host7518.mydomain,16020,1455618298475 ... 2016-02-17 15:57:11,005 INFO [AM.ZK.Worker-pool2-t71] master.RegionStates: Offlined 50cc594cd9e74287ddd5b960bfa2d8e9 from host7516.mydomain,16020,1455618299902 2016-02-17 15:57:11,005 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.AssignmentManager: Bulk assigning done 2016-02-17 15:57:11,005 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] handler.ServerShutdownHandler: Finished processing of shutdown of host7516.mydomain,16020,1455618299902 ---- ---- NameNode log --- 2016-02-17 15:41:16,008 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(699)) - Number of transactions: 45 Total time for transactions(ms): 4 Number of transactions batched in Syncs: 0 Number of syncs: 42 SyncTimes(ms): 35 11 2016-02-17 15:41:16,383 INFO namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2907)) - recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 2], src=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta from client DFSClient_NONMAPREDUCE_1038296970_1 2016-02-17 15:41:16,383 INFO namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(4036)) - Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 2], src=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta 2016-02-17 15:41:16,383 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(4157)) - DIR* NameSystem.internalReleaseLease: File /apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta has not been closed. Lease recovery is in progress. RecoveryId = 13180 for block blk_1073753851_13131{UCState=UNDER_RECOVERY, truncateBlock=null, primaryNodeIndex=2, replicas=[ReplicaUC[[DISK]DS-9787b201-fc64-450e-a20f-dcc79fb94b6f:NORMAL:192.168.184.73:50010|RBW], ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL:192.168.185.249:50010|RBW], ReplicaUC[[DISK]DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28:NORMAL:192.168.184.85:50010|RBW]]} 2016-02-17 15:41:18,588 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected 8 blocks at priority level 2; Total=8 Reset bookmarks? false 2016-02-17 15:41:18,588 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 75, pendingReplications = 0. 2016-02-17 15:41:18,588 INFO blockmanagement.BlockManager (BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks chosen but could not be replicated = 8; of which 8 have no target, 0 have no source, 0 are UC, 0 are abandoned, 0 already have enough replicas. 2016-02-17 15:41:18,960 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,961 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,962 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,963 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,964 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,966 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,967 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,967 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:18,969 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:19,096 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:19,098 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 2016-02-17 15:41:19,099 WARN security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1521)) - No groups available for user anonymous 016-02-17 15:41:19,307 INFO blockmanagement.DatanodeManager (DatanodeManager.java:handleHeartbeat(1426)) - Skipped stale nodes for recovery : 1 2016-02-17 15:41:19,694 INFO namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(4269)) - commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753851_13131, newgenerationstamp=13180, newl ength=60238, newtargets=[192.168.185.249:50010, 192.168.184.85:50010], closeFile=true, deleteBlock=false) 2016-02-17 15:41:19,696 INFO namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(4434)) - commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753851_13131, file=/apps/hbase/data/WALs/cch dpts7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta, newgenerationstamp=13180, newlength=60238, newtargets=[192.168.185.249:50010, 192.168.184.85:50010]) successful 2016-02-17 15:41:20,459 INFO hdfs.StateChange (FSNamesystem.java:saveAllocatedBlock(3583)) - BLOCK* allocate blk_1073753897_13181{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-f8c4c3b2- a7cc-40e0-9ebb-2b9c352c0d28:NORMAL:192.168.184.85:50010|RBW], ReplicaUC[[DISK]DS-882dca14-8651-4ffe-a102-8beebc774575:NORMAL:192.168.189.208:50010|RBW], ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL:192.168.185.249:50010|RBW ]]} for /apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp 2016-02-17 15:41:20,474 INFO hdfs.StateChange (FSNamesystem.java:completeFile(3503)) - DIR* completeFile: /apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp is closed by DFSClient_NONMAPREDUCE_443617401_ 1 2016-02-17 15:41:20,488 INFO ipc.Server (Server.java:run(2172)) - IPC Server handler 8 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from 192.168.189.219:51393 Call#8040 Retry#0 org.apache.hadoop.fs.PathIsNotEmptyDirectoryException: `/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting is non empty': Directory is not empty at org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:89) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3723) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1017) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:611) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) 2016-02-17 15:41:21,588 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected 8 blocks at priority level 2; Total=8 Reset bookmarks? false 2016-02-17 15:41:21,588 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 76, pendingReplications = 0. ... 2016-02-17 15:56:15,384 INFO namenode.FSNamesystem (FSNamesystem.java:rollEditLog(5937)) - Roll Edit Log from 192.168.184.72 2016-02-17 15:56:15,384 INFO namenode.FSEditLog (FSEditLog.java:rollEditLog(1202)) - Rolling edit logs 2016-02-17 15:56:15,384 INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1258)) - Ending log segment 369158 2016-02-17 15:56:15,395 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(699)) - Number of transactions: 14 Total time for transactions(ms): 5 Number of transactions batched in Syncs: 0 Number of syncs: 14 SyncTimes(ms): 26 6 2016-02-17 15:56:15,421 INFO namenode.FileJournalManager (FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits file /hadoop/hdfs/namenode/current/edits_inprogress_0000000000000369158 -> /hadoop/hdfs/namenode/current/edits_ 0000000000000369158-0000000000000369171 2016-02-17 15:56:15,421 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1218)) - Starting log segment at 369172 2016-02-17 15:56:15,764 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected 3 blocks at priority level 1; 3 blocks at priority level 2; Total=6 Reset bookmarks? fals e 2016-02-17 15:56:15,764 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 91, pendingReplications = 4. 2016-02-17 15:56:15,764 INFO blockmanagement.BlockManager (BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks chosen but could not be replicated = 3; of which 3 have no target, 0 have no source, 0 are UC, 0 are abandoned, 0 already have enough replicas. 2016-02-17 15:56:18,764 INFO BlockStateChange (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - chooseUnderReplicatedBlocks selected 6 blocks at priority level 2; Total=6 Reset bookmarks? false 2016-02-17 15:56:18,765 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* neededReplications = 87, pendingReplications = 5. 2016-02-17 15:56:18,765 INFO blockmanagement.BlockManager (BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks chosen but could not be replicated = 2; of which 0 have no target, 2 have no source, 0 are UC, 0 are abandoned, 0 already have enough replicas. ... 2016-02-17 15:56:55,688 INFO namenode.FSNamesystem (FSNamesystem.java:recoverLeaseInternal(2907)) - recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 1], src=/apps/hbase/data/WALs/host7516.mydomain,16 020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 from client DFSClient_NONMAPREDUCE_1038296970_1 2016-02-17 15:56:55,688 INFO namenode.FSNamesystem (FSNamesystem.java:internalReleaseLease(4036)) - Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 1], src=/apps/hbase/data/WALs/host7516.mydomain,16020 ,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 2016-02-17 15:56:55,688 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(4157)) - DIR* NameSystem.internalReleaseLease: File /apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C160 20%2C1455618299902.default.1455690304230 has not been closed. Lease recovery is in progress. RecoveryId = 13183 for block blk_1073753849_13127{UCState=UNDER_RECOVERY, truncateBlock=null, primaryNodeIndex=2, replicas=[ReplicaUC[[DISK]DS-97 87b201-fc64-450e-a20f-dcc79fb94b6f:NORMAL:192.168.184.73:50010|RBW], ReplicaUC[[DISK]DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28:NORMAL:192.168.184.85:50010|RBW], ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL:192.168.185.249:500 10|RBW]]} 2016-02-17 15:56:58,687 INFO namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(4434)) - commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753849_13127, file=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230, newgenerationstamp=13183, newlength=118884436, newtargets=[192.168.184.85:50010, 192.168.185.249:50010]) successful ---- Thanks, Hironori Ogibayashi