HBase Cluster not responding after shutting down one of slave nodes

おぎばやしひろのり Wed, 17 Feb 2016 10:49:03 -0800

Hello,

I am testing how HBase works in case of failure and experienced
unexpected behavior. If someone could tell me how to shorten recovery
time, that would be appreciated.


What I tried was, first client start writing to HBase, and then, shut
down one of the slave node machines. I expected the writing operation
stops for a while (1 minute or so) and after HBase automatically moves
failed regions to another slave node, client will be able to write
again.
But actually, it took about 15 minutes for the cluster to recover.
What was going on? I already have applied configurations presented on
https://hbase.apache.org/book.html#mttr

My environment is:
- 3 master(NN, HMaster, ZK, etc.) nodes and 4 slave nodes(DN, RS), all
of them are virtual machines.
- HDP 2.4.3 (HBase 1.1.1)
- CentOS 6.7

Here are HMaster and NameNode log. HMaster detected failure of a
regionserver at 15:41:16 and trying to start recovery, but throwing
RetriesExhaustedException again and again, then, finally finished
recoverying at 15:57:11. I noticed that there are some logs about
lease recovery in NameNode log, but not sure this is related.

---- HMaster log ---
2016-02-17 15:41:16,003 INFO  [main-EventThread]
zookeeper.RegionServerTracker: RegionServer ephemeral node deleted,
processing expiration [host7516.mydomain,16020,1455618299902]
2016-02-17 15:41:16,005 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
handler.MetaServerShutdownHandler: Splitting hbase:meta logs for
host7516.mydomain,16020,1455618299902
2016-02-17 15:41:16,010 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
master.SplitLogManager: dead splitlog workers
[host7516.mydomain,16020,1455618299902]
2016-02-17 15:41:16,011 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
master.SplitLogManager: started splitting 1 logs in
[hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting]
for [host7516.
mydomain,16020,1455618299902]
2016-02-17 15:41:16,016 INFO  [main-EventThread]
coordination.SplitLogManagerCoordination: task
/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..met
a.1455690873222.meta acquired by host7518.mydomain,16020,1455618298475
2016-02-17 15:41:16,266 INFO
[host7512.mydomain,16000,1455618257882_splitLogManager__ChoreService_1]
master.SplitLogManager: total tasks = 1 unassigned = 0
tasks={/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C14556182
99902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta=last_update
= 1455691276196 last_version = 2 cur_worker_name =
host7518.mydomain,16020,1455618298475 status = in_progress incarnation
= 0 resub
mits = 0 batch = installed = 1 done = 0 error = 0}
2016-02-17 15:41:20,481 INFO  [main-EventThread]
coordination.SplitLogManagerCoordination: task
/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..met
a.1455690873222.meta entered state: DONE host7518.mydomain,16020,1455618298475
2016-02-17 15:41:20,487 INFO  [main-EventThread] wal.WALSplitter:
Archived processed log
hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta
to 
hdfs://hdpts/apps/hbase/data/oldWALs/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta
2016-02-17 15:41:20,487 INFO  [main-EventThread]
coordination.SplitLogManagerCoordination: Done splitting
/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta
2016-02-17 15:41:20,492 WARN
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
master.SplitLogManager: returning success without actually splitting
and deleting all the log files in path
hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting
2016-02-17 15:41:20,492 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
master.SplitLogManager: finished splitting (more than or equal to) 83
bytes in 1 log files in
[hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting]
in 4481ms
2016-02-17 15:41:20,492 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
handler.MetaServerShutdownHandler: Server
host7516.mydomain,16020,1455618299902 was carrying META. Trying to
assign.
2016-02-17 15:41:20,492 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates:
Transition {1588230740 state=OPEN, ts=1455690873330,
server=host7516.mydomain,16020,1455618299902} to {1588230740
state=OFFLINE, ts=1455691280492,
server=host7516.mydomain,16020,1455618299902}
2016-02-17 15:41:20,493 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates:
Offlined 1588230740 from host7516.mydomain,16020,1455618299902
2016-02-17 15:41:27,495 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at
address=host7516.mydomain,16020,1455618299902, exception=Connection
timed out
2016-02-17 15:41:27,495 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
zookeeper.MetaTableLocator: Deleting hbase:meta region location in
ZooKeeper
2016-02-17 15:41:27,500 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
master.AssignmentManager: Setting node as OFFLINED in ZooKeeper for
region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY =>
'', ENDKEY => ''}
2016-02-17 15:41:27,502 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1]
master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to
host7517.mydomain,16020,1455618296942
2016-02-17 15:41:27,502 INFO
[MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates:
Transition {1588230740 state=OFFLINE, ts=1455691287500,
server=host7516.mydomain,16020,1455618299902} to {1588230740
state=PENDING_OPEN, ts=1455691287502,
server=host7517.mydomain,16020,1455618296942}
2016-02-17 15:41:27,512 INFO  [AM.ZK.Worker-pool2-t62]
master.RegionStates: Transition {1588230740 state=PENDING_OPEN,
ts=1455691287502, server=host7517.mydomain,16020,1455618296942} to
{1588230740 state=OPENING, ts=1455691287512,
server=host7517.mydomain,16020,1455618296942}
2016-02-17 15:41:27,598 INFO  [AM.ZK.Worker-pool2-t63]
master.RegionStates: Transition {1588230740 state=OPENING,
ts=1455691287512, server=host7517.mydomain,16020,1455618296942} to
{1588230740 state=OPEN, ts=1455691287598,
server=host7517.mydomain,16020,1455618296942}
2016-02-17 15:41:27,598 INFO  [AM.ZK.Worker-pool2-t63]
coordination.ZkOpenRegionCoordination: Handling OPENED of 1588230740
from host7512.mydomain,16000,1455618257882; deleting unassigned node
2016-02-17 15:42:57,601 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4]
handler.ServerShutdownHandler: Received exception accessing hbase:meta
during server shutdown of host7516.mydomain,16020,1455618299902,
retrying hbase:meta read
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=351, exceptions:
Wed Feb 17 15:42:57 KST 2016, null, java.net.SocketTimeoutException:
callTimeout=60000, callDuration=90304: row '' on table 'hbase:meta' at
region=hbase:meta,,1.1588230740,
hostname=host7516.mydomain,16020,1455618299902, seqNum=0

        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
        at 
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
        at 
org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
        at 
org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
        at 
org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
        at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
        at 
org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
        at 
org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:156)
        at 
org.apache.hadoop.hbase.MetaTableAccessor.getServerUserRegions(MetaTableAccessor.java:555)
        at 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:177)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: callTimeout=60000,
callDuration=90304: row '' on table 'hbase:meta' at
region=hbase:meta,,1.1588230740,
hostname=host7516.mydomain,16020,1455618299902, seqNum=0
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
        at 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
        ... 3 more
Caused by: java.io.IOException: Call to
HOST7516.mydomain/192.168.184.73:16020 failed on local exception:
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1388,
waitTime=90001, operationTimeout=90000 expired.
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1262)
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1230)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
        at 
org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372)
        at 
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199)
        at 
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
        at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
        ... 4 more
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=1388, waitTime=90001, operationTimeout=90000 expired.
        at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1204)
        ... 14 more
...(Same error happens again and again)
2016-02-17 15:56:55,625 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4]
handler.ServerShutdownHandler: Splitting logs for
host7516.mydomain,16020,1455618299902 before assignment; region
count=73
2016-02-17 15:56:55,627 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager:
dead splitlog workers [host7516.mydomain,16020,1455618299902]
2016-02-17 15:56:55,628 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager:
started splitting 1 logs in
[hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting]
for [host7516.line.
ism,16020,1455618299902]
2016-02-17 15:56:55,634 INFO  [main-EventThread]
coordination.SplitLogManagerCoordination: task
/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.defa
ult.1455690304230 acquired by host7517.mydomain,16020,1455618296942
2016-02-17 15:56:56,266 INFO
[host7512.mydomain,16000,1455618257882_splitLogManager__ChoreService_1]
master.SplitLogManager: total tasks = 1 unassigned = 0
tasks={/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C14556182
99902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.default.1455690304230=last_update
= 1455692215679 last_version = 2 cur_worker_name =
host7517.mydomain,16020,1455618296942 status = in_progress incarnation
= 0 resubmit
s = 0 batch = installed = 1 done = 0 error = 0}
2016-02-17 15:57:01,336 INFO  [main-EventThread]
coordination.SplitLogManagerCoordination: task
/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.defa
ult.1455690304230 entered state: DONE host7517.mydomain,16020,1455618296942
2016-02-17 15:57:01,343 INFO  [main-EventThread] wal.WALSplitter:
Archived processed log
hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.14556903042
30 to 
hdfs://hdpts/apps/hbase/data/oldWALs/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230
2016-02-17 15:57:01,343 INFO  [main-EventThread]
coordination.SplitLogManagerCoordination: Done splitting
/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C14556182
99902.default.1455690304230
2016-02-17 15:57:01,346 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager:
finished splitting (more than or equal to) 83 bytes in 1 log files in
[hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,14556
18299902-splitting] in 5718ms
...
2016-02-17 15:57:01,353 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.AssignmentManager:
Bulk assigning 73 region(s) across 3 server(s), round-robin=true
2016-02-17 15:57:01,353 INFO
[host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-0]
master.AssignmentManager: Assigning 24 region(s) to
host7515.mydomain,16020,1455691124033
2016-02-17 15:57:01,353 INFO
[host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-1]
master.AssignmentManager: Assigning 24 region(s) to
host7517.mydomain,16020,1455618296942
2016-02-17 15:57:01,353 INFO
[host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-2]
master.AssignmentManager: Assigning 25 region(s) to
host7518.mydomain,16020,1455618298475
...
2016-02-17 15:57:11,005 INFO  [AM.ZK.Worker-pool2-t71]
master.RegionStates: Offlined 50cc594cd9e74287ddd5b960bfa2d8e9 from
host7516.mydomain,16020,1455618299902
2016-02-17 15:57:11,005 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.AssignmentManager:
Bulk assigning done
2016-02-17 15:57:11,005 INFO
[MASTER_SERVER_OPERATIONS-HOST7512:16000-4]
handler.ServerShutdownHandler: Finished processing of shutdown of
host7516.mydomain,16020,1455618299902
----

---- NameNode log ---
2016-02-17 15:41:16,008 INFO  namenode.FSEditLog
(FSEditLog.java:printStatistics(699)) - Number of transactions: 45
Total time for transactions(ms): 4 Number of transactions batched in
Syncs: 0 Number of syncs: 42 SyncTimes(ms): 35 11
2016-02-17 15:41:16,383 INFO  namenode.FSNamesystem
(FSNamesystem.java:recoverLeaseInternal(2907)) - recoverLease: [Lease.
Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 2],
src=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta
from client DFSClient_NONMAPREDUCE_1038296970_1
2016-02-17 15:41:16,383 INFO  namenode.FSNamesystem
(FSNamesystem.java:internalReleaseLease(4036)) - Recovering [Lease.
Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 2],
src=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta
2016-02-17 15:41:16,383 WARN  hdfs.StateChange
(FSNamesystem.java:internalReleaseLease(4157)) - DIR*
NameSystem.internalReleaseLease: File
/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta
has not been closed. Lease recovery is in progress. RecoveryId = 13180
for block blk_1073753851_13131{UCState=UNDER_RECOVERY,
truncateBlock=null, primaryNodeIndex=2,
replicas=[ReplicaUC[[DISK]DS-9787b201-fc64-450e-a20f-dcc79fb94b6f:NORMAL:192.168.184.73:50010|RBW],
ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL:192.168.185.249:50010|RBW],
ReplicaUC[[DISK]DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28:NORMAL:192.168.184.85:50010|RBW]]}
2016-02-17 15:41:18,588 INFO  BlockStateChange
(UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) -
chooseUnderReplicatedBlocks selected 8 blocks at priority level 2;
Total=8 Reset bookmarks? false
2016-02-17 15:41:18,588 INFO  BlockStateChange
(BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK*
neededReplications = 75, pendingReplications = 0.
2016-02-17 15:41:18,588 INFO  blockmanagement.BlockManager
(BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks
chosen but could not be replicated = 8; of which 8 have no target, 0
have no source, 0 are UC, 0 are abandoned, 0 already have enough
replicas.
2016-02-17 15:41:18,960 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,961 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,962 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,963 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,964 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,966 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,967 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,967 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:18,969 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:19,096 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:19,098 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
2016-02-17 15:41:19,099 WARN  security.UserGroupInformation
(UserGroupInformation.java:getGroupNames(1521)) - No groups available
for user anonymous
016-02-17 15:41:19,307 INFO  blockmanagement.DatanodeManager
(DatanodeManager.java:handleHeartbeat(1426)) - Skipped stale nodes for
recovery : 1
2016-02-17 15:41:19,694 INFO  namenode.FSNamesystem
(FSNamesystem.java:commitBlockSynchronization(4269)) -
commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753851_13131,
newgenerationstamp=13180, newl
ength=60238, newtargets=[192.168.185.249:50010, 192.168.184.85:50010],
closeFile=true, deleteBlock=false)
2016-02-17 15:41:19,696 INFO  namenode.FSNamesystem
(FSNamesystem.java:commitBlockSynchronization(4434)) -
commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753851_13131,
file=/apps/hbase/data/WALs/cch
dpts7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta,
newgenerationstamp=13180, newlength=60238,
newtargets=[192.168.185.249:50010, 192.168.184.85:50010]) successful
2016-02-17 15:41:20,459 INFO  hdfs.StateChange
(FSNamesystem.java:saveAllocatedBlock(3583)) - BLOCK* allocate
blk_1073753897_13181{UCState=UNDER_CONSTRUCTION, truncateBlock=null,
primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-f8c4c3b2-
a7cc-40e0-9ebb-2b9c352c0d28:NORMAL:192.168.184.85:50010|RBW],
ReplicaUC[[DISK]DS-882dca14-8651-4ffe-a102-8beebc774575:NORMAL:192.168.189.208:50010|RBW],
ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL:192.168.185.249:50010|RBW
]]} for 
/apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp
2016-02-17 15:41:20,474 INFO  hdfs.StateChange
(FSNamesystem.java:completeFile(3503)) - DIR* completeFile:
/apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp
is closed by DFSClient_NONMAPREDUCE_443617401_
1
2016-02-17 15:41:20,488 INFO  ipc.Server (Server.java:run(2172)) - IPC
Server handler 8 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from
192.168.189.219:51393 Call#8040 Retry#0
org.apache.hadoop.fs.PathIsNotEmptyDirectoryException:
`/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting
is non empty': Directory is not empty
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:89)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3723)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1017)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:611)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
2016-02-17 15:41:21,588 INFO  BlockStateChange
(UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) -
chooseUnderReplicatedBlocks selected 8 blocks at priority level 2;
Total=8 Reset bookmarks? false
2016-02-17 15:41:21,588 INFO  BlockStateChange
(BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK*
neededReplications = 76, pendingReplications = 0.
...
2016-02-17 15:56:15,384 INFO  namenode.FSNamesystem
(FSNamesystem.java:rollEditLog(5937)) - Roll Edit Log from
192.168.184.72
2016-02-17 15:56:15,384 INFO  namenode.FSEditLog
(FSEditLog.java:rollEditLog(1202)) - Rolling edit logs
2016-02-17 15:56:15,384 INFO  namenode.FSEditLog
(FSEditLog.java:endCurrentLogSegment(1258)) - Ending log segment
369158
2016-02-17 15:56:15,395 INFO  namenode.FSEditLog
(FSEditLog.java:printStatistics(699)) - Number of transactions: 14
Total time for transactions(ms): 5 Number of transactions batched in
Syncs: 0 Number of syncs: 14 SyncTimes(ms): 26 6
2016-02-17 15:56:15,421 INFO  namenode.FileJournalManager
(FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits
file /hadoop/hdfs/namenode/current/edits_inprogress_0000000000000369158
-> /hadoop/hdfs/namenode/current/edits_
0000000000000369158-0000000000000369171
2016-02-17 15:56:15,421 INFO  namenode.FSEditLog
(FSEditLog.java:startLogSegment(1218)) - Starting log segment at
369172
2016-02-17 15:56:15,764 INFO  BlockStateChange
(UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) -
chooseUnderReplicatedBlocks selected 3 blocks at priority level 1; 3
blocks at priority level 2;  Total=6 Reset bookmarks? fals
e
2016-02-17 15:56:15,764 INFO  BlockStateChange
(BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK*
neededReplications = 91, pendingReplications = 4.
2016-02-17 15:56:15,764 INFO  blockmanagement.BlockManager
(BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks
chosen but could not be replicated = 3; of which 3 have no target, 0
have no source, 0 are UC, 0 are abandoned,
0 already have enough replicas.
2016-02-17 15:56:18,764 INFO  BlockStateChange
(UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) -
chooseUnderReplicatedBlocks selected 6 blocks at priority level 2;
Total=6 Reset bookmarks? false
2016-02-17 15:56:18,765 INFO  BlockStateChange
(BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK*
neededReplications = 87, pendingReplications = 5.
2016-02-17 15:56:18,765 INFO  blockmanagement.BlockManager
(BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks
chosen but could not be replicated = 2; of which 0 have no target, 2
have no source, 0 are UC, 0 are abandoned,
0 already have enough replicas.
...
2016-02-17 15:56:55,688 INFO  namenode.FSNamesystem
(FSNamesystem.java:recoverLeaseInternal(2907)) - recoverLease: [Lease.
Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 1],
src=/apps/hbase/data/WALs/host7516.mydomain,16
020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230
from client DFSClient_NONMAPREDUCE_1038296970_1
2016-02-17 15:56:55,688 INFO  namenode.FSNamesystem
(FSNamesystem.java:internalReleaseLease(4036)) - Recovering [Lease.
Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 1],
src=/apps/hbase/data/WALs/host7516.mydomain,16020
,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230
2016-02-17 15:56:55,688 WARN  hdfs.StateChange
(FSNamesystem.java:internalReleaseLease(4157)) - DIR*
NameSystem.internalReleaseLease: File
/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C160
20%2C1455618299902.default.1455690304230 has not been closed. Lease
recovery is in progress. RecoveryId = 13183 for block
blk_1073753849_13127{UCState=UNDER_RECOVERY, truncateBlock=null,
primaryNodeIndex=2, replicas=[ReplicaUC[[DISK]DS-97
87b201-fc64-450e-a20f-dcc79fb94b6f:NORMAL:192.168.184.73:50010|RBW],
ReplicaUC[[DISK]DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28:NORMAL:192.168.184.85:50010|RBW],
ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL:192.168.185.249:500
10|RBW]]}
2016-02-17 15:56:58,687 INFO  namenode.FSNamesystem
(FSNamesystem.java:commitBlockSynchronization(4434)) -
commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753849_13127,
file=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230,
newgenerationstamp=13183, newlength=118884436,
newtargets=[192.168.184.85:50010, 192.168.185.249:50010]) successful

----

Thanks,
Hironori Ogibayashi

HBase Cluster not responding after shutting down one of slave nodes

Reply via email to