Thanks Samir, I tried the same operation again, this time cluster was recovered correctly, I compared the logs.
Here is the host7517 regionserver log in the cluster-not-responding case. "NB dead servers", "Log hdfs://... still exists at ..." lines appears again and again. ---- 2016-02-17 15:41:49,160 INFO [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Atomically moving host7516.mydomain,16020,1455618299902's wals to my queue 2016-02-17 15:41:49,171 INFO [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Atomically moved the dead regionserver logs. 2016-02-17 15:41:49,268 INFO [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x1cfa9c9f connecting to ZooKeeper ensemble=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 2016-02-17 15:41:49,268 INFO [ReplicationExecutor-0] zookeeper.ZooKeeper: Initiating client connection, connectString=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 sessionTimeout=90000 watcher=hconnection-0x1cfa9c9f0x0, quorum=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181, baseZNode=/hbase-unsecure 2016-02-17 15:41:49,276 INFO [ReplicationExecutor-0-SendThread(192.168.167.71:2181)] zookeeper.ClientCnxn: Opening socket connection to server 192.168.167.71/192.168.167.71:2181. Will not attempt to authenticate using SASL (unknown error) 2016-02-17 15:41:49,276 INFO [ReplicationExecutor-0-SendThread(192.168.167.71:2181)] zookeeper.ClientCnxn: Socket connection established to 192.168.167.71/192.168.167.71:2181, initiating session 2016-02-17 15:41:49,279 INFO [ReplicationExecutor-0-SendThread(192.168.167.71:2181)] zookeeper.ClientCnxn: Session establishment complete on server 192.168.167.71/192.168.167.71:2181, sessionid = 0x25210f2711800fb, negotiated timeout = 40000 2016-02-17 15:41:49,281 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902] zookeeper.RecoverableZooKeeper: Process identifier=connection to cluster: 1-host7516.mydomain,16020,1455618299902 connecting to ZooKeeper ensemble=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 2016-02-17 15:41:49,281 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902] zookeeper.ZooKeeper: Initiating client connection, connectString=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 sessionTimeout=90000 watcher=connection to cluster: 1-host7516.mydomain,16020,14556182999020x0, quorum=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181, baseZNode=/hbase-unsecure 2016-02-17 15:41:49,283 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902-SendThread(192.168.189.222:2181)] zookeeper.ClientCnxn: Opening socket connection to server 192.168.189.222/192.168.189.222:2181. Will not attempt to authenticate using SASL (unknown error) 2016-02-17 15:41:49,283 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902-SendThread(192.168.189.222:2181)] zookeeper.ClientCnxn: Socket connection established to 192.168.189.222/192.168.189.222:2181, initiating session 2016-02-17 15:41:49,286 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902-SendThread(192.168.189.222:2181)] zookeeper.ClientCnxn: Session establishment complete on server 192.168.189.222/192.168.189.222:2181, sessionid = 0x15210f2693c00d1, negotiated timeout = 40000 2016-02-17 15:41:49,286 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902] regionserver.ReplicationSource: Replicating 5d5e12bc-25ce-411f-8393-2b0f3ab0a779 -> 0cd74950-c586-4fcb-aca5-1db49ce76d14 2016-02-17 15:41:49,290 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902] regionserver.ReplicationSource: NB dead servers : 1 2016-02-17 15:41:49,291 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902] regionserver.ReplicationSource: Possible location hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902/host7516.mydomain%252C16020%252C1455618299902.default.1455690304230 2016-02-17 15:41:49,291 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902] regionserver.ReplicationSource: Possible location hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%252C16020%252C1455618299902.default.1455690304230 2016-02-17 15:41:49,292 INFO [ReplicationExecutor-0.replicationSource,1-host7516.mydomain,16020,1455618299902] regionserver.ReplicationSource: Log hdfs://hdpts/apps/hbase/data/oldWALs/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 still exists at hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 ---- Finally, at 15:56:55, master server start splitting host7516.mydomain,16020,1455618299902. after that, cluster recovered. --- 2016-02-17 15:56:55,625 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] handler.ServerShutdownHandler: Splitting logs for host7516.mydomain,16020,1455618299902 before assignment; region count=73 2016-02-17 15:56:55,627 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: dead splitlog workers [host7516.mydomain,16020,1455618299902] 2016-02-17 15:56:55,628 INFO [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: started splitting 1 logs in [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting] for [host7516.mydomain,16020,1455618299902] --- Here is the host7517 regionserver log in the correctly-recovered case. (This time, I shutdown host7515 which has hbase:meta region) I can see "Done with the recovered queue" log. ---- 2016-02-23 16:52:24,632 INFO [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Atomically moving host7515.mydomain,16020,1455786190819's wals to my queue 2016-02-23 16:52:24,637 INFO [ReplicationExecutor-0] replication.ReplicationQueuesZKImpl: Atomically moved the dead regionserver logs. 2016-02-23 16:52:24,705 INFO [ReplicationExecutor-0] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4610fe09 connecting to ZooKeeper ensemble=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 2016-02-23 16:52:24,705 INFO [ReplicationExecutor-0] zookeeper.ZooKeeper: Initiating client connection, connectString=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 sessionTimeout=90000 watcher=hconnection-0x4610fe090x0, quorum=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181, baseZNode=/hbase-unsecure 2016-02-23 16:52:24,714 INFO [ReplicationExecutor-0-SendThread(192.168.189.222:2181)] zookeeper.ClientCnxn: Opening socket connection to server 192.168.189.222/192.168.189.222:2181. Will not attempt to authenticate using SASL (unknown error) 2016-02-23 16:52:24,715 INFO [ReplicationExecutor-0-SendThread(192.168.189.222:2181)] zookeeper.ClientCnxn: Socket connection established to 192.168.189.222/192.168.189.222:2181, initiating session 2016-02-23 16:52:24,718 INFO [ReplicationExecutor-0-SendThread(192.168.189.222:2181)] zookeeper.ClientCnxn: Session establishment complete on server 192.168.189.222/192.168.189.222:2181, sessionid = 0x15210f2693c00e2, negotiated timeout = 40000 2016-02-23 16:52:24,719 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819] zookeeper.RecoverableZooKeeper: Process identifier=connection to cluster: 1-host7515.mydomain,16020,1455786190819 connecting to ZooKeeper ensemble=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 2016-02-23 16:52:24,720 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819] zookeeper.ZooKeeper: Initiating client connection, connectString=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181 sessionTimeout=90000 watcher=connection to cluster: 1-host7515.mydomain,16020,14557861908190x0, quorum=repl7503.mydomain:2181,repl7504.mydomain:2181,repl7502.mydomain:2181, baseZNode=/hbase-unsecure 2016-02-23 16:52:24,722 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819-SendThread(192.168.167.71:2181)] zookeeper.ClientCnxn: Opening socket connection to server 192.168.167.71/192.168.167.71:2181. Will not attempt to authenticate using SASL (unknown error) 2016-02-23 16:52:24,722 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819-SendThread(192.168.167.71:2181)] zookeeper.ClientCnxn: Socket connection established to 192.168.167.71/192.168.167.71:2181, initiating session 2016-02-23 16:52:24,724 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819-SendThread(192.168.167.71:2181)] zookeeper.ClientCnxn: Session establishment complete on server 192.168.167.71/192.168.167.71:2181, sessionid = 0x25210f271180110, negotiated timeout = 40000 2016-02-23 16:52:24,725 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819] regionserver.ReplicationSource: Replicating 5d5e12bc-25ce-411f-8393-2b0f3ab0a779 -> 0cd74950-c586-4fcb-aca5-1db49ce76d14 2016-02-23 16:52:25,747 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819] regionserver.ReplicationSourceManager: Done with the recovered queue 1-host7515.mydomain,16020,1455786190819 2016-02-23 16:52:25,750 INFO [ReplicationExecutor-0.replicationSource,1-host7515.mydomain,16020,1455786190819] regionserver.ReplicationSource: Finished recovering the queue with the following stats Total replicated edits: 0, currently replicating from: hdfs://hdpts/apps/hbase/data/oldWALs/host7515.mydomain%2C16020%2C1455786190819.default.1456212089608 at position: 81519135 ---- Actually, there was a replication setting of 'ns2:usertbl' table, which has just 1 region and its on host7516. I have not find the root cause, but this is what I noticed. Regards, Hironori 2016-02-19 20:51 GMT+09:00 Samir Ahmic <ahmic.sa...@gmail.com>: > I will search for related jiras maybe you have hit some known bug. If you > find something interesting in logs please post it here so we can open > ticket if needed. > > Regards > Samir > > On Fri, Feb 19, 2016 at 12:30 PM, おぎばやしひろのり <ogibaya...@gmail.com> wrote: > >> Thanks Samir, >> >> I checked that blog,helped me to understand what is going on when datanode >> dies. >> But still not sure why it took 15 minutes. >> I will check the server logs again, carefully. >> >> Regards, >> Hironori Ogibayashi >> >> 2016-02-18 22:56 GMT+09:00 Samir Ahmic <ahmic.sa...@gmail.com>: >> > Thanks for explaining Hironori, >> > By shuting down DataNode process on host7516 we also have to add hadoop >> > recovery time in total recovery time. >> > Here is great blog explaining how hadoop recovery process works. >> > >> > * >> http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ >> > < >> http://blog.cloudera.com/blog/2015/02/understanding-hdfs-recovery-processes-part-1/ >> >* >> > >> > Regards >> > Samir >> > >> > On Thu, Feb 18, 2016 at 11:04 AM, おぎばやしひろのり <ogibaya...@gmail.com> >> wrote: >> > >> >> Samir, >> >> >> >> Thank you for your advice. >> >> The actual operation I had to shutdown host7516 was, just click "stop" >> >> button on our VM console. I don't know the internal, but I can see >> >> some process terminating messages in /var/log/messages, so it looks >> >> like a kind of graceful shutdown. >> >> About reproducing, yes, I can, but not every time. maybe once in 3,4 >> >> times or so. I am not sure the difference. >> >> >> >> I also checked the regionserver logs. There ware warnings related to >> >> DataNode access, this is because there was DataNode on the shutdown >> >> host7516, too. >> >> Regarding log split, I couldn't find any errors. Both host finished >> >> the task within a few seconds. >> >> >> >> --- host7517 >> >> 2016-02-17 15:39:59,002 INFO [LruBlockCacheStatsExecutor] >> >> hfile.LruBlockCache: totalSize=420.92 KB, freeSize=399.19 MB, >> >> max=399.60 MB, blockCount=0, accesses=24227, hits=0, hitRatio=0, >> >> cachingAccesses=0, cachingHits=0, cachingHitsRatio=0 >> >> ,evictions=7289, evicted=0, evictedPerRun=0.0 >> >> 2016-02-17 15:39:59,869 INFO [HOST7517:16020Replication Statistics >> >> #0] regionserver.Replication: Normal source for cluster 1: Total >> >> replicated edits: 0, currently replicating from: >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7517.mydomain,16020,1455618296942/host7517.mydomain%2C16020%2C1455618296942.default.1455690301451 >> >> at position: 96609590 >> >> >> >> 2016-02-17 15:40:50,953 WARN [ResponseProcessor for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753847_13125] >> >> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753847_13125 >> >> java.io.IOException: Bad response ERROR for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753847_13125 from >> >> datanode DatanodeInfoWithStorage[192.168.184.73:50010 >> >> ,DS-9787b201-fc64-450e-a20f-dcc79fb94b6f,DISK] >> >> at >> >> >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) >> >> 2016-02-17 15:40:50,954 WARN [DataStreamer for file >> >> >> >> >> /apps/hbase/data/WALs/host7517.mydomain,16020,1455618296942/host7517.mydomain%2C16020%2C1455618296942.default.1455690301451 >> >> block BP-1843495860-192.168.189.219-1453778090403:blk_1073753847_13125] >> >> hdfs.DFSClient: Error Recovery for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753847_13125 in >> >> pipeline DatanodeInfoWithStorage[192.168.185.249:50010 >> >> ,DS-52f00167-e863-48d8-8f98-3b613f774b0c,DISK], >> >> DatanodeInfoWithStorage[192.168.184.85:50010 >> >> ,DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28,DISK], >> >> DatanodeInfoWithStorage[192.168.184.73:50010 >> >> ,DS-9787b201-fc64-450e-a20f-dcc79fb94b6f,DISK]: >> >> bad datanode DatanodeInfoWithStorage[192.168.184.73:50010 >> >> ,DS-9787b201-fc64-450e-a20f-dcc79fb94b6f,DISK] >> >> ... >> >> 2016-02-17 15:56:55,633 INFO [SplitLogWorker-HOST7517:16020] >> >> coordination.ZkSplitLogWorkerCoordination: worker >> >> host7517.mydomain,16020,1455618296942 acquired task >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.default.1455690304230 >> >> 2016-02-17 15:56:55,677 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> wal.WALSplitter: Splitting wal: >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230, >> >> length=83 >> >> 2016-02-17 15:56:55,677 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> wal.WALSplitter: DistributedLogReplay = false >> >> 2016-02-17 15:56:55,685 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> util.FSHDFSUtils: Recovering lease on dfs file >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 >> >> 2016-02-17 15:56:55,696 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> util.FSHDFSUtils: recoverLease=false, attempt=0 on >> >> >> >> >> file=hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 >> >> after 11ms >> >> 2016-02-17 15:56:59,698 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> util.FSHDFSUtils: recoverLease=true, attempt=1 on >> >> >> >> >> file=hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020 >> >> 2016-02-17 15:56:59,771 INFO >> >> [RS_LOG_REPLAY_OPS-HOST7517:16020-0-Writer-1] wal.WALSplitter: >> >> Creating writer >> >> >> >> >> path=hdfs://hdpts/apps/hbase/data/data/default/usertable/64199c31957c01b5bd9ee50b02e1f7fd/recovered.edits/0000000000000542072.temp >> >> region=64199c31957c01b5bd9ee50b02e1f7fd >> >> ...(similar lines) >> >> 2016-02-17 15:57:01,001 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> wal.WALSplitter: Split writers finished >> >> 2016-02-17 15:57:01,034 INFO [split-log-closeStream-3] >> >> wal.WALSplitter: Rename >> >> >> >> >> hdfs://hdpts/apps/hbase/data/data/default/usertable/06ed7277b7b9539a3ba597e0041acb12/recovered.edits/0000000000000054100.temp >> >> to >> >> >> hdfs://hdpts/apps/hbase/data/data/default/usertable/06ed7277b7b9539a3ba597e0041acb12/recovered.edits/0000000000000055498 >> >> ...(similar lines) >> >> 2016-02-17 15:57:01,332 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> wal.WALSplitter: Processed 73326 edits across 53 regions; edits >> >> skipped=0; log >> >> >> file=hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230, >> >> length=83, corrupted=false, progress failed=false >> >> 2016-02-17 15:57:01,336 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> coordination.ZkSplitLogWorkerCoordination: successfully transitioned >> >> task >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.default.1455690304230 >> >> to final state DONE host7517.mydomain,16020,1455618296942 >> >> 2016-02-17 15:57:01,336 INFO [RS_LOG_REPLAY_OPS-HOST7517:16020-0] >> >> handler.WALSplitterHandler: worker >> >> host7517.mydomain,16020,1455618296942 done with task >> >> >> >> >> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7b0d8dc7 >> >> in 5691ms >> >> >> >> --- host7518 >> >> >> >> 2016-02-17 15:40:55,957 WARN [ResponseProcessor for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753896_13177] >> >> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753896_13177 >> >> java.io.IOException: Bad response ERROR for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753896_13177 from >> >> datanode DatanodeInfoWithStorage[192.168.184.73:50010 >> >> ,DS-9787b201-fc64-450e-a20f-dcc79fb94b6f,DISK] >> >> at >> >> >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786) >> >> 2016-02-17 15:40:55,959 WARN [DataStreamer for file >> >> >> >> >> /apps/hbase/data/WALs/host7518.mydomain,16020,1455618298475/host7518.mydomain%2C16020%2C1455618298475.default.1455691227152 >> >> block BP-1843495860-192.168.189.219-1453778090403:blk_1073753896_13177] >> >> hdfs.DFSClient: Error Recovery for block >> >> BP-1843495860-192.168.189.219-1453778090403:blk_1073753896_13177 in >> >> pipeline DatanodeInfoWithStorage[192.168.184.85:50010 >> >> ,DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28,DISK], >> >> DatanodeInfoWithStorage[192.168.184.73:50010 >> >> ,DS-9787b201-fc64-450e-a20f-dcc79fb94b6f,DISK], >> >> DatanodeInfoWithStorage[192.168.185.249:50010 >> >> ,DS-52f00167-e863-48d8-8f98-3b613f774b0c,DISK]: >> >> bad datanode DatanodeInfoWithStorage[192.168.184.73:50010 >> >> ,DS-9787b201-fc64-450e-a20f-dcc79fb94b6f,DISK] >> >> 2016-02-17 15:40:56,043 INFO [sync.4] wal.FSHLog: Slow sync cost: >> >> 3774 ms, current pipeline: >> >> [DatanodeInfoWithStorage[192.168.184.85:50010 >> >> ,DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28,DISK], >> >> DatanodeInfoWithStorage[192.168.185.249:50010 >> >> ,DS-52f00167-e863-48d8-8f98-3b613f774b0c,DISK], >> >> DatanodeInfoWithStorage[192.168.189.208:50010 >> >> ,DS-882dca14-8651-4ffe-a102-8beebc774575,DISK]] >> >> 2016-02-17 15:40:56,044 INFO [sync.3] wal.FSHLog: Slow sync cost: >> >> 20103 ms, current pipeline: >> >> [DatanodeInfoWithStorage[192.168.184.85:50010 >> >> ,DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28,DISK], >> >> DatanodeInfoWithStorage[192.168.185.249:50010 >> >> ,DS-52f00167-e863-48d8-8f98-3b613f774b0c,DISK], >> >> DatanodeInfoWithStorage[192.168.189.208:50010 >> >> ,DS-882dca14-8651-4ffe-a102-8beebc774575,DISK]] >> >> ... >> >> 2016-02-17 15:41:16,016 INFO [SplitLogWorker-HOST7518:16020] >> >> coordination.ZkSplitLogWorkerCoordination: worker >> >> host7518.mydomain,16020,1455618298475 acquired task >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta >> >> 2016-02-17 15:41:16,194 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> wal.WALSplitter: Splitting wal: >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta, >> >> length=83 >> >> 2016-02-17 15:41:16,194 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> wal.WALSplitter: DistributedLogReplay = false >> >> 2016-02-17 15:41:16,252 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> util.FSHDFSUtils: Recovering lease on dfs file >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> 2016-02-17 15:41:16,396 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> util.FSHDFSUtils: recoverLease=false, attempt=0 on >> >> >> >> >> file=hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> after 144ms >> >> 2016-02-17 15:41:20,398 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> util.FSHDFSUtils: recoverLease=true, attempt=1 on >> >> >> >> >> file=hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> after 4146ms >> >> 2016-02-17 15:41:20,456 INFO >> >> [RS_LOG_REPLAY_OPS-HOST7518:16020-0-Writer-2] wal.WALSplitter: >> >> Creating writer >> >> >> >> >> path=hdfs://hdpts/apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp >> >> region=1588230740 >> >> 2016-02-17 15:41:20,457 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> wal.WALSplitter: Split writers finished >> >> 2016-02-17 15:41:20,479 INFO [split-log-closeStream-1] >> >> wal.WALSplitter: Rename >> >> >> >> >> hdfs://hdpts/apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp >> >> to >> >> >> hdfs://hdpts/apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009661 >> >> 2016-02-17 15:41:20,479 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> wal.WALSplitter: Processed 3 edits across 1 regions; edits >> >> skipped=148; log >> >> >> >> >> file=hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta, >> >> length=83, corrupted=false, progress failed=false >> >> 2016-02-17 15:41:20,482 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> coordination.ZkSplitLogWorkerCoordination: successfully transitioned >> >> task >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta >> >> to final state DONE host7518.mydomain,16020,1455618298475 >> >> 2016-02-17 15:41:20,482 INFO [RS_LOG_REPLAY_OPS-HOST7518:16020-0] >> >> handler.WALSplitterHandler: worker >> >> host7518.mydomain,16020,1455618298475 done with task >> >> >> >> >> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42ed43fe >> >> in 4359ms >> >> >> >> Regards, >> >> Hironori Ogibayashi >> >> >> >> 2016-02-18 5:12 GMT+09:00 Samir Ahmic <ahmic.sa...@gmail.com>: >> >> > Hi Hironori, >> >> > >> >> > Based on your logs you have shutdown regionserver that was hosting >> system >> >> > "hbase:meta" table. This is special case since user regions can not be >> >> > reassigned to other servers until "hbase:meta" table region is >> >> reassigned. >> >> > This can partially explain delay in recovery time you have >> experienced. >> >> > Also there is WAL splitting process which must be completed before >> >> regions >> >> > are reassigned. Still 15 minutes is too long period for recovery. >> >> > How did you shutdown host7516? Also you should check logs for errors >> >> > from host7517 and host7518 around this time 15:41 because they have >> >> > acquired WAL splitting tasks. >> >> > Also can you reproduce this every time in your tests ? >> >> > >> >> > Regards >> >> > Samir >> >> > >> >> > On Wed, Feb 17, 2016 at 8:49 AM, おぎばやしひろのり <ogibaya...@gmail.com> >> wrote: >> >> > >> >> >> Hello, >> >> >> >> >> >> I am testing how HBase works in case of failure and experienced >> >> >> unexpected behavior. If someone could tell me how to shorten recovery >> >> >> time, that would be appreciated. >> >> >> >> >> >> What I tried was, first client start writing to HBase, and then, shut >> >> >> down one of the slave node machines. I expected the writing operation >> >> >> stops for a while (1 minute or so) and after HBase automatically >> moves >> >> >> failed regions to another slave node, client will be able to write >> >> >> again. >> >> >> But actually, it took about 15 minutes for the cluster to recover. >> >> >> What was going on? I already have applied configurations presented on >> >> >> https://hbase.apache.org/book.html#mttr >> >> >> >> >> >> My environment is: >> >> >> - 3 master(NN, HMaster, ZK, etc.) nodes and 4 slave nodes(DN, RS), >> all >> >> >> of them are virtual machines. >> >> >> - HDP 2.4.3 (HBase 1.1.1) >> >> >> - CentOS 6.7 >> >> >> >> >> >> Here are HMaster and NameNode log. HMaster detected failure of a >> >> >> regionserver at 15:41:16 and trying to start recovery, but throwing >> >> >> RetriesExhaustedException again and again, then, finally finished >> >> >> recoverying at 15:57:11. I noticed that there are some logs about >> >> >> lease recovery in NameNode log, but not sure this is related. >> >> >> >> >> >> ---- HMaster log --- >> >> >> 2016-02-17 15:41:16,003 INFO [main-EventThread] >> >> >> zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, >> >> >> processing expiration [host7516.mydomain,16020,1455618299902] >> >> >> 2016-02-17 15:41:16,005 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> handler.MetaServerShutdownHandler: Splitting hbase:meta logs for >> >> >> host7516.mydomain,16020,1455618299902 >> >> >> 2016-02-17 15:41:16,010 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> master.SplitLogManager: dead splitlog workers >> >> >> [host7516.mydomain,16020,1455618299902] >> >> >> 2016-02-17 15:41:16,011 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> master.SplitLogManager: started splitting 1 logs in >> >> >> >> >> >> >> >> >> [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting] >> >> >> for [host7516. >> >> >> mydomain,16020,1455618299902] >> >> >> 2016-02-17 15:41:16,016 INFO [main-EventThread] >> >> >> coordination.SplitLogManagerCoordination: task >> >> >> >> >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..met >> >> >> a.1455690873222.meta acquired by >> host7518.mydomain,16020,1455618298475 >> >> >> 2016-02-17 15:41:16,266 INFO >> >> >> >> [host7512.mydomain,16000,1455618257882_splitLogManager__ChoreService_1] >> >> >> master.SplitLogManager: total tasks = 1 unassigned = 0 >> >> >> >> >> >> tasks={/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C14556182 >> >> >> >> >> >> >> >> >> 99902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta=last_update >> >> >> = 1455691276196 last_version = 2 cur_worker_name = >> >> >> host7518.mydomain,16020,1455618298475 status = in_progress >> incarnation >> >> >> = 0 resub >> >> >> mits = 0 batch = installed = 1 done = 0 error = 0} >> >> >> 2016-02-17 15:41:20,481 INFO [main-EventThread] >> >> >> coordination.SplitLogManagerCoordination: task >> >> >> >> >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..met >> >> >> a.1455690873222.meta entered state: DONE >> >> >> host7518.mydomain,16020,1455618298475 >> >> >> 2016-02-17 15:41:20,487 INFO [main-EventThread] wal.WALSplitter: >> >> >> Archived processed log >> >> >> >> >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> >> to >> >> >> >> >> >> hdfs://hdpts/apps/hbase/data/oldWALs/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> >> 2016-02-17 15:41:20,487 INFO [main-EventThread] >> >> >> coordination.SplitLogManagerCoordination: Done splitting >> >> >> >> >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902..meta.1455690873222.meta >> >> >> 2016-02-17 15:41:20,492 WARN >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> master.SplitLogManager: returning success without actually splitting >> >> >> and deleting all the log files in path >> >> >> >> >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting >> >> >> 2016-02-17 15:41:20,492 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> master.SplitLogManager: finished splitting (more than or equal to) 83 >> >> >> bytes in 1 log files in >> >> >> >> >> >> >> >> >> [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting] >> >> >> in 4481ms >> >> >> 2016-02-17 15:41:20,492 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> handler.MetaServerShutdownHandler: Server >> >> >> host7516.mydomain,16020,1455618299902 was carrying META. Trying to >> >> >> assign. >> >> >> 2016-02-17 15:41:20,492 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates: >> >> >> Transition {1588230740 state=OPEN, ts=1455690873330, >> >> >> server=host7516.mydomain,16020,1455618299902} to {1588230740 >> >> >> state=OFFLINE, ts=1455691280492, >> >> >> server=host7516.mydomain,16020,1455618299902} >> >> >> 2016-02-17 15:41:20,493 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates: >> >> >> Offlined 1588230740 from host7516.mydomain,16020,1455618299902 >> >> >> 2016-02-17 15:41:27,495 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at >> >> >> address=host7516.mydomain,16020,1455618299902, exception=Connection >> >> >> timed out >> >> >> 2016-02-17 15:41:27,495 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> zookeeper.MetaTableLocator: Deleting hbase:meta region location in >> >> >> ZooKeeper >> >> >> 2016-02-17 15:41:27,500 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> master.AssignmentManager: Setting node as OFFLINED in ZooKeeper for >> >> >> region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => >> >> >> '', ENDKEY => ''} >> >> >> 2016-02-17 15:41:27,502 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] >> >> >> master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to >> >> >> host7517.mydomain,16020,1455618296942 >> >> >> 2016-02-17 15:41:27,502 INFO >> >> >> [MASTER_META_SERVER_OPERATIONS-HOST7512:16000-1] master.RegionStates: >> >> >> Transition {1588230740 state=OFFLINE, ts=1455691287500, >> >> >> server=host7516.mydomain,16020,1455618299902} to {1588230740 >> >> >> state=PENDING_OPEN, ts=1455691287502, >> >> >> server=host7517.mydomain,16020,1455618296942} >> >> >> 2016-02-17 15:41:27,512 INFO [AM.ZK.Worker-pool2-t62] >> >> >> master.RegionStates: Transition {1588230740 state=PENDING_OPEN, >> >> >> ts=1455691287502, server=host7517.mydomain,16020,1455618296942} to >> >> >> {1588230740 state=OPENING, ts=1455691287512, >> >> >> server=host7517.mydomain,16020,1455618296942} >> >> >> 2016-02-17 15:41:27,598 INFO [AM.ZK.Worker-pool2-t63] >> >> >> master.RegionStates: Transition {1588230740 state=OPENING, >> >> >> ts=1455691287512, server=host7517.mydomain,16020,1455618296942} to >> >> >> {1588230740 state=OPEN, ts=1455691287598, >> >> >> server=host7517.mydomain,16020,1455618296942} >> >> >> 2016-02-17 15:41:27,598 INFO [AM.ZK.Worker-pool2-t63] >> >> >> coordination.ZkOpenRegionCoordination: Handling OPENED of 1588230740 >> >> >> from host7512.mydomain,16000,1455618257882; deleting unassigned node >> >> >> 2016-02-17 15:42:57,601 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] >> >> >> handler.ServerShutdownHandler: Received exception accessing >> hbase:meta >> >> >> during server shutdown of host7516.mydomain,16020,1455618299902, >> >> >> retrying hbase:meta read >> >> >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >> after >> >> >> attempts=351, exceptions: >> >> >> Wed Feb 17 15:42:57 KST 2016, null, java.net.SocketTimeoutException: >> >> >> callTimeout=60000, callDuration=90304: row '' on table 'hbase:meta' >> at >> >> >> region=hbase:meta,,1.1588230740, >> >> >> hostname=host7516.mydomain,16020,1455618299902, seqNum=0 >> >> >> >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155) >> >> >> at >> >> >> org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:156) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.MetaTableAccessor.getServerUserRegions(MetaTableAccessor.java:555) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:177) >> >> >> at >> >> >> >> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) >> >> >> at >> >> >> >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> >> >> at >> >> >> >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> >> >> at java.lang.Thread.run(Thread.java:745) >> >> >> Caused by: java.net.SocketTimeoutException: callTimeout=60000, >> >> >> callDuration=90304: row '' on table 'hbase:meta' at >> >> >> region=hbase:meta,,1.1588230740, >> >> >> hostname=host7516.mydomain,16020,1455618299902, seqNum=0 >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64) >> >> >> ... 3 more >> >> >> Caused by: java.io.IOException: Call to >> >> >> HOST7516.mydomain/192.168.184.73:16020 failed on local exception: >> >> >> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1388, >> >> >> waitTime=90001, operationTimeout=90000 expired. >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1262) >> >> >> at >> >> >> >> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1230) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126) >> >> >> ... 4 more >> >> >> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call >> >> >> id=1388, waitTime=90001, operationTimeout=90000 expired. >> >> >> at >> >> >> org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70) >> >> >> at >> >> >> >> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1204) >> >> >> ... 14 more >> >> >> ...(Same error happens again and again) >> >> >> 2016-02-17 15:56:55,625 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] >> >> >> handler.ServerShutdownHandler: Splitting logs for >> >> >> host7516.mydomain,16020,1455618299902 before assignment; region >> >> >> count=73 >> >> >> 2016-02-17 15:56:55,627 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: >> >> >> dead splitlog workers [host7516.mydomain,16020,1455618299902] >> >> >> 2016-02-17 15:56:55,628 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: >> >> >> started splitting 1 logs in >> >> >> >> >> >> >> >> >> [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting] >> >> >> for [host7516.line. >> >> >> ism,16020,1455618299902] >> >> >> 2016-02-17 15:56:55,634 INFO [main-EventThread] >> >> >> coordination.SplitLogManagerCoordination: task >> >> >> >> >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.defa >> >> >> ult.1455690304230 acquired by host7517.mydomain,16020,1455618296942 >> >> >> 2016-02-17 15:56:56,266 INFO >> >> >> >> [host7512.mydomain,16000,1455618257882_splitLogManager__ChoreService_1] >> >> >> master.SplitLogManager: total tasks = 1 unassigned = 0 >> >> >> >> >> >> tasks={/hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C14556182 >> >> >> >> >> >> >> >> >> 99902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.default.1455690304230=last_update >> >> >> = 1455692215679 last_version = 2 cur_worker_name = >> >> >> host7517.mydomain,16020,1455618296942 status = in_progress >> incarnation >> >> >> = 0 resubmit >> >> >> s = 0 batch = installed = 1 done = 0 error = 0} >> >> >> 2016-02-17 15:57:01,336 INFO [main-EventThread] >> >> >> coordination.SplitLogManagerCoordination: task >> >> >> >> >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C1455618299902.defa >> >> >> ult.1455690304230 entered state: DONE >> >> host7517.mydomain,16020,1455618296942 >> >> >> 2016-02-17 15:57:01,343 INFO [main-EventThread] wal.WALSplitter: >> >> >> Archived processed log >> >> >> >> >> >> >> >> >> hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.14556903042 >> >> >> 30 to >> >> >> >> >> >> hdfs://hdpts/apps/hbase/data/oldWALs/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 >> >> >> 2016-02-17 15:57:01,343 INFO [main-EventThread] >> >> >> coordination.SplitLogManagerCoordination: Done splitting >> >> >> >> >> >> >> >> >> /hbase-unsecure/splitWAL/WALs%2Fhost7516.mydomain%2C16020%2C1455618299902-splitting%2Fhost7516.mydomain%252C16020%252C14556182 >> >> >> 99902.default.1455690304230 >> >> >> 2016-02-17 15:57:01,346 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.SplitLogManager: >> >> >> finished splitting (more than or equal to) 83 bytes in 1 log files in >> >> >> [hdfs://hdpts/apps/hbase/data/WALs/host7516.mydomain,16020,14556 >> >> >> 18299902-splitting] in 5718ms >> >> >> ... >> >> >> 2016-02-17 15:57:01,353 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.AssignmentManager: >> >> >> Bulk assigning 73 region(s) across 3 server(s), round-robin=true >> >> >> 2016-02-17 15:57:01,353 INFO >> >> >> [host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-0] >> >> >> master.AssignmentManager: Assigning 24 region(s) to >> >> >> host7515.mydomain,16020,1455691124033 >> >> >> 2016-02-17 15:57:01,353 INFO >> >> >> [host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-1] >> >> >> master.AssignmentManager: Assigning 24 region(s) to >> >> >> host7517.mydomain,16020,1455618296942 >> >> >> 2016-02-17 15:57:01,353 INFO >> >> >> [host7512.mydomain,16000,1455618257882-GeneralBulkAssigner-2] >> >> >> master.AssignmentManager: Assigning 25 region(s) to >> >> >> host7518.mydomain,16020,1455618298475 >> >> >> ... >> >> >> 2016-02-17 15:57:11,005 INFO [AM.ZK.Worker-pool2-t71] >> >> >> master.RegionStates: Offlined 50cc594cd9e74287ddd5b960bfa2d8e9 from >> >> >> host7516.mydomain,16020,1455618299902 >> >> >> 2016-02-17 15:57:11,005 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] master.AssignmentManager: >> >> >> Bulk assigning done >> >> >> 2016-02-17 15:57:11,005 INFO >> >> >> [MASTER_SERVER_OPERATIONS-HOST7512:16000-4] >> >> >> handler.ServerShutdownHandler: Finished processing of shutdown of >> >> >> host7516.mydomain,16020,1455618299902 >> >> >> ---- >> >> >> >> >> >> ---- NameNode log --- >> >> >> 2016-02-17 15:41:16,008 INFO namenode.FSEditLog >> >> >> (FSEditLog.java:printStatistics(699)) - Number of transactions: 45 >> >> >> Total time for transactions(ms): 4 Number of transactions batched in >> >> >> Syncs: 0 Number of syncs: 42 SyncTimes(ms): 35 11 >> >> >> 2016-02-17 15:41:16,383 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:recoverLeaseInternal(2907)) - recoverLease: >> [Lease. >> >> >> Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 2], >> >> >> >> >> >> >> >> >> src=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> >> from client DFSClient_NONMAPREDUCE_1038296970_1 >> >> >> 2016-02-17 15:41:16,383 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:internalReleaseLease(4036)) - Recovering [Lease. >> >> >> Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 2], >> >> >> >> >> >> >> >> >> src=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> >> 2016-02-17 15:41:16,383 WARN hdfs.StateChange >> >> >> (FSNamesystem.java:internalReleaseLease(4157)) - DIR* >> >> >> NameSystem.internalReleaseLease: File >> >> >> >> >> >> >> >> >> /apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta >> >> >> has not been closed. Lease recovery is in progress. RecoveryId = >> 13180 >> >> >> for block blk_1073753851_13131{UCState=UNDER_RECOVERY, >> >> >> truncateBlock=null, primaryNodeIndex=2, >> >> >> >> >> >> >> >> >> replicas=[ReplicaUC[[DISK]DS-9787b201-fc64-450e-a20f-dcc79fb94b6f:NORMAL: >> >> 192.168.184.73:50010 >> >> >> |RBW], >> >> >> >> >> >> ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL: >> >> 192.168.185.249:50010 >> >> >> |RBW], >> >> >> >> >> >> ReplicaUC[[DISK]DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28:NORMAL: >> >> 192.168.184.85:50010 >> >> >> |RBW]]} >> >> >> 2016-02-17 15:41:18,588 INFO BlockStateChange >> >> >> (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - >> >> >> chooseUnderReplicatedBlocks selected 8 blocks at priority level 2; >> >> >> Total=8 Reset bookmarks? false >> >> >> 2016-02-17 15:41:18,588 INFO BlockStateChange >> >> >> (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* >> >> >> neededReplications = 75, pendingReplications = 0. >> >> >> 2016-02-17 15:41:18,588 INFO blockmanagement.BlockManager >> >> >> (BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks >> >> >> chosen but could not be replicated = 8; of which 8 have no target, 0 >> >> >> have no source, 0 are UC, 0 are abandoned, 0 already have enough >> >> >> replicas. >> >> >> 2016-02-17 15:41:18,960 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,961 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,962 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,963 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,964 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,966 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,967 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,967 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:18,969 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:19,096 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:19,098 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 2016-02-17 15:41:19,099 WARN security.UserGroupInformation >> >> >> (UserGroupInformation.java:getGroupNames(1521)) - No groups available >> >> >> for user anonymous >> >> >> 016-02-17 15:41:19,307 INFO blockmanagement.DatanodeManager >> >> >> (DatanodeManager.java:handleHeartbeat(1426)) - Skipped stale nodes >> for >> >> >> recovery : 1 >> >> >> 2016-02-17 15:41:19,694 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:commitBlockSynchronization(4269)) - >> >> >> >> >> >> >> >> >> commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753851_13131, >> >> >> newgenerationstamp=13180, newl >> >> >> ength=60238, newtargets=[192.168.185.249:50010, 192.168.184.85:50010 >> ], >> >> >> closeFile=true, deleteBlock=false) >> >> >> 2016-02-17 15:41:19,696 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:commitBlockSynchronization(4434)) - >> >> >> >> >> >> >> >> >> commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753851_13131, >> >> >> file=/apps/hbase/data/WALs/cch >> >> >> >> >> >> >> >> >> dpts7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902..meta.1455690873222.meta, >> >> >> newgenerationstamp=13180, newlength=60238, >> >> >> newtargets=[192.168.185.249:50010, 192.168.184.85:50010]) successful >> >> >> 2016-02-17 15:41:20,459 INFO hdfs.StateChange >> >> >> (FSNamesystem.java:saveAllocatedBlock(3583)) - BLOCK* allocate >> >> >> blk_1073753897_13181{UCState=UNDER_CONSTRUCTION, truncateBlock=null, >> >> >> primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-f8c4c3b2- >> >> >> a7cc-40e0-9ebb-2b9c352c0d28:NORMAL:192.168.184.85:50010|RBW], >> >> >> >> >> >> ReplicaUC[[DISK]DS-882dca14-8651-4ffe-a102-8beebc774575:NORMAL: >> >> 192.168.189.208:50010 >> >> >> |RBW], >> >> >> >> >> >> ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL: >> >> 192.168.185.249:50010 >> >> >> |RBW >> >> >> ]]} for >> >> >> >> >> >> /apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp >> >> >> 2016-02-17 15:41:20,474 INFO hdfs.StateChange >> >> >> (FSNamesystem.java:completeFile(3503)) - DIR* completeFile: >> >> >> >> >> >> >> >> >> /apps/hbase/data/data/hbase/meta/1588230740/recovered.edits/0000000000000009659.temp >> >> >> is closed by DFSClient_NONMAPREDUCE_443617401_ >> >> >> 1 >> >> >> 2016-02-17 15:41:20,488 INFO ipc.Server (Server.java:run(2172)) - >> IPC >> >> >> Server handler 8 on 8020, call >> >> >> org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from >> >> >> 192.168.189.219:51393 Call#8040 Retry#0 >> >> >> org.apache.hadoop.fs.PathIsNotEmptyDirectoryException: >> >> >> >> `/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting >> >> >> is non empty': Directory is not empty >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:89) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3723) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1017) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:611) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) >> >> >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) >> >> >> at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) >> >> >> at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) >> >> >> at java.security.AccessController.doPrivileged(Native Method) >> >> >> at javax.security.auth.Subject.doAs(Subject.java:422) >> >> >> at >> >> >> >> >> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >> >> >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) >> >> >> 2016-02-17 15:41:21,588 INFO BlockStateChange >> >> >> (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - >> >> >> chooseUnderReplicatedBlocks selected 8 blocks at priority level 2; >> >> >> Total=8 Reset bookmarks? false >> >> >> 2016-02-17 15:41:21,588 INFO BlockStateChange >> >> >> (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* >> >> >> neededReplications = 76, pendingReplications = 0. >> >> >> ... >> >> >> 2016-02-17 15:56:15,384 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:rollEditLog(5937)) - Roll Edit Log from >> >> >> 192.168.184.72 >> >> >> 2016-02-17 15:56:15,384 INFO namenode.FSEditLog >> >> >> (FSEditLog.java:rollEditLog(1202)) - Rolling edit logs >> >> >> 2016-02-17 15:56:15,384 INFO namenode.FSEditLog >> >> >> (FSEditLog.java:endCurrentLogSegment(1258)) - Ending log segment >> >> >> 369158 >> >> >> 2016-02-17 15:56:15,395 INFO namenode.FSEditLog >> >> >> (FSEditLog.java:printStatistics(699)) - Number of transactions: 14 >> >> >> Total time for transactions(ms): 5 Number of transactions batched in >> >> >> Syncs: 0 Number of syncs: 14 SyncTimes(ms): 26 6 >> >> >> 2016-02-17 15:56:15,421 INFO namenode.FileJournalManager >> >> >> (FileJournalManager.java:finalizeLogSegment(133)) - Finalizing edits >> >> >> file >> /hadoop/hdfs/namenode/current/edits_inprogress_0000000000000369158 >> >> >> -> /hadoop/hdfs/namenode/current/edits_ >> >> >> 0000000000000369158-0000000000000369171 >> >> >> 2016-02-17 15:56:15,421 INFO namenode.FSEditLog >> >> >> (FSEditLog.java:startLogSegment(1218)) - Starting log segment at >> >> >> 369172 >> >> >> 2016-02-17 15:56:15,764 INFO BlockStateChange >> >> >> (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - >> >> >> chooseUnderReplicatedBlocks selected 3 blocks at priority level 1; 3 >> >> >> blocks at priority level 2; Total=6 Reset bookmarks? fals >> >> >> e >> >> >> 2016-02-17 15:56:15,764 INFO BlockStateChange >> >> >> (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* >> >> >> neededReplications = 91, pendingReplications = 4. >> >> >> 2016-02-17 15:56:15,764 INFO blockmanagement.BlockManager >> >> >> (BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks >> >> >> chosen but could not be replicated = 3; of which 3 have no target, 0 >> >> >> have no source, 0 are UC, 0 are abandoned, >> >> >> 0 already have enough replicas. >> >> >> 2016-02-17 15:56:18,764 INFO BlockStateChange >> >> >> (UnderReplicatedBlocks.java:chooseUnderReplicatedBlocks(394)) - >> >> >> chooseUnderReplicatedBlocks selected 6 blocks at priority level 2; >> >> >> Total=6 Reset bookmarks? false >> >> >> 2016-02-17 15:56:18,765 INFO BlockStateChange >> >> >> (BlockManager.java:computeReplicationWorkForBlocks(1527)) - BLOCK* >> >> >> neededReplications = 87, pendingReplications = 5. >> >> >> 2016-02-17 15:56:18,765 INFO blockmanagement.BlockManager >> >> >> (BlockManager.java:computeReplicationWorkForBlocks(1534)) - Blocks >> >> >> chosen but could not be replicated = 2; of which 0 have no target, 2 >> >> >> have no source, 0 are UC, 0 are abandoned, >> >> >> 0 already have enough replicas. >> >> >> ... >> >> >> 2016-02-17 15:56:55,688 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:recoverLeaseInternal(2907)) - recoverLease: >> [Lease. >> >> >> Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 1], >> >> >> src=/apps/hbase/data/WALs/host7516.mydomain,16 >> >> >> >> >> >> >> >> >> 020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 >> >> >> from client DFSClient_NONMAPREDUCE_1038296970_1 >> >> >> 2016-02-17 15:56:55,688 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:internalReleaseLease(4036)) - Recovering [Lease. >> >> >> Holder: DFSClient_NONMAPREDUCE_1038296970_1, pendingcreates: 1], >> >> >> src=/apps/hbase/data/WALs/host7516.mydomain,16020 >> >> >> >> >> >> >> >> >> ,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230 >> >> >> 2016-02-17 15:56:55,688 WARN hdfs.StateChange >> >> >> (FSNamesystem.java:internalReleaseLease(4157)) - DIR* >> >> >> NameSystem.internalReleaseLease: File >> >> >> >> >> >> >> >> >> /apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C160 >> >> >> 20%2C1455618299902.default.1455690304230 has not been closed. Lease >> >> >> recovery is in progress. RecoveryId = 13183 for block >> >> >> blk_1073753849_13127{UCState=UNDER_RECOVERY, truncateBlock=null, >> >> >> primaryNodeIndex=2, replicas=[ReplicaUC[[DISK]DS-97 >> >> >> 87b201-fc64-450e-a20f-dcc79fb94b6f:NORMAL:192.168.184.73:50010|RBW], >> >> >> >> >> >> ReplicaUC[[DISK]DS-f8c4c3b2-a7cc-40e0-9ebb-2b9c352c0d28:NORMAL: >> >> 192.168.184.85:50010 >> >> >> |RBW], >> >> >> ReplicaUC[[DISK]DS-52f00167-e863-48d8-8f98-3b613f774b0c:NORMAL: >> >> >> 192.168.185.249:500 >> >> >> 10|RBW]]} >> >> >> 2016-02-17 15:56:58,687 INFO namenode.FSNamesystem >> >> >> (FSNamesystem.java:commitBlockSynchronization(4434)) - >> >> >> >> >> >> >> >> >> commitBlockSynchronization(oldBlock=BP-1843495860-192.168.189.219-1453778090403:blk_1073753849_13127, >> >> >> >> >> >> >> >> >> file=/apps/hbase/data/WALs/host7516.mydomain,16020,1455618299902-splitting/host7516.mydomain%2C16020%2C1455618299902.default.1455690304230, >> >> >> newgenerationstamp=13183, newlength=118884436, >> >> >> newtargets=[192.168.184.85:50010, 192.168.185.249:50010]) successful >> >> >> >> >> >> ---- >> >> >> >> >> >> Thanks, >> >> >> Hironori Ogibayashi >> >> >> >> >> >> >> >> >>