Is it possible for you to upgrade to 0.98.10+ ?

I will take a look at your logs later. 

Thanks

Friday, July 24, 2015, 7:15 PM +0800 from Konstantin Chudinov  
<kchudi...@griddynamics.com>:
>Hello Ted,
>Thank you for your answer!
>Hadoop and HBase versions are: 
>2.3.0-cdh5.1.0 - версия хадупа (и hdfs)
>hbase-0.98.1
>About hdfs.. i don’t see anything special in the logs. I’ve attached them to 
>this message. Btw, it’s another server, which is also crashed (I’ve lost hdfs 
>logs of previous server), so hbase logs are in archive as well.
>
>Best regards,
>
>Konstantin Chudinov
>
>On 23 Jul 2015, at 20:44, Ted Yu < yuzhih...@gmail.com > wrote:
>>
>>What release of HBase do you use ?
>>
>>I looked at the two log files but didn't find such information. 
>>In the log for node 118, I saw something such as the following:
>>Failed to connect to /10.0.229.16:50010 for block, add to deadNodes and 
>>continue 
>>
>>Was hdfs healthy around the time region server got stuck ?
>>
>>Cheers
>>
>>
>>Friday, July 24, 2015, 12:21 AM +0800 from Konstantin Chudinov  < 
>>kchudi...@griddynamics.com >:
>>>Hi all,
>>>Our team faced cascading server's stuck. RS logs are similar to that in 
>>>HBASE-10499 (  https://issues.apache.org/jira/browse/HBASE-10499 ) except 
>>>there is no RegionTooBusyException before flush loop:
>>>2015-07-19 07:32:41,961 INFO org.apache.hadoop.hbase.regionserver.HStore: 
>>>Completed major compaction of 2 file(s) in s of table4,\xC7 
>>>,1390920313296.9f554d5828cfa9689de27c1a42d844e3. into 
>>>65dae45c82264b4d80fc7ed0818a4094(size=1.2 M), total size for store is 1.2 M. 
>>>This selection was in queue for 0sec, and took 0sec to execute.
>>>2015-07-19 07:32:41,961 INFO 
>>>org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed 
>>>compaction: Request = regionName=table4,\xC7 
>>>,1390920313296.9f554d5828cfa9689de27c1a42d844e3., storeName=s, fileCount=2, 
>>>fileSize=1.2 M, priority=998, time=24425664829680753; duration=0sec
>>>2015-07-19 07:32:41,962 INFO 
>>>org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy: 
>>>Default compaction algorithm has selected 1 files from 1 candidates
>>>2015-07-19 07:32:44,764 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 18943
>>>2015-07-19 07:32:54,765 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 4851
>>>2015-07-19 07:33:04,764 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 7466
>>>2015-07-19 07:33:14,764 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 4940
>>>2015-07-19 07:33:24,765 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 12909
>>>2015-07-19 07:33:34,764 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 5897
>>>2015-07-19 07:33:44,764 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 9110
>>>2015-07-19 07:33:54,764 INFO 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>regionserver60020.periodicFlusher requesting flush for region 
>>>webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825. after 
>>>a delay of 7109
>>>....
>>>until we've rebooted RS at 10:08.
>>>8 servers got in stuck at the same time.
>>>I haven't found anything in hmaster's logs. Thread dumps shows, that many 
>>>theads (and flush thread) are waiting for read lock during access to HDFS:
>>>"RpcServer.handler=19,port=60020" - Thread t@90
>>>  java.lang.Thread.State: WAITING
>>>at java.lang.Object.wait(Native Method)
>>>- waiting on <77770184> (a org.apache.hadoop.hbase.util.IdLock$Entry)
>>>at java.lang.Object.wait(Object.java:503)
>>>at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:319)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.requestSeek(NonLazyKeyValueScanner.java:39)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:311)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3987)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136)
>>>- locked <1623a240> (a 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
>>>at 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
>>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
>>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
>>>at 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>at 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>at 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>at java.lang.Thread.run(Thread.java:745)
>>>"RpcServer.handler=29,port=60020" - Thread t@100
>>>  java.lang.Thread.State: BLOCKED
>>>at 
>>>org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:354)
>>>- waiting to lock <399a6ff3> (a org.apache.hadoop.hdfs.DFSInputStream) owned 
>>>by "RpcServer.handler=21,port=60020" t@92
>>>at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1270)
>>>at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1224)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
>>>at 
>>>org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3866)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateFromJoinedHeap(HRegion.java:3840)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3995)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805)
>>>at 
>>>org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136)
>>>- locked <3af54140> (a 
>>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
>>>at 
>>>org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
>>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
>>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
>>>at 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
>>>at 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
>>>at 
>>>org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
>>>at java.lang.Thread.run(Thread.java:745)
>>>  Locked ownable synchronizers:
>>>- locked <5320bfc4> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
>>>I have zipped all logs and dumps and attached to this mail.
>>>This problem occurs once a month on out cluster.
>>>Does anybody know what the reason of this cascading servers failure? 
>>>Thank you in advance!
>>>
>>>Konstantin Chudinov

Reply via email to