2Ted:

Unfortunately hbase 0.98.10 is not an option as CDH does not provide such
choice.
Although, there are a lot of upstream fixes included in 0.98.6 (CDH 5.3.5):
http://archive.cloudera.com/cdh5/cdh/5/hbase-0.98.6-cdh5.3.5.releasenotes.html

Later CDH versions ship 1.0.0+ versions of hbase.

On 27 July 2015 at 12:54, Konstantin Chudinov <kchudi...@griddynamics.com>
wrote:

> >>or maybe you run HDFS-balancer?
> Yes, we trigger hdfs balancer almost all of the time. And in other logs we
> often see similar exception. So, I don’t think this is reason of failure.
> >>Are all your RS collocated with DN?
> Yes, all of them.
> >>can we get logs during "stuck time" from DN located here: /10.0.240.163
> Unfortunately, our log rolling strategy doesn’t allow us to retrieve logs
> for the day of failure.
> Question from Ted:
> >>
> > Is it possible for you to upgrade to 0.98.10+
> It’s very expensive upgrade for us. To initiate upgrade like this, we
> should provide proves, that it 100% fixes current issue. That’s why I’m
> asking here the reason.
>
> Best regards,
>
> Konstantin Chudinov
> Software Engineer
> Grid Dynamics
>
> On 24 Jul 2015, at 15:23, Serega Sheypak <serega.shey...@gmail.com> wrote:
>
> > probably block was being replicated because of DN failure and HBase was
> trying to access that replica and got stuck?
> > I can see that DN answers that some blocks are missing.
> > or maybe you run HDFS-balancer?
> >
> > The other thing is that you should always get read access to HDFS by
> design, you are not allowed to modify file concurrently, first writer gets
> lease on block and NN doesn't allow to get concurrent leases as I remember
> it correctly...
> >
> > See what happens with block 1099777976128
> >
> > RS:
> > 015-07-19 07:25:08,533 INFO org.apache.hadoop.hbase.regionserver.HStore:
> Starting compaction of 2 file(s) in i of
> table7,\x8C\xA0,1435936455217.12a2d1e37fd8f0f9870fc1b5afd6046d. into
> tmpdir=hdfs://server1/hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/.tmp,
> totalSize=416.0 M
> > 2015-07-19 07:25:08,556 WARN org.apache.hadoop.hdfs.BlockReaderFactory:
> BlockReaderFactory(fileName=/hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e,
> block=BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128):
> unknown response code ERROR while attempting to set up short-circuit
> access. Block
> BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 is
> not valid
> > 2015-07-19 07:25:08,556 WARN
> org.apache.hadoop.hdfs.client.ShortCircuitCache:
> ShortCircuitCache(0x6b1f04e2): failed to load
> 1195579097_BP-1892992341-10.10.122.111-1352825964285
> > 2015-07-19 07:25:08,557 WARN org.apache.hadoop.hdfs.BlockReaderFactory:
> I/O error constructing remote block reader.
> > java.io.IOException: Got error for OP_READ_BLOCK, self=/
> 10.0.241.39:53420, remote=/10.0.241.39:50010, for file
> /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e,
> for pool BP-1892992341-10.10.122.111-1352825964285 block
> 1195579097_1099777976128
> >       at
> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432)
> >       at
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397)
> >       at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
> >       at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
> >       at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
> >       at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566)
> >       at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789)
> >       at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
> >       at java.io.DataInputStream.read(DataInputStream.java:149)
> >       at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1210)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1483)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1052)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:244)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:240)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:202)
> >       at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:257)
> >       at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
> >       at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1080)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482)
> >       at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475)
> >       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >       at java.lang.Thread.run(Thread.java:745)
> > 2015-07-19 07:25:08,558 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.0.241.39:50010 for block, add to deadNodes and continue.
> java.io.IOException: Got error for OP_READ_BLOCK, self=/10.0.241.39:53420,
> remote=/10.0.241.39:50010, for file
> /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e,
> for pool BP-1892992341-10.10.122.111-1352825964285 block
> 1195579097_1099777976128
> > java.io.IOException: Got error for OP_READ_BLOCK, self=/
> 10.0.241.39:53420, remote=/10.0.241.39:50010, for file
> /hbase/data/default/table7/12a2d1e37fd8f0f9870fc1b5afd6046d/i/983cf03fddfa480f92346f25a61b0b9e,
> for pool BP-1892992341-10.10.122.111-1352825964285 block
> 1195579097_1099777976128
> >       at
> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:432)
> >       at
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:397)
> >       at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:786)
> >       at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:665)
> >       at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:325)
> >       at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:566)
> >       at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:789)
> >       at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:836)
> >       at java.io.DataInputStream.read(DataInputStream.java:149)
> >       at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1210)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1483)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
> >       at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1052)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:244)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:152)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:317)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:240)
> >       at
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:202)
> >       at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:257)
> >       at
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
> >       at
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:109)
> >       at
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1080)
> >       at
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482)
> >       at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:475)
> >       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >       at java.lang.Thread.run(Thread.java:745)
> > 2015-07-19 07:25:08,559 INFO org.apache.hadoop.hdfs.DFSClient:
> Successfully connected to /10.0.240.163:50010 for
> BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128
> > 2015-07-19 07:25:12,382 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,c7000000,1432632712751.736fd216603c2368a7001f34c944a7a0.
> after a delay of 17793
> >
> >
> > DN:
> > 2015-07-19 07:25:08,556 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: opReadBlock
> BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128
> received exception
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica
> not found for
> BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128
> > 2015-07-19 07:25:08,557 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(10.0.241.39,
> datanodeUuid=b5972c62-421f-4eab-9ae7-641b8019d406, infoPort=50075,
> ipcPort=50020,
> storageInfo=lv=-55;cid=cluster10;nsid=1415935480;c=1418135836666):Got
> exception while serving
> BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128 to /
> 10.0.241.39:53420
> > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica
> not found for
> BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128
> >       at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:419)
> >       at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:228)
> >       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:466)
> >       at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
> >       at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> >       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
> >       at java.lang.Thread.run(Thread.java:745)
> > 2015-07-19 07:25:08,557 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> hbase101.nnn.pvt:50010:DataXceiver error processing READ_BLOCK operation
> src: /10.0.241.39:53420 dest: /10.0.241.39:50010
> > org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica
> not found for
> BP-1892992341-10.10.122.111-1352825964285:blk_1195579097_1099777976128
> >       at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:419)
> >       at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:228)
> >       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:466)
> >       at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
> >       at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
> >       at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:229)
> >       at java.lang.Thread.run(Thread.java:745)
> > 2015-07-19 07:25:09,022 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> BP-1892992341-10.10.122.111-1352825964285:blk_1201101758_1099783498791 src:
> /10.0.241.39:53422 dest: /10.0.241.39:50010
> >
> > Are all your RS collocated with DN? Or you have RS which doesn't have
> local DN?
> > can we get logs during "stuck time" from DN located here: /10.0.240.163
> >
> > 2015-07-24 13:49 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
> >
> > Is it possible for you to upgrade to 0.98.10+ ?
> >
> > I will take a look at your logs later.
> >
> > Thanks
> >
> > Friday, July 24, 2015, 7:15 PM +0800 from Konstantin Chudinov  <
> kchudi...@griddynamics.com>:
> > >Hello Ted,
> > >Thank you for your answer!
> > >Hadoop and HBase versions are:
> > >2.3.0-cdh5.1.0 - версия хадупа (и hdfs)
> > >hbase-0.98.1
> > >About hdfs.. i don’t see anything special in the logs. I’ve attached
> them to this message. Btw, it’s another server, which is also crashed (I’ve
> lost hdfs logs of previous server), so hbase logs are in archive as well.
> > >
> > >Best regards,
> > >
> > >Konstantin Chudinov
> > >
> > >On 23 Jul 2015, at 20:44, Ted Yu < yuzhih...@gmail.com > wrote:
> > >>
> > >>What release of HBase do you use ?
> > >>
> > >>I looked at the two log files but didn't find such information.
> > >>In the log for node 118, I saw something such as the following:
> > >>Failed to connect to /10.0.229.16:50010 for block, add to deadNodes
> and continue
> > >>
> > >>Was hdfs healthy around the time region server got stuck ?
> > >>
> > >>Cheers
> > >>
> > >>
> > >>Friday, July 24, 2015, 12:21 AM +0800 from Konstantin Chudinov  <
> kchudi...@griddynamics.com >:
> > >>>Hi all,
> > >>>Our team faced cascading server's stuck. RS logs are similar to that
> in HBASE-10499 (  https://issues.apache.org/jira/browse/HBASE-10499 )
> except there is no RegionTooBusyException before flush loop:
> > >>>2015-07-19 07:32:41,961 INFO
> org.apache.hadoop.hbase.regionserver.HStore: Completed major compaction of
> 2 file(s) in s of table4,\xC7
> ,1390920313296.9f554d5828cfa9689de27c1a42d844e3. into
> 65dae45c82264b4d80fc7ed0818a4094(size=1.2 M), total size for store is 1.2
> M. This selection was in queue for 0sec, and took 0sec to execute.
> > >>>2015-07-19 07:32:41,961 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed
> compaction: Request = regionName=table4,\xC7
> ,1390920313296.9f554d5828cfa9689de27c1a42d844e3., storeName=s, fileCount=2,
> fileSize=1.2 M, priority=998, time=24425664829680753; duration=0sec
> > >>>2015-07-19 07:32:41,962 INFO
> org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy:
> Default compaction algorithm has selected 1 files from 1 candidates
> > >>>2015-07-19 07:32:44,764 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 18943
> > >>>2015-07-19 07:32:54,765 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 4851
> > >>>2015-07-19 07:33:04,764 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 7466
> > >>>2015-07-19 07:33:14,764 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 4940
> > >>>2015-07-19 07:33:24,765 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 12909
> > >>>2015-07-19 07:33:34,764 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 5897
> > >>>2015-07-19 07:33:44,764 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 9110
> > >>>2015-07-19 07:33:54,764 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver60020.periodicFlusher requesting flush for region
> webpage_table,51000000,1432632712750.5d3471db423cb08f9ed294c4f3094825.
> after a delay of 7109
> > >>>....
> > >>>until we've rebooted RS at 10:08.
> > >>>8 servers got in stuck at the same time.
> > >>>I haven't found anything in hmaster's logs. Thread dumps shows, that
> many theads (and flush thread) are waiting for read lock during access to
> HDFS:
> > >>>"RpcServer.handler=19,port=60020" - Thread t@90
> > >>>  java.lang.Thread.State: WAITING
> > >>>at java.lang.Object.wait(Native Method)
> > >>>- waiting on <77770184> (a org.apache.hadoop.hbase.util.IdLock$Entry)
> > >>>at java.lang.Object.wait(Object.java:503)
> > >>>at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:319)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
> > >>>at
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
> > >>>at
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.requestSeek(NonLazyKeyValueScanner.java:39)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:311)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3987)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136)
> > >>>- locked <1623a240> (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
> > >>>at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
> > >>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
> > >>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
> > >>>at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> > >>>at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> > >>>at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> > >>>at java.lang.Thread.run(Thread.java:745)
> > >>>"RpcServer.handler=29,port=60020" - Thread t@100
> > >>>  java.lang.Thread.State: BLOCKED
> > >>>at
> org.apache.hadoop.hdfs.DFSInputStream.getFileLength(DFSInputStream.java:354)
> > >>>- waiting to lock <399a6ff3> (a
> org.apache.hadoop.hdfs.DFSInputStream) owned by
> "RpcServer.handler=21,port=60020" t@92
> > >>>at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1270)
> > >>>at
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:90)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1224)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockDataInternal(HFileBlock.java:1432)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1314)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:355)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:494)
> > >>>at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:542)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:377)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.pollRealKV(KeyValueHeap.java:347)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:304)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
> > >>>at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
> > >>>at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3866)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateFromJoinedHeap(HRegion.java:3840)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3995)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3814)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3805)
> > >>>at
> org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3136)
> > >>>- locked <3af54140> (a
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
> > >>>at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29497)
> > >>>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
> > >>>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
> > >>>at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> > >>>at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> > >>>at
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> > >>>at java.lang.Thread.run(Thread.java:745)
> > >>>  Locked ownable synchronizers:
> > >>>- locked <5320bfc4> (a
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
> > >>>I have zipped all logs and dumps and attached to this mail.
> > >>>This problem occurs once a month on out cluster.
> > >>>Does anybody know what the reason of this cascading servers failure?
> > >>>Thank you in advance!
> > >>>
> > >>>Konstantin Chudinov
> >
>
>


-- 
Sincerely yours,
Andrey Shevchenko

Big Data, Grid Dynamics
St.Petersburg, Russia
Mobile: +7(931)378-19-86
Skype: dioeraclier

Reply via email to