Rushabh S Shah created HBASE-21607:
--------------------------------------

             Summary: HBase Region server read fails in case of datanode disk 
error.
                 Key: HBASE-21607
                 URL: https://issues.apache.org/jira/browse/HBASE-21607
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 1.3.2
            Reporter: Rushabh S Shah


Hbase region server reads failed with following error.
{noformat}
2018-11-30 16:49:18,760 WARN [,queue=12,port=60020] hdfs.BlockReaderFactory - 
BlockReaderFactory(fileName=<fileName>, 
block=BP-1618467445-<ip>-1516873463430:blk_1090719164_16982933): error creating 
ShortCircuitReplica.
java.io.IOException: invalid metadata header version 0. Can only handle version 
1.
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitReplica.<init>(ShortCircuitReplica.java:129)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:558)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:490)
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:782)
at 
org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:716)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1145)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1087)
at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1444)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1407)
at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:89)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:834)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1530)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1781)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1624)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:455)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1263)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:189)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:372)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:220)
at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2164)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:5916)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5890)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2739)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2719)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7197)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7156)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7149)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2250)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:35068)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2373)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)

 

2018-11-30 16:49:18,760 WARN [,queue=12,port=60020] 
shortcircuit.ShortCircuitCache - ShortCircuitCache(0x16fd768f): failed to load 
1090719164_BP-1618467445-<ip>-1516873463430
2018-11-30 16:49:18,761 DEBUG [,queue=12,port=60020] ipc.RpcServer - 
RpcServer.FifoWFPBQ.default.handler=246,queue=12,port=60020: callId: 46940 
service: ClientService methodName: Get size: 443 connection: <ip>:48798 
deadline: 1543596678759
2018-11-30 16:49:18,761 DEBUG [,queue=12,port=60020] ipc.RpcServer - 
RpcServer.FifoWFPBQ.default.handler=246,queue=12,port=60020: callId: 46940 
service: ClientService methodName: Get size: 443 connection: <ip>:48798 
deadline: 1543596678759
java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader 
reader=hdfs://<fileName>, compression=none, cacheConf=bl
ockCache=LruBlockCache{blockCount=696858, currentSize=7625195376, 
freeSize=5689203856, maxSize=13314399232, heapSize=7625195376, 
minSize=12648679424, minFactor=0.95, multiSize=6324339712, multiFactor=0.5, 
singleSize=3162169856, singleFactor=0.25}, cacheDa
taOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, 
cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, 
prefetchOnOpen=false, firstKey=<keyName>
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:207)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:372)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:220)
at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2164)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:5916)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5890)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2739)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2719)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7197)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7156)
at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:7149)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2250)
at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:35068)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2373)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
Caused by: java.io.IOException: Invalid HFile block magic: 
\x00\x00\x00\x00\x00\x00\x00\x00
at org.apache.hadoop.hbase.io.hfile.BlockType.parse(BlockType.java:155)
at org.apache.hadoop.hbase.io.hfile.BlockType.read(BlockType.java:167)
at org.apache.hadoop.hbase.io.hfile.HFileBlock.<init>(HFileBlock.java:345)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1821)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1624)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:455)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1263)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:297)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:189)
... 16 more


{noformat}
TheĀ "Invalid HFile block magic" exception was thrown all the way back to the 
client.

The first stack trace indicates that hdfs short circuit read failed and then it 
fell back to hdfs local rpc read and that succeeded.

The underlying reason is there was a write error in one of the disks. The 
datanode was receiving and mirroring(to downstream nodes in the pipeline) just 
fine but was writing garbage data(all \x00\x00\x00\x00\x00\x00\x00\x00) to 
disk. Datanode hasn't detected this disk as bad yet.

We have hbase checksum turned on and hdfs checksum evaluation turned off. While 
reading this erroneous block, datanode didn't do any checksum validation and 
streamed all the data as is to region server and relied on hbase checksum 
validation to catch it. But we have a special case in hbase checksum validation 
that if the checksumType byte is 0, then it is NULL and considers that checksum 
validation is not needed for this block. Refer to 
[ChecksumUtil|https://github.com/apache/hbase/blob/branch-1.3/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/ChecksumUtil.java#L95]
 and 
[ChecksumType|https://github.com/apache/hbase/blob/branch-1.3/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ChecksumType.java#L33].
 Thanks [~lhofhansl] and [~karanmehta93] for help debugging this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to