Eungsop Yoo created HBASE-29299:
-----------------------------------
Summary: Reopen initialReader of HStoreFile to refresh metadata
when read failed
Key: HBASE-29299
URL: https://issues.apache.org/jira/browse/HBASE-29299
Project: HBase
Issue Type: Bug
Reporter: Eungsop Yoo
Assignee: Eungsop Yoo
I discovered an issue while testing Erasure Coding. If more DataNodes go down
than the number of parity stripes, the Scan naturally fails. However, even
after restarting the downed DataNodes, the Scan continues to fail. This issue
does not occur every time, but it happens with high probability. The root cause
is that the initialReader inside the HStoreFile holds an HDFS metadata cache,
which does not get refreshed. Therefore, I modified the logic to close the
initialReader and reopen it when an exception occurs.
Here is the log captured when the scan fails:
{code}
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=8, exceptions:
2025-05-07T08:17:57.123Z,
RpcRetryingCaller{globalStartTime=2025-05-07T08:17:57.084Z, pause=100,
maxAttempts=8}, java.io.IOException: java.io.IOException: Could not seek
StoreFileScanner[HFileScanner for reader reader=hdfs://hbase-alpha25/hbas
e/data/default/test1/8a9fd0285a94ed3a8a16f595842e17fa/c/0ca5ca4cd7d14fe993d19e4632b2fb52,
compression=none, cacheConf=cacheDataOnRead=true, cacheDataOnWrite=false,
cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, c
acheDataCompressed=false, prefetchOnOpen=false,
firstKey=Optional[user00000000000000000000000000000000256006064453599002/c:field0/1745905948161/Put/seqid=0],
lastKey=Optional[user00000000000000000000000000000000511999723045682420/c:field3/1745
905845638/Put/seqid=0], avgKeyLen=73, avgValueLen=30, entries=134592,
length=15040759, cur=null] to key
org.apache.hadoop.hbase.PrivateCellUtil$FirstOnRowDeleteFamilyCell@1e25b769
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:232)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:416)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:260)
at
org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:1712)
at
org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:1703)
at
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.initializeScanners(RegionScannerImpl.java:166)
at
org.apache.hadoop.hbase.regionserver.RegionScannerImpl.<init>(RegionScannerImpl.java:146)
at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:3019)
at
org.apache.hadoop.hbase.regionserver.HRegion.lambda$getScanner$3(HRegion.java:3004)
at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2990)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2985)
at
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2979)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.newRegionScanner(RSRpcServices.java:3203)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3580)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45006)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:415)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
Caused by: java.io.IOException: Encountered an exception when invoking
ByteBuffer positioned read when trying to read 0 bytes from position 0
at
org.apache.hadoop.hbase.io.util.BlockIOUtils.preadWithExtraDirectly(BlockIOUtils.java:368)
at
org.apache.hadoop.hbase.io.util.BlockIOUtils.preadWithExtra(BlockIOUtils.java:311)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1481)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1719)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1519)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1331)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1252)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readAndUpdateNewBlock(HFileReaderImpl.java:943)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:932)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:311)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:214)
... 19 more
Caused by: java.lang.reflect.InvocationTargetException
at
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at
org.apache.hadoop.hbase.io.util.BlockIOUtils.preadWithExtraDirectly(BlockIOUtils.java:363)
... 29 more
Caused by: java.io.IOException: 3 missing blocks, the stripe is:
AlignedStripe(Offset=0, length=33, fetchedChunksNum=0, missingChunksNum=3);
locatedBlocks is: LocatedBlocks{; fileLength=15040759;
underConstruction=false; blocks=[LocatedStri
pedBlock{BP-5442367-10.202.27.120-1743751500104:blk_-9223372036854771360_190437;
getBlockSize()=15040759; corrupt=false; offset=0;
locs=[DatanodeInfoWithStorage[10.202.5.226:1004,DS-7207429b-7335-4e37-9848-4d9b88ab83e0,DISK],
DatanodeInfoWithS
torage[10.202.4.17:1004,DS-42a094e9-a2df-4317-b7bb-7685c3a4e13e,DISK],
DatanodeInfoWithStorage[10.203.21.242:1004,DS-19008e58-49f0-4820-945a-0533a6fb4d0a,DISK],
DatanodeInfoWithStorage[10.202.15.79:1004,DS-1816a8ec-7dbd-4889-afc7-063ecb6521ec,
DISK],
DatanodeInfoWithStorage[10.202.12.73:1004,DS-d5c48efe-cfdd-4ea9-9eb3-59af4afe3824,DISK]];
indices=[0, 1, 2, 3, 4]}];
lastLocatedBlock=LocatedStripedBlock{BP-5442367-10.202.27.120-1743751500104:blk_-9223372036854771360_190437;
getBlockS
ize()=15040759; corrupt=false; offset=0;
locs=[DatanodeInfoWithStorage[10.202.5.226:1004,DS-7207429b-7335-4e37-9848-4d9b88ab83e0,DISK],
DatanodeInfoWithStorage[10.202.4.17:1004,DS-42a094e9-a2df-4317-b7bb-7685c3a4e13e,DISK],
DatanodeInfoWithSto
rage[10.203.21.242:1004,DS-19008e58-49f0-4820-945a-0533a6fb4d0a,DISK],
DatanodeInfoWithStorage[10.202.15.79:1004,DS-1816a8ec-7dbd-4889-afc7-063ecb6521ec,DISK],
DatanodeInfoWithStorage[10.202.12.73:1004,DS-d5c48efe-cfdd-4ea9-9eb3-59af4afe3824,D
ISK]]; indices=[0, 1, 2, 3, 4]}; isLastBlockComplete=true;
ecPolicy=ErasureCodingPolicy=[Name=RS-3-2-1024k, Schema=[ECSchema=[Codec=rs,
numDataUnits=3, numParityUnits=2]], CellSize=1048576, Id=2]}
at
org.apache.hadoop.hdfs.StripeReader.checkMissingBlocks(StripeReader.java:180)
at
org.apache.hadoop.hdfs.StripeReader.readDataForDecoding(StripeReader.java:198)
at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:344)
at
org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:506)
at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1499)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1708)
at
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:259)
at
java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
... 31 more
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)