Wellington Chevreuil created HBASE-29627:
--------------------------------------------
Summary: Handle any block cache fetching errors when reading a
block in HFileReaderImpl
Key: HBASE-29627
URL: https://issues.apache.org/jira/browse/HBASE-29627
Project: HBase
Issue Type: Improvement
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
One of our customers had faced a situation where blocks are getting cached
uncompressed into bucket cache (due to this other issue reported in
HBASE-29623) following flushes or compactions, then at read time, during cache
retrieval in HFileReaderImpl, it fails to decode the block accordingly,
throwing below uncaught exception, failing the read indefinitely:
{noformat}
2025-09-17 06:38:25,607 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed
region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966.,
storeName=a3012627f502c78738430343b0b54966/0, priority=45,
startTime=1758091104691
java.lang.IllegalArgumentException: There is no data block encoder for given id
'5'
at
org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
{noformat}
I suspect this mostly related to an error in calculating the meta_space initial
offset, due to some extra byte in the byte buffer (something in the line of
this [comment from
HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026).
If the buffer has extra bytes, we could miss calculate the meta space offset,
reading a wrong byte value for the "usesChecksum" flag, which then would lead
to wrong header size calculation (24 for no checksum, 33 for checksum), then
leading to a wrong positioning for reading the encoding type short.
Unfortunately, I could not reproduce this issue on a controlled test
environment. However, I think we should still change HFileReaderImpl to handle
any possible exception happening when retrieving blocks from the cache, so that
instead of failing the whole read operation, it should evict the given corrupt
block from the cache and resort to read it from the file system.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)