Wellington Chevreuil created HBASE-29627:
--------------------------------------------

             Summary: Handle any block cache fetching errors when reading a 
block in HFileReaderImpl
                 Key: HBASE-29627
                 URL: https://issues.apache.org/jira/browse/HBASE-29627
             Project: HBase
          Issue Type: Improvement
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


One of our customers had faced a situation where blocks are getting cached 
uncompressed into bucket cache (due to this other issue reported in 
HBASE-29623) following flushes or compactions, then at read time, during cache 
retrieval in HFileReaderImpl, it fails to decode the block accordingly, 
throwing below uncaught exception, failing the read indefinitely: 
{noformat}
2025-09-17 06:38:25,607 ERROR 
org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed 
region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966., 
storeName=a3012627f502c78738430343b0b54966/0, priority=45, 
startTime=1758091104691
java.lang.IllegalArgumentException: There is no data block encoder for given id 
'5'
        at 
org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
 {noformat}

I suspect this mostly related to an error in calculating the meta_space initial 
offset, due to some extra byte in the byte buffer (something in the line of 
this [comment from 
HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026).
 If the buffer has extra bytes, we could miss calculate the meta space offset, 
reading a wrong byte value for the "usesChecksum" flag, which then would lead 
to wrong header size calculation (24 for no checksum, 33 for checksum), then 
leading to a wrong positioning for reading the encoding type short. 

Unfortunately, I could not reproduce this issue on a controlled test 
environment. However, I think we should still change HFileReaderImpl to handle 
any possible exception happening when retrieving blocks from the cache, so that 
instead of failing the whole read operation, it should evict the given corrupt 
block from the cache and resort to read it from the file system.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to