[ 
https://issues.apache.org/jira/browse/HBASE-29627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wellington Chevreuil updated HBASE-29627:
-----------------------------------------
    Description: 
One of our customers had faced a situation where blocks are getting cached 
uncompressed into bucket cache (due to this other issue reported in 
HBASE-29623) following flushes or compactions, then at read time, during cache 
retrieval in HFileReaderImpl, it fails to decode the block accordingly, 
throwing below uncaught exception, failing the read indefinitely: 
{noformat}
2025-09-17 06:38:25,607 ERROR 
org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed 
region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966., 
storeName=a3012627f502c78738430343b0b54966/0, priority=45, 
startTime=1758091104691
java.lang.IllegalArgumentException: There is no data block encoder for given id 
'5'
        at 
org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
 {noformat}

I suspect this mostly related to an error in calculating the meta_space initial 
offset, due to some extra byte in the byte buffer (something in the line of 
this [comment from 
HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026]).
 If the buffer has extra bytes, we could miss calculate the meta space offset, 
reading a wrong byte value for the "usesChecksum" flag, which then would lead 
to wrong header size calculation (24 for no checksum, 33 for checksum), then 
leading to a wrong positioning for reading the encoding type short. 

Unfortunately, I could not reproduce this issue on a controlled test 
environment. However, I think we should still change HFileReaderImpl to handle 
any possible exception happening when retrieving blocks from the cache, so that 
instead of failing the whole read operation, it should evict the given corrupt 
block from the cache and resort to read it from the file system.

  was:
One of our customers had faced a situation where blocks are getting cached 
uncompressed into bucket cache (due to this other issue reported in 
HBASE-29623) following flushes or compactions, then at read time, during cache 
retrieval in HFileReaderImpl, it fails to decode the block accordingly, 
throwing below uncaught exception, failing the read indefinitely: 
{noformat}
2025-09-17 06:38:25,607 ERROR 
org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed 
region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966., 
storeName=a3012627f502c78738430343b0b54966/0, priority=45, 
startTime=1758091104691
java.lang.IllegalArgumentException: There is no data block encoder for given id 
'5'
        at 
org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
        at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
        at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
        at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
        at 
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
        at 
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
        at 
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
 {noformat}

I suspect this mostly related to an error in calculating the meta_space initial 
offset, due to some extra byte in the byte buffer (something in the line of 
this [comment from 
HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026).
 If the buffer has extra bytes, we could miss calculate the meta space offset, 
reading a wrong byte value for the "usesChecksum" flag, which then would lead 
to wrong header size calculation (24 for no checksum, 33 for checksum), then 
leading to a wrong positioning for reading the encoding type short. 

Unfortunately, I could not reproduce this issue on a controlled test 
environment. However, I think we should still change HFileReaderImpl to handle 
any possible exception happening when retrieving blocks from the cache, so that 
instead of failing the whole read operation, it should evict the given corrupt 
block from the cache and resort to read it from the file system.


> Handle any block cache fetching errors when reading a block in HFileReaderImpl
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-29627
>                 URL: https://issues.apache.org/jira/browse/HBASE-29627
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>              Labels: pull-request-available
>
> One of our customers had faced a situation where blocks are getting cached 
> uncompressed into bucket cache (due to this other issue reported in 
> HBASE-29623) following flushes or compactions, then at read time, during 
> cache retrieval in HFileReaderImpl, it fails to decode the block accordingly, 
> throwing below uncaught exception, failing the read indefinitely: 
> {noformat}
> 2025-09-17 06:38:25,607 ERROR 
> org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed 
> region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966.,
>  storeName=a3012627f502c78738430343b0b54966/0, priority=45, 
> startTime=1758091104691
> java.lang.IllegalArgumentException: There is no data block encoder for given 
> id '5'
>         at 
> org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
>         at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
>         at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
>         at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
>         at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
>  {noformat}
> I suspect this mostly related to an error in calculating the meta_space 
> initial offset, due to some extra byte in the byte buffer (something in the 
> line of this [comment from 
> HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026]).
>  If the buffer has extra bytes, we could miss calculate the meta space 
> offset, reading a wrong byte value for the "usesChecksum" flag, which then 
> would lead to wrong header size calculation (24 for no checksum, 33 for 
> checksum), then leading to a wrong positioning for reading the encoding type 
> short. 
> Unfortunately, I could not reproduce this issue on a controlled test 
> environment. However, I think we should still change HFileReaderImpl to 
> handle any possible exception happening when retrieving blocks from the 
> cache, so that instead of failing the whole read operation, it should evict 
> the given corrupt block from the cache and resort to read it from the file 
> system.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to