[
https://issues.apache.org/jira/browse/HBASE-29627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wellington Chevreuil updated HBASE-29627:
-----------------------------------------
Description:
One of our customers had faced a situation where blocks are getting cached
uncompressed into bucket cache (due to this other issue reported in
HBASE-29623) following flushes or compactions, then at read time, during cache
retrieval in HFileReaderImpl, it fails to decode the block accordingly,
throwing below uncaught exception, failing the read indefinitely:
{noformat}
2025-09-17 06:38:25,607 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed
region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966.,
storeName=a3012627f502c78738430343b0b54966/0, priority=45,
startTime=1758091104691
java.lang.IllegalArgumentException: There is no data block encoder for given id
'5'
at
org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
{noformat}
I suspect this mostly related to an error in calculating the meta_space initial
offset, due to some extra byte in the byte buffer (something in the line of
this [comment from
HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026]).
If the buffer has extra bytes, we could miss calculate the meta space offset,
reading a wrong byte value for the "usesChecksum" flag, which then would lead
to wrong header size calculation (24 for no checksum, 33 for checksum), then
leading to a wrong positioning for reading the encoding type short.
Unfortunately, I could not reproduce this issue on a controlled test
environment. However, I think we should still change HFileReaderImpl to handle
any possible exception happening when retrieving blocks from the cache, so that
instead of failing the whole read operation, it should evict the given corrupt
block from the cache and resort to read it from the file system.
was:
One of our customers had faced a situation where blocks are getting cached
uncompressed into bucket cache (due to this other issue reported in
HBASE-29623) following flushes or compactions, then at read time, during cache
retrieval in HFileReaderImpl, it fails to decode the block accordingly,
throwing below uncaught exception, failing the read indefinitely:
{noformat}
2025-09-17 06:38:25,607 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed
region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966.,
storeName=a3012627f502c78738430343b0b54966/0, priority=45,
startTime=1758091104691
java.lang.IllegalArgumentException: There is no data block encoder for given id
'5'
at
org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
at
org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
at
org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
at
org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
{noformat}
I suspect this mostly related to an error in calculating the meta_space initial
offset, due to some extra byte in the byte buffer (something in the line of
this [comment from
HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026).
If the buffer has extra bytes, we could miss calculate the meta space offset,
reading a wrong byte value for the "usesChecksum" flag, which then would lead
to wrong header size calculation (24 for no checksum, 33 for checksum), then
leading to a wrong positioning for reading the encoding type short.
Unfortunately, I could not reproduce this issue on a controlled test
environment. However, I think we should still change HFileReaderImpl to handle
any possible exception happening when retrieving blocks from the cache, so that
instead of failing the whole read operation, it should evict the given corrupt
block from the cache and resort to read it from the file system.
> Handle any block cache fetching errors when reading a block in HFileReaderImpl
> ------------------------------------------------------------------------------
>
> Key: HBASE-29627
> URL: https://issues.apache.org/jira/browse/HBASE-29627
> Project: HBase
> Issue Type: Improvement
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Major
> Labels: pull-request-available
>
> One of our customers had faced a situation where blocks are getting cached
> uncompressed into bucket cache (due to this other issue reported in
> HBASE-29623) following flushes or compactions, then at read time, during
> cache retrieval in HFileReaderImpl, it fails to decode the block accordingly,
> throwing below uncaught exception, failing the read indefinitely:
> {noformat}
> 2025-09-17 06:38:25,607 ERROR
> org.apache.hadoop.hbase.regionserver.CompactSplit: Compaction failed
> region=XXXXXXXXXXXXXXXXXXXXXX,1721528038124.a3012627f502c78738430343b0b54966.,
> storeName=a3012627f502c78738430343b0b54966/0, priority=45,
> startTime=1758091104691
> java.lang.IllegalArgumentException: There is no data block encoder for given
> id '5'
> at
> org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.getEncodingById(DataBlockEncoding.java:157)
> at
> org.apache.hadoop.hbase.io.hfile.HFileBlock.getDataBlockEncoding(HFileBlock.java:2003)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1122)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1288)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1249)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:750)
> at
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.next(HFileReaderImpl.java:1528)
> at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:194)
> at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:112)
> at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:681)
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:437)
> at
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:354)
> {noformat}
> I suspect this mostly related to an error in calculating the meta_space
> initial offset, due to some extra byte in the byte buffer (something in the
> line of this [comment from
> HBASE-27053|https://issues.apache.org/jira/browse/HBASE-27053?focusedCommentId=17564026&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17564026]).
> If the buffer has extra bytes, we could miss calculate the meta space
> offset, reading a wrong byte value for the "usesChecksum" flag, which then
> would lead to wrong header size calculation (24 for no checksum, 33 for
> checksum), then leading to a wrong positioning for reading the encoding type
> short.
> Unfortunately, I could not reproduce this issue on a controlled test
> environment. However, I think we should still change HFileReaderImpl to
> handle any possible exception happening when retrieving blocks from the
> cache, so that instead of failing the whole read operation, it should evict
> the given corrupt block from the cache and resort to read it from the file
> system.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)