[
https://issues.apache.org/jira/browse/HBASE-28589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ZhenyuLi updated HBASE-28589:
-----------------------------
Description:
When an IOException occurs during response creation in
ServerCall.setResponse(), the method only catches the IOException and logs a
warning and sets the response to null. This causes the client to receive no
response or experience connection issues without knowing what went wrong on the
server side.
An example of the current ServerCall.setResponse catching the exception is the
flaw in the fix in {-}HBASE-14598{-}.
The original fix for -HBASE-14598- addressed two aspects:
# When a Scan/Get RPC attempts to allocate an excessively large array that
could trigger an OutOfMemoryError (OOM), it checks the array size before
allocation and throws a BufferOverflowException to prevent OOM.
# The fix intended to stop client retries for such failures by throwing a
DoNotRetryException when a BufferOverflowException occurs, as retrying cannot
resolve the underlying issue.
*The Problem:* The DoNotRetryException is never propagated to the client side.
Here's the issue flow:
# ByteBufferOutputStream.checkSizeAndGrow() throws BufferOverflowException
# The exception propagates through the call stack:
** ByteBufferOutputStream.checkSizeAndGrow()
** encoder.write()
** encodeCellsTo() (Catches BufferOverflowException and turns it into
DoNotRetryIOException)
** this.cellBlockBuilder.buildCellBlockStream()
** call.setResponse()
# The DoNotRetryException is ultimately caught in call.setResponse, where it
is merely logged but not sent back to the client
# As a result, the client continues retrying indefinitely since the response
is null and the Netty connection will be closed.
*Current Status:* In the latest branches (3.0 and 2.6), this issue still
exists. In ServerCall.java, when ALLOCATOR_POOL_ENABLED_KEY
(hbase.ipc.server.reservoir.enabled) is set to false, the setResponse() method
follows the same problematic path. If a DoNotRetryException is thrown in
{{{}ByteBuffer b = this.cellBlockBuilder.buildCellBlock(this.connection.codec,
this.connection.compressionCodec, cells);{}}}, it gets swallowed in the
setResponse() catch block and never reaches the client.
*Steps to Reproduce:*
# Set up a 3-node HBase cluster with 3 RegionServers
# Set hbase.ipc.server.reservoir.enabled to false to use ByteBufferOutputStream
# Inject a BufferOverflowException at
ByteBufferOutputStream.checkSizeAndGrow() to simulate an OOM condition
# Send a scan request
# Observe endless client retries
*Expected Behavior:* The DoNotRetryException should be properly propagated to
the client to prevent retry attempts.
was:
I have discovered that the fix for HBASE-14598 does not completely resolve the
issue, and the problem persists in the latest branches (3.0 and 2.6).
The original fix for HBASE-14598 addressed two aspects:
# When a Scan/Get RPC attempts to allocate an excessively large array that
could trigger an OutOfMemoryError (OOM), it checks the array size before
allocation and throws a {{BufferOverflowException}} to prevent OOM.
# The fix intended to stop client retries for such failures by throwing a
{{DoNotRetryException}} when a {{BufferOverflowException}} occurs, as retrying
cannot resolve the underlying issue.
*The Problem:* The {{DoNotRetryException}} is never propagated to the client
side. Here's the issue flow:
# {{ByteBufferOutputStream.checkSizeAndGrow()}} throws
{{BufferOverflowException}}
# The exception propagates through the call stack:
** {{ByteBufferOutputStream.checkSizeAndGrow()}}
** {{encoder.write()}}
** {{encodeCellsTo() (Catch BufferOverflowException and turn it into
DoNotRetryIOException)}}
** {{this.cellBlockBuilder.buildCellBlockStream()}}
** {{call.setResponse()}}
# The {{DoNotRetryException}} is ultimately caught in call.setResponse, where
it is merely logged but not sent back to the client
# As a result, the client continues retrying indefinitely since the response
is null and netty connection will be closed.
*Current Status:* In the latest branches (3.0 and 2.6), this issue still
exists. In {{{}ServerCall.java{}}}, when {{ALLOCATOR_POOL_ENABLED_KEY}}
({{{}hbase.ipc.server.reservoir.enabled{}}}) is set to {{{}false{}}}, the
{{setResponse()}} method follows the same problematic path. If a
{{DoNotRetryException}} is thrown in the
ByteBuffer b = this.cellBlockBuilder.buildCellBlock(this.connection.codec,
this.connection.compressionCodec, cells); it gets swallowed in the
{{setResponse()}} catch block and never reaches the client.
*Steps to Reproduce:*
# Set up a 3-node HBase cluster with 3 RegionServers
# Set {{hbase.ipc.server.reservoir.enabled}} to {{false to use
ByteBufferOutputStream}}
# Inject a {{BufferOverflowException}} at
{{ByteBufferOutputStream.checkSizeAndGrow()}} to simulate an OOM condition
# Send a scan request
# Observe endless client retries
*Expected Behavior:* The {{DoNotRetryException}} should be properly propagated
to the client to prevent retry attempts.
> Server side DoNotRetryException not propagated to client
> --------------------------------------------------------
>
> Key: HBASE-28589
> URL: https://issues.apache.org/jira/browse/HBASE-28589
> Project: HBase
> Issue Type: Bug
> Components: IPC/RPC
> Affects Versions: 2.0.0, 2.4.0, 2.5.0, 2.6.0, 3.0.0
> Reporter: ZhenyuLi
> Priority: Major
>
> When an IOException occurs during response creation in
> ServerCall.setResponse(), the method only catches the IOException and logs a
> warning and sets the response to null. This causes the client to receive no
> response or experience connection issues without knowing what went wrong on
> the server side.
> An example of the current ServerCall.setResponse catching the exception is
> the flaw in the fix in {-}HBASE-14598{-}.
> The original fix for -HBASE-14598- addressed two aspects:
> # When a Scan/Get RPC attempts to allocate an excessively large array that
> could trigger an OutOfMemoryError (OOM), it checks the array size before
> allocation and throws a BufferOverflowException to prevent OOM.
> # The fix intended to stop client retries for such failures by throwing a
> DoNotRetryException when a BufferOverflowException occurs, as retrying cannot
> resolve the underlying issue.
> *The Problem:* The DoNotRetryException is never propagated to the client
> side. Here's the issue flow:
> # ByteBufferOutputStream.checkSizeAndGrow() throws BufferOverflowException
> # The exception propagates through the call stack:
> ** ByteBufferOutputStream.checkSizeAndGrow()
> ** encoder.write()
> ** encodeCellsTo() (Catches BufferOverflowException and turns it into
> DoNotRetryIOException)
> ** this.cellBlockBuilder.buildCellBlockStream()
> ** call.setResponse()
> # The DoNotRetryException is ultimately caught in call.setResponse, where it
> is merely logged but not sent back to the client
> # As a result, the client continues retrying indefinitely since the response
> is null and the Netty connection will be closed.
> *Current Status:* In the latest branches (3.0 and 2.6), this issue still
> exists. In ServerCall.java, when ALLOCATOR_POOL_ENABLED_KEY
> (hbase.ipc.server.reservoir.enabled) is set to false, the setResponse()
> method follows the same problematic path. If a DoNotRetryException is thrown
> in {{{}ByteBuffer b =
> this.cellBlockBuilder.buildCellBlock(this.connection.codec,
> this.connection.compressionCodec, cells);{}}}, it gets swallowed in the
> setResponse() catch block and never reaches the client.
> *Steps to Reproduce:*
> # Set up a 3-node HBase cluster with 3 RegionServers
> # Set hbase.ipc.server.reservoir.enabled to false to use
> ByteBufferOutputStream
> # Inject a BufferOverflowException at
> ByteBufferOutputStream.checkSizeAndGrow() to simulate an OOM condition
> # Send a scan request
> # Observe endless client retries
> *Expected Behavior:* The DoNotRetryException should be properly propagated to
> the client to prevent retry attempts.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)