[
https://issues.apache.org/jira/browse/HBASE-28638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani updated HBASE-28638:
---------------------------------
Description:
As per one of the recent incidents, some regions faced 5+ minute of
availability drop because before active master could initiate SCP for the dead
server, some region moves tried to assign regions on the already dead
regionserver. Sometimes, due to transient issues, we see that active master
gets notified after few minutes (5+ minute in this case).
{code:java}
2024-05-08 03:47:38,518 WARN [RSProcedureDispatcher-pool-4790]
procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed
due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to
address=host1:61020 failed on local exception:
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection
closed, try=0, retrying... {code}
And as we know, we have infinite retries here, so it kept going on..
Eventually, SCP could be initiated only after active master discovered the
server as dead:
{code:java}
2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer -
Processing host1,61020,1713411866443; numProcessing=1
2024-05-08 03:50:01,038 INFO [RegionServerTracker-0]
master.RegionServerTracker - RegionServer ephemeral node deleted, processing
expiration [host1,61020,1713411866443] {code}
leading to
{code:java}
2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833]
assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691,
state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51,
server=host1,61020,1713411866443 for region state=OPENING,
location=host1,61020,1713411866443, table=T1,
region=5cafbe54d5685acc6c4866758e67fd51, targetServer host1,61020,1713411866443
is dead, SCP will interrupt us, give up {code}
This entire duration of outage could be avoided if we can fail-fast for
connection drop errors.
*Problem Statement:*
Master initiated remote procedures are scheduled by RSProcedureDispatcher. If
it encounters specific errors on first retry (e.g. CallQueueTooBigException or
SaslException), it is guaranteed that the remote call has not reached the
regionserver, therefore the remote call is marked failed prompting the parent
procedure to select different target regionserver to resume the operation.
If the first attempt is successful, RSProcedureDispatcher continues with
infinite retries. We can encounter valid case (e.g. ConnectionClosedException)
which is halting the remote operation. Without manual intervention, it can
cause significant delay upto several minutes or hours to the
region-in-transition.
*Proposed Solution:*
The purpose of this Jira is to impose retry limit for specific error types such
that if the retry limit is reached, the master can recover the state of the
ongoing remote call failure by initiating SCP (ServerCrashProcedure) on the
target server. The SCP is going to override the TRSP
(TransitRegionStateProcedure) if required. This can ensure that the target
server has no region hosted online before we suspend the ongoing TRSP.
Scheduling SCP for the target server will always lead to the regionserver in
stopped state. Either regionserver would be automatically stopped, or if the
regionserver is able to send the region report to master, master will reject
it, which will further lead to regionserver abort.
was:
As per one of the recent incidents, some regions faced 5+ minute of
availability drop because before active master could initiate SCP for the dead
server, some region moves tried to assign regions on the already dead
regionserver. Sometimes, due to transient issues, we see that active master
gets notified after few minutes (5+ minute in this case).
{code:java}
2024-05-08 03:47:38,518 WARN [RSProcedureDispatcher-pool-4790]
procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed
due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to
address=host1:61020 failed on local exception:
org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection
closed, try=0, retrying... {code}
And as we know, we have infinite retries here, so it kept going on..
Eventually, SCP could be initiated only after active master discovered the
server as dead:
{code:java}
2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer -
Processing host1,61020,1713411866443; numProcessing=1
2024-05-08 03:50:01,038 INFO [RegionServerTracker-0]
master.RegionServerTracker - RegionServer ephemeral node deleted, processing
expiration [host1,61020,1713411866443] {code}
leading to
{code:java}
2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833]
assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691,
state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51,
server=host1,61020,1713411866443 for region state=OPENING,
location=host1,61020,1713411866443, table=T1,
region=5cafbe54d5685acc6c4866758e67fd51, targetServer host1,61020,1713411866443
is dead, SCP will interrupt us, give up {code}
This entire duration of outage could be avoided if we can fail-fast for
connection drop errors.
*Problem Statement:*
Master initiated remote procedures are scheduled by RSProcedureDispatcher. If
it encounters specific errors on first retry (e.g. CallQueueTooBigException or
SaslException), it is guaranteed that the remote call has not reached the
regionserver, therefore the remote call is marked failed prompting the parent
procedure to select different target regionserver to resume the operation.
If the first attempt is successful, RSProcedureDispatcher continues with
infinite retries. We can encounter valid case (e.g. ConnectionClosedException)
which is halting the remote operation. Without manual intervention, it can
cause significant delay upto several minutes or hours to the
region-in-transition.
The purpose of this Jira is to impose retry limit for specific error types such
that if the retry limit is reached, the master can recover the state of the
ongoing remote call failure by initiating SCP (ServerCrashProcedure) on the
target server. The SCP is going to override the TRSP
(TransitRegionStateProcedure) if required. This can ensure that the target
server has no region hosted online before we suspend the ongoing TRSP.
Scheduling SCP for the target server will always lead to the regionserver in
stopped state. Either regionserver would be automatically stopped, or if the
regionserver is able to send the region report to master, master will reject
it, which will further lead to regionserver abort.
> Impose retry limit for specific errors to recover from remote procedure
> failure using server crash
> --------------------------------------------------------------------------------------------------
>
> Key: HBASE-28638
> URL: https://issues.apache.org/jira/browse/HBASE-28638
> Project: HBase
> Issue Type: Sub-task
> Components: amv2, master, Region Assignment
> Affects Versions: 3.0.0-beta-1, 2.6.1, 2.5.10
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
>
> As per one of the recent incidents, some regions faced 5+ minute of
> availability drop because before active master could initiate SCP for the
> dead server, some region moves tried to assign regions on the already dead
> regionserver. Sometimes, due to transient issues, we see that active master
> gets notified after few minutes (5+ minute in this case).
> {code:java}
> 2024-05-08 03:47:38,518 WARN [RSProcedureDispatcher-pool-4790]
> procedure.RSProcedureDispatcher - request to host1,61020,1713411866443 failed
> due to org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Call to
> address=host1:61020 failed on local exception:
> org.apache.hadoop.hbase.exceptions.ConnectionClosedException: Connection
> closed, try=0, retrying... {code}
> And as we know, we have infinite retries here, so it kept going on..
>
> Eventually, SCP could be initiated only after active master discovered the
> server as dead:
> {code:java}
> 2024-05-08 03:50:01,038 DEBUG [RegionServerTracker-0] master.DeadServer -
> Processing host1,61020,1713411866443; numProcessing=1
> 2024-05-08 03:50:01,038 INFO [RegionServerTracker-0]
> master.RegionServerTracker - RegionServer ephemeral node deleted, processing
> expiration [host1,61020,1713411866443] {code}
> leading to
> {code:java}
> 2024-05-08 03:50:02,313 DEBUG [RSProcedureDispatcher-pool-4833]
> assignment.RegionRemoteProcedureBase - pid=54800701, ppid=54800691,
> state=RUNNABLE; OpenRegionProcedure 5cafbe54d5685acc6c4866758e67fd51,
> server=host1,61020,1713411866443 for region state=OPENING,
> location=host1,61020,1713411866443, table=T1,
> region=5cafbe54d5685acc6c4866758e67fd51, targetServer
> host1,61020,1713411866443 is dead, SCP will interrupt us, give up {code}
> This entire duration of outage could be avoided if we can fail-fast for
> connection drop errors.
>
> *Problem Statement:*
> Master initiated remote procedures are scheduled by RSProcedureDispatcher. If
> it encounters specific errors on first retry (e.g. CallQueueTooBigException
> or SaslException), it is guaranteed that the remote call has not reached the
> regionserver, therefore the remote call is marked failed prompting the parent
> procedure to select different target regionserver to resume the operation.
> If the first attempt is successful, RSProcedureDispatcher continues with
> infinite retries. We can encounter valid case (e.g.
> ConnectionClosedException) which is halting the remote operation. Without
> manual intervention, it can cause significant delay upto several minutes or
> hours to the region-in-transition.
>
> *Proposed Solution:*
> The purpose of this Jira is to impose retry limit for specific error types
> such that if the retry limit is reached, the master can recover the state of
> the ongoing remote call failure by initiating SCP (ServerCrashProcedure) on
> the target server. The SCP is going to override the TRSP
> (TransitRegionStateProcedure) if required. This can ensure that the target
> server has no region hosted online before we suspend the ongoing TRSP.
> Scheduling SCP for the target server will always lead to the regionserver in
> stopped state. Either regionserver would be automatically stopped, or if the
> regionserver is able to send the region report to master, master will reject
> it, which will further lead to regionserver abort.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)