[ 
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-27951:
------------------------------
    Fix Version/s:     (was: 4.0.0-alpha-1)

> Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-27951
>                 URL: https://issues.apache.org/jira/browse/HBASE-27951
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.10
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Analysis of a recent production incident is not yet complete but an item of 
> note is an apparent deadlock. Imagine you are gracefully draining a 
> regionserver by way of a flurry of moveRegion requests. The handler for 
> moveRegion submits a TRSP and then waits on its future without timeout. 
> Imagine that there are sufficient number of moveRegion requests to tie up the 
> normal priority master RPC pool. Now imagine that all of those requests are 
> waiting on TRSPs pending on a regionserver that is concurrently bounced or 
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE 
> because the target regionserver terminated before responding to the close 
> requests, blocking the moveRegion requests, blocking the RPC handlers. The 
> regionserver restarts and tries to check in, but cannot report to the master 
> because there are no free normal priority handlers to handle it. It seems not 
> correct to have the regionserver operational dependencies 
> (regionServerStartup, regionServerReport, and reportFatalRSError) contending 
> with normal priority requests.
> They should be made ADMIN_QOS priority to avoid this case. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to