[ 
https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837454#comment-17837454
 ] 

Prathyusha commented on HBASE-28522:
------------------------------------

>The flow by design is SCP will interrupt the TRSP to assign the region first, 
>and then unassign it.

True, from my understanding this code path should take care of it
SCP#assingRegions
 

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() != 
{color}{color:#7f0055}null{color}{color:#000000}) {{color}

{color:#000000} 
{color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{} 
found RIT {}; {}"{color}{color:#000000}, 
{color}{color:#7f0055}this{color}{color:#000000}, 
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure(), 
{color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}

{color:#000000} 
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure().{color}{color:#000000}serverCrashed({color}{color:#6a3e3e}env{color}{color:#000000},
 {color}{color:#6a3e3e}regionNode{color}{color:#000000}, getServerName(),{color}

{color:#000000} 
!{color}{color:#6a3e3e}retainAssignment{color}{color:#000000}){color}{color:#000000};{color}

{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}

{color:#000000} }{color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}

{color:#000000} 
{color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color}

{color:#000000} 
TableState.State.{color}{color:#0000c0}DISABLING{color}{color:#000000}){color}

{color:#000000} ) {{color}

{color:#000000} {color}{color:#3f7f5f}// We need to change the state here 
otherwise the TRSP scheduled by DTP will try to{color}

{color:#000000} {color}{color:#3f7f5f}// close the region from a dead server 
and will never succeed. Please see HBASE-23636{color}

{color:#000000} {color}{color:#3f7f5f}// for more details.{color}

{color:#000000} 
{color}{color:#6a3e3e}env{color}{color:#000000}.getAssignmentManager().regionClosedAbnormally({color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}

{color:#000000} 
{color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{} 
found table disabling for region {}, set it state to 
ABNORMALLY_CLOSED."{color}{color:#000000},{color}

{color:#000000} {color}{color:#7f0055}this{color}{color:#000000}, 
{color}{color:#6a3e3e}regionNode{color}{color:#000000});{color}

{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}

{color:#000000} }{color}

{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}

{color:#000000} 
{color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color}

{color:#000000} 
TableState.State.{color}{color:#0000c0}DISABLED{color}{color:#000000}){color}

{color:#000000} ) {{color}

{color:#000000} {color}{color:#3f7f5f}// This should not happen, table disabled 
but has regions on server.{color}

{color:#000000} 
{color}{color:#0000c0}LOG{color}{color:#000000}.warn({color}{color:#2a00ff}"Found
 table disabled for region {}, procDetails: {}"{color}{color:#000000}, 
{color}{color:#6a3e3e}regionNode{color}{color:#000000}, 
{color}{color:#7f0055}this{color}{color:#000000});{color}

{color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color}

{color:#000000} }{color}

{color:#000000} TransitRegionStateProcedure 
{color}{color:#6a3e3e}proc{color}{color:#000000} ={color}

{color:#000000} 
TransitRegionStateProcedure.{color}{color:#000000}assign{color}{color:#000000}({color}{color:#6a3e3e}env{color}{color:#000000},
 {color}{color:#6a3e3e}region{color}{color:#000000}, 
!{color}{color:#6a3e3e}retainAssignment{color}{color:#000000}, 
{color}{color:#7f0055}null{color}{color:#000000});{color}

{color:#000000} 
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure({color}{color:#6a3e3e}proc{color}{color:#000000});{color}
{color:#000000} 
addChildProcedure({color}{color:#6a3e3e}proc{color}{color:#000000}{color:#000000});
---------
but we did not see "{color:#2a00ff}found RIT" {color}log lines and SCP was 
triggered a bit before DisableTableProc set the table state to DISABLING.
So it has set the ASSIGN proc in regionNode, before DisableTableProc has 
triggered {color:#0747a6}forceCreateUnssignProcedure {color}{color:#172b4d}and 
this essentially again is overriding the current proc of regionNode (which 
should be the child assign of TRSP)
{color}
{color}{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000} 
({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() != 
{color}{color:#7f0055}null{color}{color:#000000}) {{color}

{color:#000000} 
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.unsetProcedure({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure());{color}

{color:#000000} }{color}

{color:#000000} {color}{color:#7f0055}return{color}{color:#000000} 
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure(TransitRegionStateProcedure.{color}{color:#000000}unassign{color}{color:#000000}(getProcedureEnvironment(),{color}

{color:#000000} 
{color}{color:#6a3e3e}regionNode{color}{color:#000000}.getRegionInfo()));{color}

 ------

Now the Assign proc of SCP also was waiting on the shared Table lock, but 
DisableTableProc must have taken the table exclusive lock blocking ASSIGN of 
SCP.

{color:#4c9aff}2024-03-16 17:59:23,003 DEBUG [PEWorker-40] 
procedure2.ProcedureExecutor - LOCK_EVENT_WAIT pid=21594220, ppid=21592440, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, 
ASSIGN{color}

It looks like if the SCP was triggered a bit later, it would have interrupted 
current child UNASSIGN of DisableTableProc

> UNASSIGN proc indefinitely stuck on dead rs
> -------------------------------------------
>
>                 Key: HBASE-28522
>                 URL: https://issues.apache.org/jira/browse/HBASE-28522
>             Project: HBase
>          Issue Type: Improvement
>          Components: proc-v2
>            Reporter: Prathyusha
>            Priority: Minor
>
> One scenario we noticed in production -
> we had DisableTableProc and SCP almost triggered at similar time
> 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure - 
> Set <TABLE_NAME> to state=DISABLING
> 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure - 
> Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true; 
> ServerCrashProcedure 
> <regionserver>, splitWal=true, meta=false
> DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is 
> not completed
> {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - 
> LOCK_EVENT_WAIT pid=21594220, ppid=21592440, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}}
> UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we 
> had to manually bypass unassign of DisableTableProc and then do ASSIGN.
> If we can break the loop for UNASSIGN procedure to not retry if there is scp 
> for that server, we do not need manual intervention?, at least the 
> DisableTableProc can go to a rollback state?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to