[ https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837454#comment-17837454 ]
Prathyusha commented on HBASE-28522: ------------------------------------ >The flow by design is SCP will interrupt the TRSP to assign the region first, >and then unassign it. True, from my understanding this code path should take care of it SCP#assingRegions {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() != {color}{color:#7f0055}null{color}{color:#000000}) {{color} {color:#000000} {color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{} found RIT {}; {}"{color}{color:#000000}, {color}{color:#7f0055}this{color}{color:#000000}, {color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure(), {color}{color:#6a3e3e}regionNode{color}{color:#000000});{color} {color:#000000} {color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure().{color}{color:#000000}serverCrashed({color}{color:#6a3e3e}env{color}{color:#000000}, {color}{color:#6a3e3e}regionNode{color}{color:#000000}, getServerName(),{color} {color:#000000} !{color}{color:#6a3e3e}retainAssignment{color}{color:#000000}){color}{color:#000000};{color} {color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color} {color:#000000} }{color} {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color} {color:#000000} {color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color} {color:#000000} TableState.State.{color}{color:#0000c0}DISABLING{color}{color:#000000}){color} {color:#000000} ) {{color} {color:#000000} {color}{color:#3f7f5f}// We need to change the state here otherwise the TRSP scheduled by DTP will try to{color} {color:#000000} {color}{color:#3f7f5f}// close the region from a dead server and will never succeed. Please see HBASE-23636{color} {color:#000000} {color}{color:#3f7f5f}// for more details.{color} {color:#000000} {color}{color:#6a3e3e}env{color}{color:#000000}.getAssignmentManager().regionClosedAbnormally({color}{color:#6a3e3e}regionNode{color}{color:#000000});{color} {color:#000000} {color}{color:#0000c0}LOG{color}{color:#000000}.info({color}{color:#2a00ff}"{} found table disabling for region {}, set it state to ABNORMALLY_CLOSED."{color}{color:#000000},{color} {color:#000000} {color}{color:#7f0055}this{color}{color:#000000}, {color}{color:#6a3e3e}regionNode{color}{color:#000000});{color} {color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color} {color:#000000} }{color} {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color} {color:#000000} {color}{color:#6a3e3e}env{color}{color:#000000}.getMasterServices().getTableStateManager().isTableState({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getTable(),{color} {color:#000000} TableState.State.{color}{color:#0000c0}DISABLED{color}{color:#000000}){color} {color:#000000} ) {{color} {color:#000000} {color}{color:#3f7f5f}// This should not happen, table disabled but has regions on server.{color} {color:#000000} {color}{color:#0000c0}LOG{color}{color:#000000}.warn({color}{color:#2a00ff}"Found table disabled for region {}, procDetails: {}"{color}{color:#000000}, {color}{color:#6a3e3e}regionNode{color}{color:#000000}, {color}{color:#7f0055}this{color}{color:#000000});{color} {color:#000000} {color}{color:#7f0055}continue{color}{color:#000000};{color} {color:#000000} }{color} {color:#000000} TransitRegionStateProcedure {color}{color:#6a3e3e}proc{color}{color:#000000} ={color} {color:#000000} TransitRegionStateProcedure.{color}{color:#000000}assign{color}{color:#000000}({color}{color:#6a3e3e}env{color}{color:#000000}, {color}{color:#6a3e3e}region{color}{color:#000000}, !{color}{color:#6a3e3e}retainAssignment{color}{color:#000000}, {color}{color:#7f0055}null{color}{color:#000000});{color} {color:#000000} {color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure({color}{color:#6a3e3e}proc{color}{color:#000000});{color} {color:#000000} addChildProcedure({color}{color:#6a3e3e}proc{color}{color:#000000}{color:#000000}); --------- but we did not see "{color:#2a00ff}found RIT" {color}log lines and SCP was triggered a bit before DisableTableProc set the table state to DISABLING. So it has set the ASSIGN proc in regionNode, before DisableTableProc has triggered {color:#0747a6}forceCreateUnssignProcedure {color}{color:#172b4d}and this essentially again is overriding the current proc of regionNode (which should be the child assign of TRSP) {color} {color}{color} {color:#000000} {color}{color:#7f0055}if{color}{color:#000000} ({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure() != {color}{color:#7f0055}null{color}{color:#000000}) {{color} {color:#000000} {color}{color:#6a3e3e}regionNode{color}{color:#000000}.unsetProcedure({color}{color:#6a3e3e}regionNode{color}{color:#000000}.getProcedure());{color} {color:#000000} }{color} {color:#000000} {color}{color:#7f0055}return{color}{color:#000000} {color}{color:#6a3e3e}regionNode{color}{color:#000000}.setProcedure(TransitRegionStateProcedure.{color}{color:#000000}unassign{color}{color:#000000}(getProcedureEnvironment(),{color} {color:#000000} {color}{color:#6a3e3e}regionNode{color}{color:#000000}.getRegionInfo()));{color} ------ Now the Assign proc of SCP also was waiting on the shared Table lock, but DisableTableProc must have taken the table exclusive lock blocking ASSIGN of SCP. {color:#4c9aff}2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - LOCK_EVENT_WAIT pid=21594220, ppid=21592440, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN{color} It looks like if the SCP was triggered a bit later, it would have interrupted current child UNASSIGN of DisableTableProc > UNASSIGN proc indefinitely stuck on dead rs > ------------------------------------------- > > Key: HBASE-28522 > URL: https://issues.apache.org/jira/browse/HBASE-28522 > Project: HBase > Issue Type: Improvement > Components: proc-v2 > Reporter: Prathyusha > Priority: Minor > > One scenario we noticed in production - > we had DisableTableProc and SCP almost triggered at similar time > 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure - > Set <TABLE_NAME> to state=DISABLING > 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure - > Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true; > ServerCrashProcedure > <regionserver>, splitWal=true, meta=false > DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is > not completed > {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - > LOCK_EVENT_WAIT pid=21594220, ppid=21592440, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; > TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}} > UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we > had to manually bypass unassign of DisableTableProc and then do ASSIGN. > If we can break the loop for UNASSIGN procedure to not retry if there is scp > for that server, we do not need manual intervention?, at least the > DisableTableProc can go to a rollback state? -- This message was sent by Atlassian Jira (v8.20.10#820010)