[ https://issues.apache.org/jira/browse/HBASE-28522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17845085#comment-17845085 ]
Prathyusha edited comment on HBASE-28522 at 5/9/24 5:59 PM: ------------------------------------------------------------ [~zhangduo] yes, this is exactly the condition which I was trying to describe in my comment above (sorry if I was unclear a bit), here is the below sequence of events happend, ending in a state of stuck procedures and bypass was the only way out. fyi [~apurtell] [~vjasani] !timeline.jpg! was (Author: prathyu6): [~zhangduo] yes, this is exactly the condition which I was trying to describe in my comment above (sorry if I was unclear a bit), here is the below sequence of events happend, ending in a state of stuck procedures and bypass was the only way out. !timeline.jpg! > UNASSIGN proc indefinitely stuck on dead rs > ------------------------------------------- > > Key: HBASE-28522 > URL: https://issues.apache.org/jira/browse/HBASE-28522 > Project: HBase > Issue Type: Improvement > Components: proc-v2, Region Assignment > Reporter: Prathyusha > Assignee: Prathyusha > Priority: Critical > Attachments: timeline.jpg > > > One scenario we noticed in production - > we had DisableTableProc and SCP almost triggered at similar time > 2024-03-16 17:59:23,014 INFO [PEWorker-11] procedure.DisableTableProcedure - > Set <TABLE_NAME> to state=DISABLING > 2024-03-16 17:59:15,243 INFO [PEWorker-26] procedure.ServerCrashProcedure - > Start pid=21592440, state=RUNNABLE:SERVER_CRASH_START, locked=true; > ServerCrashProcedure > <regionserver>, splitWal=true, meta=false > DisabeTableProc creates unassign procs, and at this time ASSIGNs of SCP is > not completed > {{2024-03-16 17:59:23,003 DEBUG [PEWorker-40] procedure2.ProcedureExecutor - > LOCK_EVENT_WAIT pid=21594220, ppid=21592440, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; > TransitRegionStateProcedure table=<TABLE_NAME>, region=<regionhash>, ASSIGN}} > UNASSIGN created by DisableTableProc is stuck on the dead regionserver and we > had to manually bypass unassign of DisableTableProc and then do ASSIGN. > If we can break the loop for UNASSIGN procedure to not retry if there is scp > for that server, we do not need manual intervention?, at least the > DisableTableProc can go to a rollback state? -- This message was sent by Atlassian Jira (v8.20.10#820010)