[ https://issues.apache.org/jira/browse/HBASE-21611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725278#comment-16725278 ]
Sergey Shelukhin commented on HBASE-21611: ------------------------------------------ Well, there are 10s or 100s of regions in this state, so retries even at a maximum interval of 10 minutes log 5-8 lines every few seconds on average, and more before they get to 10 minute wait. Looking at how SCP already checks for RIT procedure, I wonder if it should instead replace the RIT procedure in the beginning, and make it a dependency, instead of checking for it in the end. Not sure if ProcedureV2 would allow making it a dependency retroactively. Then RIT itself could avoid waiting forever because it expects SCP to take over; so if there's no SCP it's some sort of a bug. > REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash > procedure > ---------------------------------------------------------------------------------- > > Key: HBASE-21611 > URL: https://issues.apache.org/jira/browse/HBASE-21611 > Project: HBase > Issue Type: Bug > Reporter: Sergey Shelukhin > Priority: Major > > 1) Not a bug per se, since HDFS is not supposed to lose files, just a bit > fragile. > When a dead server's WAL directory is deleted (due to a manual intervention, > or some issue with HDFS) while some regions are in CLOSING state on that > server, they get stuck forever in REGION_STATE_TRANSITION_CONFIRM_CLOSED - > REGION_STATE_TRANSITION_CLOSE - "give up and mark the procedure as complete, > the parent procedure will take care of this" loop. There's no crash procedure > for the server so nobody ever takes care of that. > 2) Under normal circumstances, when a large WAL is being split, this same > loop keeps spamming the logs and wasting resources for no reason, until the > crash procedure completes. There's no reason for it to retry - it should just > wait for crash procedure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)