[ 
https://issues.apache.org/jira/browse/HBASE-21611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725278#comment-16725278
 ] 

Sergey Shelukhin commented on HBASE-21611:
------------------------------------------

Well, there are 10s or 100s of regions in this state, so retries even at a 
maximum interval of 10 minutes log 5-8 lines every few seconds on average, and 
more before they get to 10 minute wait.
Looking at how SCP already checks for RIT procedure, I wonder if it should 
instead replace the RIT procedure in the beginning, and make it a dependency, 
instead of checking for it in the end. Not sure if ProcedureV2 would allow 
making it a dependency retroactively. Then RIT itself could avoid waiting 
forever because it expects SCP to take over; so if there's no SCP it's some 
sort of a bug. 


> REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash 
> procedure
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-21611
>                 URL: https://issues.apache.org/jira/browse/HBASE-21611
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> 1) Not a bug per se, since HDFS is not supposed to lose files, just a bit 
> fragile.
> When a dead server's WAL directory is deleted (due to a manual intervention, 
> or some issue with HDFS) while some regions are in CLOSING state on that 
> server, they get stuck forever in REGION_STATE_TRANSITION_CONFIRM_CLOSED - 
> REGION_STATE_TRANSITION_CLOSE - "give up and mark the procedure as complete, 
> the parent procedure will take care of this" loop. There's no crash procedure 
> for the server so nobody ever takes care of that.
> 2) Under normal circumstances, when a large WAL is being split, this same 
> loop keeps spamming the logs and wasting resources for no reason, until the 
> crash procedure completes. There's no reason for it to retry - it should just 
> wait for crash procedure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to