[
https://issues.apache.org/jira/browse/HBASE-21611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725278#comment-16725278
]
Sergey Shelukhin commented on HBASE-21611:
--
Well, there are 10s or 100s of regions in this state, so retries even at a
maximum interval of 10 minutes log 5-8 lines every few seconds on average, and
more before they get to 10 minute wait.
Looking at how SCP already checks for RIT procedure, I wonder if it should
instead replace the RIT procedure in the beginning, and make it a dependency,
instead of checking for it in the end. Not sure if ProcedureV2 would allow
making it a dependency retroactively. Then RIT itself could avoid waiting
forever because it expects SCP to take over; so if there's no SCP it's some
sort of a bug.
> REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash
> procedure
> --
>
> Key: HBASE-21611
> URL: https://issues.apache.org/jira/browse/HBASE-21611
> Project: HBase
> Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> 1) Not a bug per se, since HDFS is not supposed to lose files, just a bit
> fragile.
> When a dead server's WAL directory is deleted (due to a manual intervention,
> or some issue with HDFS) while some regions are in CLOSING state on that
> server, they get stuck forever in REGION_STATE_TRANSITION_CONFIRM_CLOSED -
> REGION_STATE_TRANSITION_CLOSE - "give up and mark the procedure as complete,
> the parent procedure will take care of this" loop. There's no crash procedure
> for the server so nobody ever takes care of that.
> 2) Under normal circumstances, when a large WAL is being split, this same
> loop keeps spamming the logs and wasting resources for no reason, until the
> crash procedure completes. There's no reason for it to retry - it should just
> wait for crash procedure.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)