[jira] [Commented] (HBASE-21611) REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash procedure

2018-12-19 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725066#comment-16725066
 ] 

Duo Zhang commented on HBASE-21611:
---

This is by design I'd say, we have to retry until the SCP interrupts us. 
Checking for SCP maybe possible but it will lead to more complicated logic, and 
also more possible races and bugs... And does it spam the logs? Maybe the 
problem is that the backoff logic is broken? Otherwise it will soon become 
seconds or even minutes interval.

> REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash 
> procedure
> --
>
> Key: HBASE-21611
> URL: https://issues.apache.org/jira/browse/HBASE-21611
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> 1) Not a bug per se, since HDFS is not supposed to lose files, just a bit 
> fragile.
> When a dead server's WAL directory is deleted (due to a manual intervention, 
> or some issue with HDFS) while some regions are in CLOSING state on that 
> server, they get stuck forever in REGION_STATE_TRANSITION_CONFIRM_CLOSED - 
> REGION_STATE_TRANSITION_CLOSE - "give up and mark the procedure as complete, 
> the parent procedure will take care of this" loop. There's no crash procedure 
> for the server so nobody ever takes care of that.
> 2) Under normal circumstances, when a large WAL is being split, this same 
> loop keeps spamming the logs and wasting resources for no reason, until the 
> crash procedure completes. There's no reason for it to retry - it should just 
> wait for crash procedure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21611) REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash procedure

2018-12-19 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725278#comment-16725278
 ] 

Sergey Shelukhin commented on HBASE-21611:
--

Well, there are 10s or 100s of regions in this state, so retries even at a 
maximum interval of 10 minutes log 5-8 lines every few seconds on average, and 
more before they get to 10 minute wait.
Looking at how SCP already checks for RIT procedure, I wonder if it should 
instead replace the RIT procedure in the beginning, and make it a dependency, 
instead of checking for it in the end. Not sure if ProcedureV2 would allow 
making it a dependency retroactively. Then RIT itself could avoid waiting 
forever because it expects SCP to take over; so if there's no SCP it's some 
sort of a bug. 


> REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash 
> procedure
> --
>
> Key: HBASE-21611
> URL: https://issues.apache.org/jira/browse/HBASE-21611
> Project: HBase
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> 1) Not a bug per se, since HDFS is not supposed to lose files, just a bit 
> fragile.
> When a dead server's WAL directory is deleted (due to a manual intervention, 
> or some issue with HDFS) while some regions are in CLOSING state on that 
> server, they get stuck forever in REGION_STATE_TRANSITION_CONFIRM_CLOSED - 
> REGION_STATE_TRANSITION_CLOSE - "give up and mark the procedure as complete, 
> the parent procedure will take care of this" loop. There's no crash procedure 
> for the server so nobody ever takes care of that.
> 2) Under normal circumstances, when a large WAL is being split, this same 
> loop keeps spamming the logs and wasting resources for no reason, until the 
> crash procedure completes. There's no reason for it to retry - it should just 
> wait for crash procedure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)