[ 
https://issues.apache.org/jira/browse/HBASE-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-25212:
----------------------------------------
    Description: 
After deciding a region should be closed, the regionserver will set the 
internal region state to closing and wait for all pending requests to complete, 
via a rendezvous on the region lock. In closing state the region will not 
accept any new requests but requests in progress will be allowed to complete 
before the close action takes place. In our production we see outlier wait 
times on this lock in excess of several minutes. 

During close when there are requests in flight the regionserver is subject to 
any conceivable reason for delay, like full scans over large regions, expensive 
filtering hierarchies, bugs, or store level performance problems like slow 
HDFS. The regionserver should interrupt requests in progress to facilitate 
smaller/shorter close times on an opt-in basis.

Optionally, via configuration parameter -- which would be a system wide default 
set in hbase-site.xml in common practice but could be overridden in table 
schema for per table settings -- interrupt requests in progress holding the 
region lock rather than wait for completion of all operations in flight. Send 
back NotServingRegionException("region is closing") to the clients of the 
interrupted operations, like we do after the write lock is acquired. The client 
will transparently relocate the region data and resubmit the aborted requests 
per normal retry policy. This can be less disruptive than waiting for very long 
times for a region to close in extreme outlier cases (e.g. 50 minutes). In such 
extreme cases it is better to abort the regionserver if the close lock cannot 
be acquired in a reasonable amount of time.

After waiting for all requests to complete then we flush the region's memstore 
and finish the close. The flush portion of the close process is out of scope of 
this proposal. Under normal conditions the flush portion of the close completes 
quickly. It is specifically waits on the close lock that has been an occasional 
issue in our production that causes difficulty achieving 99.99% availability.

  was:
After deciding a region should be closed, the regionserver will set the 
internal region state to closing and wait for all pending requests to complete, 
via a rendezvous on the region lock. In closing state the region will not 
accept any new requests but requests in progress will be allowed to complete 
before the close action takes place. In our production we see outlier wait 
times on this lock in excess of several minutes. 

During close when there are requests in flight the regionserver is subject to 
any conceivable reason for delay, like full scans over large regions, expensive 
filtering hierarchies, bugs, or store level performance problems like slow 
HDFS. The regionserver should interrupt requests in progress to facilitate 
smaller/shorter close times on an opt-in basis.

Optionally, via configuration parameter -- which would be a system wide default 
set in hbase-site.xml in common practice but could be overridden in table 
schema for per table settings -- interrupt requests in progress holding the 
region lock rather than wait for completion of all operations in flight. Send 
back NotServingRegionException("region is closing") to the clients of the 
interrupted operations, like we do after the write lock is acquired. The client 
will transparently relocate the region data and resubmit the aborted requests 
per normal retry policy. This can be less disruptive than waiting for very long 
times for a region to close in extreme outlier cases (e.g. 50 minutes).

After waiting for all requests to complete then we flush the region's memstore 
and finish the close. The flush portion of the close process is out of scope of 
this proposal. Under normal conditions the flush portion of the close completes 
quickly. It is specifically waits on the close lock that has been an occasional 
issue in our production that causes difficulty achieving 99.99% availability.


> Optionally abort requests in progress after deciding a region should close
> --------------------------------------------------------------------------
>
>                 Key: HBASE-25212
>                 URL: https://issues.apache.org/jira/browse/HBASE-25212
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
>
> After deciding a region should be closed, the regionserver will set the 
> internal region state to closing and wait for all pending requests to 
> complete, via a rendezvous on the region lock. In closing state the region 
> will not accept any new requests but requests in progress will be allowed to 
> complete before the close action takes place. In our production we see 
> outlier wait times on this lock in excess of several minutes. 
> During close when there are requests in flight the regionserver is subject to 
> any conceivable reason for delay, like full scans over large regions, 
> expensive filtering hierarchies, bugs, or store level performance problems 
> like slow HDFS. The regionserver should interrupt requests in progress to 
> facilitate smaller/shorter close times on an opt-in basis.
> Optionally, via configuration parameter -- which would be a system wide 
> default set in hbase-site.xml in common practice but could be overridden in 
> table schema for per table settings -- interrupt requests in progress holding 
> the region lock rather than wait for completion of all operations in flight. 
> Send back NotServingRegionException("region is closing") to the clients of 
> the interrupted operations, like we do after the write lock is acquired. The 
> client will transparently relocate the region data and resubmit the aborted 
> requests per normal retry policy. This can be less disruptive than waiting 
> for very long times for a region to close in extreme outlier cases (e.g. 50 
> minutes). In such extreme cases it is better to abort the regionserver if the 
> close lock cannot be acquired in a reasonable amount of time.
> After waiting for all requests to complete then we flush the region's 
> memstore and finish the close. The flush portion of the close process is out 
> of scope of this proposal. Under normal conditions the flush portion of the 
> close completes quickly. It is specifically waits on the close lock that has 
> been an occasional issue in our production that causes difficulty achieving 
> 99.99% availability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to