[ 
https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799543#comment-17799543
 ] 

David Manning commented on HBASE-28271:
---------------------------------------

{quote}In cases where a region stays in RIT for considerable time, if enough 
attempts are made by the client to create snapshots on the table, it can easily 
exhaust all handler threads, leading to potentially unresponsive master.{quote}

It can happen more easily than this, too, because you don't have to make repeat 
attempts to create snapshot on the same table. You can attempt to snapshot a 
different table, and it will still hang a new RPC handler.

This is because the {{SnapshotManager#snapshotTable}} is {{synchronized}} and 
this is where the {{handler.prepare()}} call is made to acquire the lock. We 
indefinitely await the lock held by the region in transition, but we do so 
within {{SnapshotManager}}'s synchronized block.

Any additional snapshot RPC, even for a different table, will end up blocked on 
entering a separate {{synchronized}} method in 
{{SnapshotManager#cleanupSentinels}}. This makes the condition easier to hit if 
you are doing a process which snapshots all tables in the cluster.

> Infinite waiting on lock acquisition by snapshot can result in unresponsive 
> master
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-28271
>                 URL: https://issues.apache.org/jira/browse/HBASE-28271
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>         Attachments: image.png
>
>
> When a region is stuck in transition for significant time, any attempt to 
> take snapshot on the table would keep master handler thread in forever 
> waiting state. As part of the creating snapshot on enabled or disabled table, 
> in order to get the table level lock, LockProcedure is executed but if any 
> region of the table is in transition, LockProcedure could not be executed by 
> the snapshot handler, resulting in forever waiting until the region 
> transition is completed, allowing the table level lock to be acquired by the 
> snapshot handler.
> In cases where a region stays in RIT for considerable time, if enough 
> attempts are made by the client to create snapshots on the table, it can 
> easily exhaust all handler threads, leading to potentially unresponsive 
> master. Attached a sample thread dump.
> Proposal: The snapshot handler should not stay stuck forever if it cannot 
> take table level lock, it should fail-fast.
> !image.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to