[ https://issues.apache.org/jira/browse/HBASE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799543#comment-17799543 ]
David Manning commented on HBASE-28271: --------------------------------------- {quote}In cases where a region stays in RIT for considerable time, if enough attempts are made by the client to create snapshots on the table, it can easily exhaust all handler threads, leading to potentially unresponsive master.{quote} It can happen more easily than this, too, because you don't have to make repeat attempts to create snapshot on the same table. You can attempt to snapshot a different table, and it will still hang a new RPC handler. This is because the {{SnapshotManager#snapshotTable}} is {{synchronized}} and this is where the {{handler.prepare()}} call is made to acquire the lock. We indefinitely await the lock held by the region in transition, but we do so within {{SnapshotManager}}'s synchronized block. Any additional snapshot RPC, even for a different table, will end up blocked on entering a separate {{synchronized}} method in {{SnapshotManager#cleanupSentinels}}. This makes the condition easier to hit if you are doing a process which snapshots all tables in the cluster. > Infinite waiting on lock acquisition by snapshot can result in unresponsive > master > ---------------------------------------------------------------------------------- > > Key: HBASE-28271 > URL: https://issues.apache.org/jira/browse/HBASE-28271 > Project: HBase > Issue Type: Improvement > Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.7 > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Attachments: image.png > > > When a region is stuck in transition for significant time, any attempt to > take snapshot on the table would keep master handler thread in forever > waiting state. As part of the creating snapshot on enabled or disabled table, > in order to get the table level lock, LockProcedure is executed but if any > region of the table is in transition, LockProcedure could not be executed by > the snapshot handler, resulting in forever waiting until the region > transition is completed, allowing the table level lock to be acquired by the > snapshot handler. > In cases where a region stays in RIT for considerable time, if enough > attempts are made by the client to create snapshots on the table, it can > easily exhaust all handler threads, leading to potentially unresponsive > master. Attached a sample thread dump. > Proposal: The snapshot handler should not stay stuck forever if it cannot > take table level lock, it should fail-fast. > !image.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)