[ https://issues.apache.org/jira/browse/HBASE-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tak-Lon (Stephen) Wu resolved HBASE-27498. ------------------------------------------ Fix Version/s: 2.4.16 2.5.3 Resolution: Fixed > Observed lot of threads blocked in > ConnectionImplementation.getKeepAliveMasterService > ------------------------------------------------------------------------------------- > > Key: HBASE-27498 > URL: https://issues.apache.org/jira/browse/HBASE-27498 > Project: HBase > Issue Type: Bug > Components: Client > Affects Versions: 2.5.0 > Reporter: Vaibhav Joshi > Priority: Major > Fix For: 2.4.16, 2.5.3 > > Attachments: Screenshot 2022-11-16 at 10.06.59 AM.png > > > Recently We observed that lot of threads are blocked in method > "ConnectionImplementation.getKeepAliveMasterService" during some > initialization stages of rolling restart workflow. > During rolling restart, we make RPC calls to Master using > RpcRetryingCallerImpl, so as part of initialization we call > "ConnectionImplementation.getKeepAliveMasterService" for each thread. > Internally this method do RPC call within a synchronized block to check if > master is running (mss.isMasterRunning). > Lots of threads are in blocked state due following synchronized block > synchronized (masterLock) { > if (!isKeepAliveMasterConnectedAndRunning(this.masterServiceState)) > { MasterServiceStubMaker stubMaker = new MasterServiceStubMaker(); > this.masterServiceState.stub = stubMaker.makeStub(); } > resetMasterServiceState(this.masterServiceState); > } > In Thread Dump Analyzer (2.4), we get warning that "A lot of threads are > waiting for this monitor to become available again. > This might indicate a congestion. You also should analyze other locks > blocked by threads waiting for this monitor as there might be much more > threads waiting for it.". Please check attached screenshot !Screenshot > 2022-11-16 at 10.06.59 AM.png|width=1639,height=971! > -------------------- > "pool-11-thread-158" #313 prio=5 os_prio=0 tid=0x000055b88bcb8800 nid=0x404e > waiting for monitor entry [0x00007fa48aa86000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService(ConnectionImplementation.java:1336) > - waiting to lock <0x00000005d30ecb68> (a java.lang.Object) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster(ConnectionImplementation.java:1327) > at > org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:57) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:103) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3019) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3011) > at org.apache.hadoop.hbase.client.HBaseAdmin.move(HBaseAdmin.java:1458) > at > org.apache.hadoop.hbase.util.MoveWithoutAck.call(MoveWithoutAck.java:58) > at > org.apache.hadoop.hbase.util.MoveWithoutAck.call(MoveWithoutAck.java:33) > ------------------- > > *Proposal:* > We can optimize this flow as follows > 1. Use double checked lock for > "isKeepAliveMasterConnectedAndRunning(this.masterServiceState)" so that > theads don't race for monitor, when master is running. > 2. "isKeepAliveMasterConnectedAndRunning()" method should reuse the Globally > cached state of isMasterRunning instead of doing expensive Call in for each > thread. > Check PR [https://github.com/apache/hbase/pull/4889] for more details. > Note: The "master" branch uses "AsyncConnectionImpl" so apparently we don't > have issues there. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)