[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149647#comment-17149647 ]
Nick Dimiduk commented on HBASE-24526: -------------------------------------- My test harness found this issue again last night: full of stuck workers. > Deadlock executing assign meta procedure > ---------------------------------------- > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment > Affects Versions: 2.3.0 > Reporter: Nick Dimiduk > Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)