[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163209#comment-17163209 ] Michael Stack commented on HBASE-24526: --- Update: I've put aside trying to find the issue we saw on the RegionServer side which we originally thought a deadlock; I cannot reproduce it after multiple runs. Hopefully next time it happens we have enough debug in place. The notes above [~zhangduo] are great. Hoist them out into new Pv2 Scheduling improvements issue? I'm afraid they'll be lost in the shuffle if left deep down here. Could include working on stuff like "It is not easy to fix since the MasterProcedureScheduler is really really complicated...", quoting you from HBASE-19976. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158522#comment-17158522 ] Huaxiang Sun commented on HBASE-24526: -- Thanks [~zhangduo] for the detail explain. We traced further, the issue is not at the master side. The issue is at RS side, where RS receives the openMeta RPC, but is not able to process it. More logging is added at RS code, and still working on reproducing. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151574#comment-17151574 ] Duo Zhang commented on HBASE-24526: --- In general there are two problems. 1. The updating meta is a blocking operation, which will cause the PEWorker to wait. So if meta is temporarily unavailable, it is possible that all PEWokers are stuck and we can not make any progress for a long time. This can not be fully solved by checking meta availability before executing the procedure, as we need sometime to find out the meta is unavailable, and the state of meta is reset by TRSP, which is also executed by a PEWorker, which could make things worse. You can see HBASE-19976 and its sub tasks on how we plan to partially solve it by introducing priority for procedures and also add more workers if all workers are stuck. There is a UT called TestProcedurePriority which is to confirm that the mechanism can finally make progress. IIRC there is still a problem that, we always poll SCP priority to TRSP, so if a bunch of servers crash at the same time, we have to finish the all the wal splitting before we actually assigning meta... In general, on master branch, since we have builtin async connection support for master service, we could introduce a sub procedure for updating meta in TRSP and SCP, so the parent procedure will release the PEWorker, and in the update meta procedure, we could make use of the async connection to release the PEWorker when updating meta. On branch-2.x, this is not possible, but I think a possible way, is to reduce the retry count and timeout to release the PEWorker soon so we have a chance to schedule the procedure with highest priority. 2. All meta updating operations are under the region node lock, together with modification to the in memory state, which implies that, the in memory state of a region is always the same with the record in meta. This is a very strong assumption which makes the logic much easier to implement, but also makes the optimization in #1 impossible, as when meta updating fails, if we still want to keep the state in sync, the only way is to abort... Maybe in some places we have retry on meta updating failure, but I'm not sure whether the logic is correct. We should revisit the related code to remove this assumption, so we can apply the optimization in #1. Thanks. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,15
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149758#comment-17149758 ] Nick Dimiduk commented on HBASE-24526: -- Another note: today's run is with the new procedure-based wal splitting, thus putting more load on the procedure engine. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149739#comment-17149739 ] Nick Dimiduk commented on HBASE-24526: -- It also appears we're setting meta location in ZooKeeper before meta has been confirmed open on the target region server. I think this is a big part of why the worker pool gets backed up. In this case, it looks like the assigned rs crashes while opening meta, but since a location is set in ZK, all the other assign procedures start executing. They flood the worker pool, but none of them can make progress because meta isn't online. Meanwhile, recovery of meta cannot proceed because the procedure worker pool is saturated and cannot accept new work. The cluster hangs in this brown-out state until master is restarted, and the new master gets to try again. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149736#comment-17149736 ] Nick Dimiduk commented on HBASE-24526: -- One curiosity in the log, it looks like master believes log splitting is complete before the worker has finished. The master first sees the split log dir is empty, but when it tries to delete the directory, hdfs refuses because there are files present. From the master, I see {noformat} 2020-07-01 06:17:27,237 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=116621, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=hbasedn189.example.com,16020,1593583050478, splitWal=true, meta=true ... 2020-07-01 06:17:27,241 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=116622, ppid=116621, state=RUNNABLE:ACQUIRE_SPLIT_WAL_WORKER; org.apache.hadoop.hbase.master.procedure.SplitWALProcedure}, {pid=116623, ppid=116621, state=RUNNABLE:ACQUIRE_SPLIT_WAL_WORKER; org.apache.hadoop.hbase.master.procedure.SplitWALProcedure}] 2020-07-01 06:17:31,449 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure pid=116623, resume processing parent pid=116621, state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_META_WALS_DIR, locked=true; ServerCrashProcedure server=hbasedn189.example.com,16020,1593583050478, splitWal=true, meta=true 2020-07-01 06:17:31,449 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=116623, ppid=116621, state=SUCCESS; org.apache.hadoop.hbase.master.procedure.SplitWALProcedure in 4.2070 sec 2020-07-01 06:17:31,452 INFO org.apache.hadoop.hbase.master.SplitLogManager: hdfs://nick-itbll/hbase-itbll/WALs/hbasedn189.example.com,16020,1593583050478-splitting dir is empty, no logs to split. 2020-07-01 06:17:31,452 INFO org.apache.hadoop.hbase.master.SplitWALManager: size of WALs of hbasedn189.example.com,16020,1593583050478 is 0, isMeta: true 2020-07-01 06:17:31,454 WARN org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Remove WAL directory of server hbasedn189.example.com,16020,1593583050478 failed, ignore... org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException): `/hbase-itbll/WALs/hbasedn189.example.com,16020,1593583050478-splitting is non empty': Directory is not empty ... at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:887) at org.apache.hadoop.hbase.master.SplitWALManager.deleteWALDir(SplitWALManager.java:125) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.cleanupSplitDir(ServerCrashProcedure.java:259) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:168) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:65) {noformat} > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > t
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149647#comment-17149647 ] Nick Dimiduk commented on HBASE-24526: -- My test harness found this issue again last night: full of stuck workers. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129770#comment-17129770 ] Michael Stack commented on HBASE-24526: --- These went in last night; seem related: {code} commit 4486a565b5cd9b9304701bc24c0f7d30cf174711 Author: Duo Zhang Date: Tue Jun 9 11:07:16 2020 +0800 HBASE-24117 Shutdown AssignmentManager before ProcedureExecutor may cause SCP to accidentally skip assigning a region (#1865) Signed-off-by: Michael Stack commit dd1010c15d1737d6f83497ef56e4dad09d80ac74 Author: Duo Zhang Date: Tue Jun 9 08:14:00 2020 +0800 HBASE-24517 AssignmentManager.start should add meta region to ServerStateNode (#1866) Signed-off-by: Viraj Jasani Signed-off-by: Wellington Ramos Chevreuil {code} > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129760#comment-17129760 ] Nick Dimiduk commented on HBASE-24526: -- I have a couple observations, I propose each here for discussion: # Master should not set meta in ZooKeeper until after the region open on a region server has been acknowledged. This would allow other procedures involving meta edits to short-circuit abort for reschedule, since they would know they cannot make progress in this state. # The PEWorker pool should prioritize operations involving meta (and other system tables). I presume there's some kind of queuing mechanism here, which may be a false assumption. # Work dispatched to a PEWorker thread should not be permitted to hold that thread indefinitely. We should have an external mechanism, or something in the PEWorker's run loop, that interrupts procedure execution after a time limit. Procedures are designed to be durable and resumable, so this shouldn't impact correctness. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129756#comment-17129756 ] Nick Dimiduk commented on HBASE-24526: -- bq. Is there reason to believe this is new for 2.3, or we just haven't gotten evidence that it's elsewhere yet? I don't have an assessment one way or the other. I suspect it's possible on anything running procV2 (PEWorker pool getting completely consumed), but it may only be possible in combination with the details of this {{TransitionRegionStateProcedure}} and how it handles meta. I was running with branch-2.3, {{02c099d566}}. > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-24526) Deadlock executing assign meta procedure
[ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129749#comment-17129749 ] Sean Busbey commented on HBASE-24526: - Is there reason to believe this is new for 2.3, or we just haven't gotten evidence that it's elsewhere yet? > Deadlock executing assign meta procedure > > > Key: HBASE-24526 > URL: https://issues.apache.org/jira/browse/HBASE-24526 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment >Affects Versions: 2.3.0 >Reporter: Nick Dimiduk >Priority: Critical > > I have what appears to be a deadlock while assigning meta. During recovery, > master creates the assign procedure for meta, and immediately marks meta as > assigned in zookeeper. It then creates the subprocedure to open meta on the > target region. However, the PEWorker pool is full of procedures that are > stuck, I think because their calls to update meta are going nowhere. For what > it's worth, the balancer is running concurrently, and has calculated a plan > size of 41. > From the master log, > {noformat} > 2020-06-06 00:34:07,314 INFO > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: > Starting pid=17802, ppid=17801, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; > state=OPEN, location=null; forceNewPlan=true, retain=false > 2020-06-06 00:34:07,465 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > hbasedn139.example.com,16020,1591403576247 > 2020-06-06 00:34:07,466 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > {noformat} > {{pid=17803}} is not mentioned again. hbasedn139 never receives an > {{openRegion}} RPC. > Meanwhile, additional procedures are scheduled and picked up by workers, each > getting "stuck". I see log lines for all 16 PEWorker threads, saying that > they are stuck. > {noformat} > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock > for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=IntegrationTestBigLinkedList, > region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE > 2020-06-06 00:34:07,961 INFO > org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 > updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, > regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956 > ... > 2020-06-06 00:34:22,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 14.3340 sec > ... > 2020-06-06 00:34:27,295 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck > PEWorker-16(pid=17804), run time 19.3340 sec > ... > {noformat} > The cluster stays in this state, with PEWorker thread stuck for upwards of 15 > minutes. Eventually master starts logging > {noformat} > 2020-06-06 00:50:18,033 INFO > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, > tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue > is full on hbasedn139.example.com,16020,1591403576247, too many items queued > ?, details=row > 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' > on table 'hbase:meta' at region=hbase:meta,,1. > 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, > see https://s.apache.org/timeout > {noformat} > The master never recovers on its own. > I'm not sure how common this condition might be. This popped after about 20 > total hours of running ITBLL with ServerKillingMonkey. -- This message was sent by Atlassian Jira (v8.3.4#803005)