[ https://issues.apache.org/jira/browse/HBASE-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Elser resolved HBASE-20680. -------------------------------- Resolution: Incomplete Closing this as stale. Need to revisit on the heels of the other good work to stabilize recovery/assignment that have gone in. Still an issue with RSGroups. > Master hung during initialization waiting on hbase:meta to be assigned which > never does > --------------------------------------------------------------------------------------- > > Key: HBASE-20680 > URL: https://issues.apache.org/jira/browse/HBASE-20680 > Project: HBase > Issue Type: Bug > Reporter: Josh Elser > Priority: Critical > Attachments: 20680-logs.tar.gz > > > When running IntegrationTestRSGroups, the test became hung waiting on the > master to be initialized. > The hbase cluster was launched without RSGroup config. The test script adds > required RSGroup configs to hbase-site.xml and restarts the cluster. > It seems that, at one point while the master was trying to assign meta, the > destination regionserver was in the middle of going down. This has now left > HBase in a state where it starts the regionserver recovery procedures, but > never actually gets hbase:meta assigned. > {code} > 2018-06-01 10:47:50,024 INFO [PEWorker-5] procedure2.ProcedureExecutor: > Initialized subprocedures=[{pid=41, ppid=40, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740}] > 2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] > wal.WALProcedureStore: hsync completed for > hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log > 2018-06-01 10:47:50,026 INFO [PEWorker-3] > procedure.MasterProcedureScheduler: pid=41, ppid=40, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740 checking lock on 1588230740 > 2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting > location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 > for rit=OFFLINE, location=ctr- > e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, > table=hbase:meta, region=1588230740 last loc=null > 2018-06-01 10:47:50,026 INFO [PEWorker-3] assignment.AssignProcedure: > Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740; > rit=OFFLINE, > location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190; > forceNewPlan=false, retain=true target svr=null > {code} > At Fri Jun 1 10:48:04, master was restarted. > The new master picked up pid=41: > {code} > 2018-06-01 10:48:47,971 INFO [PEWorker-1] assignment.AssignProcedure: > Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740; > rit=OFFLINE, location=null; forceNewPlan=false, retain=false target svr=null > {code} > There was no further log for pid=41 after above. > Later when master initiated another meta recovery procedure (pid=42), the > second procedure seems to be locked out by the former: > {code} > 2018-06-01 10:49:34,292 INFO [PEWorker-2] > procedure.MasterProcedureScheduler: pid=43, ppid=42, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, > region=1588230740, > target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 > checking lock on 1588230740 > 2018-06-01 10:49:34,293 DEBUG [PEWorker-2] > assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, > namespaceLocks={}, tableLocks={}, > regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, > waitingProcCount=1}}, peerLocks={} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)