[ 
https://issues.apache.org/jira/browse/HBASE-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Elser resolved HBASE-20680.
--------------------------------
    Resolution: Incomplete

Closing this as stale. Need to revisit on the heels of the other good work to 
stabilize recovery/assignment that have gone in. Still an issue with RSGroups.

> Master hung during initialization waiting on hbase:meta to be assigned which 
> never does
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-20680
>                 URL: https://issues.apache.org/jira/browse/HBASE-20680
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Josh Elser
>            Priority: Critical
>         Attachments: 20680-logs.tar.gz
>
>
> When running IntegrationTestRSGroups, the test became hung waiting on the 
> master to be initialized.
> The hbase cluster was launched without RSGroup config. The test script adds 
> required RSGroup configs to hbase-site.xml and restarts the cluster.
> It seems that, at one point while the master was trying to assign meta, the 
> destination regionserver was in the middle of going down. This has now left 
> HBase in a state where it starts the regionserver recovery procedures, but 
> never actually gets hbase:meta assigned.
> {code}
> 2018-06-01 10:47:50,024 INFO  [PEWorker-5] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=41, ppid=40, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, 
> region=1588230740}]
> 2018-06-01 10:47:50,026 DEBUG [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore: hsync completed for 
> hdfs://ctr-e138-1518143905142-340983-03-000014.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000002.log
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] 
> procedure.MasterProcedureScheduler: pid=41, ppid=40, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,     
>             region=1588230740 checking lock on 1588230740
> 2018-06-01 10:47:50,026 DEBUG [PEWorker-3] assignment.RegionStates: setting 
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
> for rit=OFFLINE, location=ctr-  
> e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190, 
> table=hbase:meta, region=1588230740 last loc=null
> 2018-06-01 10:47:50,026 INFO  [PEWorker-3] assignment.AssignProcedure: 
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta,                region=1588230740; 
> rit=OFFLINE, 
> location=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190;
>  forceNewPlan=false, retain=true target svr=null
> {code}
> At Fri Jun  1 10:48:04, master was restarted.
> The new master picked up pid=41:
> {code}
> 2018-06-01 10:48:47,971 INFO  [PEWorker-1] assignment.AssignProcedure: 
> Starting pid=41, ppid=40, state=RUNNABLE:REGION_TRANSITION_QUEUE; 
> AssignProcedure table=hbase:meta,                region=1588230740; 
> rit=OFFLINE, location=null; forceNewPlan=false, retain=false target svr=null
> {code}
> There was no further log for pid=41 after above.
> Later when master initiated another meta recovery procedure (pid=42), the 
> second procedure seems to be locked out by the former:
> {code}
> 2018-06-01 10:49:34,292 INFO  [PEWorker-2] 
> procedure.MasterProcedureScheduler: pid=43, ppid=42, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta,     
>             region=1588230740, 
> target=ctr-e138-1518143905142-340983-03-000014.hwx.site,16020,1527849994190 
> checking lock on 1588230740
> 2018-06-01 10:49:34,293 DEBUG [PEWorker-2] 
> assignment.RegionTransitionProcedure: LOCK_EVENT_WAIT pid=43 serverLocks={}, 
> namespaceLocks={}, tableLocks={},                                 
> regionLocks={{1588230740=exclusiveLockOwner=41, sharedLockCount=0, 
> waitingProcCount=1}}, peerLocks={}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to