[ 
https://issues.apache.org/jira/browse/HBASE-21625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726213#comment-16726213
 ] 

Sergey Shelukhin commented on HBASE-21625:
------------------------------------------

Looked a little bit at scheduling... that is much more convoluted than I 
expected. Seems like it's not a startup race as the procedure did make it into 
the array that is used to display the UI. So, it's probably some race in the 
table scheduler.
Given that this is meta replica, and there was no other meta related activity, 
I don't expect races with table queue... either there's some glaring bug there 
or the issue is with adding the table queue to the main queue? Since this 
procedure doesn't hold a lock - the parent does - I'm not sure who even ever 
adds to the main queue, cause it doesn't look like doAdd does; however this 
path normally works so I'm assuming there isn't a glaring bug there, just a 
general race condition.
I won't be looking into this anymore for now.


> a runnable procedure v2 does not run
> ------------------------------------
>
>                 Key: HBASE-21625
>                 URL: https://issues.apache.org/jira/browse/HBASE-21625
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> This is on master snapshot as of a few weeks ago.
> Haven't looked at the code much yet, but it seems rather fundamental. The 
> procedure comes from meta replica assignment (HBASE-21624), in case it 
> matters w.r.t. the engine initialization; however, the master is functional 
> and other procedures run fine. I can also see lots of other open region 
> procedures with a similar patterns that were initialized before this one and 
> have run fine.
> Currently, there are no other runnable procedures on master - a lot of 
> succeeded procedures since then, the parent blocked on this procedure, and 
> one unrelated RIT procedure waiting with timeout and being updated 
> periodically.
> The procedure itself is 
> {noformat}
> 157156        157155  RUNNABLE        hadoop  
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure   Wed Dec 19 
> 17:20:27 PST 2018    Wed Dec 19 17:20:28 PST 2018            [ { region => { 
> regionId => '1', tableName => { ... }, startKey => '', endKey => '', offline 
> => 'false', split => 'false', replicaId => '1' }, targetServer => { hostName 
> => 'server1', port => '17020', startCode => '1545266805778' } }, {} ]
> {noformat}
> This is in PST so it's been like that for ~19 hours.
> The only line involving this PID in the log is {noformat}
> 2018-12-19 17:20:27,974 INFO  [PEWorker-4] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=157156, ppid=157155, state=RUNNABLE, 
> hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> {noformat}
> There are no other useful logs for either this PID, parent PID, or region in 
> question since. This PEWorker (4) is also alive and did some work since then, 
> so it's not like the thread errored out somewhere.
> All the PEWorker-s are waiting for work:
> {noformat}
> Thread 158 (PEWorker-16):
>   State: TIMED_WAITING
>   Blocked count: 1340
>   Waited count: 5064
>   Stack:
>     sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>     
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>     
> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:171)
>     
> org.apache.hadoop.hbase.procedure2.AbstractProcedureScheduler.poll(AbstractProcedureScheduler.java:153)
>     
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1949)
> {noformat}
> The main assignment procedure for this region is blocked on it:
> {noformat}
> 157155                WAITING         hadoop  TransitRegionStateProcedure 
> table=hbase:meta, region=534574363, ASSIGN  Wed Dec 19 17:20:27 PST 2018    
> Wed Dec 19 17:20:27 PST 2018            [ { state => [ '1', '2', '3' ] }, { 
> regionId => '1', tableName => { ... }, startKey => '', endKey => '', offline 
> => 'false', split => 'false', replicaId => '1' }, { initialState => 
> 'REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE', lastState => 
> 'REGION_STATE_TRANSITION_CONFIRM_OPENED', assignCandidate => { hostName => 
> 'server1', port => '17020', startCode => '1545266805778' }, forceNewPlan => 
> 'false' } ]
> 2018-12-19 17:20:27,673 INFO  [PEWorker-9] 
> procedure.MasterProcedureScheduler: Took xlock for pid=157155, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=hbase:meta, region=..., ASSIGN
> 2018-12-19 17:20:27,809 INFO  [PEWorker-9] 
> assignment.TransitRegionStateProcedure: Starting pid=157155, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=hbase:meta, region=..., ASSIGN; 
> rit=OFFLINE, location=server1,17020,1545266805778; forceNewPlan=false, 
> retain=false
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to