[ 
https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999801#comment-16999801
 ] 

Michael Stack commented on HBASE-23593:
---------------------------------------

[~zhangduo] I don't see those 'Received procedure pid.. ' on the RS side. Dang 
(Thanks for help).

> Stalled SCP Assigns
> -------------------
>
>                 Key: HBASE-23593
>                 URL: https://issues.apache.org/jira/browse/HBASE-23593
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.2.3
>            Reporter: Michael Stack
>            Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem 
> but from time to time I'll see the SCP stuck waiting on an Assign to finish. 
> The assign seems stuck at the queuing of the OpenRegionProcedure. We've 
> stored the procedure but then not a peek thereafter. Later we'll see 
> complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO 
> org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and 
> processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
> from the dead servers list
> {code}
>  
> ....even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
> Archived processed log 
> hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
>  to hdfs://nameservice1/hbase/oldWALs/s011.example.                           
>  
> org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after 
> study.
> WALS split took two minutes to split and server was removed from dead 
> servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
> Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 
> log files in 
> [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
>  in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPEN, 
> location=s011.example.org,16020,1576561318119, table=t1, 
> region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region 
> which then makes a subtask to OpenRegionProcedure. This is where it gets 
> stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
> TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: 
> the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
> state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-17 11:38:34,770 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
> rollback step
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to