[ https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999801#comment-16999801 ]
Michael Stack commented on HBASE-23593: --------------------------------------- [~zhangduo] I don't see those 'Received procedure pid.. ' on the RS side. Dang (Thanks for help). > Stalled SCP Assigns > ------------------- > > Key: HBASE-23593 > URL: https://issues.apache.org/jira/browse/HBASE-23593 > Project: HBase > Issue Type: Bug > Components: proc-v2 > Affects Versions: 2.2.3 > Reporter: Michael Stack > Priority: Major > > I'm stuck on this one so doing a write up here in case anyone else has ideas. > Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem > but from time to time I'll see the SCP stuck waiting on an Assign to finish. > The assign seems stuck at the queuing of the OpenRegionProcedure. We've > stored the procedure but then not a peek thereafter. Later we'll see > complaint that the region is STUCK. Doesn't recover. Doesn't run. > Basic story is as follows: > Server dies: > {code} > 2019-12-17 11:10:42,002 INFO > org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral > node deleted, processing expiration [s011.example.org,16020,1576561318119] > 2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: > Added s011.example.org,16020,1576561318119; numProcessing=1 > ... > 2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: > Started processing s011.example.org,16020,1576561318119; numProcessing=1 > {code} > The dead server restarts which purges the old server from dead server and > processing lists: > {code} > 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: > Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0 > 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: > STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it > from the dead servers list > {code} > > ....even though we are still processing logs in the SCP of the old server... > {code} > 2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: > Archived processed log > hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491 > to hdfs://nameservice1/hbase/oldWALs/s011.example. > > org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491 > {code} > I thought early purge of deadserver was a problem but I don't think so after > study. > WALS split took two minutes to split and server was removed from dead > servers... three minutes earlier... > {code} > 2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: > Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 > log files in > [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting] > in 143236ms > {code} > Almost immediately we get this: > {code} > 2019-12-17 11:14:08,649 WARN > org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK > Region-In-Transition state=OPEN, > location=s011.example.org,16020,1576561318119, table=t1, > region=9d6d6d5f261a0cbe7c9e85091f2c2bd4 > {code} > For this region assign, I see the SCP proc making an assign for this region > which then makes a subtask to OpenRegionProcedure. This is where it gets > stuck. No progress after this. The procedure does not come alive to run. > Here are logs for the ORP pid=421761: > {code} > 2019-12-17 11:38:34,761 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] > 2019-12-17 11:38:34,765 DEBUG > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add > TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: > the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, > state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure > 2019-12-17 11:38:34,770 DEBUG > org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure > pid=421761, ppid=402475, state=RUNNABLE, locked=true; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th > rollback step > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)