Michael Stack created HBASE-23593:
-------------------------------------

             Summary: Stalled SCP Assigns
                 Key: HBASE-23593
                 URL: https://issues.apache.org/jira/browse/HBASE-23593
             Project: HBase
          Issue Type: Bug
          Components: proc-v2
    Affects Versions: 2.2.3
            Reporter: Michael Stack


I'm stuck on this one so doing a write up here in case anyone else has ideas.

Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem but 
from time to time I'll see the SCP stuck waiting on an Assign to finish. The 
assign seems stuck at the queuing of the OpenRegionProcedure. We've stored the 
procedure but then not a peek thereafter. Later we'll see complaint that the 
region is STUCK. Doesn't recover. Doesn't run.

Basic story is as follows:

Server dies:
{code}
 2019-12-17 11:10:42,002 INFO 
org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral node 
deleted, processing expiration [s011.example.org,16020,1576561318119]
 2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: Added 
s011.example.org,16020,1576561318119; numProcessing=1
...
 2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Started processing s011.example.org,16020,1576561318119; numProcessing=1
{code}

The dead server restarts which purges the old server from dead server and 
processing lists:

{code}
 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
 2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
from the dead servers list
{code}
 

....even though we are still processing logs in the SCP of the old server...

{code}
 2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
Archived processed log 
hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
 to hdfs://nameservice1/hbase/oldWALs/s011.example.                            
org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
{code}

I thought early purge of deadserver was a problem but I don't think so after 
study.

WALS split took two minutes to split and server was removed from dead 
servers...  three minutes earlier...
{code}
 2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 log 
files in 
[hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting] 
in 143236ms
{code}

 Almost immediately we get this:

{code}
 2019-12-17 11:14:08,649 WARN 
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
Region-In-Transition state=OPEN, location=s011.example.org,16020,1576561318119, 
table=t1, region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
{code}

For this region assign, I see the SCP proc making an assign for this region 
which then makes a subtask to OpenRegionProcedure. This is where it gets stuck. 
No progress after this. The procedure does not come alive to run.

Here are logs for the ORP pid=421761:

{code}
2019-12-17 11:38:34,761 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]

2019-12-17 11:38:34,765 DEBUG 
org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: the 
exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
2019-12-17 11:38:34,770 DEBUG 
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
rollback step
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to