Andrey Elenskiy created HBASE-24438:
---------------------------------------

             Summary: Stale ServerCrashProcedure task in HBase Master UI
                 Key: HBASE-24438
                 URL: https://issues.apache.org/jira/browse/HBASE-24438
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 2.2.4
         Environment: HBase 2.2.4
HDFS 3.1.3 with erasure coding enabled
Kubernetes
            Reporter: Andrey Elenskiy


Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with 
ServerCrashProcedure after master failover. The procedures have finished with 
SUCCESS on a previously active HBase master and aren't showing in "Procedures & 
Locks".

Based on the logs, both of those regionserver were carrying hbase:meta (logs 
are sorted newest first grepped for those specific servers that have stale 
ServerCrashProcedures):
{noformat}
2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] 
procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; 
ServerCrashProcedure 
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
splitWal=true, meta=true in 2.5290sec
 2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: removed crashed server 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after 
splitting done
 2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
dead splitlog workers 
[regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
 2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: 
Log dir for server 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not exist
 2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
 2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: pid=38, 
state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
splitWal=true, meta=true found RIT pid=20, ppid=18, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
rit=ABNORMALLY_CLOSED, 
location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
table=hbase:meta, region=1588230740
 2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
[hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
 in 0ms
 2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting
 dir is empty, no logs to split.
 2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
dead splitlog workers 
[regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
 2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] 
procedure.ServerCrashProcedure: Start pid=38, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
splitWal=true, meta=true
 2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=38 for 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 
(carryingMeta=true) 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write
 locks = 1, Read locks = 0], oldState=ONLINE.
 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: 
Processing expiration of 
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on 
hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
 2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
...
2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] 
procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS; 
ServerCrashProcedure 
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
splitWal=true, meta=true in 997msec
 2020-05-21 19:04:04,497 INFO [KeepAlivePEWorker-27] 
procedure.ServerCrashProcedure: removed crashed server 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 after 
splitting done
 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
dead splitlog workers 
[regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
 2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.MasterWalManager: 
Log dir for server 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 does not exist
 2020-05-21 19:04:04,175 INFO [KeepAlivePEWorker-27] 
procedure.ServerCrashProcedure: 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 had 1 regions
 2020-05-21 19:04:03,958 INFO [KeepAlivePEWorker-27] 
procedure.ServerCrashProcedure: pid=37, 
state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
splitWal=true, meta=true found RIT pid=20, ppid=18, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
rit=OPENING, 
location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
table=hbase:meta, region=1588230740
 2020-05-21 19:04:03,852 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
[hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting]
 in 0ms
 2020-05-21 19:04:03,845 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting
 dir is empty, no logs to split.
 2020-05-21 19:04:03,843 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
dead splitlog workers 
[regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
 2020-05-21 19:04:03,725 INFO [KeepAlivePEWorker-27] 
procedure.ServerCrashProcedure: Start pid=37, 
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
splitWal=true, meta=true
 2020-05-21 19:04:03,512 INFO [RegionServerTracker-0] 
assignment.AssignmentManager: Scheduled SCP pid=37 for 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 
(carryingMeta=true) 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5bcc7453[Write
 locks = 1, Read locks = 0], oldState=ONLINE.
 2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] master.ServerManager: 
Processing expiration of 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 on 
hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
 2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] 
master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
expiration [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
{noformat}
 

The currently active HBase Master (that has the stale tasks) logged only about 
one regionserver that is shown in stale tasks (logs sorted oldest first):

 
{noformat}
2020-05-21 19:04:26,467 INFO [master/hbasemaster-1:16000:becomeActiveMaster] 
assignment.AssignmentManager: Attach pid=20, ppid=18, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN; TransitRegionStateProcedure 
table=hbase:meta, region=1588230740, ASSIGN to rit=OPENING, 
location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
table=hbase:meta, region=1588230740 to restore RIT
2020-05-21 19:04:26,758 INFO [master/hbasemaster-1:16000:becomeActiveMaster] 
master.HMaster: hbase:meta {1588230740 state=OPENING, ts=1590087849283, 
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010}
2020-05-21 19:04:26,784 INFO [PEWorker-1] zookeeper.MetaTableLocator: Setting 
hbase:meta (replicaId=0) location in ZooKeeper as 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
2020-05-21 19:04:27,022 WARN [PEWorker-1] assignment.RegionRemoteProcedureBase: 
Can not add remote operation pid=39, ppid=20, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} 
to server regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
this usually because the server is alread dead, give up and mark the procedure 
as complete, the parent procedure will take care of this.
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010; pid=39, 
ppid=20, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
2020-05-21 19:04:27,048 INFO [PEWorker-2] 
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=20, 
ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
rit=OPENING, 
location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
{noformat}
 
It hasn't logged anything about procedures 37 and 38 (the successfully 
completed ServerCrashProcedures).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to