Andrey Elenskiy created HBASE-24438:
---------------------------------------
Summary: Stale ServerCrashProcedure task in HBase Master UI
Key: HBASE-24438
URL: https://issues.apache.org/jira/browse/HBASE-24438
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 2.2.4
Environment: HBase 2.2.4
HDFS 3.1.3 with erasure coding enabled
Kubernetes
Reporter: Andrey Elenskiy
Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with
ServerCrashProcedure after master failover. The procedures have finished with
SUCCESS on a previously active HBase master and aren't showing in "Procedures &
Locks".
Based on the logs, both of those regionserver were carrying hbase:meta (logs
are sorted newest first grepped for those specific servers that have stale
ServerCrashProcedures):
{noformat}
2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28]
procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS;
ServerCrashProcedure
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
splitWal=true, meta=true in 2.5290sec
2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28]
procedure.ServerCrashProcedure: removed crashed server
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after
splitting done
2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
dead splitlog workers
[regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager:
Log dir for server
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not exist
2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28]
procedure.ServerCrashProcedure:
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28]
procedure.ServerCrashProcedure: pid=38,
state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
splitWal=true, meta=true found RIT pid=20, ppid=18,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true;
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
rit=ABNORMALLY_CLOSED,
location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
table=hbase:meta, region=1588230740
2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in
[hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
in 0ms
2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting
dir is empty, no logs to split.
2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager:
dead splitlog workers
[regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28]
procedure.ServerCrashProcedure: Start pid=38,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347,
splitWal=true, meta=true
2020-05-21 19:04:06,544 INFO [RegionServerTracker-0]
assignment.AssignmentManager: Scheduled SCP pid=38 for
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347
(carryingMeta=true)
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write
locks = 1, Read locks = 0], oldState=ONLINE.
2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager:
Processing expiration of
regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on
hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
2020-05-21 19:04:06,434 INFO [RegionServerTracker-0]
master.RegionServerTracker: RegionServer ephemeral node deleted, processing
expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
...
2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27]
procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS;
ServerCrashProcedure
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
splitWal=true, meta=true in 997msec
2020-05-21 19:04:04,497 INFO [KeepAlivePEWorker-27]
procedure.ServerCrashProcedure: removed crashed server
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 after
splitting done
2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
dead splitlog workers
[regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.MasterWalManager:
Log dir for server
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 does not exist
2020-05-21 19:04:04,175 INFO [KeepAlivePEWorker-27]
procedure.ServerCrashProcedure:
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 had 1 regions
2020-05-21 19:04:03,958 INFO [KeepAlivePEWorker-27]
procedure.ServerCrashProcedure: pid=37,
state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
splitWal=true, meta=true found RIT pid=20, ppid=18,
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
rit=OPENING,
location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
table=hbase:meta, region=1588230740
2020-05-21 19:04:03,852 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in
[hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting]
in 0ms
2020-05-21 19:04:03,845 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting
dir is empty, no logs to split.
2020-05-21 19:04:03,843 INFO [KeepAlivePEWorker-27] master.SplitLogManager:
dead splitlog workers
[regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
2020-05-21 19:04:03,725 INFO [KeepAlivePEWorker-27]
procedure.ServerCrashProcedure: Start pid=37,
state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
splitWal=true, meta=true
2020-05-21 19:04:03,512 INFO [RegionServerTracker-0]
assignment.AssignmentManager: Scheduled SCP pid=37 for
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
(carryingMeta=true)
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5bcc7453[Write
locks = 1, Read locks = 0], oldState=ONLINE.
2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] master.ServerManager:
Processing expiration of
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 on
hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
2020-05-21 19:04:03,500 INFO [RegionServerTracker-0]
master.RegionServerTracker: RegionServer ephemeral node deleted, processing
expiration [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
{noformat}
The currently active HBase Master (that has the stale tasks) logged only about
one regionserver that is shown in stale tasks (logs sorted oldest first):
{noformat}
2020-05-21 19:04:26,467 INFO [master/hbasemaster-1:16000:becomeActiveMaster]
assignment.AssignmentManager: Attach pid=20, ppid=18,
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN; TransitRegionStateProcedure
table=hbase:meta, region=1588230740, ASSIGN to rit=OPENING,
location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
table=hbase:meta, region=1588230740 to restore RIT
2020-05-21 19:04:26,758 INFO [master/hbasemaster-1:16000:becomeActiveMaster]
master.HMaster: hbase:meta {1588230740 state=OPENING, ts=1590087849283,
server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010}
2020-05-21 19:04:26,784 INFO [PEWorker-1] zookeeper.MetaTableLocator: Setting
hbase:meta (replicaId=0) location in ZooKeeper as
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
2020-05-21 19:04:27,022 WARN [PEWorker-1] assignment.RegionRemoteProcedureBase:
Can not add remote operation pid=39, ppid=20, state=RUNNABLE, locked=true;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region
{ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
to server regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010,
this usually because the server is alread dead, give up and mark the procedure
as complete, the parent procedure will take care of this.
org.apache.hadoop.hbase.procedure2.NoServerDispatchException:
regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010; pid=39,
ppid=20, state=RUNNABLE, locked=true;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
2020-05-21 19:04:27,048 INFO [PEWorker-2]
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=20,
ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true;
TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN;
rit=OPENING,
location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
{noformat}
It hasn't logged anything about procedures 37 and 38 (the successfully
completed ServerCrashProcedures).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)