[ 
https://issues.apache.org/jira/browse/HBASE-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118895#comment-17118895
 ] 

Andrey Elenskiy commented on HBASE-24438:
-----------------------------------------

I guess the issue is "staleness". The procedures clearly finished during the 
reign of previously active master. However, the new master still shows the 
ServerCrashProcedures as running for multiple hours in "Task" -> "show non-RPC 
Tasks" UI. Let me know if I should rephrase the issue/provide more info to help 
you identify the root cause.

> Stale ServerCrashProcedure task in HBase Master UI
> --------------------------------------------------
>
>                 Key: HBASE-24438
>                 URL: https://issues.apache.org/jira/browse/HBASE-24438
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 2.2.4
>         Environment: HBase 2.2.4
> HDFS 3.1.3 with erasure coding enabled
> Kubernetes
>            Reporter: Andrey Elenskiy
>            Priority: Major
>
> Tasks section (show non-RPC Tasks) in HBase Master UI has stale entries with 
> ServerCrashProcedure after master failover. The procedures have finished with 
> SUCCESS on a previously active HBase master and aren't showing in "Procedures 
> & Locks".
> Based on the logs, both of those regionserver were carrying hbase:meta (logs 
> are sorted newest first grepped for those specific servers that have stale 
> ServerCrashProcedures):
> {noformat}
> 2020-05-21 19:04:09,176 INFO [KeepAlivePEWorker-28] 
> procedure2.ProcedureExecutor: Finished pid=38, state=SUCCESS; 
> ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true in 2.5290sec
>  2020-05-21 19:04:08,962 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: removed crashed server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 after 
> splitting done
>  2020-05-21 19:04:08,747 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
>  2020-05-21 19:04:08,746 INFO [KeepAlivePEWorker-28] master.MasterWalManager: 
> Log dir for server 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 does not 
> exist
>  2020-05-21 19:04:08,636 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 had 0 regions
>  2020-05-21 19:04:08,529 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: pid=38, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true found RIT pid=20, ppid=18, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> rit=ABNORMALLY_CLOSED, 
> location=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> table=hbase:meta, region=1588230740
>  2020-05-21 19:04:08,422 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
> [hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting]
>  in 0ms
>  2020-05-21 19:04:08,416 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> hdfs://aeris/hbase/WALs/regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347-splitting
>  dir is empty, no logs to split.
>  2020-05-21 19:04:08,414 INFO [KeepAlivePEWorker-28] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
>  2020-05-21 19:04:08,300 INFO [KeepAlivePEWorker-28] 
> procedure.ServerCrashProcedure: Start pid=38, 
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
> server=regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347, 
> splitWal=true, meta=true
>  2020-05-21 19:04:06,544 INFO [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=38 for 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 
> (carryingMeta=true) 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347/CRASHED/regionCount=0/lock=java.util.concurrent.locks.ReentrantReadWriteLock@14e57294[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
>  2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] master.ServerManager: 
> Processing expiration of 
> regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347 on 
> hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
>  2020-05-21 19:04:06,434 INFO [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [regionserver-1.hbase.hbase.svc.cluster.local,16020,1590087787347]
> ...
> 2020-05-21 19:04:04,711 INFO [KeepAlivePEWorker-27] 
> procedure2.ProcedureExecutor: Finished pid=37, state=SUCCESS; 
> ServerCrashProcedure 
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
> splitWal=true, meta=true in 997msec
>  2020-05-21 19:04:04,497 INFO [KeepAlivePEWorker-27] 
> procedure.ServerCrashProcedure: removed crashed server 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 after 
> splitting done
>  2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
>  2020-05-21 19:04:04,284 INFO [KeepAlivePEWorker-27] master.MasterWalManager: 
> Log dir for server 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 does not 
> exist
>  2020-05-21 19:04:04,175 INFO [KeepAlivePEWorker-27] 
> procedure.ServerCrashProcedure: 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 had 1 regions
>  2020-05-21 19:04:03,958 INFO [KeepAlivePEWorker-27] 
> procedure.ServerCrashProcedure: pid=37, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN_META, locked=true; ServerCrashProcedure 
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
> splitWal=true, meta=true found RIT pid=20, ppid=18, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> rit=OPENING, 
> location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
> table=hbase:meta, region=1588230740
>  2020-05-21 19:04:03,852 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
> Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in 
> [hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting]
>  in 0ms
>  2020-05-21 19:04:03,845 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
> hdfs://aeris/hbase/WALs/regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010-splitting
>  dir is empty, no logs to split.
>  2020-05-21 19:04:03,843 INFO [KeepAlivePEWorker-27] master.SplitLogManager: 
> dead splitlog workers 
> [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
>  2020-05-21 19:04:03,725 INFO [KeepAlivePEWorker-27] 
> procedure.ServerCrashProcedure: Start pid=37, 
> state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure 
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
> splitWal=true, meta=true
>  2020-05-21 19:04:03,512 INFO [RegionServerTracker-0] 
> assignment.AssignmentManager: Scheduled SCP pid=37 for 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 
> (carryingMeta=true) 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5bcc7453[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
>  2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] master.ServerManager: 
> Processing expiration of 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010 on 
> hbasemaster-0.hbase.hbase.svc.cluster.local,16000,1590087665366
>  2020-05-21 19:04:03,500 INFO [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010]
> {noformat}
>  
> The currently active HBase Master (that has the stale tasks) logged only 
> about one regionserver that is shown in stale tasks (logs sorted oldest 
> first):
>  
> {noformat}
> 2020-05-21 19:04:26,467 INFO [master/hbasemaster-1:16000:becomeActiveMaster] 
> assignment.AssignmentManager: Attach pid=20, ppid=18, 
> state=RUNNABLE:REGION_STATE_TRANSITION_OPEN; TransitRegionStateProcedure 
> table=hbase:meta, region=1588230740, ASSIGN to rit=OPENING, 
> location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, 
> table=hbase:meta, region=1588230740 to restore RIT
> 2020-05-21 19:04:26,758 INFO [master/hbasemaster-1:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta {1588230740 state=OPENING, ts=1590087849283, 
> server=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010}
> 2020-05-21 19:04:26,784 INFO [PEWorker-1] zookeeper.MetaTableLocator: Setting 
> hbase:meta (replicaId=0) location in ZooKeeper as 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
> 2020-05-21 19:04:27,022 WARN [PEWorker-1] 
> assignment.RegionRemoteProcedureBase: Can not add remote operation pid=39, 
> ppid=20, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
> {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => 
> ''} to server 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010, this 
> usually because the server is alread dead, give up and mark the procedure as 
> complete, the parent procedure will take care of this.
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010; pid=39, 
> ppid=20, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2020-05-21 19:04:27,048 INFO [PEWorker-2] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=20, 
> ppid=18, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> rit=OPENING, 
> location=regionserver-0.hbase.hbase.svc.cluster.local,16020,1590087787010
> {noformat}
>  
> It hasn't logged anything about procedures 37 and 38 (the successfully 
> completed ServerCrashProcedures).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to