[ https://issues.apache.org/jira/browse/HBASE-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967109#comment-16967109 ]
Hudson commented on HBASE-23247: -------------------------------- Results for branch branch-2 [build #2344 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > [hbck2] Schedule SCPs for 'Unknown Servers' > ------------------------------------------- > > Key: HBASE-23247 > URL: https://issues.apache.org/jira/browse/HBASE-23247 > Project: HBase > Issue Type: Bug > Components: hbck2 > Affects Versions: 2.2.2 > Reporter: Michael Stack > Assignee: Michael Stack > Priority: Major > Fix For: 2.2.3 > > > I've run into an 'Unknown Server' phenomenon.Meta has regions assigned to > servers that the cluster no longer knows about. You can see the list in the > 'HBCK Report' page down the end (run 'catalogjanitor_run' in the shell to > generate a fresh report). Fix is tough if you try to do > unassign/assign/close/etc. because new assign/unassign is insistent on > checking the close succeeded by trying to contact the 'unknown server' and > being insistent on not moving on until it succeeds; TODO. There are a few > ways of obtaining this state of affairs. I'll list a few below in a minute. > Meantime, an hbck2 'fix' seems just the ticket; Run a SCP for the 'Unknown > Server' and it should clear the meta of all the bad server references.... So > just schedule an SCP using scheduleRecoveries command....only in this case it > fails before scheduling SCP with the below; i.e. a FNFE because no dir for > the 'Unknown Server'. > {code} > 22:41:13.909 [main] INFO > org.apache.hadoop.hbase.client.ConnectionImplementation - Closing master > protocol: MasterService > Exception in thread "main" java.io.IOException: > org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.FileNotFoundException): > java.io.FileNotFoundException: File > hdfs://nameservice1/hbase/genie/WALs/s1.d.com,16020,1571170081872 does not > exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:986) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:122) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1046) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1043) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1053) > at > org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1802) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1844) > at > org.apache.hadoop.hbase.master.MasterRpcServices.containMetaWals(MasterRpcServices.java:2709) > at > org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2488) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > at > org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:175) > at > org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:118) > at org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:345) > at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:746) > at org.apache.hbase.HBCK2.run(HBCK2.java:631) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hbase.HBCK2.main(HBCK2.java:865) > {code} > A simple fix makes it so I can schedule an SCP which indeed clears out the > 'Unknown Server' to restore saneness on the cluster. > As to how to get 'Unknown Server': > 1. The current scenario came about because of this exception while processing > a server crash procedure made it so the SCP exited just after splitting logs > but before it cleared old assigns. A new server instance that came up after > this one went down purged the server from dead servers list though there were > still Procedures in flight (The cluster was under a crippling overloading) > {code} > 2019-11-02 21:02:34,775 DEBUG > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting > WALs pid=112532, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; > ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, > meta=false > 2019-11-02 21:02:34,775 DEBUG > org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure > pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; > ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, > meta=false as the 2th rollback step > 2019-11-02 21:02:34,779 INFO > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=112532, > state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure > server=s1.d.com,16020,1572668980355, splitWal=true, meta=false found RIT > pid=101251, ppid=101123, state=SUCCESS, bypass=LOG-REDACTED > TransitRegionStateProcedure > table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359, > ASSIGN; rit=OPENING, location=s1.d.com,16020,1572668980355, > table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359 > 2019-11-02 21:02:34,779 ERROR > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, > locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355, > splitWal=true, meta=false > java.lang.NullPointerException > at > org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:139) > at > org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:132) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:786) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:741) > at > org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:605) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:183) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:240) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:409) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2019-11-02 21:02:34,779 DEBUG > org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure > pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true, > exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime > exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; > ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, > meta=false:java.lang.NullPointerException; ServerCrashProcedure > server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the 3th > rollback step > 2019-11-02 21:02:34,782 ERROR > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, > locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught > runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, > locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355, > splitWal=true, meta=false:java.lang.NullPointerException; > ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, > meta=false > java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > 2019-11-02 21:02:34,785 ERROR > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught > runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, > locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught > runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, > locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355, > splitWal=true, meta=false:java.lang.NullPointerException; > ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, > meta=false > java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965) > {code} > 2. I'm pretty sure I ran into this when I cleared out the MasterProcWAL to > start over fresh. -- This message was sent by Atlassian Jira (v8.3.4#803005)