[ 
https://issues.apache.org/jira/browse/HBASE-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-23247.
-----------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

Pushed on branch-2.1+. Thanks for reviews.

> [hbck2] Schedule SCPs for 'Unknown Servers'
> -------------------------------------------
>
>                 Key: HBASE-23247
>                 URL: https://issues.apache.org/jira/browse/HBASE-23247
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2
>    Affects Versions: 2.2.2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 2.2.3
>
>
> I've run into an 'Unknown Server' phenomenon.Meta has regions assigned to 
> servers that the cluster no longer knows about. You can see the list in the 
> 'HBCK Report' page down the end (run 'catalogjanitor_run' in the shell to 
> generate a fresh report). Fix is tough if you try to do 
> unassign/assign/close/etc. because new assign/unassign is insistent on 
> checking the close succeeded by trying to contact the 'unknown server' and 
> being insistent on not moving on until it succeeds; TODO. There are a few 
> ways of obtaining this state of affairs. I'll list a few below in a minute.
> Meantime, an hbck2 'fix' seems just the ticket; Run a SCP for the 'Unknown 
> Server' and it should clear the meta of all the bad server references.... So 
> just schedule an SCP using scheduleRecoveries command....only in this case it 
> fails before scheduling SCP with the below; i.e. a FNFE because no dir for 
> the 'Unknown Server'.
> {code}
>  22:41:13.909 [main] INFO  
> org.apache.hadoop.hbase.client.ConnectionImplementation - Closing master 
> protocol: MasterService
>  Exception in thread "main" java.io.IOException: 
> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.FileNotFoundException):
>  java.io.FileNotFoundException: File 
> hdfs://nameservice1/hbase/genie/WALs/s1.d.com,16020,1571170081872 does not 
> exist.
>    at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:986)
>    at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:122)
>    at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1046)
>    at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1043)
>    at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>    at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1053)
>    at 
> org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258)
>    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1802)
>    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1844)
>    at 
> org.apache.hadoop.hbase.master.MasterRpcServices.containMetaWals(MasterRpcServices.java:2709)
>    at 
> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2488)
>    at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
>    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>    at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>    at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>    at 
> org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:175)
>    at 
> org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:118)
>    at org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:345)
>    at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:746)
>    at org.apache.hbase.HBCK2.run(HBCK2.java:631)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>    at org.apache.hbase.HBCK2.main(HBCK2.java:865)
> {code}
> A simple fix makes it so I can schedule an SCP which indeed clears out the 
> 'Unknown Server' to restore saneness on the cluster.
> As to how to get 'Unknown Server':
> 1. The current scenario came about because of this exception while processing 
> a server crash procedure made it so the SCP exited just after splitting logs 
> but before it cleared old assigns. A new server instance that came up after 
> this one went down purged the server from dead servers list though there were 
> still Procedures in flight (The cluster was under a crippling overloading)
> {code}
>  2019-11-02 21:02:34,775 DEBUG 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting 
> WALs pid=112532, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; 
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, 
> meta=false
>  2019-11-02 21:02:34,775 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, 
> meta=false as the 2th rollback step
>  2019-11-02 21:02:34,779 INFO 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=112532, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure 
> server=s1.d.com,16020,1572668980355, splitWal=true, meta=false found RIT 
> pid=101251, ppid=101123, state=SUCCESS, bypass=LOG-REDACTED 
> TransitRegionStateProcedure                            
> table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359, 
> ASSIGN; rit=OPENING, location=s1.d.com,16020,1572668980355, 
> table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359
>  2019-11-02 21:02:34,779 ERROR 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355, 
> splitWal=true, meta=false
>  java.lang.NullPointerException
>          at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:139)
>          at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:132)
>          at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:786)
>          at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:741)
>          at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:605)
>          at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:183)
>          at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:240)
>          at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:409)
>          at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
>          at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
>          at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
>          at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
>          at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
>  2019-11-02 21:02:34,779 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true, 
> exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime 
> exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; 
> ServerCrashProcedure server=s1.d.com,16020,1572668980355,   splitWal=true, 
> meta=false:java.lang.NullPointerException; ServerCrashProcedure 
> server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the 3th 
> rollback step
>  2019-11-02 21:02:34,782 ERROR 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, 
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
> runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355, 
> splitWal=true, meta=false:java.lang.NullPointerException; 
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, 
> meta=false
>  java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
>          at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>          at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>          at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
>          at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
>  2019-11-02 21:02:34,785 ERROR 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, 
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
> runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355, 
> splitWal=true, meta=false:java.lang.NullPointerException; 
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, 
> meta=false
>  java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
>          at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>          at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>          at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
>          at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
> {code}
> 2. I'm pretty sure I ran into this when I cleared out the MasterProcWAL to 
> start over fresh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to