stack created HBASE-21124: ----------------------------- Summary: SCP stuck unable to split WALs (NoServerDispatchException and STUCK regions in CLOSING in logs) Key: HBASE-21124 URL: https://issues.apache.org/jira/browse/HBASE-21124 Project: HBase Issue Type: Bug Components: amv2 Reporter: stack
At first blush, we go to dispatch an unassign to a server that is not online, a NoServerDispatchException is thrown. The server is not online, so a SCP is NOT scheduled and the region close does not go to completion. Eventually regions complain they are STUCK. {code} 2018-08-27 15:43:41,578 WARN [PEWorker-13] assignment.RegionTransitionProcedure: Remote call failed ve0530.halxg.cloudera.com,16020,1535387669315; pid=20723, ppid=20578, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure table=IntegrationTestBigLinkedList, region=464487fa9329df991cae84519adf253c, server=ve0530.halxg.cloudera.com,16020,1535387669315; rit=CLOSING, location=ve0530.halxg.cloudera.com,16020,1535387669315; exception=NoServerDispatchException org.apache.hadoop.hbase.procedure2.NoServerDispatchException: ve0530.halxg.cloudera.com,16020,1535387669315; pid=20723, ppid=20578, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure table=IntegrationTestBigLinkedList, region=464487fa9329df991cae84519adf253c, server=ve0530.halxg.cloudera.com,16020,1535387669315 at org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:259) at org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:207) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:345) at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:876) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1556) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1344) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1855) 2018-08-27 15:43:41,578 WARN [PEWorker-16] assignment.UnassignProcedure: Expiring ve0530.halxg.cloudera.com,16020,1535387669315, pid=20720, ppid=20580, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure table=IntegrationTestBigLinkedList, region=aa45bd6f1aabd0ee778c2664cca76d5e, server=ve0530.halxg.cloudera.com,16020,1535387669315 rit=CLOSING, location=ve0530.halxg.cloudera.com,16020,1535387669315; exception=NoServerDispatchException 2018-08-27 15:43:41,578 WARN [PEWorker-16] master.ServerManager: Expiration of ve0530.halxg.cloudera.com,16020,1535387669315 but server not online 2018-08-27 15:43:41,578 INFO [PEWorker-16] master.ServerManager: Processing expiration of ve0530.halxg.cloudera.com,16020,1535387669315 on ve0524.halxg.cloudera.com,16000,1535403620803 {code} Root cause is that the crashed server processing is stuck trying to split logs.. We don't succeed so we are stuck in a limbo. In the middle of writing recovered.edits files we throw a FNFE: {code} 871008 2018-08-27 15:58:01,939 INFO [split-log-closeStream-2] wal.WALSplitter: Rename hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/db9d8cae5a9f098be493cb79d40e031f/recovered.edits/0000000000001384580-ve0530.halxg.cloudera.com%2C16020%2C15353876 69315.1535406304424.temp to hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/db9d8cae5a9f098be493cb79d40e031f/recovered.edits/0000000000001397283 871009 2018-08-27 15:58:01,957 ERROR [split-log-closeStream-2] wal.WALSplitter: Couldn't close log at hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C1 6020%2C1535387669315.1535406304424.temp 871010 org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/data/default/IntegrationTestBigLinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp ( inode 10892102) [Lease. Holder: DFSClient_NONMAPREDUCE_334149313_1, pending creates: 37] 871011 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2672) 871012 at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:621) 871013 at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:601) 871014 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2717) 871015 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:882) 871016 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:555) 871017 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 871018 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) 871019 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) 871020 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850) 871021 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793) 871022 at java.security.AccessController.doPrivileged(Native Method) 871023 at javax.security.auth.Subject.doAs(Subject.java:422) 871024 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840) 871025 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489) 871026 871027 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489) 871028 at org.apache.hadoop.ipc.Client.call(Client.java:1435) 871029 at org.apache.hadoop.ipc.Client.call(Client.java:1345) 871030 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) 871031 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) 871032 at com.sun.proxy.$Proxy26.complete(Unknown Source) 871033 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:484) 871034 at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source) 871035 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 871036 at java.lang.reflect.Method.invoke(Method.java:498) 871037 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) 871038 at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) 871039 at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) 871040 at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) 871041 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) 871042 at com.sun.proxy.$Proxy27.complete(Unknown Source) 871043 at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source) 871044 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 871045 at java.lang.reflect.Method.invoke(Method.java:498) 871046 at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372) 871047 at com.sun.proxy.$Proxy28.complete(Unknown Source) 871048 at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:908) 871049 at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:867) 871050 at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:850) 871051 at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805) 871052 at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) 871053 at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) 871054 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:67) 871055 at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1396) 871056 at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1373) 871057 at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1370) 871058 at java.util.concurrent.FutureTask.run(FutureTask.java:266) 871059 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 871060 at java.util.concurrent.FutureTask.run(FutureTask.java:266) 871061 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 871062 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 871063 at java.lang.Thread.run(Thread.java:748) {code} We then fail unable to close the file... {code} 6152452:blk_1088844461_15107988] hdfs.DataStreamer: Error Recovery for BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 in pipeline [DatanodeInfoWithStorage[10.17.240.25:50031,DS-aa2e553d-0dc0-4199-8cda-a8c1fa453b5b,DISK], DatanodeInfoWithStorage[10.17.240.27:500 31,DS-3665f870-e036-46f5-9a6e-44ca6991ca6b,DISK], DatanodeInfoWithStorage[10.17.240.24:50031,DS-fa22485e-20b5-44e2-b634-33260ade2b47,DISK]]: datanode 2(DatanodeInfoWithStorage[10.17.240.24:50031,DS-fa22485e-20b5-44e2-b634-33260ade2b47,DISK]) is bad. 871075 2018-08-27 15:58:02,038 WARN [DataStreamer for file /hbase/data/default/IntegrationTestBigLinkedList/e9aef737cc0991c584dd842ac1d25eee/recovered.edits/0000000000001163341-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp block BP-1622493648-10.17.240.20-151389 6152452:blk_1088844461_15107988] hdfs.DataStreamer: DataStreamer Exception 871076 org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 does not exist. 871077 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5546) 871078 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:5618) 871079 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:901) 871080 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:985) 871081 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 871082 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) 871083 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) 871084 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850) 871085 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793) 871086 at java.security.AccessController.doPrivileged(Native Method) 871087 at javax.security.auth.Subject.doAs(Subject.java:422) 871088 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840) 871089 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489) 871090 871091 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489) 871092 at org.apache.hadoop.ipc.Client.call(Client.java:1435) 871093 at org.apache.hadoop.ipc.Client.call(Client.java:1345) 871094 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) 871095 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) 871096 at com.sun.proxy.$Proxy26.updateBlockForPipeline(Unknown Source) 871097 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:914) 871098 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 871099 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 871100 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 871101 at java.lang.reflect.Method.invoke(Method.java:498) 871102 at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) 871103 at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) 871104 at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) 871105 at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) 871106 at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) 871107 at com.sun.proxy.$Proxy27.updateBlockForPipeline(Unknown Source) 871108 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 871109 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 871110 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 871111 at java.lang.reflect.Method.invoke(Method.java:498) 871112 at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372) 871113 at com.sun.proxy.$Proxy28.updateBlockForPipeline(Unknown Source) 871114 at org.apache.hadoop.hdfs.DataStreamer.updateBlockForPipeline(DataStreamer.java:1595) 871115 at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1472) 871116 at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237) 871117 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657) 871118 2018-08-27 15:58:02,041 ERROR [split-log-closeStream-3] wal.WALSplitter: Couldn't close log at hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/e9aef737cc0991c584dd842ac1d25eee/recovered.edits/0000000000001163341-ve0530.halxg.cloudera.com%2C1 6020%2C1535387669315.1535406304424.temp 871119 org.apache.hadoop.ipc.RemoteException(java.io.IOException): BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 does not exist. 871120 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5546) 871121 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:5618) 871122 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:901) 871123 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:985) 871124 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 871125 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) 871126 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) 871127 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850) 871128 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793) 871129 at java.security.AccessController.doPrivileged(Native Method) 871130 at javax.security.auth.Subject.doAs(Subject.java:422) 871131 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840) 871132 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489) ... {code} On the namenode it says: {code} 2018-08-27 15:53:23,472 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call Call#2352218 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.17.240.26:46482: java.io.FileNotFoundException: File does not exist: /hbase/data/default/IntegrationTestBig LinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp (inode 10890719) [Lease. Holder: DFSClient_NONMAPREDUCE_-638399805_1, pending creates: 4] {code} And so on. Nasty. -- This message was sent by Atlassian JIRA (v7.6.3#76005)