stack created HBASE-21124:
-----------------------------

             Summary: SCP stuck unable to split WALs (NoServerDispatchException 
and STUCK regions in CLOSING in logs)
                 Key: HBASE-21124
                 URL: https://issues.apache.org/jira/browse/HBASE-21124
             Project: HBase
          Issue Type: Bug
          Components: amv2
            Reporter: stack


At first blush, we go to dispatch an unassign to a server that is not online, a 
NoServerDispatchException is thrown. The server is not online, so a SCP is NOT 
scheduled and the region close does not go to completion. Eventually regions 
complain they are STUCK.

{code}
2018-08-27 15:43:41,578 WARN  [PEWorker-13] 
assignment.RegionTransitionProcedure: Remote call failed 
ve0530.halxg.cloudera.com,16020,1535387669315; pid=20723, ppid=20578, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
table=IntegrationTestBigLinkedList, region=464487fa9329df991cae84519adf253c, 
server=ve0530.halxg.cloudera.com,16020,1535387669315; rit=CLOSING, 
location=ve0530.halxg.cloudera.com,16020,1535387669315; 
exception=NoServerDispatchException                                             
                                               
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
ve0530.halxg.cloudera.com,16020,1535387669315; pid=20723, ppid=20578, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
table=IntegrationTestBigLinkedList, region=464487fa9329df991cae84519adf253c, 
server=ve0530.halxg.cloudera.com,16020,1535387669315                            
                                                                                
                                                                                
                                                   at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177)
                                                                                
                                                                                
        at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:259)
                                                                                
                                                                              
at 
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:207)
                                                                                
                                                                                
                   at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:345)
                                                                                
                                                                                
            at 
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
                                                                                
                                                                                
             at 
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:876)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1556)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1344)
                                                                                
                                                                                
                         at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1855)
                                                                                
                                                                                
                       2018-08-27 15:43:41,578 WARN  [PEWorker-16] 
assignment.UnassignProcedure: Expiring 
ve0530.halxg.cloudera.com,16020,1535387669315, pid=20720, ppid=20580, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure 
table=IntegrationTestBigLinkedList, region=aa45bd6f1aabd0ee778c2664cca76d5e, 
server=ve0530.halxg.cloudera.com,16020,1535387669315 rit=CLOSING, 
location=ve0530.halxg.cloudera.com,16020,1535387669315; 
exception=NoServerDispatchException                                             
                                                                  2018-08-27 
15:43:41,578 WARN  [PEWorker-16] master.ServerManager: Expiration of 
ve0530.halxg.cloudera.com,16020,1535387669315 but server not online             
                                                                                
                                              2018-08-27 15:43:41,578 INFO  
[PEWorker-16] master.ServerManager: Processing expiration of 
ve0530.halxg.cloudera.com,16020,1535387669315 on 
ve0524.halxg.cloudera.com,16000,1535403620803
{code}

Root cause is that the crashed server processing is stuck trying to split 
logs..  We don't succeed so we are stuck in a limbo.

In the middle of writing recovered.edits files we throw a FNFE:
{code}
871008 2018-08-27 15:58:01,939 INFO  [split-log-closeStream-2] wal.WALSplitter: 
Rename 
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/db9d8cae5a9f098be493cb79d40e031f/recovered.edits/0000000000001384580-ve0530.halxg.cloudera.com%2C16020%2C15353876
       69315.1535406304424.temp to 
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/db9d8cae5a9f098be493cb79d40e031f/recovered.edits/0000000000001397283
871009 2018-08-27 15:58:01,957 ERROR [split-log-closeStream-2] wal.WALSplitter: 
Couldn't close log at 
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C1
       6020%2C1535387669315.1535406304424.temp
871010 org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): 
File does not exist: 
/hbase/data/default/IntegrationTestBigLinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp
 (       inode 10892102) [Lease.  Holder: DFSClient_NONMAPREDUCE_334149313_1, 
pending creates: 37]
871011   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2672)
871012   at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:621)
871013   at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:601)
871014   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2717)
871015   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:882)
871016   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:555)
871017   at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
871018   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
871019   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
871020   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
871021   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
871022   at java.security.AccessController.doPrivileged(Native Method)
871023   at javax.security.auth.Subject.doAs(Subject.java:422)
871024   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
871025   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
871026
871027   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
871028   at org.apache.hadoop.ipc.Client.call(Client.java:1435)
871029   at org.apache.hadoop.ipc.Client.call(Client.java:1345)
871030   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
871031   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
871032   at com.sun.proxy.$Proxy26.complete(Unknown Source)
871033   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:484)
871034   at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
871035   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871036   at java.lang.reflect.Method.invoke(Method.java:498)
871037   at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
871038   at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
871039   at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
871040   at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
871041   at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
871042   at com.sun.proxy.$Proxy27.complete(Unknown Source)
871043   at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
871044   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871045   at java.lang.reflect.Method.invoke(Method.java:498)
871046   at 
org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372)
871047   at com.sun.proxy.$Proxy28.complete(Unknown Source)
871048   at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:908)
871049   at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:867)
871050   at 
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:850)
871051   at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805)
871052   at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
871053   at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
871054   at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:67)
871055   at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1396)
871056   at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1373)
871057   at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1370)
871058   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
871059   at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
871060   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
871061   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
871062   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
871063   at java.lang.Thread.run(Thread.java:748)
{code}

We then fail unable to close the file...

{code}
6152452:blk_1088844461_15107988] hdfs.DataStreamer: Error Recovery for 
BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 in pipeline 
[DatanodeInfoWithStorage[10.17.240.25:50031,DS-aa2e553d-0dc0-4199-8cda-a8c1fa453b5b,DISK],
 DatanodeInfoWithStorage[10.17.240.27:500       
31,DS-3665f870-e036-46f5-9a6e-44ca6991ca6b,DISK], 
DatanodeInfoWithStorage[10.17.240.24:50031,DS-fa22485e-20b5-44e2-b634-33260ade2b47,DISK]]:
 datanode 
2(DatanodeInfoWithStorage[10.17.240.24:50031,DS-fa22485e-20b5-44e2-b634-33260ade2b47,DISK])
 is bad.
871075 2018-08-27 15:58:02,038 WARN  [DataStreamer for file 
/hbase/data/default/IntegrationTestBigLinkedList/e9aef737cc0991c584dd842ac1d25eee/recovered.edits/0000000000001163341-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp
 block BP-1622493648-10.17.240.20-151389       6152452:blk_1088844461_15107988] 
hdfs.DataStreamer: DataStreamer Exception
871076 org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 does not exist.
871077   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5546)
871078   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:5618)
871079   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:901)
871080   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:985)
871081   at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
871082   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
871083   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
871084   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
871085   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
871086   at java.security.AccessController.doPrivileged(Native Method)
871087   at javax.security.auth.Subject.doAs(Subject.java:422)
871088   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
871089   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
871090
871091   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
871092   at org.apache.hadoop.ipc.Client.call(Client.java:1435)
871093   at org.apache.hadoop.ipc.Client.call(Client.java:1345)
871094   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
871095   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
871096   at com.sun.proxy.$Proxy26.updateBlockForPipeline(Unknown Source)
871097   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:914)
871098   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
871099   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
871100   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871101   at java.lang.reflect.Method.invoke(Method.java:498)
871102   at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
871103   at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
871104   at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
871105   at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
871106   at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
871107   at com.sun.proxy.$Proxy27.updateBlockForPipeline(Unknown Source)
871108   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
871109   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
871110   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871111   at java.lang.reflect.Method.invoke(Method.java:498)
871112   at 
org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372)
871113   at com.sun.proxy.$Proxy28.updateBlockForPipeline(Unknown Source)
871114   at 
org.apache.hadoop.hdfs.DataStreamer.updateBlockForPipeline(DataStreamer.java:1595)
871115   at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1472)
871116   at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
871117   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
871118 2018-08-27 15:58:02,041 ERROR [split-log-closeStream-3] wal.WALSplitter: 
Couldn't close log at 
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/e9aef737cc0991c584dd842ac1d25eee/recovered.edits/0000000000001163341-ve0530.halxg.cloudera.com%2C1
       6020%2C1535387669315.1535406304424.temp
871119 org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 does not exist.
871120   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5546)
871121   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:5618)
871122   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:901)
871123   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:985)
871124   at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
871125   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
871126   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
871127   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
871128   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
871129   at java.security.AccessController.doPrivileged(Native Method)
871130   at javax.security.auth.Subject.doAs(Subject.java:422)
871131   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
871132   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
...
{code}

On the namenode it says:

{code}
2018-08-27 15:53:23,472 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8020, call Call#2352218 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
10.17.240.26:46482: java.io.FileNotFoundException: File does not exist: 
/hbase/data/default/IntegrationTestBig
LinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp
 (inode 10890719) [Lease.  Holder: DFSClient_NONMAPREDUCE_-638399805_1, pending 
creates: 4]
{code}

And so on.

Nasty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to