[ 
https://issues.apache.org/jira/browse/HDFS-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544565#comment-14544565
 ] 

Keith Turner commented on HDFS-8406:
------------------------------------

Here is the full stack trace from the Accumulo master log.   

{noformat}
2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering 
lease on 
hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
 failed to create file 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
pendingCreates is non-null but no leases found.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3001)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2955)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:591)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:641)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

        at org.apache.hadoop.ipc.Client.call(Client.java:1468)
        at org.apache.hadoop.ipc.Client.call(Client.java:1399)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy15.recoverLease(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.recoverLease(ClientNamenodeProtocolTranslatorPB.java:584)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy16.recoverLease(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.recoverLease(DFSClient.java:1238)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$2.doCall(DistributedFileSystem.java:278)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$2.doCall(DistributedFileSystem.java:274)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:274)
        at 
org.apache.accumulo.server.master.recovery.HadoopLogCloser.close(HadoopLogCloser.java:55)
        at 
org.apache.accumulo.master.recovery.RecoveryManager$LogSortTask.run(RecoveryManager.java:96)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

> Lease recovery continually failed
> ---------------------------------
>
>                 Key: HDFS-8406
>                 URL: https://issues.apache.org/jira/browse/HDFS-8406
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Keith Turner
>
> While testing Accumulo on a cluster and killing processes, I ran into a 
> situation where the lease on an accumulo write ahead log in HDFS could not be 
> recovered.   Even restarting HDFS and Accumulo would not fix the problem.
> The following message was seen in an Accumulo tablet server log immediately 
> before the tablet server was killed.
> {noformat}
> 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream 
> ResponseProcessor exception  for block 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
> java.io.IOException: Bad response ERROR for block 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 from datanode 
> 10.1.5.9:50010
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
> 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 in pipeline 
> 10.1.5.55:50010, 10.1.5.9:5
> {noformat}
> Before recovering data from a write ahead log, the Accumulo master attempts 
> to recover the lease.   This repeatedly failed with messages like the 
> following.
> {noformat}
> 2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering 
> lease on 
> hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  failed to create file 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
> DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
> pendingCreates is non-null but no leases found.
> {noformat}
> Below is some info from the NN logs for the problematic file.
> {noformat}
> [ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245 
> hadoop-ec2-user-namenode-leader2.log 
> 2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocateBlock: 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2. 
> BP-802741494-10.1.5.6-1431557089849 
> blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
> 2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
> DFSClient_NONMAPREDUCE_-1128465883_16
> 2015-05-14 17:14:49,288 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], 
> src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 from 
> client DFSClient_NONMAPREDUCE_-1128465883_16
> 2015-05-14 17:14:49,288 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], 
> src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
> 2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been 
> closed. Lease recovery is in progress. RecoveryId = 192257 for block 
> blk_1073932823_192060{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
> java.lang.IllegalStateException: Failed to finalize INodeFile 
> 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where 
> blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
> 2015-05-14 17:14:54,292 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 1 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease 
> from 10.1.5.158:53784 Call#529 Retry#0: 
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
> create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 
> for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
> pendingCreates is non-null but no leases found.
> 2015-05-14 17:14:54,319 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.append: failed to create file 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
> DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
> pendingCreates is non-null but no leases found.
> 2015-05-14 17:14:54,320 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 6 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 
> 10.1.5.158:53784 Call#530 Retry#0: 
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
> create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 
> for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
> pendingCreates is non-null but no leases found.
> {noformat}
> Below is expanded info from the NN logs about the exception mentioned above.
> {noformat}
> 2015-05-14 17:14:51,704 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 10000, call 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
>  from 10.1.5.45:35687 Call#66737 Retry#0
> java.lang.IllegalStateException: Failed to finalize INodeFile 
> 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where 
> blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:172)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214)
>         at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4663)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4878)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4842)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:734)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26394)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
> {noformat}
> Below is the output of running fsck on the file.
> {noformat}
> [ec2-user@leader2 sbin]$ hdfs fsck 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 -files 
> -blocks -locations -openforwrite
> Connecting to namenode via http://leader2:50070
> FSCK started by ec2-user (auth:SIMPLE) from /10.1.5.6 for path 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 at Thu May 
> 14 19:03:21 UTC 2015
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 541965482 
> bytes, 1 block(s), OPENFORWRITE:  Under replicated 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED,
>  primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
>  
> ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}.
>  Target Replicas is 3 but found 2 replica(s).
> 0. 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED,
>  primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
>  
> ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}
>  len=541965482 repl=2 [10.1.5.55:50010, 10.1.5.45:50010, 10.1.5.9:50010, 
> 10.1.5.131:50010]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to