Keith Turner created HDFS-8406: ---------------------------------- Summary: Lease recovery continually failed Key: HDFS-8406 URL: https://issues.apache.org/jira/browse/HDFS-8406 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.6.0 Reporter: Keith Turner
While testing Accumulo on a cluster and killing processes, I ran into a situation where the lease on an accumulo write ahead log in HDFS could not be recovered. Even restarting HDFS and Accumulo would not fix the problem. The following message was seen in an Accumulo tablet server log immediately before the tablet server was killed. {noformat} 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream ResponseProcessor exception for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 java.io.IOException: Bad response ERROR for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 from datanode 10.1.5.9:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897) 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 in pipeline 10.1.5.55:50010, 10.1.5.9:5 {noformat} Before recovering data from a write ahead log, the Accumulo master attempts to recover the lease. This repeatedly failed with messages like the following. {noformat} 2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering lease on hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found. {noformat} Below is some info from the NN logs for the problematic file. {noformat} [ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245 hadoop-ec2-user-namenode-leader2.log 2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2. BP-802741494-10.1.5.6-1431557089849 blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]} 2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_-1128465883_16 2015-05-14 17:14:49,288 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease. Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 from client DFSClient_NONMAPREDUCE_-1128465883_16 2015-05-14 17:14:49,288 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: File /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been closed. Lease recovery is in progress. RecoveryId = 192257 for block blk_1073932823_192060{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]} java.lang.IllegalStateException: Failed to finalize INodeFile 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}]. 2015-05-14 17:14:54,292 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 10.1.5.158:53784 Call#529 Retry#0: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found. 2015-05-14 17:14:54,319 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.append: failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found. 2015-05-14 17:14:54,320 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 10.1.5.158:53784 Call#530 Retry#0: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because pendingCreates is non-null but no leases found. {noformat} Below is expanded info from the NN logs about the exception mentioned above. {noformat} 2015-05-14 17:14:51,704 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 10000, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization from 10.1.5.45:35687 Call#66737 Retry#0 java.lang.IllegalStateException: Failed to finalize INodeFile 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}]. at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214) at org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4663) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4878) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4842) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:734) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270) at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26394) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) {noformat} Below is the output of running fsck on the file. {noformat} [ec2-user@leader2 sbin]$ hdfs fsck /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 -files -blocks -locations -openforwrite Connecting to namenode via http://leader2:50070 FSCK started by ec2-user (auth:SIMPLE) from /10.1.5.6 for path /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 at Thu May 14 19:03:21 UTC 2015 /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 541965482 bytes, 1 block(s), OPENFORWRITE: Under replicated BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR], ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}. Target Replicas is 3 but found 2 replica(s). 0. BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW], ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR], ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]} len=541965482 repl=2 [10.1.5.55:50010, 10.1.5.45:50010, 10.1.5.9:50010, 10.1.5.131:50010] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)