[ https://issues.apache.org/jira/browse/HDFS-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang resolved HDFS-8406. ----------------------------------- Resolution: Duplicate > Lease recovery continually failed > --------------------------------- > > Key: HDFS-8406 > URL: https://issues.apache.org/jira/browse/HDFS-8406 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Keith Turner > Priority: Major > Labels: Accumulo, HBase, SolrCloud > > While testing Accumulo on a cluster and killing processes, I ran into a > situation where the lease on an accumulo write ahead log in HDFS could not be > recovered. Even restarting HDFS and Accumulo would not fix the problem. > The following message was seen in an Accumulo tablet server log immediately > before the tablet server was killed. > {noformat} > 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream > ResponseProcessor exception for block > BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 > java.io.IOException: Bad response ERROR for block > BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 from datanode > 10.1.5.9:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897) > 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block > BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 in pipeline > 10.1.5.55:50010, 10.1.5.9:5 > {noformat} > Before recovering data from a write ahead log, the Accumulo master attempts > to recover the lease. This repeatedly failed with messages like the > following. > {noformat} > 2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering > lease on > hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): > failed to create file > /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for > DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because > pendingCreates is non-null but no leases found. > {noformat} > Below is some info from the NN logs for the problematic file. > {noformat} > [ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245 > hadoop-ec2-user-namenode-leader2.log > 2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocateBlock: > /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2. > BP-802741494-10.1.5.6-1431557089849 > blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, > replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]} > 2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > fsync: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for > DFSClient_NONMAPREDUCE_-1128465883_16 > 2015-05-14 17:14:49,288 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease. > Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], > src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 from > client DFSClient_NONMAPREDUCE_-1128465883_16 > 2015-05-14 17:14:49,288 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], > src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 > 2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File > /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been > closed. Lease recovery is in progress. RecoveryId = 192257 for block > blk_1073932823_192060{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]} > java.lang.IllegalStateException: Failed to finalize INodeFile > 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where > blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}]. > 2015-05-14 17:14:54,292 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 1 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease > from 10.1.5.158:53784 Call#529 Retry#0: > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to > create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 > for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because > pendingCreates is non-null but no leases found. > 2015-05-14 17:14:54,319 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.append: failed to create file > /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for > DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because > pendingCreates is non-null but no leases found. > 2015-05-14 17:14:54,320 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 6 on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from > 10.1.5.158:53784 Call#530 Retry#0: > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to > create file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 > for DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because > pendingCreates is non-null but no leases found. > {noformat} > Below is expanded info from the NN logs about the exception mentioned above. > {noformat} > 2015-05-14 17:14:51,704 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 10000, call > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization > from 10.1.5.45:35687 Call#66737 Retry#0 > java.lang.IllegalStateException: Failed to finalize INodeFile > 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where > blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}]. > at > com.google.common.base.Preconditions.checkState(Preconditions.java:172) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4663) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4878) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4842) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:734) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26394) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > {noformat} > Below is the output of running fsck on the file. > {noformat} > [ec2-user@leader2 sbin]$ hdfs fsck > /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 -files > -blocks -locations -openforwrite > Connecting to namenode via http://leader2:50070 > FSCK started by ec2-user (auth:SIMPLE) from /10.1.5.6 for path > /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 at Thu May > 14 19:03:21 UTC 2015 > /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 541965482 > bytes, 1 block(s), OPENFORWRITE: Under replicated > BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED, > primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR], > > ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}. > Target Replicas is 3 but found 2 replica(s). > 0. > BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED, > primaryNodeIndex=2, > replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW], > > ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR], > > ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]} > len=541965482 repl=2 [10.1.5.55:50010, 10.1.5.45:50010, 10.1.5.9:50010, > 10.1.5.131:50010] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org