Keith Turner created HDFS-8406:
----------------------------------

             Summary: Lease recovery continually failed
                 Key: HDFS-8406
                 URL: https://issues.apache.org/jira/browse/HDFS-8406
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.6.0
            Reporter: Keith Turner


While testing Accumulo on a cluster and killing processes, I ran into a 
situation where the lease on an accumulo write ahead log in HDFS could not be 
recovered.   Even restarting HDFS and Accumulo would not fix the problem.

The following message was seen in an Accumulo tablet server log immediately 
before the tablet server was killed.

{noformat}
2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream 
ResponseProcessor exception  for block 
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
java.io.IOException: Bad response ERROR for block 
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 from datanode 
10.1.5.9:50010
        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block 
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 in pipeline 
10.1.5.55:50010, 10.1.5.9:5
{noformat}

Before recovering data from a write ahead log, the Accumulo master attempts to 
recover the lease.   This repeatedly failed with messages like the following.

{noformat}
2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering 
lease on 
hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
 failed to create file 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
pendingCreates is non-null but no leases found.
{noformat}

Below is some info from the NN logs for the problematic file.

{noformat}
[ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245 
hadoop-ec2-user-namenode-leader2.log 
2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
allocateBlock: 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2. 
BP-802741494-10.1.5.6-1431557089849 
blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
DFSClient_NONMAPREDUCE_-1128465883_16
2015-05-14 17:14:49,288 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.  
Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], 
src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 from 
client DFSClient_NONMAPREDUCE_-1128465883_16
2015-05-14 17:14:49,288 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], 
src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.internalReleaseLease: File 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been 
closed. Lease recovery is in progress. RecoveryId = 192257 for block 
blk_1073932823_192060{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
java.lang.IllegalStateException: Failed to finalize INodeFile 
3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where 
blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
2015-05-14 17:14:54,292 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from 
10.1.5.158:53784 Call#529 Retry#0: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
pendingCreates is non-null but no leases found.
2015-05-14 17:14:54,319 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
NameSystem.append: failed to create file 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
pendingCreates is non-null but no leases found.
2015-05-14 17:14:54,320 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 
on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from 
10.1.5.158:53784 Call#530 Retry#0: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
pendingCreates is non-null but no leases found.
{noformat}

Below is expanded info from the NN logs about the exception mentioned above.

{noformat}
2015-05-14 17:14:51,704 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 10000, call 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
 from 10.1.5.45:35687 Call#66737 Retry#0

java.lang.IllegalStateException: Failed to finalize INodeFile 
3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where 
blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
        at 
com.google.common.base.Preconditions.checkState(Preconditions.java:172)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4663)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4878)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4842)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:734)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270)
        at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26394)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{noformat}

Below is the output of running fsck on the file.

{noformat}
[ec2-user@leader2 sbin]$ hdfs fsck 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 -files -blocks 
-locations -openforwrite
Connecting to namenode via http://leader2:50070
FSCK started by ec2-user (auth:SIMPLE) from /10.1.5.6 for path 
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 at Thu May 14 
19:03:21 UTC 2015
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 541965482 
bytes, 1 block(s), OPENFORWRITE:  Under replicated 
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED,
 primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
 
ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}.
 Target Replicas is 3 but found 2 replica(s).
0. 
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED,
 primaryNodeIndex=2, 
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
 
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
 
ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}
 len=541965482 repl=2 [10.1.5.55:50010, 10.1.5.45:50010, 10.1.5.9:50010, 
10.1.5.131:50010]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to