[ 
https://issues.apache.org/jira/browse/HDFS-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326067#comment-14326067
 ] 

Kihwal Lee commented on HDFS-7809:
----------------------------------

Stack trace:

{panel}
2015-02-13 01:07:45,628 
\[org.apache.hadoop.hdfs.server.datanode.DataNode$2@278a83a0\] WARN 
datanode.DataNode: recoverBlocks FAILED:
RecoveringBlock\{BP-xxxxx:blk_12345_10000; getBlockSize()=4150; corrupt=false; 
offset=-1; locs=\[1.2.3.4:1004, 1.2.3.5:1004, 1.2.3.6:1004\]\}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.NSQuotaExceededException):
Failed to record modification for snapshot: The NameSpace quota (directories 
and files) is exceeded: quota=50000 file count=50001
        at 
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyNamespaceQuota(DirectoryWithQuotaFeature.java:138)
        at 
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyQuota(DirectoryWithQuotaFeature.java:153)
        at 
org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.addSpaceConsumed(DirectoryWithQuotaFeature.java:96)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSpaceConsumed(INodeDirectory.java:136)
        at 
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed2Parent(INode.java:484)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSpaceConsumed(INodeDirectory.java:138)
        at 
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed2Parent(INode.java:484)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addSpaceConsumed(INodeDirectory.java:138)
        at 
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed2Parent(INode.java:484)
        at 
org.apache.hadoop.hdfs.server.namenode.INode.addSpaceConsumed(INode.java:474)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.addDiff(AbstractINodeDiffList.java:125)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.checkAndAddLatestSnapshotDiff(AbstractINodeDiffList.java:284)
        at 
org.apache.hadoop.hdfs.server.namenode.snapshot.AbstractINodeDiffList.saveSelf2Snapshot(AbstractINodeDiffList.java:296)
        at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.recordModification(INodeFile.java:305)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4202)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4419)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4383)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:699)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270)
        at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28073)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
{panel}

> Block and lease recovery failure caused by snapshot issue
> ---------------------------------------------------------
>
>                 Key: HDFS-7809
>                 URL: https://issues.apache.org/jira/browse/HDFS-7809
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>            Reporter: Kihwal Lee
>            Priority: Critical
>
> On a cluster running 2.5, we have observed a decommissioning failure due to a 
> file that had been under construction for 3 days.  It turned out that the 
> file was abandoned and a lease recovery was carried out by the name node 3 
> days ago.
> The block recovery failed because the name node threw a quota exception while 
> serving {{commitBlockSynchronization()}}. After this failure, no further 
> attempt for recovery was made, leaving the file in under-construction state 
> forever.
> Furthermore, the nature of the recovery failure is very strange. Even though 
> *snapshot was never used* in the cluster, it was trying to record the diff 
> and that required incrementing {{nsquota}} by 1. The user happened to ran out 
> of his {{nsquota}} at that time, so it failed and caused 
> {{commitBlockSynchronization()}} to fail.  We do see quota discrepancies 
> occasionally. Probably those were caused by something like this all along?
> Few observations:
> - Lease recovery did not complete, yet didn't get retried.
> - No snapshot was in use, but somehow it went through snapshot-related code 
> path.
> - quota update during {{commitBlockSynchronization()}} should be done 
> unconditionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to