[ https://issues.apache.org/jira/browse/HDFS-7809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326487#comment-14326487 ]
Kihwal Lee commented on HDFS-7809: ---------------------------------- Thanks, [~jingzhao]. I would have been nicer if the bug was dealt with in a separate jira. I will dupe this to one of the jiras. > Block and lease recovery failure caused by snapshot issue > --------------------------------------------------------- > > Key: HDFS-7809 > URL: https://issues.apache.org/jira/browse/HDFS-7809 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.5.0 > Reporter: Kihwal Lee > Priority: Critical > > On a cluster running 2.5, we have observed a decommissioning failure due to a > file that had been under construction for 3 days. It turned out that the > file was abandoned and a lease recovery was carried out by the name node 3 > days ago. > The block recovery failed because the name node threw a quota exception while > serving {{commitBlockSynchronization()}}. After this failure, no further > attempt for recovery was made, leaving the file in under-construction state > forever. > Furthermore, the nature of the recovery failure is very strange. Even though > *snapshot was never used* in the cluster, it was trying to record the diff > and that required incrementing {{nsquota}} by 1. The user happened to ran out > of his {{nsquota}} at that time, so it failed and caused > {{commitBlockSynchronization()}} to fail. We do see quota discrepancies > occasionally. Probably those were caused by something like this all along? > Few observations: > - Lease recovery did not complete, yet didn't get retried. > - No snapshot was in use, but somehow it went through snapshot-related code > path. > - quota update during {{commitBlockSynchronization()}} should be done > unconditionally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)