[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file which has snapshots

2017-12-27 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16304855#comment-16304855
 ] 

Xiao Chen commented on HDFS-12369:
--

As it turned out, this issue may have some morphs on the exact symptom, 
depending on what the file status is at the time of recovery. At the core of 
the issue, the lease recovery of a deleted file should not edit log anything, 
which is what this jira fixed.

{noformat}
2017-12-27 15:38:57,360 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=/filename, replication=3, 
mtime=1514301485930, atime=1514297585263, blockSize=268435456, 
blocks=[blk_1863506432_791165194, blk_1863506631_791165393, 
blk_1863506826_791165588], permissions=hdfs:superuser:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, opCode=OP_CLOSE, txid=10577364851]
java.io.IOException: Mismatched block IDs or generation stamps, attempting to 
replace block blk_1863518793_791177559 with blk_1863506432_791165194 as block # 
0/3 of /filename
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:942)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:434)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
{noformat}
This happened when a file with the same name is created (1st time) -> deleted 
-> created by a 2nd user -> lease recovered (of the 1st creation) -> closed by 
the 2nd user, causing 2 close ops in the edits.

> Edit log corruption due to hard lease recovery of not-closed file which has 
> snapshots
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3
>
> Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, 
> HDFS-12369.03.patch, HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
>  

[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file which has snapshots

2017-09-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157876#comment-16157876
 ] 

Hudson commented on HDFS-12369:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12813 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/12813/])
HDFS-12369. Edit log corruption due to hard lease recovery of not-closed (xiao: 
rev 52b894db33bc68b46eec5cdf2735dfcf4030853a)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDeleteRace.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LeaseManager.java


> Edit log corruption due to hard lease recovery of not-closed file which has 
> snapshots
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3
>
> Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, 
> HDFS-12369.03.patch, HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease reaches hard limit.
> {noformat}
> 2017-08-16 15:05:45,927 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>   Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending 
> creates: 75], 
>   src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
>   internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
>   /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file which has snapshots

2017-09-07 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157349#comment-16157349
 ] 

Yongjun Zhang commented on HDFS-12369:
--

+1 on rev3. Thanks Xiao.


> Edit log corruption due to hard lease recovery of not-closed file which has 
> snapshots
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, 
> HDFS-12369.03.patch, HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease reaches hard limit.
> {noformat}
> 2017-08-16 15:05:45,927 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>   Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending 
> creates: 75], 
>   src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
>   internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
>   /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file which has snapshots

2017-09-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155041#comment-16155041
 ] 

Hadoop QA commented on HDFS-12369:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 91m 38s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
16s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}117m 15s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestBlockStoragePolicy |
|   | hadoop.hdfs.TestLeaseRecoveryStriped |
|   | hadoop.hdfs.TestDatanodeConfig |
|   | hadoop.hdfs.server.namenode.TestReencryptionWithKMS |
| Timed out junit tests | org.apache.hadoop.hdfs.TestWriteReadStripedFile |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:71bbb86 |
| JIRA Issue | HDFS-12369 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12885539/HDFS-12369.03.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux fba8e4ae3c99 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 
11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 1f3bc63 |
| Default Java | 1.8.0_144 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/21016/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/21016/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/21016/console |
| Powered by | Apache Yetus 0.6.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Edit log corruption due to hard lease recovery of not-closed file which has 
> snapshots
> -
>
> Key: HDFS-12369
> URL: 

[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file which has snapshots

2017-09-06 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154913#comment-16154913
 ] 

Xiao Chen commented on HDFS-12369:
--

Thanks for the review Yongjun.

bq. does this issue only occur when the file has a snapshot? 
Good question! Yes, because only with snapshots the delete will go the path of 
{{FSDirDeleteOp#unprotectedDelete}} -> {{INodeFile#cleanSubtree}}, which 
eventually ends up not {{clearFile}} in :
{code:title=FileWithSnapshotFeature.java}
  public void collectBlocksAndClear(
  INode.ReclaimContext reclaimContext, final INodeFile file) {
// check if everything is deleted.
if (isCurrentFileDeleted() && getDiffs().asList().isEmpty()) {
  file.clearFile(reclaimContext);
  return;
}
{code}
Added a few comments in test about this, and also updated the jira title.

Good catch for the extra ';'. Attach patch 3 to address that, previous 
checkstyle, and added the {{addBlock}} in test.

> Edit log corruption due to hard lease recovery of not-closed file which has 
> snapshots
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, 
> HDFS-12369.03.patch, HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease reaches hard limit.
> {noformat}
> 2017-08-16 15:05:45,927 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>   Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending 
> creates: 75], 
>   src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
>   internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
>   /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file

2017-09-05 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154668#comment-16154668
 ] 

Yongjun Zhang commented on HDFS-12369:
--

Hi [~xiaochen],

Thanks for working on this issue. The change looks good to me. 

One question, does this issue only occur when the file has a snapshot? The test 
indicates that. If it also occurs when there is no snapshot, would be nice to 
have a test for that.

BTW, Noticed an extra ";" in 
{code}
final INodeFile lastINode = iip.getLastINode().asFile();;
{code}


> Edit log corruption due to hard lease recovery of not-closed file
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, 
> HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease reaches hard limit.
> {noformat}
> 2017-08-16 15:05:45,927 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>   Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending 
> creates: 75], 
>   src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
>   internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
>   /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file

2017-08-29 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146552#comment-16146552
 ] 

Xiao Chen commented on HDFS-12369:
--

Following the above, we also brainstormed about other scenarios that a 
non-closed stream would cause issues. We only came up with {{addBlock}}, since 
the stream holder could keep writing to that stream. After further inspecting 
code, though, {{addBlock}} is also safe, since it also requires a path 
resolution which would fail when the file is deleted.

I can incorporate the above to the unit test in the next rev, with any review 
comments to come.

> Edit log corruption due to hard lease recovery of not-closed file
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Attachments: HDFS-12369.01.patch, HDFS-12369.02.patch, 
> HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease reaches hard limit.
> {noformat}
> 2017-08-16 15:05:45,927 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>   Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending 
> creates: 75], 
>   src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
>   internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
>   /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file

2017-08-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146455#comment-16146455
 ] 

Hadoop QA commented on HDFS-12369:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  5m 
38s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 39s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 208 unchanged - 0 fixed = 210 total (was 208) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}102m 32s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}138m 20s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestClientProtocolForPipelineRecovery |
|   | hadoop.hdfs.TestLeaseRecoveryStriped |
|   | hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure000 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
|   | hadoop.hdfs.TestReadStripedFileWithDecoding |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure120 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure090 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure020 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure140 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure130 |
|   | hadoop.hdfs.TestMaintenanceState |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure030 |
|   | hadoop.hdfs.TestListFilesInFileContext |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure210 |
|   | hadoop.hdfs.TestReconstructStripedFile |
| Timed out junit tests | org.apache.hadoop.hdfs.TestWriteReadStripedFile |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12369 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12884362/HDFS-12369.02.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 2deedb73a66e 3.13.0-117-generic #164-Ubuntu SMP Fri Apr 7 
11:05:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 

[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file

2017-08-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144787#comment-16144787
 ] 

Hadoop QA commented on HDFS-12369:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
25s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 40s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 1 new + 184 unchanged - 0 fixed = 185 total (was 184) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}119m 13s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}148m 23s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure060 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure100 |
|   | hadoop.hdfs.TestReadStripedFileWithMissingBlocks |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure130 |
|   | hadoop.hdfs.tools.TestDFSAdminWithHA |
|   | hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure000 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure180 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure050 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure140 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure190 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure210 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure070 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure170 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure020 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure090 |
|   | hadoop.hdfs.TestLeaseRecoveryStriped |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 |
|   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure200 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure040 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure120 |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | 

[jira] [Commented] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file

2017-08-28 Thread Xiao Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144529#comment-16144529
 ] 

Xiao Chen commented on HDFS-12369:
--

Attaching a unit test that sorta reproduces this - it ends up with an NPE when 
loading {{ReassignLeaseOp}}, instead of the FNFE when loading the {{CloseOp}}.

I'm still looking into this, but wanted to post here for early discussions.



> Edit log corruption due to hard lease recovery of not-closed file
> -
>
> Key: HDFS-12369
> URL: https://issues.apache.org/jira/browse/HDFS-12369
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiao Chen
>Assignee: Xiao Chen
> Attachments: HDFS-12369.test.patch
>
>
> HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations 
> of client operations.
> Recently, we have observed NN not able to start with the following exception:
> {noformat}
> 2017-08-17 14:32:18,418 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
> java.io.FileNotFoundException: File does not exist: 
> /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:844)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:823)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
> {noformat}
> Quoting a nicely analysed edits:
> {quote}
> In the edits logged about 1 hour later, we see this failing OP_CLOSE. The 
> sequence in the edits shows the file going through:
>   OPEN
>   ADD_BLOCK
>   CLOSE
>   ADD_BLOCK # perhaps this was an append
>   DELETE
>   (about 1 hour later) CLOSE
> It is interesting that there was no CLOSE logged before the delete.
> {quote}
> Grepping that file name, it turns out the close was triggered by 
> {{LeaseManager}}, when the lease reaches hard limit.
> {noformat}
> 2017-08-16 15:05:45,927 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>   Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending 
> creates: 75], 
>   src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M
> 2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
>   internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
>   /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-0.9nlJ3M closed.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org