[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-27 Thread Aaron T. Myers (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-6527:
-

Attachment: HDFS-6527-addendum-test.patch

Hey folks,

I've been looking into this a bit and have come to the conclusion that we 
should actually include this fix in 2.4.1. The reason is that though the 
original {{addBlock}} scenario sort of incidentally can't happen in 2.4.0, I 
believe that a similar scenario can happen with a race between {{close}} and 
{{delete}}.

Even though {{close}} doesn't do any sort of dropping of its lock during the 
duration of its RPC, the entirety of a single {{close}} operation can begin and 
end successfully between when the {{delete}} edit log op is logged, and when 
the INode is later removed in the {{delete}} call. See the attached additional 
test case which demonstrates the issue.

This will result in a similarly invalid edit log op sequence wherein you'll see 
an {{OP_ADD}}, {{OP_DELETE}}, and then {{OP_CLOSE}}, which can't be 
successfully replayed by the NN since the {{OP_CLOSE}} will get a 
{{FileNotFound}}. I've seen this happen on two clusters now.

Kihwal/Jing - if you agree with my analysis, let's reopen this JIRA so this fix 
can be included in 2.4.1, though without the {{addBlock}} test case, and with 
only the {{close}} test case.

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 3.0.0, 2.5.0

 Attachments: HDFS-6527-addendum-test.patch, 
 HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, HDFS-6527.v2.patch, 
 HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-25 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Fix Version/s: (was: 2.4.1)
   2.5.0
   3.0.0

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 3.0.0, 2.5.0

 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-20 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated HDFS-6527:


Fix Version/s: (was: 2.5.0)
   2.4.1

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 2.4.1

 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-17 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Target Version/s: 2.5.0  (was: 2.4.1)

Changing the target version from 2.4.1 to 2.5.0 since 2.4.1 is already cut.

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-17 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6527:


   Resolution: Fixed
Fix Version/s: 2.5.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

Thanks for the fix, [~kihwal]! I've committed this to trunk and branch-2.

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Fix For: 2.5.0

 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-16 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Attachment: HDFS-6527.v4.patch

The v4 patch does what you suggested.  Regarding the test code in 
{{FSNamesystem}}, {{delete()}} also needs a delay. We already have fault 
injection in various critical parts of the system.

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-16 Thread Jing Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-6527:


Attachment: HDFS-6527.v5.patch

Thanks Kihwal! The v4 patch looks good to me. But I guess the unit test now 
cannot cover the non-snapshot case since the inode will not be removed from the 
inodemap if it is still contained in a snapshot.

So based on your v4 patch I added a new unit test to cover both scenarios. Also 
I use a customized block placement policy and use whitebox to add the deleted 
inode back to the inodemap so as to remove the dependency of the fault 
injection code.

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Status: Patch Available  (was: Open)

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since startFile() acquires FSN read lock and then write lock, a deletion can 
 happen in between. Because of deferred inode removal outside FSN write lock, 
 startFile() can get the deleted inode from the inode map with FSN write lock 
 held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Attachment: HDFS-6527.trunk.patch
HDFS-6527.branch-2.4.patch

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since startFile() acquires FSN read lock and then write lock, a deletion can 
 happen in between. Because of deferred inode removal outside FSN write lock, 
 startFile() can get the deleted inode from the inode map with FSN write lock 
 held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Affects Version/s: 2.3.0

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.3.0, 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since startFile() acquires FSN read lock and then write lock, a deletion can 
 happen in between. Because of deferred inode removal outside FSN write lock, 
 startFile() can get the deleted inode from the inode map with FSN write lock 
 held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Affects Version/s: (was: 2.3.0)

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since startFile() acquires FSN read lock and then write lock, a deletion can 
 happen in between. Because of deferred inode removal outside FSN write lock, 
 startFile() can get the deleted inode from the inode map with FSN write lock 
 held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Description: 
We have seen a SBN crashing with the following error:
{panel}
\[Edit log tailer\] ERROR namenode.FSEditLogLoader:
Encountered exception on operation AddBlockOp
[path=/xxx,
penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
RpcCallId=-2]
java.io.FileNotFoundException: File does not exist: /xxx
{panel}

This was caused by the deferred removal of deleted inodes from the inode map. 
Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
deletion can happen in between. Because of deferred inode removal outside FSN 
write lock, getAdditionalBlock() can get the deleted inode from the inode map 
with FSN write lock held. This allow addition of a block to a deleted file.

As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
 OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
crashes.


  was:
We have seen a SBN crashing with the following error:
{panel}
\[Edit log tailer\] ERROR namenode.FSEditLogLoader:
Encountered exception on operation AddBlockOp
[path=/xxx,
penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
RpcCallId=-2]
java.io.FileNotFoundException: File does not exist: /xxx
{panel}

This was caused by the deferred removal of deleted inodes from the inode map. 
Since startFile() acquires FSN read lock and then write lock, a deletion can 
happen in between. Because of deferred inode removal outside FSN write lock, 
startFile() can get the deleted inode from the inode map with FSN write lock 
held. This allow addition of a block to a deleted file.

As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
 OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
crashes.



 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Status: Open  (was: Patch Available)

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Status: Patch Available  (was: Open)

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Attachment: HDFS-6527.v2.patch

The new patch simply checks the parent of inode against null.  This is done in 
checkLease(), which is called by getAdditionalBlock() after acquring the FSN 
writelock.

Also added is a new test case that reproduces the race between delete() and 
getAdditionalBlock(). Without the change in checkLease(), the test case fails. 
Its failure means getAdditionalBlock() was successful even after delete(). This 
causes the problematic edit log sequence.

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Status: Open  (was: Patch Available)

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Status: Patch Available  (was: Open)

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal

2014-06-13 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-6527:
-

Attachment: HDFS-6527.v3.patch

The new v3 patch implements what I suggested above. It nulls out the client 
name field.  Any further client actions against the file will be rejected.  
Also fixed the javac warning caused by the use of the deprecated delete() 
method in the new test case.

 Edit log corruption due to defered INode removal
 

 Key: HDFS-6527
 URL: https://issues.apache.org/jira/browse/HDFS-6527
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Kihwal Lee
Assignee: Kihwal Lee
Priority: Blocker
 Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, 
 HDFS-6527.v2.patch, HDFS-6527.v3.patch


 We have seen a SBN crashing with the following error:
 {panel}
 \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
 Encountered exception on operation AddBlockOp
 [path=/xxx,
 penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
 RpcCallId=-2]
 java.io.FileNotFoundException: File does not exist: /xxx
 {panel}
 This was caused by the deferred removal of deleted inodes from the inode map. 
 Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
 deletion can happen in between. Because of deferred inode removal outside FSN 
 write lock, getAdditionalBlock() can get the deleted inode from the inode map 
 with FSN write lock held. This allow addition of a block to a deleted file.
 As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
 crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)