[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated HDFS-6527: - Attachment: HDFS-6527-addendum-test.patch Hey folks, I've been looking into this a bit and have come to the conclusion that we should actually include this fix in 2.4.1. The reason is that though the original {{addBlock}} scenario sort of incidentally can't happen in 2.4.0, I believe that a similar scenario can happen with a race between {{close}} and {{delete}}. Even though {{close}} doesn't do any sort of dropping of its lock during the duration of its RPC, the entirety of a single {{close}} operation can begin and end successfully between when the {{delete}} edit log op is logged, and when the INode is later removed in the {{delete}} call. See the attached additional test case which demonstrates the issue. This will result in a similarly invalid edit log op sequence wherein you'll see an {{OP_ADD}}, {{OP_DELETE}}, and then {{OP_CLOSE}}, which can't be successfully replayed by the NN since the {{OP_CLOSE}} will get a {{FileNotFound}}. I've seen this happen on two clusters now. Kihwal/Jing - if you agree with my analysis, let's reopen this JIRA so this fix can be included in 2.4.1, though without the {{addBlock}} test case, and with only the {{close}} test case. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-6527-addendum-test.patch, > HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, HDFS-6527.v2.patch, > HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Fix Version/s: (was: 2.4.1) 2.5.0 3.0.0 > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated HDFS-6527: Fix Version/s: (was: 2.5.0) 2.4.1 > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Fix For: 2.4.1 > > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6527: Resolution: Fixed Fix Version/s: 2.5.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks for the fix, [~kihwal]! I've committed this to trunk and branch-2. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Fix For: 2.5.0 > > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Target Version/s: 2.5.0 (was: 2.4.1) Changing the target version from 2.4.1 to 2.5.0 since 2.4.1 is already cut. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-6527: Attachment: HDFS-6527.v5.patch Thanks Kihwal! The v4 patch looks good to me. But I guess the unit test now cannot cover the non-snapshot case since the inode will not be removed from the inodemap if it is still contained in a snapshot. So based on your v4 patch I added a new unit test to cover both scenarios. Also I use a customized block placement policy and use whitebox to add the deleted inode back to the inodemap so as to remove the dependency of the fault injection code. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Attachment: HDFS-6527.v4.patch The v4 patch does what you suggested. Regarding the test code in {{FSNamesystem}}, {{delete()}} also needs a delay. We already have fault injection in various critical parts of the system. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch, HDFS-6527.v3.patch, HDFS-6527.v4.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Attachment: HDFS-6527.v3.patch The new v3 patch implements what I suggested above. It nulls out the client name field. Any further client actions against the file will be rejected. Also fixed the javac warning caused by the use of the deprecated delete() method in the new test case. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch, HDFS-6527.v3.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Status: Patch Available (was: Open) > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Status: Open (was: Patch Available) > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Attachment: HDFS-6527.v2.patch The new patch simply checks the parent of inode against null. This is done in checkLease(), which is called by getAdditionalBlock() after acquring the FSN writelock. Also added is a new test case that reproduces the race between delete() and getAdditionalBlock(). Without the change in checkLease(), the test case fails. Its failure means getAdditionalBlock() was successful even after delete(). This causes the problematic edit log sequence. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Status: Patch Available (was: Open) > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, > HDFS-6527.v2.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Status: Open (was: Patch Available) > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Description: We have seen a SBN crashing with the following error: {panel} \[Edit log tailer\] ERROR namenode.FSEditLogLoader: Encountered exception on operation AddBlockOp [path=/xxx, penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, RpcCallId=-2] java.io.FileNotFoundException: File does not exist: /xxx {panel} This was caused by the deferred removal of deleted inodes from the inode map. Since getAdditionalBlock() acquires FSN read lock and then write lock, a deletion can happen in between. Because of deferred inode removal outside FSN write lock, getAdditionalBlock() can get the deleted inode from the inode map with FSN write lock held. This allow addition of a block to a deleted file. As a result, the edit log will contain OP_ADD, OP_DELETE, followed by OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN crashes. was: We have seen a SBN crashing with the following error: {panel} \[Edit log tailer\] ERROR namenode.FSEditLogLoader: Encountered exception on operation AddBlockOp [path=/xxx, penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, RpcCallId=-2] java.io.FileNotFoundException: File does not exist: /xxx {panel} This was caused by the deferred removal of deleted inodes from the inode map. Since startFile() acquires FSN read lock and then write lock, a deletion can happen in between. Because of deferred inode removal outside FSN write lock, startFile() can get the deleted inode from the inode map with FSN write lock held. This allow addition of a block to a deleted file. As a result, the edit log will contain OP_ADD, OP_DELETE, followed by OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN crashes. > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Affects Version/s: (was: 2.3.0) > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since startFile() acquires FSN read lock and then write lock, a deletion can > happen in between. Because of deferred inode removal outside FSN write lock, > startFile() can get the deleted inode from the inode map with FSN write lock > held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Affects Version/s: 2.3.0 > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.3.0, 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since startFile() acquires FSN read lock and then write lock, a deletion can > happen in between. Because of deferred inode removal outside FSN write lock, > startFile() can get the deleted inode from the inode map with FSN write lock > held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Attachment: HDFS-6527.trunk.patch HDFS-6527.branch-2.4.patch > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since startFile() acquires FSN read lock and then write lock, a deletion can > happen in between. Because of deferred inode removal outside FSN write lock, > startFile() can get the deleted inode from the inode map with FSN write lock > held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6527) Edit log corruption due to defered INode removal
[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-6527: - Status: Patch Available (was: Open) > Edit log corruption due to defered INode removal > > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Kihwal Lee >Assignee: Kihwal Lee >Priority: Blocker > Attachments: HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since startFile() acquires FSN read lock and then write lock, a deletion can > happen in between. Because of deferred inode removal outside FSN write lock, > startFile() can get the deleted inode from the inode map with FSN write lock > held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)