[ https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aaron T. Myers updated HDFS-6527: --------------------------------- Attachment: HDFS-6527-addendum-test.patch Hey folks, I've been looking into this a bit and have come to the conclusion that we should actually include this fix in 2.4.1. The reason is that though the original {{addBlock}} scenario sort of incidentally can't happen in 2.4.0, I believe that a similar scenario can happen with a race between {{close}} and {{delete}}. Even though {{close}} doesn't do any sort of dropping of its lock during the duration of its RPC, the entirety of a single {{close}} operation can begin and end successfully between when the {{delete}} edit log op is logged, and when the INode is later removed in the {{delete}} call. See the attached additional test case which demonstrates the issue. This will result in a similarly invalid edit log op sequence wherein you'll see an {{OP_ADD}}, {{OP_DELETE}}, and then {{OP_CLOSE}}, which can't be successfully replayed by the NN since the {{OP_CLOSE}} will get a {{FileNotFound}}. I've seen this happen on two clusters now. Kihwal/Jing - if you agree with my analysis, let's reopen this JIRA so this fix can be included in 2.4.1, though without the {{addBlock}} test case, and with only the {{close}} test case. > Edit log corruption due to defered INode removal > ------------------------------------------------ > > Key: HDFS-6527 > URL: https://issues.apache.org/jira/browse/HDFS-6527 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.4.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Blocker > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-6527-addendum-test.patch, > HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, HDFS-6527.v2.patch, > HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch > > > We have seen a SBN crashing with the following error: > {panel} > \[Edit log tailer\] ERROR namenode.FSEditLogLoader: > Encountered exception on operation AddBlockOp > [path=/xxx, > penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=, > RpcCallId=-2] > java.io.FileNotFoundException: File does not exist: /xxx > {panel} > This was caused by the deferred removal of deleted inodes from the inode map. > Since getAdditionalBlock() acquires FSN read lock and then write lock, a > deletion can happen in between. Because of deferred inode removal outside FSN > write lock, getAdditionalBlock() can get the deleted inode from the inode map > with FSN write lock held. This allow addition of a block to a deleted file. > As a result, the edit log will contain OP_ADD, OP_DELETE, followed by > OP_ADD_BLOCK. This cannot be replayed by NN, so NN doesn't start up or SBN > crashes. -- This message was sent by Atlassian JIRA (v6.2#6252)