[ 
https://issues.apache.org/jira/browse/HDFS-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron T. Myers updated HDFS-6527:
---------------------------------

    Attachment: HDFS-6527-addendum-test.patch

Hey folks,

I've been looking into this a bit and have come to the conclusion that we 
should actually include this fix in 2.4.1. The reason is that though the 
original {{addBlock}} scenario sort of incidentally can't happen in 2.4.0, I 
believe that a similar scenario can happen with a race between {{close}} and 
{{delete}}.

Even though {{close}} doesn't do any sort of dropping of its lock during the 
duration of its RPC, the entirety of a single {{close}} operation can begin and 
end successfully between when the {{delete}} edit log op is logged, and when 
the INode is later removed in the {{delete}} call. See the attached additional 
test case which demonstrates the issue.

This will result in a similarly invalid edit log op sequence wherein you'll see 
an {{OP_ADD}}, {{OP_DELETE}}, and then {{OP_CLOSE}}, which can't be 
successfully replayed by the NN since the {{OP_CLOSE}} will get a 
{{FileNotFound}}. I've seen this happen on two clusters now.

Kihwal/Jing - if you agree with my analysis, let's reopen this JIRA so this fix 
can be included in 2.4.1, though without the {{addBlock}} test case, and with 
only the {{close}} test case.

> Edit log corruption due to defered INode removal
> ------------------------------------------------
>
>                 Key: HDFS-6527
>                 URL: https://issues.apache.org/jira/browse/HDFS-6527
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Blocker
>             Fix For: 3.0.0, 2.5.0
>
>         Attachments: HDFS-6527-addendum-test.patch, 
> HDFS-6527.branch-2.4.patch, HDFS-6527.trunk.patch, HDFS-6527.v2.patch, 
> HDFS-6527.v3.patch, HDFS-6527.v4.patch, HDFS-6527.v5.patch
>
>
> We have seen a SBN crashing with the following error:
> {panel}
> \[Edit log tailer\] ERROR namenode.FSEditLogLoader:
> Encountered exception on operation AddBlockOp
> [path=/xxx,
> penultimateBlock=NULL, lastBlock=blk_111_111, RpcClientId=,
> RpcCallId=-2]
> java.io.FileNotFoundException: File does not exist: /xxx
> {panel}
> This was caused by the deferred removal of deleted inodes from the inode map. 
> Since getAdditionalBlock() acquires FSN read lock and then write lock, a 
> deletion can happen in between. Because of deferred inode removal outside FSN 
> write lock, getAdditionalBlock() can get the deleted inode from the inode map 
> with FSN write lock held. This allow addition of a block to a deleted file.
> As a result, the edit log will contain OP_ADD, OP_DELETE, followed by
>  OP_ADD_BLOCK.  This cannot be replayed by NN, so NN doesn't start up or SBN 
> crashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to