[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

Kihwal Lee (JIRA) Fri, 30 Jan 2015 12:35:05 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299188#comment-14299188
 ]


Kihwal Lee commented on HDFS-7707:
----------------------------------

bq. Do you mean that we could get a wrong iFile here?
Since the block collection of a block won't magically get updated to a new 
inode file, I don't see how it can be a wrong inode file. So it may not be due 
to delayed block removal.

bq.  what's the reason that tmpParent won't get a null at the dirX when trying 
to get the parent of dirX (if this happened)?
If snapshot is not involved, the parent will be set to null during delete while 
in the fsn write lock. Lack of memory barrier can cause stale values to be used 
in multi-processor and multi-threaded env, but I am not sure whether that is 
the cause here.

If {{commitBlockSynchronization()}} was involved, was it initiated by client 
(e.g. revoerLease() or create/append() )?

> Edit log corruption due to delayed block removal again
> ------------------------------------------------------
>
>                 Key: HDFS-7707
>                 URL: https://issues.apache.org/jira/browse/HDFS-7707
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>
> Edit log corruption is seen again, even with the fix of HDFS-6825. 
> Prior to HDFS-6825 fix, if dirX is deleted recursively, an OP_CLOSE can get 
> into edit log for the fileY under dirX, thus corrupting the edit log 
> (restarting NN with the edit log would fail). 
> What HDFS-6825 does to fix this issue is, to detect whether fileY is already 
> deleted by checking the ancestor dirs on it's path, if any of them doesn't 
> exist, then fileY is already deleted, and don't put OP_CLOSE to edit log for 
> the file.
> For this new edit log corruption, what I found was, the client first deleted 
> dirX recursively, then create another dir with exactly the same name as dirX 
> right away.  Because HDFS-6825 count on the namespace checking (whether dirX 
> exists in its parent dir) to decide whether a file has been deleted, the 
> newly created dirX defeats this checking, thus OP_CLOSE for the already 
> deleted file gets into the edit log, due to delayed block removal.
> What we need to do is to have a more robust way to detect whether a file has 
> been deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7707) Edit log corruption due to delayed block removal again

Reply via email to