[ https://issues.apache.org/jira/browse/HDFS-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051887#comment-14051887 ]
Kihwal Lee commented on HDFS-6618: ---------------------------------- [~cmccabe], thanks for the review. bq. What happens if a QuotaExceededException is thrown here ..... This is indeed problematic, but is also the case for existing code and what you are suggesting. If an exception is thrown in the middle of deleting, the partial delete is not undone. The inode at the top of the tree being deleted and potentially more will have already unlinked and the rest will remain linked, but unreachable. If inodes are removed altogether at the end, none of inodes will get removed from inodeMap, when an exception is thrown. This will cause inodes and blocks to leak. If we remove inodes as we go, at leaset some inodes will get removed in the same situation. Either way things will leak, but to a lesser degree in the latter case. But I wouldn't say the latter is superior because of this difference. I am just saying it's no worse. One of the key motivation of removing inodes inline was to avoid overhead of building up large data structure when deleting a large tree. Although now it's backed by {{ChunkedArrayList}}, there will be lots of realloc and quite a bit of memory consumption. All or part of them may be promoted and remain in the heap until the next old gen collection. This may be acceptable if we are doing deferred removal outside the lock. But since we are trying to do it inside both FSNamesystem and FSDirectory lock, building the list is just a waste. About leaking inodes and blocks: - inodes were removed from inodeMap but blocks weren't. This includes adding a block after the inode is deleted due to the delete-addBlock race. Since the block is not removed from blocksMap, but the block still has reference to the block collection (i.e. inode). This will cause memory leak, which will disappear when namenode is restarted. - unlinked/deleted inodes were not deleted from inodeMap. The deleted inodes will remain in memory. If blocks were also not removed from blocksMap, they will remain in memory. If blocks were collected, but not removed from blocksMap, they will disappear after restart. When saving fsimage, the orphaned inodes will be saved in the inode section. The way it saves INodeDirectorySection also causes all leaked (still linked) children and blocks to be saved. When loading the fsimage, the leak will be recreated in memory. I am a bit depressed after writing this. Let's fix things one at time... > Remove deleted INodes from INodeMap right away > ---------------------------------------------- > > Key: HDFS-6618 > URL: https://issues.apache.org/jira/browse/HDFS-6618 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.5.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Blocker > Attachments: HDFS-6618.AbstractList.patch, > HDFS-6618.inodeRemover.patch, HDFS-6618.inodeRemover.v2.patch, HDFS-6618.patch > > > After HDFS-6527, we have not seen the edit log corruption for weeks on > multiple clusters until yesterday. Previously, we would see it within 30 > minutes on a cluster. > But the same condition was reproduced even with HDFS-6527. The only > explanation is that the RPC handler thread serving {{addBlock()}} was > accessing stale parent value. Although nulling out parent is done inside the > {{FSNamesystem}} and {{FSDirectory}} write lock, there is no memory barrier > because there is no "synchronized" block involved in the process. > I suggest making parent volatile. -- This message was sent by Atlassian JIRA (v6.2#6252)