[ https://issues.apache.org/jira/browse/HDFS-13769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559171#comment-16559171 ]
Yiqun Lin edited comment on HDFS-13769 at 7/27/18 3:01 AM: ----------------------------------------------------------- {quote} Also clear checkpoint in trash is a typical situation of deleting a large dir, since the checkpoint dir of trash accumulates deleted files within several hours. {quote} Agree. We also met this problem. There is a big chance the checkpoint dir being a large dir. As [~kihwal] mentioned, for the safe deleting, it might not be a atomic operation. But it should be okay using for clearing trash dir. {quote} Wei-Chiu Chuang, Agree! getContentSummary is a recursive method and it may take several seconds if the dir is very large. getContentSummary holds the read-lock in FSNameSystem rather than the write-lock. Also we need a way to know whether a dir is large. If there is a better solution I don't know, please tell me, and I think it need not to be very accurate. {quote} I am thinking for this, we can skip invoking expensive call {{getContentSummary}} for the first level dir since there will be a big chance as a large dir. For the deeper children paths, we can do as current patch did. This might be a better way I think. was (Author: linyiqun): {quote} Also clear checkpoint in trash is a typical situation of deleting a large dir, since the checkpoint dir of trash accumulates deleted files within several hours. {quote} Agree. We also met this problem. There is a big chance the checkpoint dir being a large dir. As [~kihwal] mentioned, for the safe deleting, it might not be a atomic operation. But it should be okay using for clearing trash dir. {quote} Wei-Chiu Chuang, Agree! getContentSummary is a recursive method and it may take several seconds if the dir is very large. getContentSummary holds the read-lock in FSNameSystem rather than the write-lock. Also we need a way to know whether a dir is large. If there is a better solution I don't know, please tell me, and I think it need not to be very accurate. {quote} I am thinking for this, We can skip invoking expensive call {{getContentSummary}} for in first level dir since there will be a large chance as a big dir. For the child paths, we can do as current patch did. This might a better way I think. > Namenode gets stuck when deleting large dir in trash > ---------------------------------------------------- > > Key: HDFS-13769 > URL: https://issues.apache.org/jira/browse/HDFS-13769 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.8.2, 3.1.0 > Reporter: Tao Jie > Assignee: Tao Jie > Priority: Major > Attachments: HDFS-13769.001.patch > > > Similar to the situation discussed in HDFS-13671, Namenode gets stuck for a > long time when deleting trash dir with a large mount of data. We found log in > namenode: > {quote} > 2018-06-08 20:00:59,042 INFO namenode.FSNamesystem > (FSNamesystemLock.java:writeUnlock(252)) - FSNamesystem write lock held for > 23018 ms via > java.lang.Thread.getStackTrace(Thread.java:1552) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1033) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:254) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1567) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:2820) > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1047) > {quote} > One simple solution is to avoid deleting large data in one delete RPC call. > We implement a trashPolicy that divide the delete operation into several > delete RPCs, and each single deletion would not delete too many files. > Any thought? [~linyiqun] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org