[ https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039018#comment-17039018 ]
Stephen O'Donnell commented on HDFS-15171: ------------------------------------------ [~zhuqi] There are a few parts to this. First, the disk usage is refreshed every 10 minutes by a thread in CachingGetSpaceUsed, however I did not realise it did not persist the newly calculated value to the cache file. You are correct, as it does not, and the shutdown hook does it. It also could take some time to complete if the disk is large. I wonder if we could use the existing refresh thread in CachingGetSpaceUsed, perhaps by passing a callback (or something like the existing shutdown hook) so it can save the cache file each time it runs? The next problem is that in BlockPoolSlice.loadDfsUsed(), it will only load the cache file if the mtime on the file is less than "dfs.datanode.cached-dfsused.check.interval.ms" old. This defaults to 10 minutes. This results in another problem: If you shutdown a DN for more than 10 minutes, even if it shutdown cleanly and saved the cache file, it will not read the cache file on startup as the file is over 10 minutes old. Saying as the disk usage will be refreshed within 10 minutes of the DN starting, I think 10 minutes for dfs.datanode.cached-dfsused.check.interval.ms is probably too small and that default could do with being a bit higher. If we could ensure the cache file was saved approximately every 10 minutes by the refresh thread, you could argue that the DN should always use the cache file if it is there, as it should be reasonably up to date anyway. I encountered a problem like this recently with a few DNs which were taking a very long time to startup, and we worked around it by adjusting the mtime on the cache file to get them started. This was OK for a one off, but perhaps we can do better in this Jira. > Add a thread to call saveDfsUsed periodically, to prevent datanode too long > restart time. > ------------------------------------------------------------------------------------------- > > Key: HDFS-15171 > URL: https://issues.apache.org/jira/browse/HDFS-15171 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 3.2.0 > Reporter: zhuqi > Assignee: zhuqi > Priority: Major > > There are 30 storage dirs per datanode in our production cluster , it will > take too many time to restart, because sometimes the datanode didn't shutdown > gracefully. Now only the datanode graceful shut down hook and the > blockpoolslice shutdown will cause the saveDfsUsed function, that cause the > restart of datanode can't reuse the dfsuse cache sometimes. I think if we can > add a thread to periodically call the saveDfsUsed function. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org