[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15041902#comment-15041902 ]
Chris Trezzo commented on HDFS-8791: ------------------------------------ Reviewing HDFS-8578. > block ID-based DN storage layout can be very slow for datanode on ext4 > ---------------------------------------------------------------------- > > Key: HDFS-8791 > URL: https://issues.apache.org/jira/browse/HDFS-8791 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.6.0, 2.8.0, 2.7.1 > Reporter: Nathan Roberts > Assignee: Chris Trezzo > Priority: Blocker > Attachments: 32x32DatanodeLayoutTesting-v1.pdf, > 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, > HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, > HDFS-8791-trunk-v2.patch, hadoop-56-layout-datanode-dir.tgz, > test-node-upgrade.txt > > > We are seeing cases where the new directory layout causes the datanode to > basically cause the disks to seek for 10s of minutes. This can be when the > datanode is running du, and it can also be when it is performing a > checkDirs(). Both of these operations currently scan all directories in the > block pool and that's very expensive in the new layout. > The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K > leaf directories where block files are placed. > So, what we have on disk is: > - 256 inodes for the first level directories > - 256 directory blocks for the first level directories > - 256*256 inodes for the second level directories > - 256*256 directory blocks for the second level directories > - Then the inodes and blocks to store the the HDFS blocks themselves. > The main problem is the 256*256 directory blocks. > inodes and dentries will be cached by linux and one can configure how likely > the system is to prune those entries (vfs_cache_pressure). However, ext4 > relies on the buffer cache to cache the directory blocks and I'm not aware of > any way to tell linux to favor buffer cache pages (even if it did I'm not > sure I would want it to in general). > Also, ext4 tries hard to spread directories evenly across the entire volume, > this basically means the 64K directory blocks are probably randomly spread > across the entire disk. A du type scan will look at directories one at a > time, so the ioscheduler can't optimize the corresponding seeks, meaning the > seeks will be random and far. > In a system I was using to diagnose this, I had 60K blocks. A DU when things > are hot is less than 1 second. When things are cold, about 20 minutes. > How do things get cold? > - A large set of tasks run on the node. This pushes almost all of the buffer > cache out, causing the next DU to hit this situation. We are seeing cases > where a large job can cause a seek storm across the entire cluster. > Why didn't the previous layout see this? > - It might have but it wasn't nearly as pronounced. The previous layout would > be a few hundred directory blocks. Even when completely cold, these would > only take a few a hundred seeks which would mean single digit seconds. > - With only a few hundred directories, the odds of the directory blocks > getting modified is quite high, this keeps those blocks hot and much less > likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)