[ https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056342#comment-15056342 ]
Kihwal Lee commented on HDFS-8791: ---------------------------------- For those who are interested, we have upgraded a 2000+ node busy cluster with this patch. We had to do something extra to speed up the rolling upgrade process. - Tune the kernel to be less aggressive on evicting vfs-related slab entries. {noformat} echo 2 > /proc/sys/vm/vfs_cache_pressure Wait 6 hours for the DirectoryScanner to run and warm up the cache. {noformat} - Use a custom tool to upgrade the volumes offline in parallel without scanning. This tool utilizes the replica cache file that is created during upgrade-shutdown. If a node was going through the slow (regular) upgrade path, it could have taken over an hour (9-11 minutes * n drives). Via the "fast" path, the layout upgrade finished in 2-3 minutes, depending on the size of drives. The offline layout upgrade was done in 3-4 seconds on a non-busy cluster. Scanning blocks in the new layout was taking about 2 seconds (this is done in parallel), so datanodes were registering with the NNs in 6 seconds after startup. > block ID-based DN storage layout can be very slow for datanode on ext4 > ---------------------------------------------------------------------- > > Key: HDFS-8791 > URL: https://issues.apache.org/jira/browse/HDFS-8791 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.6.0, 2.8.0, 2.7.1 > Reporter: Nathan Roberts > Assignee: Chris Trezzo > Priority: Blocker > Attachments: 32x32DatanodeLayoutTesting-v1.pdf, > 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, > HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, > HDFS-8791-trunk-v2.patch, hadoop-56-layout-datanode-dir.tgz, > test-node-upgrade.txt > > > We are seeing cases where the new directory layout causes the datanode to > basically cause the disks to seek for 10s of minutes. This can be when the > datanode is running du, and it can also be when it is performing a > checkDirs(). Both of these operations currently scan all directories in the > block pool and that's very expensive in the new layout. > The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K > leaf directories where block files are placed. > So, what we have on disk is: > - 256 inodes for the first level directories > - 256 directory blocks for the first level directories > - 256*256 inodes for the second level directories > - 256*256 directory blocks for the second level directories > - Then the inodes and blocks to store the the HDFS blocks themselves. > The main problem is the 256*256 directory blocks. > inodes and dentries will be cached by linux and one can configure how likely > the system is to prune those entries (vfs_cache_pressure). However, ext4 > relies on the buffer cache to cache the directory blocks and I'm not aware of > any way to tell linux to favor buffer cache pages (even if it did I'm not > sure I would want it to in general). > Also, ext4 tries hard to spread directories evenly across the entire volume, > this basically means the 64K directory blocks are probably randomly spread > across the entire disk. A du type scan will look at directories one at a > time, so the ioscheduler can't optimize the corresponding seeks, meaning the > seeks will be random and far. > In a system I was using to diagnose this, I had 60K blocks. A DU when things > are hot is less than 1 second. When things are cold, about 20 minutes. > How do things get cold? > - A large set of tasks run on the node. This pushes almost all of the buffer > cache out, causing the next DU to hit this situation. We are seeing cases > where a large job can cause a seek storm across the entire cluster. > Why didn't the previous layout see this? > - It might have but it wasn't nearly as pronounced. The previous layout would > be a few hundred directory blocks. Even when completely cold, these would > only take a few a hundred seeks which would mean single digit seconds. > - With only a few hundred directories, the odds of the directory blocks > getting modified is quite high, this keeps those blocks hot and much less > likely to be evicted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)