Nathan Roberts created HDFS-8791:
------------------------------------

             Summary: block ID-based DN storage layout can be very slow for 
datanode on ext4
                 Key: HDFS-8791
                 URL: https://issues.apache.org/jira/browse/HDFS-8791
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.6.1
            Reporter: Nathan Roberts
            Priority: Critical


We are seeing cases where the new directory layout causes the datanode to 
basically cause the disks to seek for 10s of minutes. This can be when the 
datanode is running du, and it can also be when it is performing a checkDirs(). 
Both of these operations currently scan all directories in the block pool and 
that's very expensive in the new layout.

The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K leaf 
directories where block files are placed.

So, what we have on disk is:
- 256 inodes for the first level directories
- 256 directory blocks for the first level directories
- 256*256 inodes for the second level directories
- 256*256 directory blocks for the second level directories
- Then the inodes and blocks to store the the HDFS blocks themselves.

The main problem is the 256*256 directory blocks. 

inodes and dentries will be cached by linux and one can configure how likely 
the system is to prune those entries (vfs_cache_pressure). However, ext4 relies 
on the buffer cache to cache the directory blocks and I'm not aware of any way 
to tell linux to favor buffer cache pages (even if it did I'm not sure I would 
want it to in general).

Also, ext4 tries hard to spread directories evenly across the entire volume, 
this basically means the 64K directory blocks are probably randomly spread 
across the entire disk. A du type scan will look at directories one at a time, 
so the ioscheduler can't optimize the corresponding seeks, meaning the seeks 
will be random and far. 

In a system I was using to diagnose this, I had 60K blocks. A DU when things 
are hot is less than 1 second. When things are cold, about 20 minutes.

How do things get cold?
- A large set of tasks run on the node. This pushes almost all of the buffer 
cache out, causing the next DU to hit this situation. We are seeing cases where 
a large job can cause a seek storm across the entire cluster.

Why didn't the previous layout see this?
- It might have but it wasn't nearly as pronounced. The previous layout would 
be a few hundred directory blocks. Even when completely cold, these would only 
take a few a hundred seeks which would mean single digit seconds.  
- With only a few hundred directories, the odds of the directory blocks getting 
modified is quite high, this keeps those blocks hot and much less likely to be 
evicted.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to