[ https://issues.apache.org/jira/browse/HDFS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
James Thomas updated HDFS-6482: ------------------------------- Attachment: HDFS-6482.5.patch Added a unit test for the layout upgrade. After some thought and discussion, I have switched back to the two-level directory structure, with the first level given by 24th through 17th bits of the block ID, and the second level given by the 16th through 9th bits. First, the primary issue with the three-level structure was that it increased the number of dentries to a uncacheable level. There are up to 256 + 256^2 + 256^3 dentries in this scheme, and assuming that each dentry (including inode) consumes about a kilobyte, we would need gigabytes of dentry cache to store everything, which we almost certainly wouldn't have. So in all likelihood for operation that needed to touch the block file we would incur an additional disk access or two to obtain the inode of the third-level directory. This might be significant for short reads and for the directory scanner, which doesn't read the block file at all. Since the two-level structure has a maximum of only 256 + 256^2 dentries, we should be fine with respect to dentry cache. The main concern that was raised for the two-level structure was the possibility of directory blowup in large clusters. After all, the upper bound on the number of blocks in a single second-level directory on a DN is N/2^16, where N is the number of block sequence numbers that have been assigned by the NN (for the three-level structure it is N/2^24). This concern is overblown for two reasons: 1) Modern filesystems (ext4, xfs) index directories, so even in huge directories lookups are fast. (Slow lookups were the main problem caused by large directories on old filesystems.) Even old filesystems can handle on the order of a few thousand files in a directory, and even for a cluster of around 2^28 (~250 mm) blocks ever created we would not exceed 5000 blocks in a single directory in the absolute worst case. 2) The worst case is exceedingly unlikely. Suppose that there are T disks in our cluster (it helps to think of disks rather than DNs in this analysis). Then we can imagine that each disk has 256^2 buckets, each corresponding to a second-level directory. Let the bucket set (a, b) be the set of buckets, one from each of the T disks, with first-level directory given by a and second-level diretory given by b. Every 2^24 sequence numbers, we loop back to the same bucket set, and for each of the next 256 sequence numbers, we choose a random bucket in this bucket set (the corresponding directory in one of the T disks) to place the corresponding block in. In short, for each bucket set, we randomly assign each of N/2^16 blocks to one of the buckets in the set. Suppose that X is a random variable that gives the maximum number of blocks in any single bucket across all bucket sets, and let M = N/2^16 (some large number in large clusters). I can post a derivation if people are interested, but what I find is that for large M (large with respect to T) P(X >= a) <= 256^2*T*(1 - normal_cdf(mu = M/T, sigma = sqrt(M/T), a)) Let's try some actual numbers here. Let us upper bound the 256^2*T term by 10^10 (requires more than 10^5 disks in the cluster, which seems well into the future). The term M/T (represents the expected number of blocks per second-level directory) is desired number of blocks per disk divided by 2^16. Assuming a 128 MB block size, 6 TB drives, and average block usage of something like 50%, M/T is something like 1. Granted, the normal approximation is not very good here, but it is clear that if a is anything remotely closely to worrisome (say 1000 or more) the probability is vanishingly small. (For reference, at 8 standard devs above the mean, 1 - normal_cdf is already 10^-16, and at 10 standard devs, it's at 10^-24). Basically, directory blowup is almost impossible. > Use block ID-based block layout on datanodes > -------------------------------------------- > > Key: HDFS-6482 > URL: https://issues.apache.org/jira/browse/HDFS-6482 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 2.5.0 > Reporter: James Thomas > Assignee: James Thomas > Attachments: HDFS-6482.1.patch, HDFS-6482.2.patch, HDFS-6482.3.patch, > HDFS-6482.4.patch, HDFS-6482.5.patch, HDFS-6482.patch > > > Right now blocks are placed into directories that are split into many > subdirectories when capacity is reached. Instead we can use a block's ID to > determine the path it should go in. This eliminates the need for the LDir > data structure that facilitates the splitting of directories when they reach > capacity as well as fields in ReplicaInfo that keep track of a replica's > location. > An extension of the work in HDFS-3290. -- This message was sent by Atlassian JIRA (v6.2#6252)