[ 
https://issues.apache.org/jira/browse/HDFS-6482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Thomas updated HDFS-6482:
-------------------------------

    Attachment: HDFS-6482.5.patch

Added a unit test for the layout upgrade.

After some thought and discussion, I have switched back to the two-level 
directory structure, with the first level given by 24th through 17th bits of 
the block ID, and the second level given by the 16th through 9th bits.

First, the primary issue with the three-level structure was that it increased 
the number of dentries to a uncacheable level. There are up to 256 + 256^2 + 
256^3 dentries in this scheme, and assuming that each dentry (including inode) 
consumes about a kilobyte, we would need gigabytes of dentry cache to store 
everything, which we almost certainly wouldn't have. So in all likelihood for 
operation that needed to touch the block file we would incur an additional disk 
access or two to obtain the inode of the third-level directory. This might be 
significant for short reads and for the directory scanner, which doesn't read 
the block file at all.

Since the two-level structure has a maximum of only 256 + 256^2 dentries, we 
should be fine with respect to dentry cache.

The main concern that was raised for the two-level structure was the 
possibility of directory blowup in large clusters. After all, the upper bound 
on the number of blocks in a single second-level directory on a DN is N/2^16, 
where N is the number of block sequence numbers that have been assigned by the 
NN (for the three-level structure it is N/2^24). This concern is overblown for 
two reasons:

1) Modern filesystems (ext4, xfs) index directories, so even in huge 
directories lookups are fast. (Slow lookups were the main problem caused by 
large directories on old filesystems.) Even old filesystems can handle on the 
order of a few thousand files in a directory, and even for a cluster of around 
2^28 (~250 mm) blocks ever created we would not exceed 5000 blocks in a single 
directory in the absolute worst case.

2) The worst case is exceedingly unlikely. Suppose that there are T disks in 
our cluster (it helps to think of disks rather than DNs in this analysis). Then 
we can imagine that each disk has 256^2 buckets, each corresponding to a 
second-level directory. Let the bucket set (a, b) be the set of buckets, one 
from each of the T disks, with first-level directory given by a and 
second-level diretory given by b. Every 2^24 sequence numbers, we loop back to 
the same bucket set, and for each of the next 256 sequence numbers, we choose a 
random bucket in this bucket set (the corresponding directory in one of the T 
disks) to place the corresponding block in. In short, for each bucket set, we 
randomly assign each of N/2^16 blocks to one of the buckets in the set. Suppose 
that X is a random variable that gives the maximum number of blocks in any 
single bucket across all bucket sets, and let M = N/2^16 (some large number in 
large clusters). I can post a derivation if people are interested, but what I 
find is that for large M (large with respect to T)

P(X >= a) <= 256^2*T*(1 - normal_cdf(mu = M/T, sigma = sqrt(M/T), a))

Let's try some actual numbers here. Let us upper bound the 256^2*T term by 
10^10 (requires more than 10^5 disks in the cluster, which seems well into the 
future). The term M/T (represents the expected number of blocks per 
second-level directory) is desired number of blocks per disk divided by 2^16. 
Assuming a 128 MB block size, 6 TB drives, and average block usage of something 
like 50%, M/T is something like 1. Granted, the normal approximation is not 
very good here, but it is clear that if a is anything remotely closely to 
worrisome (say 1000 or more) the probability is vanishingly small. (For 
reference, at 8 standard devs above the mean, 1 - normal_cdf is already 10^-16, 
and at 10 standard devs, it's at 10^-24). Basically, directory blowup is almost 
impossible.


> Use block ID-based block layout on datanodes
> --------------------------------------------
>
>                 Key: HDFS-6482
>                 URL: https://issues.apache.org/jira/browse/HDFS-6482
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 2.5.0
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: HDFS-6482.1.patch, HDFS-6482.2.patch, HDFS-6482.3.patch, 
> HDFS-6482.4.patch, HDFS-6482.5.patch, HDFS-6482.patch
>
>
> Right now blocks are placed into directories that are split into many 
> subdirectories when capacity is reached. Instead we can use a block's ID to 
> determine the path it should go in. This eliminates the need for the LDir 
> data structure that facilitates the splitting of directories when they reach 
> capacity as well as fields in ReplicaInfo that keep track of a replica's 
> location.
> An extension of the work in HDFS-3290.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to