[ https://issues.apache.org/jira/browse/HDFS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Lu updated HDFS-7122: -------------------------- Status: Patch Available (was: Open) > Very poor distribution of replication copies > -------------------------------------------- > > Key: HDFS-7122 > URL: https://issues.apache.org/jira/browse/HDFS-7122 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.3.0 > Environment: medium-large environments with 100's to 1000's of DNs > will be most affected, but potentially all environments. > Reporter: Jeff Buell > Assignee: Andrew Wang > Priority: Blocker > Labels: performance > Attachments: copies_per_slave.jpg, > hdfs-7122-cdh5.1.2-testing.001.patch, hdfs-7122.001.patch > > > Summary: > Since HDFS-6268, the distribution of replica block copies across the > DataNodes (replicas 2,3,... as distinguished from the first "primary" > replica) is extremely poor, to the point that TeraGen slows down by as much > as 3X for certain configurations. This is almost certainly due to the > introduction of Thread Local Random in HDFS-6268. The mechanism appears to > be that this change causes all the random numbers in the threads to be > correlated, thus preventing a truly random choice of DN for each replica copy. > Testing details: > 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts), > 256MB block size. This results in 6 "primary" blocks on each DN. With > replication=3, there will be on average 12 more copies on each DN that are > copies of blocks from other DNs. Because of the random selection of DNs, > exactly 12 copies are not expected, but I found that about 160 DNs (1/4 of > all DNs!) received absolutely no copies, while one DN received over 100 > copies, and the elapsed time increased by about 3X from a pre-HDFS-6268 > distro. There was no pattern to which DNs didn't receive copies, nor was the > set of such DNs repeatable run-to-run. In addition to the performance > problem, there could be capacity problems due to one or a few DNs running out > of space. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I > don't see a significant difference from the Apache Hadoop source in this > regard. The workaround to recover the previous behavior is to set > dfs.namenode.handler.count=1 but of course this has scaling implications for > large clusters. > I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a > better algorithm can be implemented and tested. Testing should include a > case with many DNs and a small number of blocks on each. > It should also be noted that even pre-HDFS-6268, the random choice of DN > algorithm produces a rather non-uniform distribution of copies. This is not > due to any bug, but purely a case of random distributions being much less > uniform than one might intuitively expect. In the above case, pre-HDFS-6268 > yields something like a range of 3 to 25 block copies on each DN. > Surprisingly, the performance penalty of this non-uniformity is not as big as > might be expected (maybe only 10-20%), but HDFS should do better, and in any > case the capacity issue remains. Round-robin choice of DN? Better awareness > of which DNs currently store fewer blocks? It's not sufficient that the total > number of blocks is similar on each DN at the end, but that at each point in > time no individual DN receives a disproportionate number of blocks at once > (which could be a danger of a RR algorithm). > Probably should limit this jira to tracking the ThreadLocal issue, and track > the random choice issue in another one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)