[
https://issues.apache.org/jira/browse/HDFS-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152052#comment-14152052
]
Luke Lu commented on HDFS-7122:
-------------------------------
Created HADOOP-11152 to track the investigation of a better RNG.
> Very poor distribution of replication copies
> --------------------------------------------
>
> Key: HDFS-7122
> URL: https://issues.apache.org/jira/browse/HDFS-7122
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.3.0
> Environment: medium-large environments with 100's to 1000's of DNs
> will be most affected, but potentially all environments.
> Reporter: Jeff Buell
> Assignee: Andrew Wang
> Priority: Blocker
> Labels: performance
> Attachments: copies_per_slave.jpg,
> hdfs-7122-cdh5.1.2-testing.001.patch, hdfs-7122.001.patch
>
>
> Summary:
> Since HDFS-6268, the distribution of replica block copies across the
> DataNodes (replicas 2,3,... as distinguished from the first "primary"
> replica) is extremely poor, to the point that TeraGen slows down by as much
> as 3X for certain configurations. This is almost certainly due to the
> introduction of Thread Local Random in HDFS-6268. The mechanism appears to
> be that this change causes all the random numbers in the threads to be
> correlated, thus preventing a truly random choice of DN for each replica copy.
> Testing details:
> 1 TB TeraGen on 638 slave nodes (virtual machines on 32 physical hosts),
> 256MB block size. This results in 6 "primary" blocks on each DN. With
> replication=3, there will be on average 12 more copies on each DN that are
> copies of blocks from other DNs. Because of the random selection of DNs,
> exactly 12 copies are not expected, but I found that about 160 DNs (1/4 of
> all DNs!) received absolutely no copies, while one DN received over 100
> copies, and the elapsed time increased by about 3X from a pre-HDFS-6268
> distro. There was no pattern to which DNs didn't receive copies, nor was the
> set of such DNs repeatable run-to-run. In addition to the performance
> problem, there could be capacity problems due to one or a few DNs running out
> of space. Testing was done on CDH 5.0.0 (before) and CDH 5.1.2 (after), but I
> don't see a significant difference from the Apache Hadoop source in this
> regard. The workaround to recover the previous behavior is to set
> dfs.namenode.handler.count=1 but of course this has scaling implications for
> large clusters.
> I recommend that the ThreadLocal Random part of HDFS-6268 be reverted until a
> better algorithm can be implemented and tested. Testing should include a
> case with many DNs and a small number of blocks on each.
> It should also be noted that even pre-HDFS-6268, the random choice of DN
> algorithm produces a rather non-uniform distribution of copies. This is not
> due to any bug, but purely a case of random distributions being much less
> uniform than one might intuitively expect. In the above case, pre-HDFS-6268
> yields something like a range of 3 to 25 block copies on each DN.
> Surprisingly, the performance penalty of this non-uniformity is not as big as
> might be expected (maybe only 10-20%), but HDFS should do better, and in any
> case the capacity issue remains. Round-robin choice of DN? Better awareness
> of which DNs currently store fewer blocks? It's not sufficient that the total
> number of blocks is similar on each DN at the end, but that at each point in
> time no individual DN receives a disproportionate number of blocks at once
> (which could be a danger of a RR algorithm).
> Probably should limit this jira to tracking the ThreadLocal issue, and track
> the random choice issue in another one.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)