[ https://issues.apache.org/jira/browse/HDFS-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747717#action_12747717 ]
Brian Bockelman commented on HDFS-343: -------------------------------------- Been there, done that - bad idea. The following are the biggest bad effects: 1) You completely hammer new nodes. I've got a system which implements this policy and it's difficult to bring up new nodes, because they are immediately crushed by the onslaught of transfers 2) If you don't randomly place blocks, you get patterns embedded in your system. For example, if you have a dataset that's placed on HDFS and the policy places it on a non-random subset of nodes, then the likelihood of you overloading the nodes later by analyzing the whole dataset at once is pretty high. A random placement policy would spread the dataset out much better, decreasing the correlations in the system. This is from practical experience. Random placement is very, very good - especially for larger systems. If you want random placement + balanced nodes, run the balancer to smooth things out asynchronously. If you absolutely want this behavior, check into the pluggable placement framework in the 0.21.x series. > Better Target selection for block replication > --------------------------------------------- > > Key: HDFS-343 > URL: https://issues.apache.org/jira/browse/HDFS-343 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Enis Soztutar > > Block replication policy tends to balance the number of blocks in each > datanode in the long run, however with heterogeneous clusters with varying > number of disks per node, the nodes with one disk fill quickly while nodes > with 3 disks still have 60% free disk space. This also reduces the advantage > of using more than one disk for parallel IO, since machines with multiple > disks are not used as much. > The javadoc of the ReplicationTargetChooser reads as : > The replica placement strategy is that if the writer is on a datanode, the > 1st replica is placed on the local machine, otherwise a random datanode. The > 2nd replica is placed on a datanode that is on a different rack. The 3rd > replica is placed on a datanode which is on the same rack as the first > replica. > I think we should switch to a policy that balances the percent of disk usage > rather than balancing total block count among the datanodes. This can be done > by defining the probability of selection of a datanode based on its disk > percent usage. A formula like 1 - (percent_usage / 100 ) seems reasonable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.