[ https://issues.apache.org/jira/browse/HDFS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883632#action_12883632 ]
dhruba borthakur commented on HDFS-1094: ---------------------------------------- @Koji: we have files with replication factor of 3. if a large number of datanodes fail at the same time, we do see missing blocks. Sometimes, the datanode process on these machines fail to start even after repeated start-dfs.sh attempts, sometimes the entire machine fails to reboot. Then we have to manually fix a few of those bad datanode machines and make them come online; this fixes the "missing blocks" problem but is a manual process and is painful. > Intelligent block placement policy to decrease probability of block loss > ------------------------------------------------------------------------ > > Key: HDFS-1094 > URL: https://issues.apache.org/jira/browse/HDFS-1094 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Reporter: dhruba borthakur > Assignee: Rodrigo Schmidt > Attachments: prob.pdf, prob.pdf > > > The current HDFS implementation specifies that the first replica is local and > the other two replicas are on any two random nodes on a random remote rack. > This means that if any three datanodes die together, then there is a > non-trivial probability of losing at least one block in the cluster. This > JIRA is to discuss if there is a better algorithm that can lower probability > of losing a block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.