[ 
https://issues.apache.org/jira/browse/HDFS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886823#action_12886823
 ] 

Rodrigo Schmidt commented on HDFS-1094:
---------------------------------------

@Joydeep: Using a term to describe the groups is a great idea. Just bear in 
mind that we are not dividing the cluster into exclusive groups. Each block has 
a limited number of blocks in which it can be replicated, but separate blocks 
might have different sets of nodes associated with, and the intersection 
between these sets might not be empty.

Here is a sketch of the algorithm to make things a little more clear. This is 
just a simplification of the algorithm, and I am not describing several corner 
cases and reconfiguration scenarios.

{code}
Configuration parameters:
   - R: rack window, distance from initial rack we are allowed to place replicas
   - M: machine window, size of the machine window within a rack

Whenever network topology changes:
   - Sort racks into a logical ring, based on rack name
   - Sort nodes within each rack into logical rings, based on node names

For the first replica:
   - Write to the local machine, if possible, or pick up a random one

For the second replica:
   - Let r be the rack in which the first replica was placed
   - Let i be the index of the machine in r that keeps the first replica
   - Pick random rack r2 that is within R racks from r
   - Pick random machine m2 in r2 that is within window [i, (i+M-1)%racksize]
   - Place replica in m2

For the third replica:
   - Given steps above, pick another random machine m3 in r2 that is within the 
same window used for m2
   - Make sure m2 != m3
   - Place replica in m3

{code}

I hope this explanation helps solve the confusion.


> Intelligent block placement policy to decrease probability of block loss
> ------------------------------------------------------------------------
>
>                 Key: HDFS-1094
>                 URL: https://issues.apache.org/jira/browse/HDFS-1094
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: dhruba borthakur
>            Assignee: Rodrigo Schmidt
>         Attachments: prob.pdf, prob.pdf
>
>
> The current HDFS implementation specifies that the first replica is local and 
> the other two replicas are on any two random nodes on a random remote rack. 
> This means that if any three datanodes die together, then there is a 
> non-trivial probability of losing at least one block in the cluster. This 
> JIRA is to discuss if there is a better algorithm that can lower probability 
> of losing a block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to