[ 
https://issues.apache.org/jira/browse/HDFS-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-3368:
--------------------------------------

         Description: All replicas of a block can be removed if bad DataNodes 
come up and down during cluster restart resulting in data loss.  (was: All 
replicas of a block can be removed if bad DataNodes come up and down during 
cluter restart resulting in data loss.)
    Target Version/s: 0.22.1, 2.0.0, 3.0.0  (was: 3.0.0, 2.0.0, 0.22.1)

- A block b has 3 replicas initially located on DNs do1, do2, do3.
- At different times all three nodes malfunctioned and died, causing the 
replicas to be migrate to dn1, dn2, dn3.
- do1, do2, do3 were not added to the exclude list.
And when the cluster restarts do1, do2, do3 are brought up along with dn1, dn2, 
dn3. 
- NN sees 6 replicas for block b and correctly decides to remove 3 of them.
{{BlockPlacementPolicyDefault.chooseReplicaToDelete()}} selects three targets 
to be deleted based on the free space remaining on DNs deemed to posses 
replicas. 
dn1, dn2, dn3 are most likely to be the targets for replicas deletion because 
they have been on the cluster longer than do1, do2, do3 and therefore are 
likely to have less free space.
- Expectedly do1, do2, do3 malfunction again and go down shortly after 
reporting their blocks to NN.
- It will take 10 minutes for NN to recognize the fact that do1, do2, do3 are 
dead. By that time replicas will be removed from the good nodes, resulting in 
data loss.
This is the real story seen in production.
I verified that all major version are affected.
                
> Missing blocks due to bad DataNodes comming up and down.
> --------------------------------------------------------
>
>                 Key: HDFS-3368
>                 URL: https://issues.apache.org/jira/browse/HDFS-3368
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0, 1.0.0, 2.0.0, 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>
> All replicas of a block can be removed if bad DataNodes come up and down 
> during cluster restart resulting in data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to