[ 
https://issues.apache.org/jira/browse/HADOOP-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Paranjpye resolved HADOOP-572.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0

> Chain reaction in a big cluster caused by simultaneous failure of only a few 
> data-nodes.
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-572
>                 URL: https://issues.apache.org/jira/browse/HADOOP-572
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.6.2
>         Environment: Large dfs cluster
>            Reporter: Konstantin Shvachko
>         Assigned To: Sameer Paranjpye
>             Fix For: 0.8.0
>
>
> I've observed a cluster crash caused by simultaneous failure of only 3 
> data-nodes.
> The crash is reproducable. In order to reproduce it you need a rather large 
> cluster.
> To simplify calculations I'll consider a 600 node cluster as an example.
> The cluster should also contain a substantial amount of data.
> We will need at least 3 data-nodes containing 10,000+ blocks each.
> Now suppose that these 3 data-nodes fail at the same time, and the name-node
> started replicating all missing blocks belonging to the nodes.
> The name-node can replicate 50 blocks per second on average based on 
> experimental data.
> Meaning, it will take more than 10 minutes, which is the heartbeat expiration 
> interval,
> to replicates all 30,000+ blocks.
> With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats 
> hitting the name-node every second.
> Under heavy replication load the name-node accepts about 50 heartbeats per 
> second.
> So at most 3/4 of all heartbeats remain unserved.
> Each node SHOULD send 200 heartbeats during the 10 minute interval, and every 
> time the probability
> of the heartbeat being unserved is 3/4 or less.
> So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from 
> the practical standpoint.
> IN FACT since current implementation sets the rpc timeout to 1 minute, a 
> failed heartbeat takes
> 1 minute and 8 seconds to complete, and under this circumstances each 
> data-node can send only
> 9 heartbeats during the 10 minute interval. Thus, the probability of failing 
> of all 9 of them is 0.075,
> which means that we will loose 45 nodes out of 600 at the end of the 10 
> minute interval.
> From this point the name-node will be constantly replicating blocks and 
> loosing more nodes, and
> becomes effectively dysfunctional.
> A map-reduce framework running on top of it makes things deteriorate even 
> faster, because failing
> tasks and jobs are trying to remove files and re-create them again increasing 
> the overall load on
> the name-node.
> I see at least 2 problems that contribute to the chain reaction described 
> above.
> 1. A heartbeat failure takes too long (1'8").
> 2. Name-node synchronized operations should be fine-grained.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to