[
https://issues.apache.org/jira/browse/HADOOP-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462059
]
Sameer Paranjpye commented on HADOOP-572:
-----------------------------------------
This issue has been addressed by the following fixes:
HADOOP-725
HADOOP-641
HADOOP-642
HADOOP-255
> Chain reaction in a big cluster caused by simultaneous failure of only a few
> data-nodes.
> ----------------------------------------------------------------------------------------
>
> Key: HADOOP-572
> URL: https://issues.apache.org/jira/browse/HADOOP-572
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.6.2
> Environment: Large dfs cluster
> Reporter: Konstantin Shvachko
> Assigned To: Sameer Paranjpye
>
> I've observed a cluster crash caused by simultaneous failure of only 3
> data-nodes.
> The crash is reproducable. In order to reproduce it you need a rather large
> cluster.
> To simplify calculations I'll consider a 600 node cluster as an example.
> The cluster should also contain a substantial amount of data.
> We will need at least 3 data-nodes containing 10,000+ blocks each.
> Now suppose that these 3 data-nodes fail at the same time, and the name-node
> started replicating all missing blocks belonging to the nodes.
> The name-node can replicate 50 blocks per second on average based on
> experimental data.
> Meaning, it will take more than 10 minutes, which is the heartbeat expiration
> interval,
> to replicates all 30,000+ blocks.
> With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats
> hitting the name-node every second.
> Under heavy replication load the name-node accepts about 50 heartbeats per
> second.
> So at most 3/4 of all heartbeats remain unserved.
> Each node SHOULD send 200 heartbeats during the 10 minute interval, and every
> time the probability
> of the heartbeat being unserved is 3/4 or less.
> So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from
> the practical standpoint.
> IN FACT since current implementation sets the rpc timeout to 1 minute, a
> failed heartbeat takes
> 1 minute and 8 seconds to complete, and under this circumstances each
> data-node can send only
> 9 heartbeats during the 10 minute interval. Thus, the probability of failing
> of all 9 of them is 0.075,
> which means that we will loose 45 nodes out of 600 at the end of the 10
> minute interval.
> From this point the name-node will be constantly replicating blocks and
> loosing more nodes, and
> becomes effectively dysfunctional.
> A map-reduce framework running on top of it makes things deteriorate even
> faster, because failing
> tasks and jobs are trying to remove files and re-create them again increasing
> the overall load on
> the name-node.
> I see at least 2 problems that contribute to the chain reaction described
> above.
> 1. A heartbeat failure takes too long (1'8").
> 2. Name-node synchronized operations should be fine-grained.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira