[ https://issues.apache.org/jira/browse/HADOOP-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sameer Paranjpye resolved HADOOP-572. ------------------------------------- Resolution: Fixed Fix Version/s: 0.8.0 > Chain reaction in a big cluster caused by simultaneous failure of only a few > data-nodes. > ---------------------------------------------------------------------------------------- > > Key: HADOOP-572 > URL: https://issues.apache.org/jira/browse/HADOOP-572 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.6.2 > Environment: Large dfs cluster > Reporter: Konstantin Shvachko > Assigned To: Sameer Paranjpye > Fix For: 0.8.0 > > > I've observed a cluster crash caused by simultaneous failure of only 3 > data-nodes. > The crash is reproducable. In order to reproduce it you need a rather large > cluster. > To simplify calculations I'll consider a 600 node cluster as an example. > The cluster should also contain a substantial amount of data. > We will need at least 3 data-nodes containing 10,000+ blocks each. > Now suppose that these 3 data-nodes fail at the same time, and the name-node > started replicating all missing blocks belonging to the nodes. > The name-node can replicate 50 blocks per second on average based on > experimental data. > Meaning, it will take more than 10 minutes, which is the heartbeat expiration > interval, > to replicates all 30,000+ blocks. > With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats > hitting the name-node every second. > Under heavy replication load the name-node accepts about 50 heartbeats per > second. > So at most 3/4 of all heartbeats remain unserved. > Each node SHOULD send 200 heartbeats during the 10 minute interval, and every > time the probability > of the heartbeat being unserved is 3/4 or less. > So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from > the practical standpoint. > IN FACT since current implementation sets the rpc timeout to 1 minute, a > failed heartbeat takes > 1 minute and 8 seconds to complete, and under this circumstances each > data-node can send only > 9 heartbeats during the 10 minute interval. Thus, the probability of failing > of all 9 of them is 0.075, > which means that we will loose 45 nodes out of 600 at the end of the 10 > minute interval. > From this point the name-node will be constantly replicating blocks and > loosing more nodes, and > becomes effectively dysfunctional. > A map-reduce framework running on top of it makes things deteriorate even > faster, because failing > tasks and jobs are trying to remove files and re-create them again increasing > the overall load on > the name-node. > I see at least 2 problems that contribute to the chain reaction described > above. > 1. A heartbeat failure takes too long (1'8"). > 2. Name-node synchronized operations should be fine-grained. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira