[ 
http://issues.apache.org/jira/browse/HADOOP-572?page=comments#action_12439376 ] 
            
Doug Cutting commented on HADOOP-572:
-------------------------------------

A namenode that drops 75% of its requests for 10 minutes is a problem.  I think 
the first thing to do is to control the replication rate, so that fewer than 50 
replications are attempted per second.  This is fairly simple to do, since the 
namenode controls the issuance of replication requests.  For example, it can 
limit the number of outstanding replications, which will effectively control 
the rate.

Think of it this way, the namenode's observed current capacity is 200 
heartbeats per second and 50 block replications per second.  We're attempting 
in excess of 50 replications and still attempting 200 heartbeats, and the many 
of the heartbeats are failing to arrive in a timely manner (as are probably 
many of the replication reports, but those are less critical).  Retrying 
heartbeats sooner will just increase the load on the namenode, aggravating the 
problem.

The other thing to do is limit the heartbeat traffic.  Currently, heartbeat 
traffic is proportional to cluster size, which is not scalable.  As a simple 
measure, we can make the heartbeat interval configurable.  Longer term we can 
make it adaptive.  Longer-yet, we could even consider inverting the control, so 
that the namenode pings datanodes to check if they're alive and hand them work.

Another long-term fix would of course be to improve the namenode's performance 
and lessen its bottlenecks, so that it can handle more requests per second.  
But no matter how much we do this, we still need to make sure that all request 
rates are limited, and do not increase linearly with cluster size.

> Chain reaction in a big cluster caused by simultaneous failure of only a few 
> data-nodes.
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-572
>                 URL: http://issues.apache.org/jira/browse/HADOOP-572
>             Project: Hadoop
>          Issue Type: Bug
>    Affects Versions: 0.6.2
>         Environment: Large dfs cluster
>            Reporter: Konstantin Shvachko
>
> I've observed a cluster crash caused by simultaneous failure of only 3 
> data-nodes.
> The crash is reproducable. In order to reproduce it you need a rather large 
> cluster.
> To simplify calculations I'll consider a 600 node cluster as an example.
> The cluster should also contain a substantial amount of data.
> We will need at least 3 data-nodes containing 10,000+ blocks each.
> Now suppose that these 3 data-nodes fail at the same time, and the name-node
> started replicating all missing blocks belonging to the nodes.
> The name-node can replicate 50 blocks per second on average based on 
> experimental data.
> Meaning, it will take more than 10 minutes, which is the heartbeat expiration 
> interval,
> to replicates all 30,000+ blocks.
> With the 3 second heartbeat interval there are 600 / 3 = 200 heartbeats 
> hitting the name-node every second.
> Under heavy replication load the name-node accepts about 50 heartbeats per 
> second.
> So at most 3/4 of all heartbeats remain unserved.
> Each node SHOULD send 200 heartbeats during the 10 minute interval, and every 
> time the probability
> of the heartbeat being unserved is 3/4 or less.
> So the probability of failing of all 200 heartbeats is (3/4) ** 200 = 0 from 
> the practical standpoint.
> IN FACT since current implementation sets the rpc timeout to 1 minute, a 
> failed heartbeat takes
> 1 minute and 8 seconds to complete, and under this circumstances each 
> data-node can send only
> 9 heartbeats during the 10 minute interval. Thus, the probability of failing 
> of all 9 of them is 0.075,
> which means that we will loose 45 nodes out of 600 at the end of the 10 
> minute interval.
> From this point the name-node will be constantly replicating blocks and 
> loosing more nodes, and
> becomes effectively dysfunctional.
> A map-reduce framework running on top of it makes things deteriorate even 
> faster, because failing
> tasks and jobs are trying to remove files and re-create them again increasing 
> the overall load on
> the name-node.
> I see at least 2 problems that contribute to the chain reaction described 
> above.
> 1. A heartbeat failure takes too long (1'8").
> 2. Name-node synchronized operations should be fine-grained.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to