[ 
https://issues.apache.org/jira/browse/HDFS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425305#comment-13425305
 ] 

Jing Zhao commented on HDFS-3703:
---------------------------------

With respect to Suresh's proposal:
If the last heartbeat time for datanode is more than certain threshold T from 
the last time a namenode datanode processed the heartbeat, consider it stale. 
For writes do not use such stale datanodes (if possible). For reads, put such 
stale datanodes at the end of the list.

In this strategy, since a small T for judging stale state may generate new 
hotspots on cluster, I propose that T can be calculated as:
T = t_c + (number of nodes already marked as stale) / (total number of nodes) * 
(T_d - t_c),
where t_c is a constant value initially set in the configuration, and T_d is 
the time for marking as dead (i.e., 10.5 min).

E.g., t_c can be set as 30s, then when there is no or few nodes marked as 
stale, we can have a small T to satisfy the HBase requirement. In case that 
there are large number nodes marked as stale, e.g., near the total number of 
nodes, T will be almost T_d (i.e., ~10min), and the workload can still be 
distributed to all the nodes alive.
                
> Decrease the datanode failure detection time
> --------------------------------------------
>
>                 Key: HDFS-3703
>                 URL: https://issues.apache.org/jira/browse/HDFS-3703
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, name-node
>    Affects Versions: 1.0.3, 2.0.0-alpha
>            Reporter: nkeywal
>            Assignee: Suresh Srinivas
>
> By default, if a box dies, the datanode will be marked as dead by the 
> namenode after 10:30 minutes. In the meantime, this datanode will still be 
> proposed  by the nanenode to write blocks or to read replicas. It happens as 
> well if the datanode crashes: there is no shutdown hooks to tell the nanemode 
> we're not there anymore.
> It especially an issue with HBase. HBase regionserver timeout for production 
> is often 30s. So with these configs, when a box dies HBase starts to recover 
> after 30s and, while 10 minutes, the namenode will consider the blocks on the 
> same box as available. Beyond the write errors, this will trigger a lot of 
> missed reads:
> - during the recovery, HBase needs to read the blocks used on the dead box 
> (the ones in the 'HBase Write-Ahead-Log')
> - after the recovery, reading these data blocks (the 'HBase region') will 
> fail 33% of the time with the default number of replica, slowering the data 
> access, especially when the errors are socket timeout (i.e. around 60s most 
> of the time). 
> Globally, it would be ideal if HDFS settings could be under HBase settings. 
> As a side note, HBase relies on ZooKeeper to detect regionservers issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to