[ 
https://issues.apache.org/jira/browse/HDFS-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Levin updated HDFS-6022:
-----------------------------

    Affects Version/s: 0.23.9
                       0.23.10
                       2.2.0
                       2.3.0
        Fix Version/s: 3.0.0

> Moving deadNodes from being thread local. Improving dead datanode handling in 
> DFSClient 
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-6022
>                 URL: https://issues.apache.org/jira/browse/HDFS-6022
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 0.23.9, 0.23.10, 2.2.0, 2.3.0
>            Reporter: Jack Levin
>              Labels: patch
>             Fix For: 3.0.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> This patch solves an issue of deadNodes list being thread local.  deadNodes 
> list is created by DFSClient when some problems with write/reading, or 
> contacting datanode exist.  The problem is that deadNodes is not visible to 
> other DFSInputStream threads, hence every DFSInputStream ends up building its 
> own deadNodes.  This affect performance of DFSClient to a large degree 
> especially when a datanode goes completely offline (there is a tcp connect 
> delay experienced by all DFSInputStream threads affecting performance of the 
> whole cluster).
> This patch moves deadNodes to be global in DFSClient class so that as soon as 
> a single DFSInputStream thread reports a dead datanode, all other 
> DFSInputStream threads are informed, negating the need to create their own 
> independent lists (concurrent Map really). 
> Further, a global deadNodes health check manager thread (DeadNodeVerifier) is 
> created to verify all dead datanodes every 5 seconds, and remove the same 
> list as soon as it is up.  That thread under normal conditions (deadNodes 
> empty) would be sleeping.  If deadNodes is not empty, the thread will attempt 
> to open tcp connection every 5 seconds to affected datanodes.
> This patch has a test (TestDFSClientDeadNodes) that is quite simple, since 
> the deadNodes creation is not affected by the patch, we only test datanode 
> removal from deadNodes by the health check manager thread.  Test will create 
> a file in dfs minicluster, read from the same file rapidly, cause datanode to 
> restart, and test is the health check manager thread does the right thing, 
> removing the alive datanode from the global deadNodes list.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to