[ https://issues.apache.org/jira/browse/HDFS-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jack Levin updated HDFS-6022: ----------------------------- Affects Version/s: 0.23.9 0.23.10 2.2.0 2.3.0 Fix Version/s: 3.0.0 > Moving deadNodes from being thread local. Improving dead datanode handling in > DFSClient > ---------------------------------------------------------------------------------------- > > Key: HDFS-6022 > URL: https://issues.apache.org/jira/browse/HDFS-6022 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client > Affects Versions: 0.23.9, 0.23.10, 2.2.0, 2.3.0 > Reporter: Jack Levin > Labels: patch > Fix For: 3.0.0 > > Original Estimate: 0h > Remaining Estimate: 0h > > This patch solves an issue of deadNodes list being thread local. deadNodes > list is created by DFSClient when some problems with write/reading, or > contacting datanode exist. The problem is that deadNodes is not visible to > other DFSInputStream threads, hence every DFSInputStream ends up building its > own deadNodes. This affect performance of DFSClient to a large degree > especially when a datanode goes completely offline (there is a tcp connect > delay experienced by all DFSInputStream threads affecting performance of the > whole cluster). > This patch moves deadNodes to be global in DFSClient class so that as soon as > a single DFSInputStream thread reports a dead datanode, all other > DFSInputStream threads are informed, negating the need to create their own > independent lists (concurrent Map really). > Further, a global deadNodes health check manager thread (DeadNodeVerifier) is > created to verify all dead datanodes every 5 seconds, and remove the same > list as soon as it is up. That thread under normal conditions (deadNodes > empty) would be sleeping. If deadNodes is not empty, the thread will attempt > to open tcp connection every 5 seconds to affected datanodes. > This patch has a test (TestDFSClientDeadNodes) that is quite simple, since > the deadNodes creation is not affected by the patch, we only test datanode > removal from deadNodes by the health check manager thread. Test will create > a file in dfs minicluster, read from the same file rapidly, cause datanode to > restart, and test is the health check manager thread does the right thing, > removing the alive datanode from the global deadNodes list. -- This message was sent by Atlassian JIRA (v6.1.5#6160)