[ 
https://issues.apache.org/jira/browse/HDFS-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

khazhen updated HDFS-17867:
---------------------------
    Description: 
h2. Background

     In BlockPlacementPolicyDefault, each DN in the cluster is selected with 
roughly equal probability. However, in our cluster, there are various types of 
DataNode machines with completely different hardware specifications.

      For example, some machines have more disks, higher bandwidth NIC, 
higher-performance CPUs, etc., while some older machines are the opposite. 
Their service capacity is much lower than other newer machines. Therefore, as 
the cluster load increases, these lower-performance machines immediately become 
bottlenecks, causing the cluster's performance to decline, or even affecting 
availability (such as slow data nodes or PipelineRecovery failures).

The root cause of this problem is that we don't have a good method to achieve 
load balancing between data nodes.
h2. Solution

      To better solve this problem, we implemented a NetworkTopology that 
support weighted random choose.

We can configure a weight value for each DN similar to how we configure racks. 
For clusters containing DNs with different hardware specifications, introducing 
this feature has several benefits:
 # Better load balancing between DNs. High-performance machines can handle more 
traffic, and the overall service capacity of the cluster will be improved.
 # Higher resource utilization.
 # Reduced overhead from Balancer. Typically, higher-performance machines mean 
more hard drives and larger capacity. If we configure weights according to 
capacity ratios, the amount of data that needs to be moved by Balancer will be 
significantly reduced. (Of course, Balancer is still needed for expansion 
scenarios.)

      Our production cluster has many different types of hardware 
specifications for DN machines, and some machines can have capacities up to 10 
times that of some older models. Additionally, some machines are co-deployed 
with many other computing services, causing them to immediately become slow 
nodes once traffic increases.

      After introducing this feature, we let independently deployed 
high-performance, large-capacity machines handle more traffic, and both the 
overall IO performance and availability of the cluster have been significantly 
improved.

      Our cluster's Hadoop version is still at 2.x, so we directly modified the 
NetworkTopology class to implement this feature. However, in the latest 
version, DFSNetworkTopology has been introduced as the default implementation. 
Therefore, I attempted to re-implement this feature based on 
DFSNetworkTopology. I will introduce the details next.
h2. Implementation

      Let's have a look at the chooseRandomWithStorageType method of 
DFSNetworkTopology. Consider we have 3 dn in the cluster,
dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:
{code:java}
/
  /r1
    /dn1
    /dn2
  /r2
    /dn3 {code}
      There are 3 core steps to choose a random dn from root scope:
1. compute num of available nodes under r1 and r2, which is [2, 1] in this case.
2. perform a weighted random choose from [r1, r2] with weight [2, 1], assume r1 
is chosen
3. as r1 is a rack inner node, randomly choose a dn from its children list 
[dn1, dn2]
The probability of each of these three dn being chosen is 1/3.

      Now we want to introduce a weighted random choose from [dn1, dn2, dn3] 
with weight [3, 1, 2]. A simple and straightforward solution is to add virtual 
nodes to the topology tree, and the new topology tree looks like this:
{code:java}
/
  /r1
    /dn1'
    /dn1'
    /dn1'
    /dn2'
  /r2
    /dn3'
    /dn3' {code}
      The probability of each of these virtual nodes being chosen is 1/6, and 
dn1 has 3 virtual nodes, so the probability of choosing dn1 is 1/2, and 1/6, 
1/3 for dn2 and dn3 respectively.

      However, upon reviewing steps 1 through 3, we can see that step 1 and 2 
only care about the number of data nodes under inner node, this means that we 
don't need to really add virtual nodes to the topology tree, instead, we can 
introduce a new method getNodeCount(Node n), it accepts a node as input, and 
returns the number of data nodes under n. In the old DFSNetworkTopology class, 
it just returns the number of physical data nodes under n. Then we can add a 
new subclass of DFSNetworkTopology which overrides getNodeCount(Node n) to 
return the total weight of all data nodes under n.

      The step 3 needs to be modified as well, we should perform a weighted 
random choose from child list rather than a simple random choose.
h2. Difference with AvailableSpaceBlockPlacementPolicy

      AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to the 
cluster, it makes the new added nodes being chosen with a little high 
possibility than the old ones, and the cluster will trend to be balanced after 
a period of time. The real time load of newly added nodes won't change much.

      This feature focuses on the real time load balancing between data nodes, 
it's useful in the cluster that has many different types of data nodes.

      By the way, it is a very useful feature to make the weight of nodes 
reconfigurable without restarting namenode. It allows us to quickly adjust 
weights based on the actual load of the cluster. I will introduce this feature 
in a separate JIRA after this one is completed.

      I have submitted a PR. More suggestions and discussions are welcomed.

  was:
h2. Background

     In BlockPlacementPolicyDefault, each DN in the cluster is selected with 
roughly equal probability. However, in our cluster, there are various types of 
DataNode machines with completely different hardware specifications.

      For example, some machines have more disks, higher bandwidth NIC, 
higher-performance CPUs, etc., while some older machines are the opposite. 
Their service capacity is much lower than other newer machines. Therefore, as 
the cluster load increases, these lower-performance machines immediately become 
bottlenecks, causing the cluster's performance to decline, or even affecting 
availability (such as slow data nodes or PipelineRecovery failures).

The root cause of this problem is that we don't have a good method to achieve 
load balancing between data nodes.
h2. Solution

      To better solve this problem, we implemented a NetworkTopology that 
support weighted random choose.

We can configure a weight value for each DN similar to how we configure racks. 
For clusters containing DNs with different hardware specifications, introducing 
this feature has several benefits:
 # Better load balancing between DNs. High-performance machines can handle more 
traffic, and the overall service capacity of the cluster will be improved.
 # Higher resource utilization.
 # Reduced overhead from Balancer. Typically, higher-performance machines mean 
more hard drives and larger capacity. If we configure weights according to 
capacity ratios, the amount of data that needs to be moved by Balancer will be 
significantly reduced. (Of course, Balancer is still needed for expansion 
scenarios.)

      Our production cluster has many different types of hardware 
specifications for DN machines, and some machines can have capacities up to 10 
times that of some older models. Additionally, some machines are co-deployed 
with many other computing services, causing them to immediately become slow 
nodes once traffic increases.

      After introducing this feature, we let independently deployed 
high-performance, large-capacity machines handle more traffic, and both the 
overall IO performance and availability of the cluster have been significantly 
improved.

      Our cluster's Hadoop version is still at 2.x, so we directly modified the 
NetworkTopology class to implement this feature. However, in the latest 
version, DFSNetworkTopology has been introduced as the default implementation. 
Therefore, I attempted to re-implement this feature based on 
DFSNetworkTopology. I will introduce the details next.
h2. Implementation

      Let's have a look at the chooseRandomWithStorageType method of 
DFSNetworkTopology. Consider we have 3 dn in the cluster,
dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:

 
{code:java}
/
  /r1
    /dn1
    /dn2
  /r2
    /dn3 {code}

      There are 3 core steps to choose a random dn from root scope:
1. compute num of available nodes under r1 and r2, which is [2, 1] in this case.
2. perform a weighted random choose from [r1, r2] with weight [2, 1], assume r1 
is chosen
3. as r1 is a rack inner node, randomly choose a dn from its children list 
[dn1, dn2]
The probability of each of these three dn being chosen is 1/3.

 

      Now we want to introduce a weighted random choose from [dn1, dn2, dn3] 
with weight [3, 1, 2]. A simple and straightforward solution is to add virtual 
nodes to the topology tree, and the new topology tree looks like this:
{code:java}
/
  /r1
    /dn1'
    /dn1'
    /dn1'
    /dn2'
  /r2
    /dn3'
    /dn3' {code}
      The probability of each of these virtual nodes being chosen is 1/6, and 
dn1 has 3 virtual nodes, so the probability of choosing dn1 is 1/2, and 1/6, 
1/3 for dn2 and dn3 respectively.

      However, upon reviewing steps 1 through 3, we can see that step 1 and 2 
only care about the number of data nodes under inner node, this means that we 
don't need to really add virtual nodes to the topology tree, instead, we can 
introduce a new method getNodeCount(Node n), it accepts a node as input, and 
returns the number of data nodes under n. In the old DFSNetworkTopology class, 
it just returns the number of physical data nodes under n. Then we can add a 
new subclass of DFSNetworkTopology which overrides getNodeCount(Node n) to 
return the total weight of all data nodes under n.

      The step 3 needs to be modified as well, we should perform a weighted 
random choose from child list rather than a simple random choose.
h2. Difference with AvailableSpaceBlockPlacementPolicy

      AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to the 
cluster, it makes the new added nodes being chosen with a little high 
possibility than the old ones, and the cluster will trend to be balanced after 
a period of time. The real time load of newly added nodes won't change much.

      This feature focuses on the real time load balancing between data nodes, 
it's useful in the cluster that has many different types of data nodes.

      By the way, it is a very useful feature to make the weight of nodes 
reconfigurable without restarting namenode. It allows us to quickly adjust 
weights based on the actual load of the cluster. I will introduce this feature 
in a separate JIRA after this one is completed.

      I have submitted a PR. More suggestions and discussions are welcomed.


> Implement a new NetworkTopology that supports weighted random choose
> --------------------------------------------------------------------
>
>                 Key: HDFS-17867
>                 URL: https://issues.apache.org/jira/browse/HDFS-17867
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: khazhen
>            Priority: Major
>
> h2. Background
>      In BlockPlacementPolicyDefault, each DN in the cluster is selected with 
> roughly equal probability. However, in our cluster, there are various types 
> of DataNode machines with completely different hardware specifications.
>       For example, some machines have more disks, higher bandwidth NIC, 
> higher-performance CPUs, etc., while some older machines are the opposite. 
> Their service capacity is much lower than other newer machines. Therefore, as 
> the cluster load increases, these lower-performance machines immediately 
> become bottlenecks, causing the cluster's performance to decline, or even 
> affecting availability (such as slow data nodes or PipelineRecovery failures).
> The root cause of this problem is that we don't have a good method to achieve 
> load balancing between data nodes.
> h2. Solution
>       To better solve this problem, we implemented a NetworkTopology that 
> support weighted random choose.
> We can configure a weight value for each DN similar to how we configure 
> racks. For clusters containing DNs with different hardware specifications, 
> introducing this feature has several benefits:
>  # Better load balancing between DNs. High-performance machines can handle 
> more traffic, and the overall service capacity of the cluster will be 
> improved.
>  # Higher resource utilization.
>  # Reduced overhead from Balancer. Typically, higher-performance machines 
> mean more hard drives and larger capacity. If we configure weights according 
> to capacity ratios, the amount of data that needs to be moved by Balancer 
> will be significantly reduced. (Of course, Balancer is still needed for 
> expansion scenarios.)
>       Our production cluster has many different types of hardware 
> specifications for DN machines, and some machines can have capacities up to 
> 10 times that of some older models. Additionally, some machines are 
> co-deployed with many other computing services, causing them to immediately 
> become slow nodes once traffic increases.
>       After introducing this feature, we let independently deployed 
> high-performance, large-capacity machines handle more traffic, and both the 
> overall IO performance and availability of the cluster have been 
> significantly improved.
>       Our cluster's Hadoop version is still at 2.x, so we directly modified 
> the NetworkTopology class to implement this feature. However, in the latest 
> version, DFSNetworkTopology has been introduced as the default 
> implementation. Therefore, I attempted to re-implement this feature based on 
> DFSNetworkTopology. I will introduce the details next.
> h2. Implementation
>       Let's have a look at the chooseRandomWithStorageType method of 
> DFSNetworkTopology. Consider we have 3 dn in the cluster,
> dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:
> {code:java}
> /
>   /r1
>     /dn1
>     /dn2
>   /r2
>     /dn3 {code}
>       There are 3 core steps to choose a random dn from root scope:
> 1. compute num of available nodes under r1 and r2, which is [2, 1] in this 
> case.
> 2. perform a weighted random choose from [r1, r2] with weight [2, 1], assume 
> r1 is chosen
> 3. as r1 is a rack inner node, randomly choose a dn from its children list 
> [dn1, dn2]
> The probability of each of these three dn being chosen is 1/3.
>       Now we want to introduce a weighted random choose from [dn1, dn2, dn3] 
> with weight [3, 1, 2]. A simple and straightforward solution is to add 
> virtual nodes to the topology tree, and the new topology tree looks like this:
> {code:java}
> /
>   /r1
>     /dn1'
>     /dn1'
>     /dn1'
>     /dn2'
>   /r2
>     /dn3'
>     /dn3' {code}
>       The probability of each of these virtual nodes being chosen is 1/6, and 
> dn1 has 3 virtual nodes, so the probability of choosing dn1 is 1/2, and 1/6, 
> 1/3 for dn2 and dn3 respectively.
>       However, upon reviewing steps 1 through 3, we can see that step 1 and 2 
> only care about the number of data nodes under inner node, this means that we 
> don't need to really add virtual nodes to the topology tree, instead, we can 
> introduce a new method getNodeCount(Node n), it accepts a node as input, and 
> returns the number of data nodes under n. In the old DFSNetworkTopology 
> class, it just returns the number of physical data nodes under n. Then we can 
> add a new subclass of DFSNetworkTopology which overrides getNodeCount(Node n) 
> to return the total weight of all data nodes under n.
>       The step 3 needs to be modified as well, we should perform a weighted 
> random choose from child list rather than a simple random choose.
> h2. Difference with AvailableSpaceBlockPlacementPolicy
>       AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to 
> the cluster, it makes the new added nodes being chosen with a little high 
> possibility than the old ones, and the cluster will trend to be balanced 
> after a period of time. The real time load of newly added nodes won't change 
> much.
>       This feature focuses on the real time load balancing between data 
> nodes, it's useful in the cluster that has many different types of data nodes.
>       By the way, it is a very useful feature to make the weight of nodes 
> reconfigurable without restarting namenode. It allows us to quickly adjust 
> weights based on the actual load of the cluster. I will introduce this 
> feature in a separate JIRA after this one is completed.
>       I have submitted a PR. More suggestions and discussions are welcomed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to