[jira] [Updated] (HDFS-17867) Implement a new NetworkTopology that supports weighted random choose

khazhen (Jira) Wed, 31 Dec 2025 06:16:54 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


khazhen updated HDFS-17867:
---------------------------
    Description: 
h2. Background

         In BlockPlacementPolicyDefault, each DN in the cluster is selected 
with roughly equal probability. However, in our cluster, there are various 
types of DataNode machines with completely different hardware specifications. 
        For example, some machines have more disks, higher bandwidth NIC, 
higher-performance CPUs, etc., while some older machines are the opposite. 
Their service capacity is much lower than other newer machines. Therefore, as 
the cluster load increases, these lower-performance machines immediately become 
bottlenecks, causing the cluster's performance to decline, or even affecting 
availability (such as slow data nodes or PipelineRecovery failures). 
    The root cause of this problem is that we don't have a good method to 
achieve load balancing between data nodes. 

h2. Solution
    To better solve this problem, we implemented a NetworkTopology that can 
select DNs based on weights.
    We can configure a weight value for each DN similar to how we configure 
racks. When choosing targets, it will sample
according to the configured weights. For clusters containing DNs with different 
hardware specifications, introducing
this feature has several benefits:

# Better load balancing between DNs. High-performance machines can handle more 
traffic, and the overall service capacity of the cluster will be improved.
#  Higher resource utilization.
#  Reduced overhead from Balancer. Typically, higher-performance machines mean 
more hard drives and larger capacity. If we configure weights according to 
capacity ratios, the amount of data that needs to be moved by Balancer will be 
significantly reduced. (Of course, Balancer is still needed for expansion 
scenarios.)

    Our production cluster has many different types of hardware specifications 
for DN machines, and some machines
can have capacities up to 10 times that of some older models. Additionally, 
some machines are co-deployed with many
other computing services, causing them to immediately become slow nodes once 
traffic increases. After introducing this
feature, we let independently deployed high-performance, large-capacity 
machines handle more traffic, and both the
overall IO performance and availability of the cluster have been significantly 
improved.

    Our cluster's Hadoop version is still at 2.x, so we directly modified the 
NetworkTopology class to implement this
feature. However, in the latest version, DFSNetworkTopology has been introduced 
as the default implementation.
Therefore, I attempted to re-implement this feature based on 
DFSNetworkTopology. I will introduce the details next.

h2. Implementation
    Let's have a look at the chooseRandomWithStorageType method of 
DFSNetworkTopology. Consider we have 3 dn in the cluster,
 dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:
    /
        /r1
            /dn1
            /dn2
        /r2
            /dn3
    There are 3 core steps to choose a random dn from root scope:
    1. compute num of available nodes under r1 and r2, which is [2, 1] in this 
case.
    2. perform a weighted random choose from [r1, r2] with weight [2, 1], 
assume r1 is chosen
    3. as r1 is a rack inner node, randomly choose a dn from its children list 
[dn1, dn2]
    The probability of each of these three dn being chosen is 1/3.

    Now we want to introduce a weighted random choose from [dn1, dn2, dn3] with 
weight [3, 1, 2]. A simple and straightforward
 solution is to add virtual nodes to the topology tree, and the new topology 
tree looks like this:
    /
        /r1
            /dn1'
            /dn1'
            /dn1'
            /dn2'
        /r2
            /dn3'
            /dn3'
    The probability of each of these virtual nodes being chosen is 1/6, and dn1 
has 3 virtual nodes, so the probability of
 choosing dn1 is 1/2, and 1/6, 1/3 for dn2 and dn3 respectively.
    However, upon reviewing steps 1 through 3, we can see that step 1 and 2 
only care about the number of data nodes under
inner node, this means that we don't need to really add virtual nodes to the 
topology tree, instead, we can introduce a
new method getNodeCount(Node n), it accepts a node as input, and returns the 
number of data nodes under n. In the old
DFSNetworkTopology class, it just returns the number of physical data nodes 
under n. Then we can add a new subclass
of DFSNetworkTopology which overrides getNodeCount(Node n) to return the total 
weight of all data nodes under n.
    The step 3 needs to be modified as well, we should perform a weighted 
random choose from child list rather than a simple random
choose.

h2. Difference with AvailableSpaceBlockPlacementPolicy
    AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to the 
cluster, it makes the new added nodes being chosen with a little high 
possibility than the old ones, and the cluster will trend to be balanced after 
a period of time.
The real time load of newly added nodes won't change much.
    This feature focuses on the real time load balancing between data nodes, 
it's useful in the cluster that has many different
types of data nodes.

    By the way, it is a very useful feature to make the weight of nodes 
reconfigurable without restarting namenode.
It allows us to quickly adjust weights based on the actual load of the cluster. 
I will introduce this feature in a
separate JIRA after this one is completed.
    I have submitted a PR. More suggestions and discussions are welcomed.

  was:
h2. Background

         In BlockPlacementPolicyDefault, each DN in the cluster is selected 
with roughly equal probability. However, in our cluster, there are various 
types of DataNode machines with completely different hardware specifications. 
        For example, some machines have more disks, higher bandwidth NIC, 
higher-performance CPUs, etc., while some older machines are the opposite. 
Their service capacity is much lower than other newer machines. Therefore, as 
the cluster load increases, these lower-performance machines immediately become 
bottlenecks, causing the cluster's performance to decline, or even affecting 
availability (such as slow data nodes or PipelineRecovery failures). 
    The root cause of this problem is that we don't have a good method to 
achieve load balancing between data nodes. 

h2. Solution
    To better solve this problem, we implemented a NetworkTopology that can 
select DNs based on weights.
    We can configure a weight value for each DN similar to how we configure 
racks. When choosing targets, it will sample
according to the configured weights. For clusters containing DNs with different 
hardware specifications, introducing
this feature has several benefits:

# Better load balancing between DNs. High-performance machines can handle more 
traffic, and the overall service
    capacity of the cluster will be improved.
#  Higher resource utilization.
#  Reduced overhead from Balancer. Typically, higher-performance machines mean 
more hard drives and larger
       capacity. If we configure weights according to capacity ratios, the 
amount of data that needs to be moved by
       Balancer will be significantly reduced. (Of course, Balancer is still 
needed for expansion scenarios.)

    Our production cluster has many different types of hardware specifications 
for DN machines, and some machines
can have capacities up to 10 times that of some older models. Additionally, 
some machines are co-deployed with many
other computing services, causing them to immediately become slow nodes once 
traffic increases. After introducing this
feature, we let independently deployed high-performance, large-capacity 
machines handle more traffic, and both the
overall IO performance and availability of the cluster have been significantly 
improved.

    Our cluster's Hadoop version is still at 2.x, so we directly modified the 
NetworkTopology class to implement this
feature. However, in the latest version, DFSNetworkTopology has been introduced 
as the default implementation.
Therefore, I attempted to re-implement this feature based on 
DFSNetworkTopology. I will introduce the details next.

h2. Implementation
    Let's have a look at the chooseRandomWithStorageType method of 
DFSNetworkTopology. Consider we have 3 dn in the cluster,
 dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:
    /
        /r1
            /dn1
            /dn2
        /r2
            /dn3
    There are 3 core steps to choose a random dn from root scope:
    1. compute num of available nodes under r1 and r2, which is [2, 1] in this 
case.
    2. perform a weighted random choose from [r1, r2] with weight [2, 1], 
assume r1 is chosen
    3. as r1 is a rack inner node, randomly choose a dn from its children list 
[dn1, dn2]
    The probability of each of these three dn being chosen is 1/3.

    Now we want to introduce a weighted random choose from [dn1, dn2, dn3] with 
weight [3, 1, 2]. A simple and straightforward
 solution is to add virtual nodes to the topology tree, and the new topology 
tree looks like this:
    /
        /r1
            /dn1'
            /dn1'
            /dn1'
            /dn2'
        /r2
            /dn3'
            /dn3'
    The probability of each of these virtual nodes being chosen is 1/6, and dn1 
has 3 virtual nodes, so the probability of
 choosing dn1 is 1/2, and 1/6, 1/3 for dn2 and dn3 respectively.
    However, upon reviewing steps 1 through 3, we can see that step 1 and 2 
only care about the number of data nodes under
inner node, this means that we don't need to really add virtual nodes to the 
topology tree, instead, we can introduce a
new method getNodeCount(Node n), it accepts a node as input, and returns the 
number of data nodes under n. In the old
DFSNetworkTopology class, it just returns the number of physical data nodes 
under n. Then we can add a new subclass
of DFSNetworkTopology which overrides getNodeCount(Node n) to return the total 
weight of all data nodes under n.
    The step 3 needs to be modified as well, we should perform a weighted 
random choose from child list rather than a simple random
choose.

h2. Difference with AvailableSpaceBlockPlacementPolicy
    AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to the 
cluster, it makes the new added nodes being chosen with a little high 
possibility than the old ones, and the cluster will trend to be balanced after 
a period of time.
The real time load of newly added nodes won't change much.
    This feature focuses on the real time load balancing between data nodes, 
it's useful in the cluster that has many different
types of data nodes.

    By the way, it is a very useful feature to make the weight of nodes 
reconfigurable without restarting namenode.
It allows us to quickly adjust weights based on the actual load of the cluster. 
I will introduce this feature in a
separate JIRA after this one is completed.
    I have submitted a PR. More suggestions and discussions are welcomed.


> Implement a new NetworkTopology that supports weighted random choose
> --------------------------------------------------------------------
>
>                 Key: HDFS-17867
>                 URL: https://issues.apache.org/jira/browse/HDFS-17867
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: khazhen
>            Priority: Major
>
> h2. Background
>          In BlockPlacementPolicyDefault, each DN in the cluster is selected 
> with roughly equal probability. However, in our cluster, there are various 
> types of DataNode machines with completely different hardware specifications. 
>         For example, some machines have more disks, higher bandwidth NIC, 
> higher-performance CPUs, etc., while some older machines are the opposite. 
> Their service capacity is much lower than other newer machines. Therefore, as 
> the cluster load increases, these lower-performance machines immediately 
> become bottlenecks, causing the cluster's performance to decline, or even 
> affecting availability (such as slow data nodes or PipelineRecovery 
> failures). 
>     The root cause of this problem is that we don't have a good method to 
> achieve load balancing between data nodes. 
> h2. Solution
>     To better solve this problem, we implemented a NetworkTopology that can 
> select DNs based on weights.
>     We can configure a weight value for each DN similar to how we configure 
> racks. When choosing targets, it will sample
> according to the configured weights. For clusters containing DNs with 
> different hardware specifications, introducing
> this feature has several benefits:
> # Better load balancing between DNs. High-performance machines can handle 
> more traffic, and the overall service capacity of the cluster will be 
> improved.
> #  Higher resource utilization.
> #  Reduced overhead from Balancer. Typically, higher-performance machines 
> mean more hard drives and larger capacity. If we configure weights according 
> to capacity ratios, the amount of data that needs to be moved by Balancer 
> will be significantly reduced. (Of course, Balancer is still needed for 
> expansion scenarios.)
>     Our production cluster has many different types of hardware 
> specifications for DN machines, and some machines
> can have capacities up to 10 times that of some older models. Additionally, 
> some machines are co-deployed with many
> other computing services, causing them to immediately become slow nodes once 
> traffic increases. After introducing this
> feature, we let independently deployed high-performance, large-capacity 
> machines handle more traffic, and both the
> overall IO performance and availability of the cluster have been 
> significantly improved.
>     Our cluster's Hadoop version is still at 2.x, so we directly modified the 
> NetworkTopology class to implement this
> feature. However, in the latest version, DFSNetworkTopology has been 
> introduced as the default implementation.
> Therefore, I attempted to re-implement this feature based on 
> DFSNetworkTopology. I will introduce the details next.
> h2. Implementation
>     Let's have a look at the chooseRandomWithStorageType method of 
> DFSNetworkTopology. Consider we have 3 dn in the cluster,
>  dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:
>     /
>         /r1
>             /dn1
>             /dn2
>         /r2
>             /dn3
>     There are 3 core steps to choose a random dn from root scope:
>     1. compute num of available nodes under r1 and r2, which is [2, 1] in 
> this case.
>     2. perform a weighted random choose from [r1, r2] with weight [2, 1], 
> assume r1 is chosen
>     3. as r1 is a rack inner node, randomly choose a dn from its children 
> list [dn1, dn2]
>     The probability of each of these three dn being chosen is 1/3.
>     Now we want to introduce a weighted random choose from [dn1, dn2, dn3] 
> with weight [3, 1, 2]. A simple and straightforward
>  solution is to add virtual nodes to the topology tree, and the new topology 
> tree looks like this:
>     /
>         /r1
>             /dn1'
>             /dn1'
>             /dn1'
>             /dn2'
>         /r2
>             /dn3'
>             /dn3'
>     The probability of each of these virtual nodes being chosen is 1/6, and 
> dn1 has 3 virtual nodes, so the probability of
>  choosing dn1 is 1/2, and 1/6, 1/3 for dn2 and dn3 respectively.
>     However, upon reviewing steps 1 through 3, we can see that step 1 and 2 
> only care about the number of data nodes under
> inner node, this means that we don't need to really add virtual nodes to the 
> topology tree, instead, we can introduce a
> new method getNodeCount(Node n), it accepts a node as input, and returns the 
> number of data nodes under n. In the old
> DFSNetworkTopology class, it just returns the number of physical data nodes 
> under n. Then we can add a new subclass
> of DFSNetworkTopology which overrides getNodeCount(Node n) to return the 
> total weight of all data nodes under n.
>     The step 3 needs to be modified as well, we should perform a weighted 
> random choose from child list rather than a simple random
> choose.
> h2. Difference with AvailableSpaceBlockPlacementPolicy
>     AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to the 
> cluster, it makes the new added nodes being chosen with a little high 
> possibility than the old ones, and the cluster will trend to be balanced 
> after a period of time.
> The real time load of newly added nodes won't change much.
>     This feature focuses on the real time load balancing between data nodes, 
> it's useful in the cluster that has many different
> types of data nodes.
>     By the way, it is a very useful feature to make the weight of nodes 
> reconfigurable without restarting namenode.
> It allows us to quickly adjust weights based on the actual load of the 
> cluster. I will introduce this feature in a
> separate JIRA after this one is completed.
>     I have submitted a PR. More suggestions and discussions are welcomed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-17867) Implement a new NetworkTopology that supports weighted random choose

Reply via email to