Hello,

For my project, I am analyzing an algorithm to balance the disk usage across 
thousands of storage nodes across different availability zones.

Let’s say 
Availability zone 1
Disk usage for data of customer 1 is 70%
Disk usage for data of customer 2 is 10%

Availability zone 2
Disk usage for data of customer 1 is 30%
Disk usage for data of customer 2 is 90%

and so forth…

Clearly in above example customer 1 data has much higher data locality in AZ1 
compared to AZ2. Similarly for customer 2 data it is more data locality in AZ1 
compared to AZ1

In an ideal world, the data of the customers would look something like this 


Availability zone 1
Disk usage for data of customer 1 is 50%
Disk usage for data of customer 2 is 50%

Availability zone 2
Disk usage for data of customer 1 is 50%
Disk usage for data of customer 2 is 50%


HDFS Balancer looks related, however I have some questions:

1. Why does the algorithm tries to pair an over utilized node with under 
utilized instead of every node holding average data?
(https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html)

2. Where can I find more algorithmic details of how the pairing happens?

3. Is this the only balancing algorithm supported by HDFS?

Thanks

Reply via email to