Hello, For my project, I am analyzing an algorithm to balance the disk usage across thousands of storage nodes across different availability zones.
Let’s say Availability zone 1 Disk usage for data of customer 1 is 70% Disk usage for data of customer 2 is 10% Availability zone 2 Disk usage for data of customer 1 is 30% Disk usage for data of customer 2 is 90% and so forth… Clearly in above example customer 1 data has much higher data locality in AZ1 compared to AZ2. Similarly for customer 2 data it is more data locality in AZ1 compared to AZ1 In an ideal world, the data of the customers would look something like this Availability zone 1 Disk usage for data of customer 1 is 50% Disk usage for data of customer 2 is 50% Availability zone 2 Disk usage for data of customer 1 is 50% Disk usage for data of customer 2 is 50% HDFS Balancer looks related, however I have some questions: 1. Why does the algorithm tries to pair an over utilized node with under utilized instead of every node holding average data? (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html) 2. Where can I find more algorithmic details of how the pairing happens? 3. Is this the only balancing algorithm supported by HDFS? Thanks
