Re: Details about cluster balancing
Thanks Ayush! > On 15-Nov-2023, at 10:59 PM, Ayush Saxena wrote: > > Hi Akash, > You can read about balancer here: > https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer > HADOOP-1652(https://issues.apache.org/jira/browse/HADOOP-1652) has > some details around it as well, it has some docs attached to it, you > can read them... > For the code, you can explore something over here: > https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java#L473-L479 > > -Ayush > > On Sun, 5 Nov 2023 at 22:33, Akash Jain wrote: >> >> Hello, >> >> For my project, I am analyzing an algorithm to balance the disk usage across >> thousands of storage nodes across different availability zones. >> >> Let’s say >> Availability zone 1 >> Disk usage for data of customer 1 is 70% >> Disk usage for data of customer 2 is 10% >> >> Availability zone 2 >> Disk usage for data of customer 1 is 30% >> Disk usage for data of customer 2 is 90% >> >> and so forth… >> >> Clearly in above example customer 1 data has much higher data locality in >> AZ1 compared to AZ2. Similarly for customer 2 data it is more data locality >> in AZ1 compared to AZ1 >> >> In an ideal world, the data of the customers would look something like this >> >> >> Availability zone 1 >> Disk usage for data of customer 1 is 50% >> Disk usage for data of customer 2 is 50% >> >> Availability zone 2 >> Disk usage for data of customer 1 is 50% >> Disk usage for data of customer 2 is 50% >> >> >> HDFS Balancer looks related, however I have some questions: >> >> 1. Why does the algorithm tries to pair an over utilized node with under >> utilized instead of every node holding average data? >> (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html) >> >> 2. Where can I find more algorithmic details of how the pairing happens? >> >> 3. Is this the only balancing algorithm supported by HDFS? >> >> Thanks > > - > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org > For additional commands, e-mail: user-h...@hadoop.apache.org > - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
Re: Details about cluster balancing
Hi Akash, You can read about balancer here: https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer HADOOP-1652(https://issues.apache.org/jira/browse/HADOOP-1652) has some details around it as well, it has some docs attached to it, you can read them... For the code, you can explore something over here: https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java#L473-L479 -Ayush On Sun, 5 Nov 2023 at 22:33, Akash Jain wrote: > > Hello, > > For my project, I am analyzing an algorithm to balance the disk usage across > thousands of storage nodes across different availability zones. > > Let’s say > Availability zone 1 > Disk usage for data of customer 1 is 70% > Disk usage for data of customer 2 is 10% > > Availability zone 2 > Disk usage for data of customer 1 is 30% > Disk usage for data of customer 2 is 90% > > and so forth… > > Clearly in above example customer 1 data has much higher data locality in AZ1 > compared to AZ2. Similarly for customer 2 data it is more data locality in > AZ1 compared to AZ1 > > In an ideal world, the data of the customers would look something like this > > > Availability zone 1 > Disk usage for data of customer 1 is 50% > Disk usage for data of customer 2 is 50% > > Availability zone 2 > Disk usage for data of customer 1 is 50% > Disk usage for data of customer 2 is 50% > > > HDFS Balancer looks related, however I have some questions: > > 1. Why does the algorithm tries to pair an over utilized node with under > utilized instead of every node holding average data? > (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html) > > 2. Where can I find more algorithmic details of how the pairing happens? > > 3. Is this the only balancing algorithm supported by HDFS? > > Thanks - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
Details about cluster balancing
Hello, For my project, I am analyzing an algorithm to balance the disk usage across thousands of storage nodes across different availability zones. Let’s say Availability zone 1 Disk usage for data of customer 1 is 70% Disk usage for data of customer 2 is 10% Availability zone 2 Disk usage for data of customer 1 is 30% Disk usage for data of customer 2 is 90% and so forth… Clearly in above example customer 1 data has much higher data locality in AZ1 compared to AZ2. Similarly for customer 2 data it is more data locality in AZ1 compared to AZ1 In an ideal world, the data of the customers would look something like this Availability zone 1 Disk usage for data of customer 1 is 50% Disk usage for data of customer 2 is 50% Availability zone 2 Disk usage for data of customer 1 is 50% Disk usage for data of customer 2 is 50% HDFS Balancer looks related, however I have some questions: 1. Why does the algorithm tries to pair an over utilized node with under utilized instead of every node holding average data? (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html) 2. Where can I find more algorithmic details of how the pairing happens? 3. Is this the only balancing algorithm supported by HDFS? Thanks