subject:"Details about cluster balancing"

Re: Details about cluster balancing

2023-11-27 Thread Akash Jain

Thanks Ayush!

> On 15-Nov-2023, at 10:59 PM, Ayush Saxena  wrote:
> 
> Hi Akash,
> You can read about balancer here:
> https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer
> HADOOP-1652(https://issues.apache.org/jira/browse/HADOOP-1652) has
> some details around it as well, it has some docs attached to it, you
> can read them...
> For the code, you can explore something over here:
> https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java#L473-L479
> 
> -Ayush
> 
> On Sun, 5 Nov 2023 at 22:33, Akash Jain  wrote:
>> 
>> Hello,
>> 
>> For my project, I am analyzing an algorithm to balance the disk usage across 
>> thousands of storage nodes across different availability zones.
>> 
>> Let’s say
>> Availability zone 1
>> Disk usage for data of customer 1 is 70%
>> Disk usage for data of customer 2 is 10%
>> 
>> Availability zone 2
>> Disk usage for data of customer 1 is 30%
>> Disk usage for data of customer 2 is 90%
>> 
>> and so forth…
>> 
>> Clearly in above example customer 1 data has much higher data locality in 
>> AZ1 compared to AZ2. Similarly for customer 2 data it is more data locality 
>> in AZ1 compared to AZ1
>> 
>> In an ideal world, the data of the customers would look something like this
>> 
>> 
>> Availability zone 1
>> Disk usage for data of customer 1 is 50%
>> Disk usage for data of customer 2 is 50%
>> 
>> Availability zone 2
>> Disk usage for data of customer 1 is 50%
>> Disk usage for data of customer 2 is 50%
>> 
>> 
>> HDFS Balancer looks related, however I have some questions:
>> 
>> 1. Why does the algorithm tries to pair an over utilized node with under 
>> utilized instead of every node holding average data?
>> (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html)
>> 
>> 2. Where can I find more algorithmic details of how the pairing happens?
>> 
>> 3. Is this the only balancing algorithm supported by HDFS?
>> 
>> Thanks
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Re: Details about cluster balancing

2023-11-15 Thread Ayush Saxena

Hi Akash,
You can read about balancer here:
https://apache.github.io/hadoop/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer
HADOOP-1652(https://issues.apache.org/jira/browse/HADOOP-1652) has
some details around it as well, it has some docs attached to it, you
can read them...
For the code, you can explore something over here:
https://github.com/apache/hadoop/blob/rel/release-3.3.6/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java#L473-L479

-Ayush

On Sun, 5 Nov 2023 at 22:33, Akash Jain  wrote:
>
> Hello,
>
> For my project, I am analyzing an algorithm to balance the disk usage across 
> thousands of storage nodes across different availability zones.
>
> Let’s say
> Availability zone 1
> Disk usage for data of customer 1 is 70%
> Disk usage for data of customer 2 is 10%
>
> Availability zone 2
> Disk usage for data of customer 1 is 30%
> Disk usage for data of customer 2 is 90%
>
> and so forth…
>
> Clearly in above example customer 1 data has much higher data locality in AZ1 
> compared to AZ2. Similarly for customer 2 data it is more data locality in 
> AZ1 compared to AZ1
>
> In an ideal world, the data of the customers would look something like this
>
>
> Availability zone 1
> Disk usage for data of customer 1 is 50%
> Disk usage for data of customer 2 is 50%
>
> Availability zone 2
> Disk usage for data of customer 1 is 50%
> Disk usage for data of customer 2 is 50%
>
>
> HDFS Balancer looks related, however I have some questions:
>
> 1. Why does the algorithm tries to pair an over utilized node with under 
> utilized instead of every node holding average data?
> (https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html)
>
> 2. Where can I find more algorithmic details of how the pairing happens?
>
> 3. Is this the only balancing algorithm supported by HDFS?
>
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Details about cluster balancing

2023-11-05 Thread Akash Jain

Hello,

For my project, I am analyzing an algorithm to balance the disk usage across 
thousands of storage nodes across different availability zones.

Let’s say 
Availability zone 1
Disk usage for data of customer 1 is 70%
Disk usage for data of customer 2 is 10%

Availability zone 2
Disk usage for data of customer 1 is 30%
Disk usage for data of customer 2 is 90%

and so forth…

Clearly in above example customer 1 data has much higher data locality in AZ1 
compared to AZ2. Similarly for customer 2 data it is more data locality in AZ1 
compared to AZ1

In an ideal world, the data of the customers would look something like this 


Availability zone 1
Disk usage for data of customer 1 is 50%
Disk usage for data of customer 2 is 50%

Availability zone 2
Disk usage for data of customer 1 is 50%
Disk usage for data of customer 2 is 50%


HDFS Balancer looks related, however I have some questions:

1. Why does the algorithm tries to pair an over utilized node with under 
utilized instead of every node holding average data?
(https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/step_2__storage_group_pairing.html)

2. Where can I find more algorithmic details of how the pairing happens?

3. Is this the only balancing algorithm supported by HDFS?

Thanks

Re: Details about cluster balancing

Re: Details about cluster balancing

Details about cluster balancing

3 matches

Site Navigation

Mail list logo

Footer information