[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556101#comment-15556101
 ] 

Rithin Shetty edited comment on BOOKKEEPER-950 at 10/7/16 8:03 PM:
-------------------------------------------------------------------

We create a new ensemble placement policy called 
CapacityBalancedPlacementPolicy. This is how it would work.

1) All the bookies report the total amount of free space they have to the 
metadata server(zookeeper) on a periodic basis; say every 15 or 30 minutes. On 
zookeeper, this info can be stored in the bookie's znode at 
/ledger/available/<bookie>. It's currently unused.
2) All bookkeeper clients retrieve the free disk space info of all the bookies 
from zk periodically. The clients’ cache this info in memory.
3) Every time a new ledger needs to be created, the ensemble is chosen based on 
the free disk space on the bookies. Bookies that have higher free disk space 
will be chosen more frequently than the ones with lower free disk space. 
DefaultEnsemblePlacementPolicy, chooses the bookies for an ensemble randomly. 
CapacityBalancedPlacementPolicy algorithm will give preference to bookies that 
have more free space.
Suppose there are 6 bookies: B1, B2, B3, B4, B5, and B6 and the free space on 
each of them is 100GB, 100GB, 200GB, 200GB, 300GB, 100GB for a total of 1TB. 
With the new algorithm the probability of picking B1, B2, B6 will be 0.1, for 
B3 and B4 it is 0.2, finally for B4 it is 0.3. With 
DefaultEnsemblePlacementPolicy the probabilities for each of the bookies is 1/6 
==> 0.16.
4) This ensures that the bookies with higher free disk space will be selected 
more often. Thus balancing the disk usage over time. This will allow us to work 
with mixed disk configuration. Should have fewer RO bookies.
5) Operations like auto recovery could cause imbalance because during recovery 
the data is not populated by the bk client. The bookies decide to do the copy 
themselves. We could make the bookies check usage on other bookies similar to 
the client and balance it out.

The higher the free disk space on a node, the more write load it will be 
subjected to. Perhaps we can put an upper bound on the probability of selecting 
a single bookie to something like 0.3 or less. Thus preventing all new ledgers 
from getting created on it.


was (Author: rithin.shetty):
We create a new ensemble placement policy called 
CapacityBalancedPlacementPolicy. This is how it would work.

1) All the bookies report the total amount of free space they have to the 
metadata server(zookeeper) on a periodic basis; say every 15 or 30 minutes. On 
zookeeper, this info can be stored in the bookie's znode at 
/ledger/available/<bookie>. It's currently unused.
2) All bookkeeper clients retrieve the free disk space info of all the bookies 
from zk periodically. The clients’ cache this info in memory.
3) Every time a new ledger needs to be created, the ensemble is chosen based on 
the free disk space on the bookies. Bookies that have higher free disk space 
will be chosen more frequently than the ones with lower free disk space. 
DefaultEnsemblePlacementPolicy, chooses the bookies for an ensemble randomly. 
CapacityBalancedPlacementPolicy algorithm will give preference to bookies that 
have more free space.
Suppose there are 6 bookies: B1, B2, B3, B4, B5, and B6 and the free space on 
each of them is 100GB, 100GB, 200GB, 200GB, 300GB, 100GB for a total of 1TB. 
With the new algorithm the probability of picking B1, B2, B6 will be 0.1, for 
B3 and B4 it is 0.2, finally for B4 it is 0.3. With 
DefaultEnsemblePlacementPolicy the probabilities for each of the bookies is 1/6 
==> 0.16.
4) This ensures that the bookies with higher free disk space will be selected 
more often. Thus balancing the disk usage over time. This will allow us to work 
with mixed disk configuration. Should have fewer RO bookies.
7) Operations like auto recovery could cause imbalance because during recovery 
the data is not populated by the bk client. The bookies decide to do the copy 
themselves. We could make the bookies check usage on other bookies similar to 
the client and balance it out.

The higher the free disk space on a node, the more write load it will be 
subjected to. Perhaps we can put an upper bound on the probability of selecting 
a single bookie to something like 0.3 or less. Thus preventing all new ledgers 
from getting created on it.

> Ledger placement policy to accomodate different storage capacity of bookies
> ---------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-950
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-950
>             Project: Bookkeeper
>          Issue Type: New Feature
>            Reporter: Rithin Shetty
>            Assignee: Rithin Shetty
>             Fix For: 4.5.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> In our environment, in Salesforce, we are likely to have bookie nodes with 
> different storage capacity: some will have 1TB others might have 3TB. Also, 
> our ledgers are likely going to be long lived. The current ledger placement 
> policy selects the bookies randomly leading to uniform distribution. This 
> would cause some of bookies to reach high utilization while the rest would be 
> underutilized. We need a new ledger placement policy that has higher 
> probability of selecting bookies with higher free disk space than the ones 
> with lower disk free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to