TakaHiR07 opened a new issue, #4247:
URL: https://github.com/apache/bookkeeper/issues/4247

   ### BP
   
   This is the master ticket for tracking BP-65 :
   Proposal PR - https://github.com/apache/bookkeeper/pull/4246
   
   ### Motivation
   
   One of our clusters have 255 bookies, and we find that bookie's write 
pressure is very unbalance. 
   
   Usually there are several bookies write latency too high, which cause the 
message publish latency also too high in pulsar broker.
   
   Currently, bookie have quarantine mechanism to deal with this case. 
   
   - when broker select ensemble for a ledger, it uses RackAwarePlacement and 
random select strategy to decide which bookie should be written.
   - If bookie write error achieve n times, the bookie would be quarantined in 
broker.
   - To avoid too many broker quarantine a bookie at the same time, we can 
define quarantineRatio to do quarantine randomly.
   
   However, this mechanism is not good enough to avoid bookie high write latency
   
   1. Since we select bookie ensemble randomly, ramdom strategy is hard to 
achieve good performance if we don't have too many bookies or too many ledgers 
to select.
   2. It is hard to define appropriate n time write error, and quarantineRatio. 
Because it is hard to map bookie write pressure to these configs.
   3. It is hard to define how long we quarantine a bookie. Because we don't 
know which time bookie's write pressure has already dropped. Now we can only 
set the quarantineTime by config.
   4. Now we can only quarantine a bookie when it already has write-pressure, 
but we can't avoid the write-pressure problem occur in advance.
   
   
   ### Proposed Changes
   
   To solve this write pressure problem, we propose to introduce bookie load 
balance mechanism, which is supplement of current quarantine mechanism. 
   When we choose ensemble for ledger, if we have load information of all 
bookies, we can prefer to select the low-load bookie as ensemble. 
   And we can avoid to write into the high-load bookie, which make the bookie 
perform worse and cause high write latency. 
   Therefore, with the bookie load information, we can better avoid high write 
latency problem occur
   
   We notice that bookie already has DiskWeightBasedPlacement mechanism, which 
is similar to load balance. We just need to enhance this mechanism,
   replace it to LoadWeightBasedPlacement.
   
   The proposed changes involves:
   1. Implement BaseMetricMonitor in bookie server, which would collect bookie 
load information periodically
   2. bookie client continue to use getBookieInfo restApi to acquire load 
information from each bookie.
   3. modify the implementation of RackawareEnsemblePlacementPolicyImpl, 
support select ensemble by LoadWeightBasedPlacement. Since 
LoadWeightBasedPlacement is an enhancement of DiskWeightBasedPlacement,  it 
would cover the DiskWeightBasedPlacement if feature enable.
   
   
   
   #### BaseMetricMonitor
   
   Now BaseMetricMonitor would collect multiple load metrics, including journal 
IOUtil, ledger IOUtil, bookie write bytes per second, cpu usage, free disk 
space, total disk space.
   Then we can define the bookie load pressure by these metrics.
   Actually for our cluster, bookie load pressure is mainly influenced by 
journal IOUtil, because we use HDD as journal disk and 1 journal disk is 
responsible for 3 bookie.
   
   BaseMetricMonitor would collect the metrics per second by default. But we 
find that some metrics is jittering so much. 
   So it is necessary to smooth the collected metrics, by calculating average 
value between 3 seconds. 
   This 3 second can be modified by config 
baseMetricMonitorMetricSlideWindowSize
   
   If one bookie contains multiple disks, we calculate the average value.
   
   
   #### modification of getBookieInfo restApi
   
   bookie client continue to use getBookieInfo restApi to acquire load 
information from each bookie. 
   That means if we enable LoadWeightBasedPlacement, the restApi would contain 
more information.
   - Previous: getBookieInfo contains totalDiskUsage and freeDiskUsage
   - 
   ```
   message GetBookieInfoResponse {
       required StatusCode status = 1;
       optional int64 totalDiskCapacity = 2;
       optional int64 freeDiskSpace = 3;
   }
   ```
   
   - Current: getBookieInfo contains totalDiskUsage, freeDiskUsage, and the 
other load information
   ```
   message GetBookieInfoResponse {
       required StatusCode status = 1;
       optional int64 totalDiskCapacity = 2;
       optional int64 freeDiskSpace = 3;
       optional int32 journalIoUtil = 4;
       optional int32 ledgerIoUtil = 5;
       optional int32 cpuUsedRate = 6;
       optional int64 writeBytePerSecond = 7;
   }
   ```
   
   GetBookieInfoResponse in BookkeeperProtocol would be changed. 
   If we disable LoadWeightBasedPlacement or the restApi is error because of 
timeout or throwing exception, the load information would be -1.
   
   And we have tested the pressure of this restApi bringing to cluster. Such as 
for our cluster, with more than 20 brokers and more than 200 bookies, 
   the pressure of restApi is still acceptable.
   
   #### modification of RackawareEnsemblePlacementPolicyImpl
   
   Implement a new strategy to select bookies for LoadWeightBasedPlacement.
   
   The target is :
   - make write pressure more balance on all bookies. (Usually the more 
throughput, the higher write pressure)
   - avoid some bookies occur high write latency problem. (Usually the IOUtil 
become higher, write latency would be also higher)
   
   So the designed strategy is :
   - use writeBytesPerSecond as load weight. And use roulette wheel selection, 
define the bookie selection probability by its load weight. The higher load 
weight, the less probability bookie to be selected.
   - Selection filter the high load bookie, whose journal IOUtil is higher than 
threshold. If bookie is already high load, we should not continue to write 
entry on it. Once its IOUtil decrease, bookie become writable. 
   
   To avoid a corner case that so many bookies being filtered, we add config 
`lowLoadBookieRatio`. Default if more than half of bookies are filtered, 
fallback to randomly selection.
   
   Notice that many bookie clients would do ensemble selections separately, the 
probability of each bookie should not differ too much, or it would cause write 
incline problem. 
   So we have probability smooth operation in roulette wheel selection.
   
   
   ### compatibility
   
   LoadWeightBasedPlacement is a supplemental feature, ledger replication must 
obey the RackAwarePolicy/RegionAwarePolicy firstly, 
   and then try to obey LoadWeightBasedPlacement. We can disable the feature by 
configuration.
   
   Because GetBookieInfo protocol has been changed, this restApi would get 
error if the version of bookie-server and bookie-client is not the same one.
   
   
   ### Performance
   
   We have applied LoadWeightBasedPlacement to our production clusters. And the 
high write latency problem no matter happen.
   
   
![企业微信截图_91bcf3d9-251f-47a4-9ea5-666e06bb9115](https://github.com/apache/bookkeeper/assets/13505225/fa80d1fa-9021-4fd6-a758-dfb97d4969d9)
   
![企业微信截图_7ea1190b-3c69-4d5a-967e-5c75dd62b1b0](https://github.com/apache/bookkeeper/assets/13505225/49cfc21d-3bad-4ce7-bd3e-266bc82b3e1f)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to