Re: [PR] Update rebalance zh/en docs [incubator-pegasus-website]

via GitHub Tue, 08 Apr 2025 03:06:27 -0700


empiredan commented on code in PR #101:
URL: 
https://github.com/apache/incubator-pegasus-website/pull/101#discussion_r2032754906



##########
_docs/zh/administration/rebalance.md:
##########
@@ -22,15 +22,14 @@ permalink: administration/rebalance
    * 【readable but unwritable】: 分片可读但是不可写。指的是只剩下了一个primary，两个secondary副本全部丢失
    * 【readable and writable but unhealthy】: 分片可读可写，但仍旧不健康。指的是三副本里面少了一个secondary
    * 【dead】: partition的所有副本全不可用了，又称之为DDD状态。
-
 
![pegasus-healthy-status](/assets/images/pegasus-healthy-status.png){:class="img-responsive"}
 
    当通过pegasus shell来查看集群、表以及分片的状态时，会经常看到对分片健康情况的整体统计或单个描述。譬如通过`ls 
-d`命令，可以看到各个表处于不同健康状况的partition的个数，包括这些：
    * fully_healthy：完全健康。
    * unhealthy：不完全健康。
    * write_unhealthy：不可写，包括上面的readable but unwritable和dead。
    * read_unhealthy：不可读，包括上面的unreadable和dead。
-
+  

Review Comment:
   Please drop the leading spaces.



##########
_docs/zh/administration/rebalance.md:
##########
@@ -325,6 +324,64 @@ Pegasus提供了一些控制参数给些过程可以提供更精细的控制：
 
 不过有部分脚本的逻辑依赖小米的[minos部署系统](https://github.com/XiaoMi/minos)。这里希望大家可以帮助pegasus, 
可以支持更多的部署系统。
 
+## 集群级别负载均衡
+1. 上面的描述都是以表为单位进行负载均衡的，即当一个集群中每张表在replica server上是均衡的，meta server就认为整个集群是均衡的。
+2. 然而，在部分场景下，特别是集群replica server节点数较多，集群中存在大量**小分片**表时，即使每张表是均衡的，整个集群也不是均衡的。
+3. 从2.3.0版本开始，Pegasus支持集群级别的负载均衡,在保障不改变表均衡的情况下让整个集群的replica个数均衡。
+
+其使用方式和上面描述方法相同，并且支持上述所有的命令。  
+如果需要使用集群级别负载均衡，需要修改以下配置：
+```
+[[meta_server]]
+   balance_cluster = ture  // 默认为false
+```
+
+
 ## 设计篇
+在当前Pegasus balancer的实现中，meta server会定期对所有replica 
server节点的replica情况做评估，当其认为replica在节点分布不均衡时，会将相应replica进行迁移。
+在balancer生成决策过程中需要考虑的因素有：
+- 对于任意表，其partition在节点上的分布要均衡，这其中包括如下几个方面：
+  - 某个partition的三个副本不能全部落在一个节点上
+  - Primary的数量要均摊
+  - Secondary的数量也要均摊
+- 如果发现Primary分配不均衡时，首先考虑的策略应该是对Primary进行角色切换，而不是直接就进行数据拷贝
+- 不仅要考虑节点间的负载均衡，也要尽量保证节点内各个磁盘的replica个数是均衡的
+  
+### Move_Primary
+当primary分布不均衡时，首先考虑的策略是对进行角色切换，也就是说，需要寻找到一条从路径，将primary从“多方”迁移到“少方”。将迁移的primary数量作为流量，很自然的我们就想到了Ford-Fulkerson，即：
  
+1. 寻找一条从source到sink的增广路径
+2. 按照增广路径修改各边权重，形成残差网络
+3. 在残差网络中继续步骤1，直到找不到增广路径为止。  
+
+但是我们又不能直接套用Ford-Fulkerson。原因在于第2步中，按照Ford-Fulkerson，增广路上的一条权重为x的边意味着从A流向B的primary的个数为x，此时形成残差网络中，该边的权重需要减去x，然而其反向边也同时增加x（反向边的作用用于提供一个调整的机会，因为之前形成的增广路径很有可能不是最大流，该反向边用于调整此前形成的增广路径，具体参考Ford-Fulkerson算法）。但是在我们的模型中，反向边增加x是不合理的，例如，对于Partition[Primary:
 A, Secondary: (B, C)]，Primary从A向B流动，最终使得Partition成为[Primary: B, Secondary: (A, 
C)]，这时意味着： 
+1. A到B的流量减少
+2. A到C的流量减少
+3. B到A的流量增加
+4. B到C的流量增加  
+   
+这显然与Ford-Fulkerson的残差网络的反向边的权重变化是不同的。 所以我们将算法修改如下：
+1. 按照当前的partition分布生成图结构，并根据Ford-Fulkerson算法，找到一条增广路径
+2. 根据找到的增广路径，构造primary角色切换的决策动作。并在集群中执行该动作，生成新的partition分布
+3. 根据新的partition分布，迭代步骤1，一直到不能找到增广路径  
+
+从上面可以看出，该算法主要是对第2步进行了修改，并非像Ford-Fulkerson算法那样简单的进行边权重修改。
+
+NOTE：我们在执行Ford-Fulkerson进行primary迁移的时候，是针对单个表的，也就是说构造网络、执行角色切换都是针对单个表的，当要对多个表进行迁移，则只要循环对所有表各执行上述流程就可以了

Review Comment:
   ```suggestion
   
NOTE：我们在执行Ford-Fulkerson进行primary迁移的时候，是针对单个表的，也就是说构造网络、执行角色切换都是针对单个表的，当要对多个表进行迁移，则只要循环对所有表各执行上述流程就可以了。
   ```



##########
_docs/en/administration/rebalance.md:
##########
@@ -2,4 +2,380 @@
 permalink: administration/rebalance
 ---
 
-TRANSLATING
+This document mainly introduces the concepts, usage, and design of rebalance 
in Pegasus.
+
+## Concept Section
+In Pegasus, rebalance mainly includes the following aspects:
+1. If a certain partition replica does not meet the requirement of one primary 
and two secondaries, a node needs to be selected to complete the missing 
replica. This process in Pegasus is called `cure`.
+2. After all partitions meet the requirement of one primary and two 
secondaries, the number of replicas on each replica server in the cluster is 
adjusted to try to keep the number of replicas served by each machine at a 
similar level. This process in Pegasus is called `balance`.

Review Comment:
   ```suggestion
   2. After all partitions have 3 replicas each, all replicas need to be 
distributed evenly across the replica servers. This process in Pegasus is 
called `balance`.
   ```



##########
_docs/en/administration/rebalance.md:
##########
@@ -2,4 +2,380 @@
 permalink: administration/rebalance
 ---
 
-TRANSLATING
+This document mainly introduces the concepts, usage, and design of rebalance 
in Pegasus.
+
+## Concept Section
+In Pegasus, rebalance mainly includes the following aspects:
+1. If a certain partition replica does not meet the requirement of one primary 
and two secondaries, a node needs to be selected to complete the missing 
replica. This process in Pegasus is called `cure`.
+2. After all partitions meet the requirement of one primary and two 
secondaries, the number of replicas on each replica server in the cluster is 
adjusted to try to keep the number of replicas served by each machine at a 
similar level. This process in Pegasus is called `balance`.
+3. If a replica server has multiple disks mounted and provided to Pegasus for 
use through the configuration file `data_dirs`, the replica server should try 
to keep the number of replicas on each disk at a similar level.

Review Comment:
   ```suggestion
   3. If a replica server has multiple disks mounted and provided to Pegasus 
through the configuration file `data_dirs`, the replica server should try to 
keep the number of replicas on each disk at a similar level.
   ```



##########
_docs/en/administration/rebalance.md:
##########
@@ -2,4 +2,380 @@
 permalink: administration/rebalance
 ---
 
-TRANSLATING
+This document mainly introduces the concepts, usage, and design of rebalance 
in Pegasus.
+
+## Concept Section
+In Pegasus, rebalance mainly includes the following aspects:
+1. If a certain partition replica does not meet the requirement of one primary 
and two secondaries, a node needs to be selected to complete the missing 
replica. This process in Pegasus is called `cure`.
+2. After all partitions meet the requirement of one primary and two 
secondaries, the number of replicas on each replica server in the cluster is 
adjusted to try to keep the number of replicas served by each machine at a 
similar level. This process in Pegasus is called `balance`.
+3. If a replica server has multiple disks mounted and provided to Pegasus for 
use through the configuration file `data_dirs`, the replica server should try 
to keep the number of replicas on each disk at a similar level.
+   
+Based on these points, Pegasus has introduced some concepts to conveniently 
describe these situations:
+1. The Health Status of Partition
+   Pegasus has defined several health statuses for partitions:  
+   * 【fully healthy】: Healthy, fully meeting the requirement of one primary 
and two secondaries.
+   * 【unreadable】: The partition is unreadable. It means that the partition 
lacks a primary, but there is one or two secondaries.
+   * 【readable but unwritable】: The partition is readable but not writable. It 
means that only one primary remains, and both secondary replicas are lost.
+   * 【readable and writable but unhealthy】: The partition is both readable and 
writable, but still not healthy. It means that one secondary is missing among 
the three replicas.
+   * 【dead】: All replicas of the partition are unavailable, also known as the 
DDD state.
+![pegasus-healthy-status](/assets/images/pegasus-healthy-status.png){:class="img-responsive"}
+
+   When checking the status of the cluster, tables, and partitions through the 
Pegasus shell, you will often see the overall statistics or individual 
descriptions of the health conditions of the partitions. For example, by using 
the ls -d command, you can see the number of partitions in different health 
conditions for each table, including the following:
+   * fully_healthy: Completely healthy.
+   * unhealthy: Not completely healthy.
+   * write_unhealthy: Unwritable, including the above-mentioned readable but 
unwritable and dead states.
+   * read_unhealthy: Unreadable, including the above-mentioned unreadable and 
dead states.
+2. The Operating Level of the Meta Server  
+   The operating level of the meta server determines the extent to which the 
meta server will manage the entire distributed system.  
+   The most commonly used operating levels include:
+   * blind: Under this operating level, the meta_server rejects any operation 
that may modify the status of the metadata. This level is generally used when 
migrating Zookeeper.
+   * steady: Under this operating level, the meta server only performs cure, 
that is, it only processes unhealthy partitions.
+   * lively: Under this operating level, once all partitions have become 
healthy, the meta server will attempt to perform balance to adjust the number 
of replicas on each machine.
+
+## Operation Section
+
+### Observing the System Status
+
+You can observe the partition status of the system through the Pegasus shell 
client:
+
+1. nodes -d
+
+   It can be used to observe the number of replicas on each node in the system:
+
+   ```
+   >>> nodes -d
+   address              status              replica_count       primary_count  
     secondary_count
+   10.132.5.1:32801     ALIVE               54                  18             
     36
+   10.132.5.2:32801     ALIVE               54                  18             
     36
+   10.132.5.3:32801     ALIVE               54                  18             
     36
+   10.132.5.5:32801     ALIVE               54                  18             
     36
+   ```
+
+   If the number of partitions on each node varies significantly, you can use 
the command "set_meta_level lively" to make adjustments.
+
+2. app <table_name> -d
+   
+   It can be used to view the distribution of all partitions of a certain 
table: You can observe the composition of a specific partition, and also 
summarize the number of partitions of this table served by each node.
+
+   ```
+   >>> app temp -d
+   [Parameters]
+   app_name: temp
+   detailed: true
+   
+   [Result]
+   app_name          : temp
+   app_id            : 14
+   partition_count   : 8
+   max_replica_count : 3
+   details           :
+   pidx      ballot    replica_count       primary                             
    secondaries
+   0         22344     3/3                 10.132.5.2:32801                    
    [10.132.5.3:32801,10.132.5.5:32801]
+   1         20525     3/3                 10.132.5.3:32801                    
    [10.132.5.2:32801,10.132.5.5:32801]
+   2         19539     3/3                 10.132.5.1:32801                    
    [10.132.5.3:32801,10.132.5.5:32801]
+   3         18819     3/3                 10.132.5.5:32801                    
    [10.132.5.3:32801,10.132.5.1:32801]
+   4         18275     3/3                 10.132.5.5:32801                    
    [10.132.5.2:32801,10.132.5.1:32801]
+   5         18079     3/3                 10.132.5.3:32801                    
    [10.132.5.2:32801,10.132.5.1:32801]
+   6         17913     3/3                 10.132.5.2:32801                    
    [10.132.5.1:32801,10.132.5.5:32801]
+   7         17692     3/3                 10.132.5.1:32801                    
    [10.132.5.3:32801,10.132.5.2:32801]
+   
+   node                                    primary   secondary total
+   10.132.5.1:32801                        2         4         6
+   10.132.5.2:32801                        2         4         6
+   10.132.5.3:32801                        2         4         6
+   10.132.5.5:32801                        2         4         6
+                                           8         16        24
+   
+   fully_healthy_partition_count   : 8
+   unhealthy_partition_count       : 0
+   write_unhealthy_partition_count : 0
+   read_unhealthy_partition_count  : 0
+   
+   list app temp succeed
+   ```
+
+3. server_stat
+   
+   It can be used to observe some current monitoring data of each replica 
server. If you want to analyze the balance degree of the traffic, you should 
focus on observing the QPS and latency of each operation. For the nodes with 
obviously abnormal data values (showing a large difference from other nodes), 
it is necessary to check whether the number of partitions is unevenly 
distributed, or whether there is a read-write hot spot for a certain partition.
+
+   ```
+   >>> server_stat -t replica-server
+   COMMAND: server-stat
+   
+   CALL [replica-server] [10.132.5.1:32801] succeed: 
manual_compact_enqueue_count=0, manual_compact_running_count=0, 
closing_replica_count=0, disk_available_max_ratio=88, 
disk_available_min_ratio=78, disk_available_total_ratio=85, 
disk_capacity_total(MB)=8378920, opening_replica_count=0, 
serving_replica_count=54, commit_throughput=0, learning_count=0, 
shared_log_size(MB)=4, memused_res(MB)=2499, memused_virt(MB)=4724, 
get_p99(ns)=0, get_qps=0, multi_get_p99(ns)=0, multi_get_qps=0, 
multi_put_p99(ns)=0, multi_put_qps=0, put_p99(ns)=0, put_qps=0
+   CALL [replica-server] [10.132.5.2:32801] succeed: 
manual_compact_enqueue_count=0, manual_compact_running_count=0, 
closing_replica_count=0, disk_available_max_ratio=88, 
disk_available_min_ratio=79, disk_available_total_ratio=86, 
disk_capacity_total(MB)=8378920, opening_replica_count=0, 
serving_replica_count=54, commit_throughput=0, learning_count=0, 
shared_log_size(MB)=4, memused_res(MB)=2521, memused_virt(MB)=4733, 
get_p99(ns)=0, get_qps=0, multi_get_p99(ns)=0, multi_get_qps=0, 
multi_put_p99(ns)=0, multi_put_qps=0, put_p99(ns)=0, put_qps=0
+   CALL [replica-server] [10.132.5.3:32801] succeed: 
manual_compact_enqueue_count=0, manual_compact_running_count=0, 
closing_replica_count=0, disk_available_max_ratio=90, 
disk_available_min_ratio=78, disk_available_total_ratio=85, 
disk_capacity_total(MB)=8378920, opening_replica_count=0, 
serving_replica_count=54, commit_throughput=0, learning_count=0, 
shared_log_size(MB)=4, memused_res(MB)=2489, memused_virt(MB)=4723, 
get_p99(ns)=0, get_qps=0, multi_get_p99(ns)=0, multi_get_qps=0, 
multi_put_p99(ns)=0, multi_put_qps=0, put_p99(ns)=0, put_qps=0
+   CALL [replica-server] [10.132.5.5:32801] succeed: 
manual_compact_enqueue_count=0, manual_compact_running_count=0, 
closing_replica_count=0, disk_available_max_ratio=88, 
disk_available_min_ratio=82, disk_available_total_ratio=85, 
disk_capacity_total(MB)=8378920, opening_replica_count=0, 
serving_replica_count=54, commit_throughput=0, learning_count=0, 
shared_log_size(MB)=4, memused_res(MB)=2494, memused_virt(MB)=4678, 
get_p99(ns)=0, get_qps=0, multi_get_p99(ns)=0, multi_get_qps=0, 
multi_put_p99(ns)=0, multi_put_qps=0, put_p99(ns)=0, put_qps=0
+   
+   Succeed count: 4
+   Failed count: 0
+   ```
+
+4. app_stat -a <app_name>
+
+   It can be used to observe the statistical information of each partition in 
a certain table. For the partitions with obviously abnormal data values, 
attention should be paid to whether there is a partition hot spot.
+
+   ```
+   >>> app_stat -a temp
+   pidx                 GET   MULTI_GET         PUT   MULTI_PUT         DEL   
MULTI_DEL        INCR         CAS        SCAN     expired    filtered    
abnormal  storage_mb  file_count
+   0                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           3
+   1                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           1
+   2                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           4
+   3                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           2
+   4                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           3
+   5                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           2
+   6                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           1
+   7                      0           0           0           0           0    
       0           0           0           0           0           0           
0           0           3
+                          0           0           0           0           0    
       0           0           0           0           0           0           
0           0          19
+   ```
+
+### Controlling the Load Balancing of the Cluster
+
+Peagsus provides the following commands to control the load balancing of the 
cluster:
+
+1. set_meta_level
+
+   This command is used to control the operating level of the meta, and the 
following levels are supported:
+   * freezed：The meta server will stop the cure work for unhealthy partitions. 
It is generally used when there are many nodes crashing or the cluster is 
extremely unstable. In addition, if the number of nodes in the cluster drops 
below a certain quantity or proportion (controlled by the configuration files 
min_live_node_count_for_unfreeze and 
node_live_percentage_threshold_for_update), it will automatically change to the 
freezed state and wait for manual intervention.
+   * steady：The default level of the meta server. It only performs the cure 
operation and does not perform the balance operation.
+   * lively：The meta server will adjust the number of replicas to strive for 
balance.
+   
+   You can use either cluster_info or get_meta_level to check the current 
operating level of the cluster.
+
+   Some Suggestions for Adjustment:
+   * First, use the nodes -d command in the shell to check whether the cluster 
is balanced, and then make adjustments when it is unbalanced. Usually, after 
the following situations occur, it is necessary to enable the lively mode for 
adjustment:
+     * When a new table is created, the number of replicas may be uneven at 
this time.
+     * When nodes are launched, taken offline, or upgraded in the cluster, the 
number of replicas may also be uneven.
+     * When a node crashes and some replicas are migrated to other nodes.
+   * The adjustment process will trigger replica migration, which will affect 
the availability of the cluster. Although the impact is not significant, if the 
requirement for availability is very high and the adjustment demand is not 
urgent, it is recommended to make the adjustment during the **low-peak period**.
+   * After the adjustment is completed, reset the level to the steady state by 
using the `set_meta_level steady` command to avoid unnecessary replica 
migration during normal times and reduce cluster jitter.
+   * Pegasus also provides some commands for fine-grained control of balance. 
Please refer to [Advanced Options for Load 
Balancing](#advanced-options-for-load-balancing).
+
+2. balance
+
+   The balance command is used to manually send commands for replica 
migration. The supported migration types are:
+   * move_pri: Swap the primary and secondary of a certain partition 
(essentially in two steps: 1. Downgrade the "from" node; 2. Upgrade the "to" 
node. If the meta server goes down after the first step is completed, the new 
meta server will not continue with the second step, and the move_pri command 
can be regarded as interrupted).
+   * copy_pri: Migrate the primary of a certain partition to a new node.
+   * copy_sec: Migrate the secondary of a certain partition to a new node.
+
+   **Note that when using these commands, ensure that the meta server is in 
the steady state; otherwise, the commands will not take effect.**
+
+   Please refer to the following examples (irrelevant outputs have been 
deleted):
+   ```
+   >>> get_meta_level
+   current meta level is fl_steady
+   
+   >>> app temp -d
+   pidx      ballot    replica_count       primary                             
    secondaries
+   0         3         3/3                 10.231.58.233:34803                 
    [10.231.58.233:34802,10.231.58.233:34801]
+   
+   list app temp succeed
+   
+   >>> balance -g 1.0 -p move_pri -f 10.231.58.233:34803 -t 10.231.58.233:34802
+   send balance proposal result: ERR_OK
+   
+   >>> app temp -d
+   pidx      ballot    replica_count       primary                             
    secondaries
+   0         5         3/3                 10.231.58.233:34802                 
    [10.231.58.233:34801,10.231.58.233:34803]
+   list app temp succeed
+   ```
+
+3. propose
+   
+   The propose command is used to send replica adjustment commands at a lower 
primitive level, mainly including the following types:
+   * assign_primary：Assign the primary of a certain partition to a specific 
machine
+   * upgrade_to_primary：Upgrade the secondary of a certain partition to a 
primary
+   * add_secondary: Add a secondary for a certain partition
+   * upgrade_to_secondary: Upgrade a certain learner under a partition to a 
secondary
+   * downgrade_to_secondary：Downgrade the primary under a certain partition to 
a secondary
+   * downgrade_to_inactive：Downgrade the primary/secondary under a certain 
partition to an inactive state
+   * remove：Remove a certain replica under a certain partition
+
+   ```
+   >>> app temp -d
+   pidx      ballot    replica_count       primary                             
    secondaries                             
+   0         5         3/3                 10.231.58.233:34802                 
    [10.231.58.233:34801,10.231.58.233:34803]
+   list app temp succeed
+   >>> propose -g 1.0 -p downgrade_to_inactive -t 10.231.58.233:34802 -n 
10.231.58.233:34801
+   send proposal response: ERR_OK
+   >>> app temp -d
+   pidx      ballot    replica_count       primary                             
    secondaries                             
+   0         7         3/3                 10.231.58.233:34802                 
    [10.231.58.233:34803,10.231.58.233:34801]
+   list app temp succeed
+   ```
+
+   In the above example, the propose command aims to downgrade 
10.231.38.233:34801. Therefore, this command needs to be sent to the primary 
(10.231.58.233:34802) of the partition, and it will execute the specific matter 
of downgrading a certain replica. Note that this reflects the design concept of 
the Pegasus system: **The meta server is responsible for managing the primary, 
and the primary is responsible for managing other replicas under the 
partition**.
+   
+   In the above example, there may not be an obvious sign that 
10.231.38.233:34801 has been downgraded. This is due to the existence of the 
system's cure function, which will quickly repair an unhealthy partition. You 
can confirm that the command has taken effect by observing the changes in the 
ballot.
+
+   Under normal circumstances, you shouldn't need to use the propose command.
+
+### Advanced Options for Load Balancing
+
+The meta server provides some more fine-grained parameters for load balancing 
control. These parameters are adjusted through the remote_command command:
+
+#### You can use the help command to view all the remote_command options 
available.
+
+```
+>>> remote_command -l 127.0.0.1:34601 help
+COMMAND: help
+
+CALL [user-specified] [127.0.0.1:34601] succeed: help|Help|h|H [command] - 
display help information
+repeat|Repeat|r|R interval_seconds max_count command - execute command 
periodically
+...
+meta.lb.assign_delay_ms [num | DEFAULT]
+meta.lb.assign_secondary_black_list [<ip:port,ip:port,ip:port>|clear]
+meta.lb.balancer_in_turn <true|false>
+meta.lb.only_primary_balancer <true|false>
+meta.lb.only_move_primary <true|false>
+meta.lb.add_secondary_enable_flow_control <true|false>
+meta.lb.add_secondary_max_count_for_one_node [num | DEFAULT]
+...
+
+Succeed count: 1
+Failed count: 0
+```
+
+[remote_command](https://github.com/apache/incubator-pegasus/blob/master/src/utils/command_manager.h)The
 remote_command is a feature of Pegasus, which allows a server to register some 
commands, and then the command line can call these commands through RPC. Here, 
we use the `help` command to access the meta server leader and obtain all the 
commands supported on the meta server. In the example, all irrelevant lines 
have been omitted, leaving only all the commands related to load balancing that 
start with "meta.lb".
+
+Due to the inconsistency between the documentation and the code, the 
documentation may not necessarily cover all the current load balancing (lb) 
control commands of the meta. If you want to obtain the latest command list, 
please manually execute the `help` command using the latest code.
+
+#### assign_delay_ms
+
+The `assign_delay_ms` is used to control **how long we should delay before 
selecting a new secondary when a partition lacks one**. The reason for this is 
that the disconnection of a replica may be temporary. If a new secondary is 
selected without providing a certain buffer period, it may lead to a huge 
amount of data copying.
+
+```
+>>> remote_command -t meta-server meta.lb.assign_delay_ms
+COMMAND: meta.lb.assign_delay_ms
+CALL [meta-server] [127.0.0.1:34601] succeed: 300000
+CALL [meta-server] [127.0.0.1:34602] succeed: unknown command 
'meta.lb.assign_delay_ms'
+CALL [meta-server] [127.0.0.1:34603] succeed: unknown command 
'meta.lb.assign_delay_ms'
+Succeed count: 3
+Failed count: 0
+>>> remote_command -t meta-server meta.lb.assign_delay_ms 10
+COMMAND: meta.lb.assign_delay_ms 10
+CALL [meta-server] [127.0.0.1:34601] succeed: OK
+CALL [meta-server] [127.0.0.1:34602] succeed: unknown command 
'meta.lb.assign_delay_ms'
+CALL [meta-server] [127.0.0.1:34603] succeed: unknown command 
'meta.lb.assign_delay_ms'
+Succeed count: 3
+Failed count: 0
+>>> remote_command -t meta-server meta.lb.assign_delay_ms
+COMMAND: meta.lb.assign_delay_ms
+CALL [meta-server] [127.0.0.1:34601] succeed: 10
+CALL [meta-server] [127.0.0.1:34602] succeed: unknown command 
'meta.lb.assign_delay_ms'
+CALL [meta-server] [127.0.0.1:34603] succeed: unknown command 
'meta.lb.assign_delay_ms'
+Succeed count: 3
+Failed count: 0
+```
+
+As shown in the example, when the command is executed without parameters, it 
indicates that the current set value will be returned. Adding parameters means 
specifying the expected new value.
+
+#### assign_secondary_black_list
+
+This command is used to set **the blacklist for adding secondaries**. This 
command is extremely useful when taking nodes offline in batches from the 
cluster.
+
+#### Flow Control during the Addition of Secondaries
+
+At some times, the decision algorithm of load balancing may require adding 
quite a few secondary replicas on one machine. For example:
+* The crash of one or more nodes will cause normal nodes to accept a large 
number of partitions instantaneously.
+* When a new node is added, a large number of replicas may flood in.
+
+However, when executing these decision-making actions of adding replicas, we 
should avoid adding a large number of secondary shards simultaneously at the 
same moment, because:
+* Adding secondary replicas basically involves data copying. If the quantity 
is too large, it may affect normal reading and writing
+* The total bandwidth is limited. If multiple tasks of adding replicas are 
sharing this bandwidth, then the execution time of each task will be prolonged. 
As a result, the system will be in a state where **a large number of replicas 
are unhealthy for a long time**, increasing the risk of instability.
+
+So, Pegasus uses two commands to support flow control:
+1. meta.lb.add_secondary_enable_flow_control: It indicates whether to enable 
the flow control feature.
+2. meta.lb.add_secondary_max_count_for_one_node: It represents the number of 
add_secondary actions that can be executed simultaneously for each node.
+
+#### Fine-grained Control of the Balancer
+
+In the current implementation of Pegasus, the balancer process can be roughly 
summarized in four points:
+1. Try to achieve the balance of primaries as much as possible through role 
swapping.
+2. If it is not possible to make the primaries evenly distributed in step 1, 
achieve the balance of primaries by copying data.
+3. After step 2 is completed, achieve the balance of secondaries by copying 
data.
+4. Perform the actions in steps 1-2-3 for each table separately.
+
+Pegasus provides some control parameters for this process, enabling more 
fine-grained control:
+* meta.lb.only_primary_balancer: For each table, only steps 1 and 2 are 
carried out (reducing the data copying caused by copying secondaries).
+* meta.lb.only_move_primary: For each table, when adjusting the primary, only 
consider method 1 (reducing the data copying caused by copying primaries).
+* meta.lb.balancer_in_turn：The balancers of various tables are executed in a 
sequential manner instead of in parallel (used for debugging and observing the 
system behavior).
+
+### Usage Examples of Some Commands
+
+By combining the above load balancing primitives, Pegasus provides some 
scripts to execute operations such as rolling upgrades and node offline 
processes, for example:
+
+1. 
[scripts/migrate_node.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/migrate_node.sh)
+
+   This script is used to drive away all the primaries of the services running 
on a certain node.
+
+2. 
[scripts/pegasus_rolling_update.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_rolling_update.sh)
+
+   It is used to perform an online rolling upgrade on the nodes in the cluster.
+
+3. 
[scripts/pegasus_offline_node_list.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_offline_node_list.sh)
+
+   It is used to take a batch of nodes offline.
+
+However, the logic of some of the scripts depends on Xiaomi's [Minos 
Deployment System](https://github.com/XiaoMi/minos)。Here, it is hoped that 
everyone can offer assistance to Pegasus and enable it to support more 
deployment systems.
+
+## Cluster-level Load Balancing
+1. The above descriptions all perform load balancing on a per-table basis. 
That is, when each table in a cluster is balanced on the replica servers, the 
meta server considers the entire cluster to be balanced.
+2. However, in some scenarios, especially when there are a large number of 
replica server nodes in the cluster and there are a large number of tables with 
**small replicas** in the cluster, even if each table is balanced, the entire 
cluster is not balanced.
+3. Starting from version 2.3.0, Pegasus supports cluster-level load balancing, 
ensuring that the number of replicas in the entire cluster is balanced without 
changing the balance of the tables.
+
+The usage method is the same as the method described above, and it supports 
all the commands mentioned above.  
+If you need to use cluster-level load balancing, you need to modify the 
following configurations:
+```
+[[meta_server]]
+   balance_cluster = ture  // default is false
+```
+
+## Design Section
+n the current implementation of the Pegasus balancer, the meta server will 
regularly evaluate the replica situation of all replica server nodes. When it 
deems that the replicas are unevenly distributed across nodes, it will migrate 
the corresponding replicas.

Review Comment:
   ```suggestion
   In the current implementation of the Pegasus balancer, the meta server will 
regularly evaluate the replica distribution across all replica servers. When it 
deems that the replicas are unevenly distributed across nodes, it will migrate 
the corresponding replicas.
   ```



##########
_docs/en/administration/rebalance.md:
##########
@@ -2,4 +2,380 @@
 permalink: administration/rebalance
 ---
 
-TRANSLATING
+This document mainly introduces the concepts, usage, and design of rebalance 
in Pegasus.
+
+## Concept Section
+In Pegasus, rebalance mainly includes the following aspects:
+1. If a certain partition replica does not meet the requirement of one primary 
and two secondaries, a node needs to be selected to complete the missing 
replica. This process in Pegasus is called `cure`.
+2. After all partitions meet the requirement of one primary and two 
secondaries, the number of replicas on each replica server in the cluster is 
adjusted to try to keep the number of replicas served by each machine at a 
similar level. This process in Pegasus is called `balance`.
+3. If a replica server has multiple disks mounted and provided to Pegasus for 
use through the configuration file `data_dirs`, the replica server should try 
to keep the number of replicas on each disk at a similar level.
+   
+Based on these points, Pegasus has introduced some concepts to conveniently 
describe these situations:
+1. The Health Status of Partition
+   Pegasus has defined several health statuses for partitions:  
+   * 【fully healthy】: Healthy, fully meeting the requirement of one primary 
and two secondaries.
+   * 【unreadable】: The partition is unreadable. It means that the partition 
lacks a primary, but there is one or two secondaries.
+   * 【readable but unwritable】: The partition is readable but not writable. It 
means that only one primary remains, and both secondary replicas are lost.
+   * 【readable and writable but unhealthy】: The partition is both readable and 
writable, but still not healthy. It means that one secondary is missing among 
the three replicas.
+   * 【dead】: All replicas of the partition are unavailable, also known as the 
DDD state.
+![pegasus-healthy-status](/assets/images/pegasus-healthy-status.png){:class="img-responsive"}
+
+   When checking the status of the cluster, tables, and partitions through the 
Pegasus shell, you will often see the overall statistics or individual 
descriptions of the health conditions of the partitions. For example, by using 
the ls -d command, you can see the number of partitions in different health 
conditions for each table, including the following:

Review Comment:
   ```suggestion
      When checking the status of the cluster, tables, and partitions through 
the Pegasus shell, you will often see the overall statistics or individual 
descriptions of the health conditions of the partitions. For example, by using 
the `ls -d` command, you can see the number of partitions in different health 
conditions for each table, including the following:
   ```



##########
_docs/en/administration/rebalance.md:
##########
@@ -2,4 +2,380 @@
 permalink: administration/rebalance
 ---
 
-TRANSLATING
+This document mainly introduces the concepts, usage, and design of rebalance 
in Pegasus.
+
+## Concept Section
+In Pegasus, rebalance mainly includes the following aspects:
+1. If a certain partition replica does not meet the requirement of one primary 
and two secondaries, a node needs to be selected to complete the missing 
replica. This process in Pegasus is called `cure`.

Review Comment:
   ```suggestion
   1. If a partition has less than 3 replicas (1 primary and 2 secondaries), a 
node needs to be selected to complete the missing replicas. This process in 
Pegasus is called `cure`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Update rebalance zh/en docs [incubator-pegasus-website]

Reply via email to