(incubator-pegasus-website) branch master updated: Update rolling docs (#71)

wangdan Wed, 31 Jan 2024 22:10:48 -0800

This is an automated email from the ASF dual-hosted git repository.

wangdan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pegasus-website.git



The following commit(s) were added to refs/heads/master by this push:
     new a006ee56 Update rolling docs (#71)
a006ee56 is described below

commit a006ee562533fbfff896f317a2f4cd587f7bbcbc
Author: Yingchun Lai <[email protected]>
AuthorDate: Thu Feb 1 14:10:24 2024 +0800

    Update rolling docs (#71)
---
 _data/en/translate.yml                    |   2 +-
 _data/zh/translate.yml                    |   2 +-
 _docs/en/administration/rolling-update.md | 109 +++++++++++++++++++++++++++++-
 _docs/en/administration/scale-in-out.md   |   2 +-
 _docs/zh/administration/administration.md |   2 +-
 _docs/zh/administration/bad-disk.md       |   2 +-
 _docs/zh/administration/rolling-update.md |  92 ++++++++++++-------------
 7 files changed, 156 insertions(+), 55 deletions(-)

diff --git a/_data/en/translate.yml b/_data/en/translate.yml
index abdab8df..f9e1be32 100644
--- a/_data/en/translate.yml
+++ b/_data/en/translate.yml
@@ -48,7 +48,7 @@ title_deployment: "Deployment"
 title_config: "Configurations"
 title_rebalance: "Rebalance"
 title_monitoring: "Monitoring"
-title_rolling-update: "Rolling-Update"
+title_rolling-update: "Rolling Restart and Upgrade"
 title_scale-in-out: "Scale-in and Scale-out"
 title_resource-management: "Resource Management"
 title_cold-backup: "Cold Backup"
diff --git a/_data/zh/translate.yml b/_data/zh/translate.yml
index f2aab23b..25ac85da 100644
--- a/_data/zh/translate.yml
+++ b/_data/zh/translate.yml
@@ -48,7 +48,7 @@ title_deployment: "集群部署"
 title_config: "配置说明"
 title_rebalance: "负载均衡"
 title_monitoring: "可视化监控"
-title_rolling-update: "集群升级"
+title_rolling-update: "集群重启和升级"
 title_scale-in-out: "集群扩容缩容"
 title_resource-management: "资源管理"
 title_cold-backup: "冷备份"
diff --git a/_docs/en/administration/rolling-update.md 
b/_docs/en/administration/rolling-update.md
index 93818691..0ffa9b70 100644
--- a/_docs/en/administration/rolling-update.md
+++ b/_docs/en/administration/rolling-update.md
@@ -2,4 +2,111 @@
 permalink: administration/rolling-update
 ---
 
-TRANSLATING
+# Design goals
+
+When upgrading the Pegasus server version or persistently modifying the 
configuration, it is necessary to restart the cluster. For distributed 
clusters, the commonly used restart method is **Rolling Restart**, which means 
restarting servers one by one without stopping cluster service.
+
+> The following document assumes that the number of replicas of tables in the 
Pegasus cluster is 3.
+
+The important goal of cluster restart is to maintain continuous service and 
minimize the impact on availability. During the restart process, the following 
factors can affect service availability:
+* After the Replica Server process is killed, the replicas served by the 
process cannot provide services:
+  * For primary replicas: Since the primary replicas directly provide reading 
and writing services to the client, killing a process will definitely affect 
read and write operations, and it needs to wait for the Meta Server to reassign 
new primary replicas before it can be recovered. The Meta Server maintenance 
the survival status of the Replica Servers through beacons, and the latency of 
Failure Detector depends on the configuration parameter `fd_grace_seconds`, 
default to 10 seconds, wh [...]
+  * For secondary replicas: Since the secondary replicas do not serve reads, 
theoretically they have no impact on reads. But it will affect writing because 
the PacificA consistency protocol requires all replicas to be written 
successfully before the write operation can be submitted. After the process is 
killed, the primary replica will find that the secondary replica has been lost 
during the write operation, and then notify the Meta Server to kick it out. 
After the _configuration_ stage, [...]
+* Restarting Meta Server: The impact of restarting Meta Server on availability 
can be almost negligible. Because the client retrieves the service node 
information for each partition from the Meta Server for the first time and 
caches the information locally, there is usually no need to query from Meta 
Server again. Therefore, a short disconnection during the Meta Server restart 
process has little impact on the client. However, considering that the Meta 
Server needs to maintain beacons wit [...]
+* Restarting the Collector: Restarting the Collector has no impact on 
availability. However, availability metrics are collected from the Collector, 
so it may have a slight impact on the metrics data.
+
+Therefore, the following points can be considered to keep availability during 
cluster restart:
+* Only one process can be restarted at a time, and the next process can only 
be restarted after the process is restarted and fully recovered to provide 
service. Because:
+  * If the cluster does not recover to a fully healthy state after restarting 
a process, and some partitions still have only one primary and one secondary 
replica, then killing another Replica Server process is likely to enter a state 
with only one primary replica, making it unable to provide write service.
+  * Waiting for all partitions in the cluster to recover three replicas before 
restarting the next process can also reduce the risk of data loss.
+* Proactively migrate replicas before Failure Detector delays impact 
availability, instead passively migrate. Because:
+  * Passive migration requires waiting for the Failure Detector to detect 
Replica Server loss, while proactive migration involves migrating the primary 
replicas served by this server to other servers before killing the process. 
This `reconfiguration` procedure is fast and typically takes less than 1 second 
to complete.
+* Try to manually downgrade the secondary replicas of the Replica Server 
served before killing the process. Because:
+  * Proactively trigger the `reconfiguration` rather than passive triggering 
on write failures, further reducing the impact on availability.
+* Minimize the workload of the recovery process during process restart to 
shorten the process restart time.
+  * Replica Server requires replay WAL logs to recover data upon restart. If 
it is killed directly, the amount of data that needs to be replayed may be 
large. However, if the flush operation of memtables to disk is actively 
triggered before killing, the amount of data that needs to be replayed during 
restart will be greatly reduced, and the restart time will be much shorter. The 
time required for the entire cluster to restart can also be greatly reduced.
+* Minimize unnecessary data transmission between servers to avoid availability 
impacts caused by high load of CPU, network IO, and disk IO when transmit data.
+  * After the Replica Server crashes, some partitions enter the state of `1 
primary + 1 secondary`. If the Meta Server immediately supplements replicas on 
other Replica Servers, it will bring about a large number of cross server data 
transmission, increase CPU, network IO, and disk IO load, and affect cluster 
stability. Pegasus's solution to this problem is to allow `1 primary + 1 
secondary` state for a period of time, providing a maintenance window for the 
restarted Replica Server. If i [...]
+
+# Restart steps
+
+## High availability restart steps
+
+* If it is an upgrade, please prepare new server packages and configuration 
files first
+* Use shell tools to set the meta level of the cluster to `steady`, turn off 
[load balancing](rebalance), and avoid unnecessary replica migration
+  ```
+  >>> set_meta_level steady
+  ```
+* Use shell tools to set the maintenance window of a single Replica Server
+  ```
+  >>> remote_command -t meta-server meta.lb.assign_delay_ms $value
+  ```
+  `value` can be understood as the maintenance window of a single Replica 
Server, which is the trigger time for the Meta Server to supplement replicas to 
other servers after discovering that the Replica Server is lost. For example, 
configure to `3600000`.
+* Restart the Replica Server process one by one. Restart a single Replica 
Server steps:
+  * Use shell tools to send [remote commands](remote-commands#meta-server) to 
Meta Server, temporarily disable `add_secondary` operations:
+    ```
+    >>> remote_command -t meta-server 
meta.lb.add_secondary_max_count_for_one_node 0
+    ```
+  * Use `migrate_node` command to transfer all primary replicas on the Replica 
Server to other servers:
+    ```bash
+    $ ./run.sh migrate_node -c $meta_list -n $node -t run
+    ```
+    Use shell tools to check the replicas of the servers served through the 
`nodes -d` command, and wait for the number of **primary** replicas to become 
0. If it doesn't change to 0 for a long time, please execute the command again.
+  * Use `downgrade_node` command to downgrade all secondary replicas on the 
Replica Server to `INACTIVE`:
+    ```bash
+    $ ./run.sh downgrade_node -c $meta_list -n $node -t run
+    ```
+    Use shell tools to check the replicas of the servers served through the 
`nodes -d` command, and wait for the number of **secondary** replicas to become 
0. If it doesn't change to 0 for a long time, please execute the command again.
+  * Use shell tools to send a remote command to the Replica Server to close 
all replicas and trigger flush operations:
+    ```
+    >>> remote_command -l $node replica.kill_partition
+    ```
+    Wait for about 1 minute for the data to be flushed to the disk to complete.
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Replica Server process
+  * Use shell tools to send [remote commands](remote-commands#meta-server) to 
Meta Server, enable `add_secondary` operations, let it quickly supplement 
replicas:
+    ```
+    >>> remote_command -t meta-server 
meta.lb.add_secondary_max_count_for_one_node 100
+    ```
+  * Use the `ls - d` command of the shell tool to check the cluster status and 
wait for all partitions to fully recover health
+  * Continue with the next Replica Server
+* Restart the Meta Server process one by one. Restart a single Meta Server 
steps:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Meta Server process
+  * Wait for more than 30 seconds to ensure the continuity of beacons between 
Meta Server and Replica Servers
+  * Continue with the next Meta Server
+* Restart the Collector process:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Collector process
+* Reset configurations
+  * Reset the configurations modified in the above steps using shell tools:
+    ```
+    >>> remote_command -t meta-server 
meta.lb.add_secondary_max_count_for_one_node DEFAULT
+    >>> remote_command -t meta-server meta.lb.assign_delay_ms DEFAULT
+    ```
+
+## Simplified restart steps
+
+If the availability requirement is not high, the restart steps can be 
simplified as follows:
+* If it is an upgrade, please prepare new server packages and configuration 
files first
+* Use shell tools to set the meta level of the cluster to `steady`, turn off 
[load balancing](rebalance), and avoid unnecessary replica migration
+  ```
+  >>> set_meta_level steady
+  ```
+* Restart the Replica Server process one by one. Restart a single Replica 
Server steps:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Replica Server process
+  * Use the `ls - d` command of the shell tool to check the cluster status and 
wait for all partitions to fully recover health
+  * Continue with the next Replica Server
+* Restart the Meta Server process one by one. Restart a single Meta Server 
steps:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Meta Server process
+  * Wait for more than 30 seconds to ensure the continuity of beacons between 
Meta Server and Replica Servers
+  * Continue with the next Meta Server
+* Restart the Collector process:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Collector process
+
+# Restart script
+
+It can be referenced the script based on 
[Minos](https://github.com/XiaoMi/minos) and [High availability restart 
steps](#high-availability-restart-steps): 
[scripts/pegasus_rolling_update.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_rolling_update.sh).
diff --git a/_docs/en/administration/scale-in-out.md 
b/_docs/en/administration/scale-in-out.md
index fdd3c216..a04b06a1 100644
--- a/_docs/en/administration/scale-in-out.md
+++ b/_docs/en/administration/scale-in-out.md
@@ -2,7 +2,7 @@
 permalink: administration/scale-in-out
 ---
 
-# Design goal
+# Design goals
 
 When the storage capacity of the cluster is insufficient or the read/write 
throughput is too high, it is necessary to scale out the capacity by adding 
more nodes. On the contrary, scaling in can be achieved by reducing the number 
of nodes.
 
diff --git a/_docs/zh/administration/administration.md 
b/_docs/zh/administration/administration.md
index b53c15e6..e47bc648 100644
--- a/_docs/zh/administration/administration.md
+++ b/_docs/zh/administration/administration.md
@@ -15,7 +15,7 @@ Pegasus 不仅仅只提供简单的 key value 存储接口，我们还基于稳
 如果有机器发生持久性的故障，你也可以参照 [集群扩容缩容](scale-in-out) 剔除这个坏节点。
 如果是机器的某个SSD盘出故障，可以参照 [坏盘检修](bad-disk) 剔除这个坏盘。
 
-如果需要升级集群，请参照 [集群升级](rolling-update)。
+如果需要重启或升级集群，请参照 [集群重启](rolling-update)。
 
 集群运行过程中，你需要时刻关注资源（磁盘、内存、网络）的使用情况，并及时做出运维调整，请参照 [资源管理](resource-management)。
 
diff --git a/_docs/zh/administration/bad-disk.md 
b/_docs/zh/administration/bad-disk.md
index 04780df0..a55765ed 100644
--- a/_docs/zh/administration/bad-disk.md
+++ b/_docs/zh/administration/bad-disk.md
@@ -32,7 +32,7 @@ Pegasus 支持磁盘黑名单，如果你要下线某块磁盘，首先要把它
 
 ## 重启节点
 
-在你标注好坏盘名单后，你可以通过 [高可用升级](rolling-update) 单独重启对应节点的服务进程。
+在你标注好坏盘名单后，你可以通过 [高可用重启](rolling-update#高可用重启) 单独重启对应节点的服务进程。
 通常你在程序日志中能够看到下列记录，表示黑名单内的磁盘的确被忽略了：
 
 ```log
diff --git a/_docs/zh/administration/rolling-update.md 
b/_docs/zh/administration/rolling-update.md
index 7db7039b..35ef0a7b 100644
--- a/_docs/zh/administration/rolling-update.md
+++ b/_docs/zh/administration/rolling-update.md
@@ -4,83 +4,80 @@ permalink: administration/rolling-update
 
 # 功能目标
 
-当需要升级 server 版本或者持久化修改配置时，都需要对集群进行重启。对于分布式集群来说，常用的重启方法是滚动重启 
(Rolling-Restart)，即不停止集群服务，而对 server 逐个进行重启。
+当需要升级 Pegasus server 版本或者持久化修改配置时，都需要对集群进行重启。对于分布式集群来说，常用的重启方法是滚动重启 
(Rolling-Restart)，即不停止集群服务，而对 server 逐个进行重启。
 
 > 以下文档假定 Pegasus 集群中表的副本数为 3。
 
-集群重启的重要目标是不停服，并且对可用性的影响降至最低。为了达到这个目标，我们先梳理在重启过程会影响可用性的点：
-* replica server 进程被 kill 后，该进程服务的 replica 无法提供服务：
-  * 对于 primary replica：因为直接向客户端提供读写服务，所以进程被 kill 后肯定会影响读写，需要等 meta server 
重新分派新的 primary replica 后才能恢复。meta server 通过心跳感知 replica server 的存活状态，failure 
detection 的时间延迟取决于配置参数 `fd_grace_seconds`，默认为 10 秒，即最多需要经过 10 秒，meta server 
才能知道 replica server 宕机了，然后重新分派新的 primary replica。
-  * 对于 secondary replica：由于不服务读，所以理论上对读无影响。但是会影响写，因为 PacificA 
一致性协议要求所有副本都写成功，写操作才能提交。进程被 kill 后，primary replica 在执行写操作过程中会发现该 secondary 
replica 已失联，然后通知 meta server 将其踢出，经过 `reconfiguration` 
阶段后变成一主一备，继续提供写服务。对于在该切换过程中尚未完成的写操作，即使有 `reconciliation` 
阶段重新执行，但客户端可能已经超时，这对可用性是有一定影响的。但是这个影响相对较小，因为 `reconfiguration` 的速度是比较快的，通常能在 1 
秒内完成。
-* 重启 meta server：重启 meta server 对可用性的影响几乎可以忽略不计。因为客户端首次从 meta server 获取到各 
partition 的服务节点信息后，会在本地缓存该信息，通常不需要再次向 meta server 查询，因此 meta server 
重启过程中的短暂失联对客户端基本没有影响。不过考虑到 meta server 需要与 replica server 维持心跳，所以要避免连续 kill 
meta server 进程，造成 replica server 心跳失联的风险。
-* 重启 collector：重启 collector 对可用性没有影响。但是可用性统计是在 collector 上进行的，所以可能会对 metrics 
数据有轻微影响。
+集群重启的重要目标是不停服，并且对可用性的影响降至最低。在重启过程中，影响服务可用性的有如下几点：
+* Replica Server 进程被 kill 后，该进程服务的 replica 无法提供服务：
+  * 对于 primary replica：因为 primary replica 直接向客户端提供读写服务，所以进程被 kill 后肯定会影响读写，需要等 
Meta Server 重新分派新的 primary replica 后才能恢复。Meta Server 通过心跳维护 Replica Server 
的存活状态，Failure Detector 的时间延迟取决于配置参数 `fd_grace_seconds`，默认为 10 秒，即最多需要经过 10 
秒，Meta Server 才能知道 Replica Server 宕机了，然后重新分派新的 primary replica。
+  * 对于 secondary replica：由于 secondary replica 不服务读，所以理论上对读无影响。但是会影响写，因为 
PacificA 一致性协议要求所有副本都写成功，写操作才能提交。进程被 kill 后，primary replica 在执行写操作过程中会发现该 
secondary replica 已失联，然后通知 Meta Server 将其踢出，经过 `reconfiguration` 
阶段后变成一主一备，继续提供写服务。对于在该切换过程中尚未完成的写操作，即使有 `reconciliation` 
阶段重新执行，但客户端可能已经超时，这对可用性是有一定影响的。但是这个影响相对较小，因为 `` 的速度是比较快的，通常能在 1 秒内完成。
+* 重启 Meta Server：重启 Meta Server 对可用性的影响几乎可以忽略不计。因为客户端首次从 Meta Server 获取到各 
partition 的服务节点信息后，会在本地缓存该信息，通常不需要再次向 Meta Server 查询，因此 Meta Server 
重启过程中的短暂失联对客户端基本没有影响。不过考虑到 Meta Server 需要与 Replica Server 维持心跳，所以要避免长时间停止 Meta 
Server 进程，造成 Replica Server 失联。
+* 重启 Collector：重启 Collector 对可用性没有影响。但是可用性统计是在 Collector 上进行的，所以可能会对 metrics 
数据有轻微影响。
 
-因此，在集群重启过程要提高可用性，需要考虑如下几点：
+因此，可以考虑如下几点来保持集群重启过程中的可用性：
 * 一次只能重启一个进程，且在该进程重启并完全恢复进入服务状态后，才能重启下一个进程。因为：
-  * 如果重启一个进程后，集群没有恢复到完全健康状态，有的 partition 还只有一主一备，这时如果再 kill 一个 replica server 
进程，很可能进入只有一主的状态，从而无法提供写服务。
+  * 如果重启一个进程后，集群没有恢复到完全健康状态，有的 partition 还只有一主一备，这时如果再 kill 一个 Replica Server 
进程，很可能进入只有一主的状态，从而无法提供写服务。
   * 等待集群所有 partition 都恢复三副本后再重启下一个进程，也能降低数据丢失的风险。
-* 尽量主动迁移 replica，而不是被动迁移 replica，避免 failure detection 的延迟影响可用性。因为：
-  * 被动迁移需要等待 failure detection 来感知节点失联，而主动迁移就是在 kill 掉 replica server 
之前，先将这个进程服务的 primary replica 都迁移到其他节点上，这个 `reconfiguration` 过程是很快的，基本 1 秒以内完成。
-* 尽量在 kill 掉 replica server 之前，将该进程服务的 secondary replica 手动降级。因为：
-  * 将 `reconfiguration` 过程由 “写失败时的被动触发” 变为 “主动触发”，进一步降低对可用性的影响。
+* 尽量主动迁移 replica，而不是被动迁移 replica，避免 Failure Detector 的延迟影响可用性。因为：
+  * 被动迁移需要等待 Failure Detector 来感知节点失联，而主动迁移就是在 kill 掉 Replica Server 
之前，先将这个进程服务的 primary replica 都迁移到其他节点上，这个 `reconfiguration` 过程是很快的，基本 1 秒以内完成。
+* 尽量在 kill 掉 Replica Server 之前，将该进程服务的 secondary replica 手动降级。因为：
+  * 将 `reconfiguration` 过程由写失败时的被动触发变为主动触发，进一步降低对可用性的影响。
 * 尽量减少进程重启时恢复过程的工作量，以缩短进程重启时间。
-  * replica server 在重启时需要 replay log 来恢复数据。如果直接 kill 掉，则需要 replay 
的数据量可能很大。但是如果在 kill 之前，先主动触发 memtable 的 flush 操作，让内存数据持久化到磁盘，在重启时需要 replay 
的数据量就会大大减少，重启时间会缩短很多，而整个集群重启所需的时间也能大大缩短。
+  * Replica Server 在重启时需要 replay WAL log 来恢复数据。如果直接 kill 掉，则需要 replay 
的数据量可能很大。但是如果在 kill 之前，先主动触发 memtable 的 flush 操作，让内存数据持久化到磁盘，在重启时需要 replay 
的数据量就会大大减少，重启时间会缩短很多，而整个集群重启所需的时间也能大大缩短。
 * 尽量减少不必要的节点间数据拷贝，避免因为增加 CPU、网络 IO、磁盘 IO 的负载带来的可用性影响。
-  * replica server 挂掉后，部分 partition 进入一主一备的状态。如果 meta server 立即在其他 replica 
server 上补充副本，会带来大量的跨节点数据拷贝，增加 CPU、网络 IO、磁盘 IO 负载压力，影响集群稳定性。Pegasus 
解决这个问题的办法是，允许在一段时间内维持一主一备状态，给重启的 replica server 一个维护窗口。如果长时间没有恢复，才会在新的 replica 
server 上补充副本。这样兼顾了数据的安全性和集群的稳定性。可以通过配置参数 `replica_assign_delay_ms_for_dropouts` 
控制等待时间，默认为 5 分钟。
+  * Replica Server 挂掉后，部分 partition 进入一主一备的状态。如果 Meta Server 立即在其他 Replica 
Server 上补充副本，会带来大量的跨节点数据拷贝，增加 CPU、网络 IO、磁盘 IO 负载压力，影响集群稳定性。Pegasus 
解决这个问题的办法是，允许在一段时间内维持一主一备状态，给重启的 Replica Server 一个维护窗口。如果长时间没有恢复，才会在其他的 Replica 
Server 上补充副本。这样兼顾了数据的完整性和集群的稳定性。可以通过配置参数 `replica_assign_delay_ms_for_dropouts` 
控制等待时间，默认为 5 分钟。
 
 # 重启流程
 
 ## 高可用重启
 
-流程如下：
 * 如果是升级，请先准备好新的 server 程序包和配置文件
 * 使用 shell 工具将集群的 meta level 设置为 `steady`，关闭 [负载均衡功能](rebalance)，避免不必要的 
replica 迁移
   ```
   >>> set_meta_level steady
   ```
-* 使用 shell 工具将集群的 meta level 设置为 `steady`，关闭 [负载均衡功能](rebalance)，避免不必要的 
replica 迁移
+* 使用 shell 工具设置单 Replica Server 的维护时间
   ```
   >>> remote_command -t meta-server meta.lb.assign_delay_ms $value
   ```
-  其中 `value` 可理解为 replcia server 的维护时间，即为 meta server 发现 replica server 
失联后，到其他节点补充副本的触发时间。例如配置为 `3600000`。
-* 重启 replica server 进程，采用逐个重启的策略。重启单个 replica server：
-  * 通过 shell 工具向 meta server 发送 [远程命令](remote-commands#meta-server)，临时禁掉 
`add_secondary` 操作：
+  其中 `value` 为 Meta Server 发现 Replica Server 失联后，到其他节点补充副本的触发时间。例如配置为 
`3600000`。
+* 重启 Replica Server 进程，采用逐个重启的策略。重启单个 Replica Server：
+  * 通过 shell 工具向 Meta Server 发送 [远程命令](remote-commands#meta-server)，临时禁掉 
`add_secondary` 操作：
     ```
     >>> remote_command -t meta-server 
meta.lb.add_secondary_max_count_for_one_node 0
     ```
-  * 通过 migrate_node 命令，将 replica server 上的 primary replica 都转移到其他节点：
+  * 通过 migrate_node 命令，将 Replica Server 上的 primary replica 都转移到其他节点：
     ```bash
     $ ./run.sh migrate_node -c $meta_list -n $node -t run
     ```
-    通过 shell 工具的 `nodes -d` 命令查看该节点服务的 replica 情况，等待 primary replica 的个数变为 
0。如果长时间不变为 0，请重新执行该命令。
-  * 通过 downgrade_node 命令，将 replica server 上的 secondary replica 都降级为 `INACTIVE`：
+    通过 shell 工具的 `nodes -d` 命令查看该节点服务的 replica 情况，等待 **primary** replica 的个数变为 
0。如果长时间不变为 0，请重新执行该命令。
+  * 通过 downgrade_node 命令，将 Replica Server 上的 secondary replica 都降级为 `INACTIVE`：
     ```bash
     $ ./run.sh downgrade_node -c $meta_list -n $node -t run
     ```
-    通过 shell 工具的 `nodes -d` 命令查看该节点的服务 replica 情况，等待 secondary replica 的个数变为 
0。如果长时间不变为 0，请重新执行该命令。
-  * 通过 shell 工具向 replica server 发送远程命令，将所有 replica 都关闭，以触发 flush 操作，将数据都刷到磁盘：
+    通过 shell 工具的 `nodes -d` 命令查看该节点的服务 replica 情况，等待 **secondary** replica 
的个数变为 0。如果长时间不变为 0，请重新执行该命令。
+  * 通过 shell 工具向 Replica Server 发送远程命令，将所有 replica 都关闭，以触发 flush 操作：
     ```
     >>> remote_command -l $node replica.kill_partition
     ```
     等待大约 1 分钟，让数据刷到磁盘完成。
   * 如果是升级操作，则替换程序包和配置文件
-  * 重启 replica server 进程
-  * 通过 shell 工具向 meta server 发送 [远程命令](remote-commands#meta-server)，开启 
`add_secondary` 操作，让其快速补充副本：
+  * 重启 Replica Server 进程
+  * 通过 shell 工具向 Meta Server 发送 [远程命令](remote-commands#meta-server)，开启 
`add_secondary` 操作，让其快速补充副本：
     ```
     >>> remote_command -t meta-server 
meta.lb.add_secondary_max_count_for_one_node 100
     ```
   * 使用 shell 工具的 `ls -d` 命令查看集群状态，等待所有 partition 都完全恢复健康
-  * 继续操作下一个 replica server
-* 重启 meta server 进程，采用逐个重启的策略。重启单个 meta server：
-  * kill 掉 meta server 进程
+  * 继续操作下一个 Replica Server
+* 重启 Meta Server 进程，采用逐个重启的策略。重启单个 Meta Server：
   * 如果是升级操作，替换程序包和配置文件
-  * 重启 meta server 进程
-  * 等待 30 秒以上，保证 meta server 与 replica server 心跳的连续性
-  * 继续操作下一个 meta server
-* 重启 collector 进程：
-  * kill 掉 collector 进程
+  * 重启 Meta Server 进程
+  * 等待 30 秒以上，保证 Meta Server 与 Replica Server 心跳的连续性
+  * 继续操作下一个 Meta Server
+* 重启 Collector 进程：
   * 如果是升级操作，替换程序包和配置文件
-  * 重启 collector 进程
+  * 重启 Collector 进程
 * 重置参数
   * 通过 shell 工具重置以上步骤修改过的参数：
     ```
@@ -96,22 +93,19 @@ permalink: administration/rolling-update
   ```
   >>> set_meta_level steady
   ```
-* 重启 replica server 进程，采用逐个重启的策略。重启单个 replica server：
-  * kill 掉 replica server 进程
-  * 如果是升级操作，替换程序包和配置文件
-  * 重启 replica server 进程
+* 重启 Replica Server 进程，采用逐个重启的策略。重启单个 Replica Server：
+  * 如果是升级操作，则替换程序包和配置文件
+  * 重启 Replica Server 进程
   * 使用 shell 工具的 `ls -d` 命令查看集群状态，等待所有 partition 都完全恢复健康
-  * 继续操作下一个 replica server
-* 重启 meta server 进程，采用逐个重启的策略。重启单个 meta server：
-  * kill 掉 meta server 进程
+  * 继续操作下一个 Replica Server
+* 重启 Meta Server 进程，采用逐个重启的策略。重启单个 Meta Server：
   * 如果是升级操作，替换程序包和配置文件
-  * 重启 meta server 进程
-  * 等待 30 秒以上，保证 meta server 与 replica server 心跳的连续性
-  * 继续操作下一个 meta server
-* 重启 collector 进程：
-  * kill 掉 collector 进程
+  * 重启 Meta Server 进程
+  * 等待 30 秒以上，保证 Meta Server 与 Replica Server 心跳的连续性
+  * 继续操作下一个 Meta Server
+* 重启 Collector 进程：
   * 如果是升级操作，替换程序包和配置文件
-  * 重启 collector 进程
+  * 重启 Collector 进程
 
 # 重启脚本
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(incubator-pegasus-website) branch master updated: Update rolling docs (#71)

Reply via email to