lengyuexuexuan opened a new issue, #2213:
URL: https://github.com/apache/incubator-pegasus/issues/2213
## Feature Request
**Is your feature request related to a problem? Please describe:**
<!-- A clear and concise description of what the problem is. Ex. I'm always
frustrated when [...] -->
To further enhance the stability of the Pegasus service and prevent Az-level
failures, the current recommended solution is to deploy a backup cluster in
another Az and use the duplication feature to synchronize data to the backup
cluster in real time.
However, the duplication approach incurs high costs, which do not meet the
needs of some businesses. Therefore, an alternative approach is to deploy the
three replicas of Pegasus across different Azs. While this may introduce some
additional read and write latency, it improves disaster recovery capabilities
at Az level without increasing costs.
**Describe the feature you'd like:**
<!-- A clear and concise description of what you want to happen. -->
1. Each shard's three replicas must be distributed across different
availability zones (AZs), ensuring at least one replica in each AZ.
2. After operations such as creating new tables, cure, and balance, the
above rule should still be maintained.
3. This feature can be dynamically enabled or disabled and persistently
stored in ZooKeeper (ZK).
4. The system should support placing all primary shards in the primary AZ,
with a dynamic toggle to enable or disable this behavior, and the ability to
change the primary AZ.
5. The multi-AZ distribution can be automatically disabled based on the node
ratio between the primary and backup AZs.
**The specific implementation of this method at Xiaomi**
1. Methods for Identifying the Availability Zone (AZ) of Each Node
1. Modify the startup script so that when a replica starts, the script
retrieves the AZand writes it to the <run_dir>/replica directory with the
filename "az.info"
```
# az.info
{"az":"ak"}
```
2. The replica sends the AZ information to the meta via the
RPC_CM_CONFIG_SYNC request.
2. Relevant Configuration
1. All configurations can be dynamically modified and persisted to
ZooKeeper.
2. ```
enable_cross_az = true
az_list = c3,ak
priamry_az = c3
enable_primary_not_cross_az = false
az_nodes_ratio = 2
```
3. Here is the purpose of these parameters
1. enable_cross_az is the global switch.
2. When cross-AZ is enabled, only nodes that belong to the AZs
listed in the az_list can perform cure and balance operations.
3. primary_az specifies the primary AZ.
4. When enable_primary_not_cross_az is set to true, the primary
replica will only be distributed on nodes within the primary AZ.
5. az_nodes_ratio is used to prevent a significant disparity in the
number of nodes between different AZs. Continuing with the above rule could
lead to uneven distribution, causing excessive pressure on some nodes. If
there are two AZs, a ratio of 2 is recommended. If there are three AZs, a ratio
of 1.5 is recommended.
3. cure
1. missing_primary
1. enable_primary_not_cross_az == true
1. When the secondary is empty(create table), meta finds the
target node in the primary AZ that has the fewest primary nodes.
2. When the secondary isn't empty, meta should prioritize
selecting the secondary node within the primary AZ as the new primary.
3. When the last drops isn't empty, the behavior should remain
the same as before.
2. enable_primary_not_cross_az == false
1. The behavior should remain the same as before.
2. missing_secondary
1. First, determine which AZ the new secondary replica should be
placed in.
2. Continue to prioritize selecting the node in last_drops, but
add a check to ensure that it matches the expected AZ.
3. If no node in last_drop meets the requirements, then search for
the node with the fewest partitions in the target AZ to serve as the target
node.
4. balance
1. If enable_primary_not_cross_az == true
1. First, move all the primary shards to the primary AZ (using
move_primary / copy_primary).
2. Perform primary balance within the primary AZ (using
Ford-Fulkerson and copy_primary).
3. Perform secondary balance
1. Check all replicas. If they do not meet the multi-AZ
distribution requirements, perform copy_secondary. During the copy_secondary
process, the original secondary balance conditions must also be satisfied.
2. After all replicas meet the multi-AZ distribution
requirements, perform the previous secondary balance. However, apply a
restriction: if copying the secondary of a replica would change the multi-AZ
distribution state of the replica, then skip that operation.
2. If enable_primary_not_cross_az == false
1. The primary balance process remains the same as before.
2. The secondary balance process is the same as described above.
5. test
1. if enable_primary_not_cross_az == true, all primary replicas are in
primary az(C3), and all replicas' replication are present in both C3 and C4.

2. if enable_primary_not_cross_az == false, all replicas' replication
are present in both C3 and C4.

3. The impact on read and write latency
1. read: If reading from a primary across AZs, the read latency
will incur an additional cross-AZ RTT
2. write
1. If the client and primary are in the same AZ, then the write
latency will incur an additional cross-AZ RTT.
2. If the client and primary are in different AZs, then the
write latency will incur two additional cross-AZ RTTs. (client->primary one
rtt, primary->secondary one rtt)
**The feature will soon be launched at Xiaomi.
If anyone needs more detailed design plans or test records, feel free to
contact me.
Thanks.**
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]