lengyuexuexuan opened a new issue, #2213:
URL: https://github.com/apache/incubator-pegasus/issues/2213

   ## Feature Request
   
   **Is your feature request related to a problem? Please describe:**
   <!-- A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] -->
   To further enhance the stability of the Pegasus service and prevent Az-level 
failures, the current recommended solution is to deploy a backup cluster in 
another Az and use the duplication feature to synchronize data to the backup 
cluster in real time.
   
   However, the duplication approach incurs high costs, which do not meet the 
needs of some businesses. Therefore, an alternative approach is to deploy the 
three replicas of Pegasus across different Azs. While this may introduce some 
additional read and write latency, it improves disaster recovery capabilities 
at Az level without increasing costs.
   
   **Describe the feature you'd like:**
   <!-- A clear and concise description of what you want to happen. -->
   1. Each shard's three replicas must be distributed across different 
availability zones (AZs), ensuring at least one replica in each AZ.
   2. After operations such as creating new tables, cure, and balance, the 
above rule should still be maintained.
   3. This feature can be dynamically enabled or disabled and persistently 
stored in ZooKeeper (ZK).
   4. The system should support placing all primary shards in the primary AZ, 
with a dynamic toggle to enable or disable this behavior, and the ability to 
change the primary AZ.
   5. The multi-AZ distribution can be automatically disabled based on the node 
ratio between the primary and backup AZs.
   
   **The specific implementation of this method at Xiaomi**
   1. Methods for Identifying the Availability Zone (AZ) of Each Node
       1. Modify the startup script so that when a replica starts, the script 
retrieves the AZand writes it to the <run_dir>/replica directory with the 
filename "az.info" 
       ```
       # az.info
       {"az":"ak"}
       ```
       2. The replica sends the AZ information to the meta via the 
RPC_CM_CONFIG_SYNC request.
   2. Relevant Configuration
       1. All configurations can be dynamically modified and persisted to 
ZooKeeper.
       2. ```
            enable_cross_az = true     
            az_list = c3,ak
            priamry_az = c3
            enable_primary_not_cross_az = false
            az_nodes_ratio = 2
           ```
        3. Here is the purpose of these parameters
            1. enable_cross_az is the global switch.
            2. When cross-AZ is enabled, only nodes that belong to the AZs 
listed in the az_list can perform cure and balance operations.
            3. primary_az specifies the primary AZ.
            4. When enable_primary_not_cross_az is set to true, the primary 
replica will only be distributed on nodes within the primary AZ.
            5. az_nodes_ratio is used to prevent a significant disparity in the 
number of nodes between different AZs. Continuing with the above rule could 
lead to uneven distribution, causing excessive pressure on some nodes.   If 
there are two AZs, a ratio of 2 is recommended. If there are three AZs, a ratio 
of 1.5 is recommended.
   3. cure
        1. missing_primary
             1. enable_primary_not_cross_az == true
                 1. When the secondary is empty(create table), meta finds the 
target node in the primary AZ that has the fewest primary nodes.
                 2. When the secondary isn't empty, meta  should prioritize 
selecting the secondary node within the primary AZ as the new primary.
                 3. When the last drops isn't empty, the behavior should remain 
the same as before.
             2. enable_primary_not_cross_az == false
                 1. The behavior should remain the same as before.
        2. missing_secondary
             1. First, determine which AZ the new secondary replica should be 
placed in.
             2. Continue to prioritize selecting the node in last_drops, but 
add a check to ensure that it matches the expected AZ.
             3. If no node in last_drop meets the requirements, then search for 
the node with the fewest partitions in the target AZ to serve as the target 
node.
   4. balance
       1. If enable_primary_not_cross_az == true
            1. First, move all the primary shards to the primary AZ (using 
move_primary / copy_primary).
            2. Perform primary balance within the primary AZ (using 
Ford-Fulkerson and copy_primary).
            3. Perform secondary balance
                1. Check all replicas. If they do not meet the multi-AZ 
distribution requirements, perform copy_secondary. During the copy_secondary 
process, the original secondary balance conditions must also be satisfied.
                2. After all replicas meet the multi-AZ distribution 
requirements, perform the previous secondary balance. However, apply a 
restriction: if copying the secondary of a replica would change the multi-AZ 
distribution state of the replica, then skip that operation.
        2. If enable_primary_not_cross_az == false
            1. The primary balance process remains the same as before. 
            2. The secondary balance process is the same as described above.
   5. test
       1. if enable_primary_not_cross_az == true, all primary replicas are in 
primary az(C3), and all replicas' replication are present in both C3 and C4.
   
![Image](https://github.com/user-attachments/assets/1a9aae63-b23d-4c7a-b770-fb12b4fe339a)
       2. if enable_primary_not_cross_az == false, all replicas' replication 
are present in both C3 and C4.
   
![Image](https://github.com/user-attachments/assets/efef155e-99a5-4d36-9d12-4d65de312f4d)
       3. The impact on read and write latency 
            1. read: If reading from a primary across AZs, the read latency 
will incur an additional cross-AZ RTT
            2. write
                1. If the client and primary are in the same AZ, then the write 
latency will incur an additional cross-AZ RTT.
                2. If the client and primary are in different AZs, then the 
write latency will incur two additional cross-AZ RTTs. (client->primary one 
rtt, primary->secondary  one rtt)
   
   
   **The feature will soon be launched at Xiaomi. 
   If anyone needs more detailed design plans or test records, feel free to 
contact me. 
   Thanks.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to