empiredan opened a new issue, #2146:
URL: https://github.com/apache/incubator-pegasus/issues/2146
Firstly, I created a new duplication, and related logs of primary meta
server of source cluster could be found as follows:
```
D2024-10-22 10:37:21.642 (1729564641642000641 0088)
meta.meta_state0.0102000500000411:
meta_duplication_service.cpp:183:add_duplication(): add duplication for
app(zgy_test_dup), remote cluster name is target_cluster, remote app name is
wetry_db.zgy_test_dup, remote replica count is 3
D2024-10-22 10:37:21.644 (1729564641644158230 0088)
meta.meta_state0.01030000000000dd:
meta_duplication_service.cpp:324:operator()(): [a27d1729564641] add duplication
successfully [app_name: zgy_test_dup, remote_cluster_name: target_cluster,
remote_app_name: wetry_db.zgy_test_dup]
D2024-10-22 10:37:21.644 (1729564641644182306 0088)
meta.meta_state0.01030000000000dd: duplication_info.cpp:172:persist_status():
[a27d1729564641] change duplication status from DS_INIT to DS_PREPARE
successfully [app_id: 27]
W2024-10-22 10:37:21.827 (1729564641827424045 0088)
meta.meta_state0.01020003000003dd:
meta_duplication_service.cpp:172:all_checkpoint_has_prepared(): replica
checkpoint still running: 0/4
D2024-10-22 10:37:21.827 (1729564641827543128 0088)
meta.meta_state0.01020003000003dd:
duplication_info.cpp:203:report_progress_if_time_up(): duplication report:
{"create_ts":"2024-10-22
10:37:21","dupid":1729564641,"fail_mode":"FAIL_SLOW","progress":{"0":-1,"1":-1,"2":-1,"3":-1},"remote":"target_cluster","remote_app_name":"wetry_db.zgy_test_dup","remote_replica_count":3,"status":"DS_PREPARE"}
```
However, it seems that the duplication got stuck in `DS_PREPARE` and never
forwarded to next status. To find the reason, some error logs primary meta
server of source cluster could be found as follows:
```
E2024-10-22 10:38:06.829 (1729564686829455037 0088)
meta.meta_state0.01030000000000e6:
meta_duplication_service.cpp:502:operator()(): create follower
app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint
failed: duplication_status = DS_PREPARE, create_err = ERR_TIMEOUT, update_err =
ERR_NO_NEED_OPERATE
......
E2024-10-22 10:38:11.833 (1729564691833345364 0088)
meta.meta_state0.01030000000000e8:
meta_duplication_service.cpp:502:operator()(): create follower
app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint
failed: duplication_status = DS_PREPARE, create_err = ERR_APP_EXIST, update_err
= ERR_NO_NEED_OPERATE
......
E2024-10-22 13:23:02.220 (1729574582220478204 0088)
meta.meta_state0.01030000000008a2:
meta_duplication_service.cpp:502:operator()(): create follower
app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint
failed: duplication_status = DS_PREPARE, create_err = ERR_APP_EXIST, update_err
= ERR_NO_NEED_OPERATE
```
Some error occurred while creating table on follower cluster, and the
primary meta server of source cluster did not receive successful response.
`ERR_APP_EXIST` means table has been created on the follower cluster, and the
source cluster could not send the same request to create the table again.
On the follower cluster, the ID of the created table is `4672`, and the
primary meta server of target cluster sent the proposal to proposal to the
primary replica of each partition of the table:
```
D2024-10-22 13:22:56.98 (1729574576098497750 2263d3)
meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send
proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.0), ballot =
0, target = 10.2.131.174:8171, node = 10.2.131.174:8171
D2024-10-22 13:22:56.98 (1729574576098581429 2263d3)
meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send
proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.1), ballot =
0, target = 10.2.131.173:8171, node = 10.2.131.173:8171
D2024-10-22 13:22:56.98 (1729574576098612769 2263d3)
meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send
proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.2), ballot =
0, target = 10.2.131.172:8171, node = 10.2.131.172:8171
D2024-10-22 13:22:56.98 (1729574576098655070 2263d3)
meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send
proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.3), ballot =
0, target = 10.2.131.170:8171, node = 10.2.131.170:8171
```
The replica server in charge of the primary replica also received the
proposal. However, it failed to connect to the replica server of source cluster
to get the checkpoint data(actually there is some security policy on the source
cluster which would ignore the packaged from other clusters):
```
D2024-10-22 13:22:56.98 (1729574576098611769 2d5346)
replica.replica9.020053150cc06bae: replica_stub.cpp:1028:on_config_proposal():
[email protected]:8171: received config proposal
replication::config_type::CT_ASSIGN_PRIMARY for
[10.2.131.173:8171](https://10.2.131.173:8171/)
D2024-10-22 13:22:56.99 (1729574576099024706 2d5371)
replica.rep_long2.02040009060b292b: fs_manager.cpp:273:allocate_dir():
10.2.131.173:8171: put pid(4672.1) to dir(data0), which has 0 replicas of
current app, 10762 replicas totally
D2024-10-22 13:22:56.99 (1729574576099271963 2d5371)
replica.rep_long2.02040009060b292b:
replica_follower.cpp:83:duplicate_checkpoint(): [[email protected]:8171]
start duplicate master[source_cluster.zgy_test_dup] checkpoint
D2024-10-22 13:22:56.99 (1729574576099314314 2d5323)
replica.default6.020600020000babc:
replica_follower.cpp:109:async_duplicate_checkpoint_from_master_replica():
[[email protected]:8171] query master[source_cluster.zgy_test_dup] replica
configuration
......
E2024-10-22 13:23:01.99 (1729574581099441055 2d5324)
replica.default7.020100060717fc1e:
replica_follower.cpp:138:update_master_replica_config():
[[email protected]:8171] query master[source_cluster.zgy_test_dup] config
failed: ERR_TIMEOUT
E2024-10-22 13:23:01.99 (1729574581099538649 2d5371)
replica.rep_long2.02040009060b292b: replica_stub.cpp:2320:new_replica():
[email protected]:8171: try to duplicate replica checkpoint failed,
error(ERR_TRY_AGAIN) and please check previous detail error log
D2024-10-22 13:23:01.99 (1729574581099608162 2d5371)
replica.rep_long2.02040009060b292b: replica.cpp:546:close():
[email protected]:8171: replica closed, time_used = 0ms
D2024-10-22 13:23:01.99 (1729574581099614989 2d5371)
replica.rep_long2.02040009060b292b: replica.cpp:546:close():
[email protected]:8171: replica closed, time_used = 0ms
D2024-10-22 13:23:01.99 (1729574581099665868 2d5371)
replica.rep_long2.02040009060b292b: fs_manager.cpp:295:remove_replica():
10.2.131.173:8171: remove gpid(4672.1) from dir(data0)
D2024-10-22 13:23:01.99 (1729574581099673613 2d5371)
replica.rep_long2.02040009060b292b: replica_stub.cpp:2259:open_replica():
[email protected]:8171: open replica failed, erase from opening replicas
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]