[ceph-users] multisite replication issue with Quincy
We have encountered replication issues in our multisite settings with Quincy v17.2.3. Our Ceph clusters are brand new. We tore down our clusters and re-deployed fresh Quincy ones before we did our test. In our environment, we have 3 RGW nodes per site, each node has 2 instances for client traffic and 1 instance dedicated for replication. Our test was done using cosbench with the following settings: - 10 rgw users - 3000 buckets per user - write only - 6 different object sizes with the following distribution: 1k: 17% 2k: 48% 3k: 14% 4k: 5% 1M: 13% 8M: 3% - trying to write 10 million objects per object size bucket per user to avoid writing to the same objects - no multipart uploads involved The test ran for about 2 hours roughly from 22:50pm 9/14 to 1:00am 9/15. And after that, the replication tail continued for another roughly 4 hours till 4:50am 9/15 with gradually decreasing replication traffic. Then the replication stopped and nothing has been going on in the clusters since. While we were verifying the replication status, we found many issues. 1. The sync status shows the clusters are not fully synced. However all the replication traffic has stopped and nothing is going on in the clusters. Secondary zone: realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 2 shards behind shards: [40,74] Why the replication stopped even though the clusters are still not in-sync? 2. We can see some buckets are not fully synced and we are able to identified some missing objects in our secondary zone. Here is an example bucket. This is its sync status in the secondary zone. realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78]) source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) source bucket :mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78]) full sync: 0/101 shards incremental sync: 100/101 shards bucket is behind on 1 shards behind shards: [78] 3. We can see from the above sync status, the behind shard for the example bucket is not in the list of the behind shards for the system sync status. Why is that? 4. Data sync status for these behind shards doesn't list any "pending_buckets" or "recovering_buckets". An example: { "shard_id": 74, "marker": { "status": "incremental-sync", "marker": "0003:03381964", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-09-15T00:00:08.718840Z" }, "pending_buckets": [], "recovering_buckets": [] } Shouldn't the not-yet-in-sync buckets be listed here? 5. The sync status of the primary zone is different from the sync status of the secondary zone with different groups of behind shards. The same for the sync status of the same bucket. Is it legitimate? Please see the item 1 for sync status of the secondary zone, and the item 6 for the primary zone. 6. Why the primary zone has behind shards anyway since the replication is from primary to the secondary?| Primary Zone: realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw) metadata sync no sync (zone is master) data sync source: 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 30 shards behind shards: [6,7,26,28,29,37,47,52,55,56,61,67,68,69,74,79,82,91,95,99,101,104,106,111,112,121,122,123,126,127] 7. We have buckets in-sync that show correct sync status in secondary zone but still show behind shards in primary. Why is that? Secondary Zone: realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm) zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup) zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt) bucket :mixed-5wrks-dev-4k-thisisbcstestload008167[b68a
[ceph-users] Re: multisite replication issue with Quincy
advanced rgw_lifecycle_work_time 00:00-23:59 * client.rgwbasic rgw_max_concurrent_requests 2048 Multisite clusters settings: 2 clusters, each has 3 mons, 4 osds, 2 rgws Each rgw has 2 client traffic rgws and 2 replication rgws Testing tool: cosbench Reproduce steps: 1. Create 2 vm clusters for primary site and secondary site. 2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites. 3. Setup custom configs on mons of both clusters. 4. On the primary site, create 10 rgw users for cosbench tests, and set the max-buckets of each user to 10,000. 5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and generate 10 mins write-only traffic. 6. Run a cosbench workload to create another 30,000 buckets and generate 4 hours of write-only traffic. 7. We observed “behind shards” in sync status after the 4-hr cosbench test, and the replication didn’t catch up over time. Cluster status: 1) Primary site: $ sudo radosgw-admin sync status realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm) zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup) zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master) metadata sync no sync (zone is master) data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 3 shards behind shards: [18,48,64] $ sudo radosgw-admin data sync status --shard-id=48 --source-zone=dev-zone-bcc-secondary { "shard_id": 48, "marker": { "status": "incremental-sync", "marker": "0001:1013", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-10-03T05:56:20.319563Z" }, "pending_buckets": [], "recovering_buckets": [] } 2) Secondary site: $ sudo radosgw-admin sync status realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm) zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup) zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 1 shards behind shards: [31] $ sudo radosgw-admin data sync status --shard-id=31 --source-zone=dev-zone-bcc-master { "shard_id": 31, "marker": { "status": "incremental-sync", "marker": "0001:0512", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-10-03T05:56:03.944817Z" }, "pending_buckets": [], "recovering_buckets": [] } Some error/fail log lines we observed: 1) Primary site 2022-10-02T23:15:12.482-0400 7fbf6a819700 1 req 8882223748441190067 0.00115s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002 … 2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345 2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=10345 … 2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, rgw::cls::fifo::Completion::Ptr&&, int):1858 trim failed: r=-5 tid=14844 ... 2) Secondary site ... 2022-10-02T23:15:50.279-0400 7f679a2ce700 1 req 16201632253829371026 0.00102s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002 ... We did a bucket sync run on a broken bucket, but nothing happened and the bucket still didn't sync. $ sudo radosgw-admin bucket sync run --bucket=jjm-4hr-test-1k-thisisbcstestload0011007 --source-zone=dev-zone-bcc-secondary From: jane.dev@gmail.com At: 10/04/22 18:57:12 UTC-4:00To: Jane Zhu (BLOOMBERG/ 120 PARK ) Subject: Fwd: [ceph
[ceph-users] Re: multisite replication issue with Quincy
lient.rgwadvanced rgw_lc_max_worker 3 client.rgwadvanced rgw_lc_max_wp_worker 3 client.rgwadvanced rgw_lifecycle_work_time 00:00-23:59 * client.rgwbasic rgw_max_concurrent_requests 2048 Multisite clusters settings: 2 clusters, each has 3 mons, 4 osds, 2 rgws Each rgw has 2 client traffic rgws and 2 replication rgws Testing tool: cosbench Reproduce steps: 1. Create 2 vm clusters for primary site and secondary site. 2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites. 3. Setup custom configs on mons of both clusters. 4. On the primary site, create 10 rgw users for cosbench tests, and set the max-buckets of each user to 10,000. 5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and generate 10 mins write-only traffic. 6. Run a cosbench workload to create another 30,000 buckets and generate 4 hours of write-only traffic. 7. We observed “behind shards” in sync status after the 4-hr cosbench test, and the replication didn’t catch up over time. Cluster status: 1) Primary site: $ sudo radosgw-admin sync status realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm) zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup) zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master) metadata sync no sync (zone is master) data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 3 shards behind shards: [18,48,64] $ sudo radosgw-admin data sync status --shard-id=48 --source-zone=dev-zone-bcc-secondary { "shard_id": 48, "marker": { "status": "incremental-sync", "marker": "0001:1013", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-10-03T05:56:20.319563Z" }, "pending_buckets": [], "recovering_buckets": [] } 2) Secondary site: $ sudo radosgw-admin sync status realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm) zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup) zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 1 shards behind shards: [31] $ sudo radosgw-admin data sync status --shard-id=31 --source-zone=dev-zone-bcc-master { "shard_id": 31, "marker": { "status": "incremental-sync", "marker": "0001:0512", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "2022-10-03T05:56:03.944817Z" }, "pending_buckets": [], "recovering_buckets": [] } Some error/fail log lines we observed: 1) Primary site 2022-10-02T23:15:12.482-0400 7fbf6a819700 1 req 8882223748441190067 0.00115s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002 … 2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, std::string_view, std::deque, uint64_t, optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345 2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const DoutPrefixProvider*, const std::deque&, uint64_t, optional_yield):1102 push_part failed: r=-34 tid=10345 … 2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, rgw::cls::fifo::Completion::Ptr&&, int):1858 trim failed: r=-5 tid=14844 ... 2) Secondary site ... 2022-10-02T23:15:50.279-0400 7f679a2ce700 1 req 16201632253829371026 0.00102s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002 ... We did a bucket sync run on a broken bucket, but nothing happened and the bucket still didn't sync. $ sudo radosgw-admin