[ceph-users] multisite replication issue with Quincy

2022-09-15 Thread Jane Zhu
We have encountered replication issues in our multisite settings with
Quincy v17.2.3.

Our Ceph clusters are brand new. We tore down our clusters and re-deployed
fresh Quincy ones before we did our test.
In our environment, we have 3 RGW nodes per site, each node has 2 instances
for client traffic and 1 instance dedicated for replication.

Our test was done using cosbench with the following settings:
- 10 rgw users
- 3000 buckets per user
- write only
- 6 different object sizes with the following distribution:
1k: 17%
2k: 48%
3k: 14%
4k: 5%
1M: 13%
8M: 3%
- trying to write 10 million objects per object size bucket per user to
avoid writing to the same objects
- no multipart uploads involved
The test ran for about 2 hours roughly from 22:50pm 9/14 to 1:00am 9/15.
And after that, the replication tail continued for another roughly 4 hours
till 4:50am 9/15 with gradually decreasing replication traffic. Then the
replication stopped and nothing has been going on in the clusters since.

While we were verifying the replication status, we found many issues.
1. The sync status shows the clusters are not fully synced. However all the
replication traffic has stopped and nothing is going on in the clusters.
Secondary zone:

  realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
  zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
   zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 2 shards
behind shards: [40,74]


Why the replication stopped even though the clusters are still not in-sync?

2. We can see some buckets are not fully synced and we are able to
identified some missing objects in our secondary zone.
Here is an example bucket. This is its sync status in the secondary zone.

  realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
  zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
   zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
 bucket
:mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78])

source zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  source bucket
:mixed-5wrks-dev-4k-thisisbcstestload004178[b68a526a-ffaa-4058-9903-6e7c6eac35bb.89152.78])
full sync: 0/101 shards
incremental sync: 100/101 shards
bucket is behind on 1 shards
behind shards: [78]

3. We can see from the above sync status, the behind shard for the example
bucket is not in the list of the behind shards for the system sync status.
Why is that?

4. Data sync status for these behind shards doesn't list any
"pending_buckets" or "recovering_buckets".
An example:

{
"shard_id": 74,
"marker": {
"status": "incremental-sync",
"marker": "0003:03381964",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "2022-09-15T00:00:08.718840Z"
},
"pending_buckets": [],
"recovering_buckets": []
}


Shouldn't the not-yet-in-sync buckets be listed here?

5. The sync status of the primary zone is different from the sync status of
the secondary zone with different groups of behind shards. The same for the
sync status of the same bucket. Is it legitimate? Please see the item 1 for
sync status of the secondary zone, and the item 6 for the primary zone.

6. Why the primary zone has behind shards anyway since the replication is
from primary to the secondary?|
Primary Zone:

  realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
  zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
   zone b68a526a-ffaa-4058-9903-6e7c6eac35bb (prod-zone-pw)
  metadata sync no sync (zone is master)
  data sync source: 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 30 shards
behind shards:
[6,7,26,28,29,37,47,52,55,56,61,67,68,69,74,79,82,91,95,99,101,104,106,111,112,121,122,123,126,127]

7. We have buckets in-sync that show correct sync status in secondary zone
but still show behind shards in primary. Why is that?
Secondary Zone:

  realm 8a98f19f-db58-4c09-bde6-ac89560d79b0 (prod-realm)
  zonegroup e041ea69-1e0b-4ad7-92f2-74b20aa3edf3 (prod-zonegroup)
   zone 1dadcf12-f44c-4940-8acc-9623a48b829e (prod-zone-tt)
 bucket
:mixed-5wrks-dev-4k-thisisbcstestload008167[b68a

[ceph-users] Re: multisite replication issue with Quincy

2022-10-04 Thread Jane Zhu (BLOOMBERG/ 120 PARK)
 advanced  rgw_lifecycle_work_time 
00:00-23:59 * 
client.rgwbasic rgw_max_concurrent_requests 
2048  


Multisite clusters settings:

2 clusters, each has 3 mons, 4 osds, 2 rgws 
Each rgw has 2 client traffic rgws and 2 replication rgws

Testing tool: cosbench

Reproduce steps:

1. Create 2 vm clusters for primary site and secondary site.
2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites.
3. Setup custom configs on mons of both clusters.
4. On the primary site, create 10 rgw users for cosbench tests, and set the 
max-buckets of each user to 10,000. 
5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and 
generate 10 mins write-only traffic.
6. Run a cosbench workload to create another 30,000 buckets and generate 4 
hours of write-only traffic. 
7. We observed “behind shards” in sync status after the 4-hr cosbench test, and 
the replication didn’t catch up over time.

Cluster status:

1) Primary site:
$ sudo radosgw-admin sync status
  realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
  zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
   zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master)
  metadata sync no sync (zone is master)
  data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c 
(dev-zone-bcc-secondary)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [18,48,64]

$ sudo radosgw-admin data sync status --shard-id=48 
--source-zone=dev-zone-bcc-secondary
{
"shard_id": 48,
"marker": {
"status": "incremental-sync",
"marker": "0001:1013",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "2022-10-03T05:56:20.319563Z" 
},
"pending_buckets": [],
"recovering_buckets": []
}

2) Secondary site:
$ sudo radosgw-admin sync status
  realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
  zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
   zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary)
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 
(dev-zone-bcc-master)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [31]
$ sudo radosgw-admin data sync status --shard-id=31 
--source-zone=dev-zone-bcc-master
{
"shard_id": 31,
"marker": {
"status": "incremental-sync",
"marker": "0001:0512",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "2022-10-03T05:56:03.944817Z" 
},
"pending_buckets": [],
"recovering_buckets": []
}

Some error/fail log lines we observed:

1) Primary site
2022-10-02T23:15:12.482-0400 7fbf6a819700  1 req 8882223748441190067 
0.00115s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
…
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 
0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const 
DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, 
std::string_view, std::deque, uint64_t, 
optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 
0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const 
DoutPrefixProvider*, const std::deque&, uint64_t, 
optional_yield):1102 push_part failed: r=-34 tid=10345
…
2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void 
rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, 
rgw::cls::fifo::Completion::Ptr&&, int):1858 trim 
failed: r=-5 tid=14844
...

2) Secondary site
...
2022-10-02T23:15:50.279-0400 7f679a2ce700  1 req 16201632253829371026 
0.00102s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
...

We did a bucket sync run on a broken bucket, but nothing happened and the 
bucket still didn't sync.
$ sudo radosgw-admin bucket sync run 
--bucket=jjm-4hr-test-1k-thisisbcstestload0011007 
--source-zone=dev-zone-bcc-secondary


From: jane.dev@gmail.com At: 10/04/22 18:57:12 UTC-4:00To:  Jane Zhu 
(BLOOMBERG/ 120 PARK ) 
Subject: Fwd: [ceph

[ceph-users] Re: multisite replication issue with Quincy

2022-10-10 Thread Jane Zhu (BLOOMBERG/ 120 PARK)
lient.rgwadvanced  rgw_lc_max_worker   
3 
client.rgwadvanced  rgw_lc_max_wp_worker
3 
client.rgwadvanced  rgw_lifecycle_work_time 
00:00-23:59 * 
client.rgwbasic rgw_max_concurrent_requests 
2048  


Multisite clusters settings:

2 clusters, each has 3 mons, 4 osds, 2 rgws 
Each rgw has 2 client traffic rgws and 2 replication rgws

Testing tool: cosbench

Reproduce steps:

1. Create 2 vm clusters for primary site and secondary site.
2. Deploy 17.2.3 GA version or 17.2.4 GA version to both sites.
3. Setup custom configs on mons of both clusters.
4. On the primary site, create 10 rgw users for cosbench tests, and set the 
max-buckets of each user to 10,000. 
5. Run a cosbench workload to create 30,000 buckets for the 10 rgw users and 
generate 10 mins write-only traffic.
6. Run a cosbench workload to create another 30,000 buckets and generate 4 
hours of write-only traffic. 
7. We observed “behind shards” in sync status after the 4-hr cosbench test, and 
the replication didn’t catch up over time.

Cluster status:

1) Primary site:
$ sudo radosgw-admin sync status
  realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
  zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
   zone 5a7692dd-eebc-4e96-b776-774004b37ea9 (dev-zone-bcc-master)
  metadata sync no sync (zone is master)
  data sync source: 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c 
(dev-zone-bcc-secondary)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 3 shards
behind shards: [18,48,64]

$ sudo radosgw-admin data sync status --shard-id=48 
--source-zone=dev-zone-bcc-secondary
{
"shard_id": 48,
"marker": {
"status": "incremental-sync",
"marker": "0001:1013",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "2022-10-03T05:56:20.319563Z" 
},
"pending_buckets": [],
"recovering_buckets": []
}

2) Secondary site:
$ sudo radosgw-admin sync status
  realm 53f4e30b-53eb-4f15-bd64-83fa1c0d5a81 (dev-realm)
  zonegroup fc33abf4-8d5a-4646-a127-483db4447840 (dev-zonegroup)
   zone 0a828e9c-17f0-4a3e-a0a8-c7a408c0699c (dev-zone-bcc-secondary)
  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: 5a7692dd-eebc-4e96-b776-774004b37ea9 
(dev-zone-bcc-master)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [31]
$ sudo radosgw-admin data sync status --shard-id=31 
--source-zone=dev-zone-bcc-master
{
"shard_id": 31,
"marker": {
"status": "incremental-sync",
"marker": "0001:0512",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "2022-10-03T05:56:03.944817Z" 
},
"pending_buckets": [],
"recovering_buckets": []
}

Some error/fail log lines we observed:

1) Primary site
2022-10-02T23:15:12.482-0400 7fbf6a819700  1 req 8882223748441190067 
0.00115s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
…
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 
0.068001039s s3:put_obj int rgw::cls::fifo::{anonymous}::push_part(const 
DoutPrefixProvider*, librados::v14_2_0::IoCtx&, const string&, 
std::string_view, std::deque, uint64_t, 
optional_yield):160 fifo::op::PUSH_PART failed r=-34 tid=10345
2022-10-03T01:55:10.084-0400 7fbcb02a5700 -1 req 17130738504357154573 
0.068001039s s3:put_obj int rgw::cls::fifo::FIFO::push_entries(const 
DoutPrefixProvider*, const std::deque&, uint64_t, 
optional_yield):1102 push_part failed: r=-34 tid=10345
…
2022-10-03T03:08:00.503-0400 7fc00496e700 -1 rgw rados thread: void 
rgw::cls::fifo::Trimmer::handle(const DoutPrefixProvider*, 
rgw::cls::fifo::Completion::Ptr&&, int):1858 trim 
failed: r=-5 tid=14844
...

2) Secondary site
...
2022-10-02T23:15:50.279-0400 7f679a2ce700  1 req 16201632253829371026 
0.00102s op->ERRORHANDLER: err_no=-2002 new_err_no=-2002
...

We did a bucket sync run on a broken bucket, but nothing happened and the 
bucket still didn't sync.
$ sudo radosgw-admin