[ceph-users] Re: [rgw multisite] Perpetual behind

2023-06-20 Thread kchheda3
And as per the tracker, the issue was merged to quincy and is available in 
17.2.6 (looking at the release notes), so you might want to upgrade your 
cluster and re run your tests.
Note, the existing issue will not go away post upgrading to 17.2.6, you will 
have to manually sync the buckets that are out of sync !
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [rgw multisite] Perpetual behind

2023-06-20 Thread kchheda3
Hi Yixin,
we had faced similar issue, and this was the tracker 
https://tracker.ceph.com/issues/57562, that has all the details
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [rgw multisite] Perpetual behind

2023-06-18 Thread Richard Bade
Hi Yixin,
One place that I start with trying to figure this out is the sync
error logs. You may have already looked here:
sudo radosgw-admin sync error list --rgw-zone={zone_name}
If there's a lot in there you can trim it to a specific date so you
can see if they're still occurring
sudo radosgw-admin sync error trim --end-date="2023-06-16 03:00:00"
--rgw-zone={zone_name}
There's a log for both sides of the sync, so make sure you check both
your zones.

The next thing I try is re-running a full sync, metadata and then data:
sudo radosgw-admin metadata sync init --rgw-zone=zone1 --source_zone=zone2
sudo radosgw-admin metadata sync init --rgw-zone=zone2 --source_zone=zone1

sudo radosgw-admin data sync init --rgw-zone=zone1 --source_zone=zone2
sudo radosgw-admin data sync init --rgw-zone=zone2 --source_zone=zone1

you need to restart all the rgw processes to get this to start.
Obviously if you have a massive amount of data you don't want to
re-run a full data sync.

Lastly, I had this stuck sync happen for me with an old cluster that
had explicit placement in the buckets. I think this is because the
pool name was different in each of my zones so the explicit placement
couldn't find anywhere to put the data and the sync never finished.
Might be worth checking for this situation as there is also another
thread on the mailing list recently where someone had explicit
placement causing issues with regards to sync.

I hope that helps you track down the issue.
Rich

On Sat, 17 Jun 2023 at 08:41, Yixin Jin  wrote:
>
> Hi ceph gurus,
>
> I am experimenting with rgw multisite sync feature using Quincy release 
> (17.2.5). I am using the zone-level sync, not bucket-level sync policy. 
> During my experiment, somehow my setup got into a situation that it doesn't 
> seem to get out of. One zone is perpetually behind the other, although there 
> is no ongoing client request.
>
> Here is the output of my "sync status":
>
> root@mon1-z1:~# radosgw-admin sync status
>   realm f90e4356-3aa7-46eb-a6b7-117dfa4607c4 (test-realm)
>   zonegroup a5f23c9c-0640-41f2-956f-a8523eccecb3 (zg)
>zone bbe3e2a1-bdba-4977-affb-80596a6fe2b9 (z1)
>   metadata sync no sync (zone is master)
>   data sync source: 9645a68b-012e-4889-bf24-096e7478f786 (z2)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is behind on 14 shards
> behind shards: 
> [56,61,63,107,108,109,110,111,112,113,114,115,116,117]
>
>
> It stays behind forever while rgw is almost completely idle (1% of CPU).
>
> Any suggestion on how to drill deeper to see what happened?
>
> Thanks,
> Yixin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [rgw multisite] Perpetual behind

2023-06-17 Thread Alexander E. Patrakov
On Sat, Jun 17, 2023 at 4:41 AM Yixin Jin  wrote:
>
> Hi ceph gurus,
>
> I am experimenting with rgw multisite sync feature using Quincy release 
> (17.2.5). I am using the zone-level sync, not bucket-level sync policy. 
> During my experiment, somehow my setup got into a situation that it doesn't 
> seem to get out of. One zone is perpetually behind the other, although there 
> is no ongoing client request.
>
> Here is the output of my "sync status":
>
> root@mon1-z1:~# radosgw-admin sync status
>   realm f90e4356-3aa7-46eb-a6b7-117dfa4607c4 (test-realm)
>   zonegroup a5f23c9c-0640-41f2-956f-a8523eccecb3 (zg)
>zone bbe3e2a1-bdba-4977-affb-80596a6fe2b9 (z1)
>   metadata sync no sync (zone is master)
>   data sync source: 9645a68b-012e-4889-bf24-096e7478f786 (z2)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is behind on 14 shards
> behind shards: 
> [56,61,63,107,108,109,110,111,112,113,114,115,116,117]
>
>
> It stays behind forever while rgw is almost completely idle (1% of CPU).
>
> Any suggestion on how to drill deeper to see what happened?

Hello!

I have no idea what has happened, but it would be helpful if you
confirm the latency between the two clusters. In other words, please
don't expect the sync between e.g. Germany and Singapore to catch up
fast. It will be limited by the amount of data that can be synced in
one request and the hard-coded maximum number of requests in flight.

In Reef, there are new tunables that help on high-latency links:
rgw_data_sync_spawn_window, rgw_bucket_sync_spawn_window.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io