Hello,

We have a 3 geo locational freshly installed multisite setup with an upgraded 
octopus from 15.2.5 to 15.2.7.
We have 6 osd nodes, 3 mon/mgr/rgw in each dc, full SSD, 3 ssd is using 1 nvme 
for journaling. Each zone backed with 3 RGW, one on each mon/mgr node.
The goal is to replicate 2 (currently) big buckets in the zonegroup but it only 
works if I disable and reenable the bucket sync.
Big buckets means, one bucket is presharded for 9000 shards (9 billions 
objects), the 2nd bucket that I'm detailing in this ticket 24000 (24 billions 
objects) shards.

Once picked up the objects (not all, only the ones that was on the source site 
at that given time when it was enabled) it will slows down a lot from 100.000 
objects / 15 minutes in and 10GB/15 minutes to 50 objects/4 hours.
Once it synchronized after enabled/disabled, it maxing out the osd nodes with 
NVME/SSD drives with some operation which I don't know what is it. Let me show 
you the symptoms below.

Let me summarize as much as I can.

We have 1 realm, in this realm we have 1 zonegroup (please help me to check if 
the sync policies are ok) and in this zonegroup we have 1 cluster in US, 1 in 
Hong Kong (master) and 1 in Singapore.

Here is the realm, zonegroup and zones definition: 
https://pastebin.com/raw/pu66tqcf

Let me show you one enable/disable operation when I've disabled on the HKG 
master site the pix-bucket and enabled it.

In this screenshot: https://i.ibb.co/WNC0gNQ/6nodes6day.png
the highlighted area is when the data sync is running after disable enable. You 
can see almost no operation. You can see also when sync is not running, the 
green and yellow is the NVME drive rocksdb+wal drives. The screenshot 
represents the 6 Singapore nodes SSD/NVME disk utilizations. The first node you 
can see in the last hours no green and yellow, it's because I've reinstalled in 
that nodes all the osds to not use NVME.

In the following 1st screenshot you can see the HKG object usage where the user 
is uploading the objects. 2nd screenshot the SGP one where you can see the 
highlighted area is the disable/enable operation.
HKG where user upload: https://i.ibb.co/vj2VFYP/pixhkg6d.png
SGP where sync happened: https://i.ibb.co/w41rmQT/pixsgp6d.png

Let me show you some troubleshooting logs regarding bucket sync status, cluster 
sync status, reshard list (which might be because of previous testing), sync 
error list

https://pastebin.com/raw/TdwiZFC1

The issue might be very similar to this issue:
https://tracker.ceph.com/issues/21591

Where I should move forward or how can I help you to provide more logs to help 
me please?

Thank you in advance

________________________________
This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to