[ceph-users] Multisite stuck data shard recovery after bucket deletion
Hello, I'm testing multisite sync on reef 18.2.2, cephadm and ubuntu 22.04. Right now I'm testing symmetrical sync policy making backup to read-only zone. My sync policy allows for replication and I enable replication via put-bucket-replication. My multisite setup fails at seemingly basic operation. My test looks like this: 1. create bucket 2. upload some data to bucket 3. wait for replication to copy some of the data 4. run `rclone purge` on the bucket in master zone while replication is in progress. All data and bucket itself are deleted. I've tested this on normal secondary zone and archive zone. It seems that bucket is deleted so quickly that replication gets stuck. Buckets are gone from both zones but data sync shard still tries to replicate them Example of a recovering shard. { "shard_id": 100, "marker": { "status": "full-sync", "marker": "", "next_step_marker": "", "total_entries": 0, "pos": 0, "timestamp": "0.00" }, "pending_buckets": [ "bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.5:9", "bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.7:9" ], "recovering_buckets": [ "bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.7:9[0]" ], "current_time": "2024-07-17T13:23:11Z" } In this case there are 2 pending buckets because I've reused the bucket name. The only semi-automatic solution I've found is to recreate bucket with the same name and wait for recovering shards to disappear. Is there any way to make ceph clean up these stuck shards automatically? Best regards Adam Prycki smime.p7s Description: Kryptograficzna sygnatura S/MIME ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Separated multisite sync and user traffic, doable?
Hi, as far as I know these endpoints are only for multisite replication purposes. You can set just one endpoint pointing to haproxy with multiple RGW behind it. You can create separate RGWs with disabled sync thread which will serve real users. It could make them more responsive. Lookup rgw_run_sync_thread. Also, I would avoid running heavy rgw worload on monitor machines. They can be sensitive to network load. Best regards Adam Prycki W dniu 14.06.2024 o 04:44, Szabo, Istvan (Agoda) pisze: Hi, Could that cause any issue if the endpoints defined in the zonegroups are not in the endpoint list behind haproxy? The question is mainly about the role of the endpoint servers in the zonegroup list. Their role is the sync only or something else also? This would be the scenario, could it work? * * I have 3 mon/mgr server and 15 OSD * RGWs on the mon/mgr would be in the zonegroup definition like this "zones": [ { "id": "61c9sdf40-fdsd-4sdd-9rty9-ed56jda41817", "name": "dc", "endpoints": [ "http://mon1:8080;, "http://mon2:8080;, "http://mon3:8080; ], * However for user traffic I'd use an haproxy endpoint with the 15 OSD node rgws (each osd node 1x). Ty This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io smime.p7s Description: Kryptograficzna sygnatura S/MIME ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Slow osd ops on large arm cluster
Hello, we are having issues with slow ops on our large ARM hpc ceph cluster. Cluster runs on 18.2.0 and ubutnu 20.04 MONs, MGRs and MDSs had to be moved to intel servers because of poor single core performance on our arm servers. Our main cephfs data pool is on 54 serwers in 9 racks with 1458 HDDs in total. (OSDs without block.db on ssd) Cephfs data pool is configured as erasure coded pool with k=6 m=2 and rack level replication. Pool has about 16k PGs with average pg per osd at ~90. We have had good experience with EC cephfs on 3,5 times smaller intel ceph cluster. But this arm deployment is becoming problematic. We started experiencing issues since one of the users started to generate sequential RW traffic at at about 5GiB/s. Single OSD with slow ops was enough to create laggy PG and crash application generating this traffic. We've even had issue where osd with slow ops was lagged for 6 hours and required manual restart. Now we are experiencing slow ops even at much lower read only traffic ~400MiB/s Here is an example of slow ops on OSD: { "ops": [ { "description": "osd_op(client.255949991.0:92728602 4.d22s0 4:44b3390a:::1000b640ddc.039b:head [read 3633152~8192] snapc 0=[] ondisk+read+known_if_redirected e1117246)", "initiated_at": "2024-07-08T10:19:58.469537+", "age": 507.242936848, "duration": 507.2429885483, "type_data": { "flag_point": "started", "client_info": { "client": "client.255949991", "client_addr": "x.x.x.x:0/887459214", "tid": 92728602 }, "events": [ { "event": "initiated", "time": "2024-07-08T10:19:58.469537+", "duration": 0 }, { "event": "throttled", "time": "2024-07-08T10:19:58.469537+", "duration": 0 }, { "event": "header_read", "time": "2024-07-08T10:19:58.469535+", "duration": 4294967295.981 }, { "event": "all_read", "time": "2024-07-08T10:19:58.469571+", "duration": 3.5859e-05 }, { "event": "dispatched", "time": "2024-07-08T10:19:58.469573+", "duration": 2.08e-06 }, { "event": "queued_for_pg", "time": "2024-07-08T10:19:58.469586+", "duration": 1.27210001e-05 }, { "event": "reached_pg", "time": "2024-07-08T10:19:58.485132+", "duration": 0.0155460489 }, { "event": "started", "time": "2024-07-08T10:19:58.485147+", "duration": 1.5161e-05 } ] } }, HDD with this OSD is not busy. Arm cores on these servers are slow but no process reaches full 100% core usage. I think we may have the same issue as one described here: https://www.mail-archive.com/ceph-users@ceph.io/msg13273.html I've tried to reduce osd_pool_default_read_lease_ratio form 0.8 to 0.2 I've tried to reduce osd_heartbeat_grace from 20 to 10. It should lower read_lease_interval from 16 to 2 but it didn't help. Still see a lot of slow ops. Could you give me tips what I could tune to fix this issue? Could this be an issue with large number of EC PGs on large cluster with weak CPUs? Best regards Adam Prycki ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Huge amounts of objects orphaned by lifecycle policy.
Hi Casey, I did use full `radosgw-admin gc process --include-all` I know about that background gc delay. Is running `radosgw-admin gc process --include-all` from terminal any different than gc process running in the background? I wonder if I should use it while trying to recreate this issue. Best regards Adam Prycki W dniu 2024-06-27 20:58, Casey Bodley napisał(a): hi Adam, On Thu, Jun 27, 2024 at 4:41 AM Adam Prycki wrote: Hello, I have a question. Do people use rgw lifecycle policies in production? I had big hopes for this technology bug in practice it seems to be very unreliable. Recently I've been testing different pool layouts and using lifecycle policy to move data between them. Once I've checked orphaned objects I've discovered that my pools were full of orphaned objects. One pool was over 1/3 orphans by volume. Orphan object belonged to data that was moved by lifecycle. Yesterday I decided to recreate one of the pools with 3TiB of data. All 3TiB was located in a single directory of some buckets. I've created a lifecycle which should move it all to STANDARD pool and run radosgw-admin lc process --bucket. After lifecycle finished executing ceph pool still contained 1TiB of data. Removing objects from rgw-orphan-list output reduced pool size to 65GiB and 17k objects. The 17k rados __shadow objects seem to belong to s3 objects which were not moved by lifecycle. I tried lifecycle from radosgw-admin but lifecycle seems to be unable to move them. s3cmd info show that they still report old storage class. Filenames don't contain special characters other than spaces. I have directories with sequentially named objects, some of them cannot be moved by lifecycle. Deleting all the objects form original 3TiB dataset also doesn't help. After running gc and orphan finding tool there are still 1,2k rados objects which should have been deleted but are not considered orphans. i assume you used `radosgw-admin gc process` here - can you confirm whether you added the --include-all option? without that option, garbage collection won't delete objects newer than rgw_gc_obj_min_wait=2hours in case they're still being read. it sounds like these rados objects may still be in the gc queue, which could explain why they aren't considered orphans I've been testing on 18.2.2. Best regards Adam Prycki ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Huge amounts of objects orphaned by lifecycle policy.
Hello, I have a question. Do people use rgw lifecycle policies in production? I had big hopes for this technology bug in practice it seems to be very unreliable. Recently I've been testing different pool layouts and using lifecycle policy to move data between them. Once I've checked orphaned objects I've discovered that my pools were full of orphaned objects. One pool was over 1/3 orphans by volume. Orphan object belonged to data that was moved by lifecycle. Yesterday I decided to recreate one of the pools with 3TiB of data. All 3TiB was located in a single directory of some buckets. I've created a lifecycle which should move it all to STANDARD pool and run radosgw-admin lc process --bucket. After lifecycle finished executing ceph pool still contained 1TiB of data. Removing objects from rgw-orphan-list output reduced pool size to 65GiB and 17k objects. The 17k rados __shadow objects seem to belong to s3 objects which were not moved by lifecycle. I tried lifecycle from radosgw-admin but lifecycle seems to be unable to move them. s3cmd info show that they still report old storage class. Filenames don't contain special characters other than spaces. I have directories with sequentially named objects, some of them cannot be moved by lifecycle. Deleting all the objects form original 3TiB dataset also doesn't help. After running gc and orphan finding tool there are still 1,2k rados objects which should have been deleted but are not considered orphans. I've been testing on 18.2.2. Best regards Adam Prycki smime.p7s Description: Kryptograficzna sygnatura S/MIME ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io