[ceph-users] Multisite stuck data shard recovery after bucket deletion

2024-07-17 Thread Adam Prycki

Hello,

I'm testing multisite sync on reef 18.2.2, cephadm and ubuntu 22.04.

Right now I'm testing symmetrical sync policy making backup to read-only 
zone.
My sync policy allows for replication and I enable replication via 
put-bucket-replication.



My multisite setup fails at seemingly basic operation.

My test looks like this:
1. create bucket
2. upload some data to bucket
3. wait for replication to copy some of the data
4. run `rclone purge` on the bucket in master zone while replication is 
in progress. All data and bucket itself are deleted.


I've tested this on normal secondary zone and archive zone.

It seems that bucket is deleted so quickly that replication gets stuck.
Buckets are gone from both zones but data sync shard still tries to 
replicate them


Example of a recovering shard.

{
"shard_id": 100,
"marker": {
"status": "full-sync",
"marker": "",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "0.00"
},
"pending_buckets": [
"bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.5:9",
"bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.7:9"
],
"recovering_buckets": [
"bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.7:9[0]"
],
"current_time": "2024-07-17T13:23:11Z"
}

In this case there are 2 pending buckets because I've reused the bucket 
name.


The only semi-automatic solution I've found is to recreate bucket with 
the same name and wait for recovering shards to disappear.


Is there any way to make ceph clean up these stuck shards automatically?

Best regards
Adam Prycki


smime.p7s
Description: Kryptograficzna sygnatura S/MIME
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Separated multisite sync and user traffic, doable?

2024-07-17 Thread Adam Prycki

Hi,

as far as I know these endpoints are only for multisite replication 
purposes. You can set just one endpoint pointing to haproxy with 
multiple RGW behind it.


You can create separate RGWs with disabled sync thread which will serve 
real users. It could make them more responsive. Lookup rgw_run_sync_thread.


Also, I would avoid running heavy rgw worload on monitor machines. They 
can be sensitive to network load.


Best regards
Adam Prycki

W dniu 14.06.2024 o 04:44, Szabo, Istvan (Agoda) pisze:

Hi,

Could that cause any issue if the endpoints defined in the zonegroups are not 
in the endpoint list behind haproxy?
The question is mainly about the role of the endpoint servers in the zonegroup 
list. Their role is the sync only or something else also?

This would be the scenario, could it work?

   *

   *
I have 3 mon/mgr server and 15 OSD
   *
RGWs on the mon/mgr would be in the zonegroup definition like this

   "zones": [
 {
   "id": "61c9sdf40-fdsd-4sdd-9rty9-ed56jda41817",
   "name": "dc",
   "endpoints": [
 "http://mon1:8080;,
 "http://mon2:8080;,
 "http://mon3:8080;
   ],


   *   However for user traffic I'd use an haproxy endpoint with the 15 OSD 
node rgws (each osd node 1x).

Ty


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


smime.p7s
Description: Kryptograficzna sygnatura S/MIME
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Slow osd ops on large arm cluster

2024-07-08 Thread Adam Prycki

Hello,

we are having issues with slow ops on our large ARM hpc ceph cluster.

Cluster runs on 18.2.0 and ubutnu 20.04
MONs, MGRs and MDSs had to be moved to intel servers because of poor 
single core performance on our arm servers.
Our main cephfs data pool is on 54 serwers in 9 racks with 1458 HDDs in 
total. (OSDs without block.db on ssd)
Cephfs data pool is configured as erasure coded pool with k=6 m=2 and 
rack level replication. Pool has about 16k PGs with average pg per osd 
at ~90.


We have had good experience with EC cephfs on 3,5 times smaller intel 
ceph cluster. But this arm deployment is becoming problematic. We 
started experiencing issues since one of the users started to generate 
sequential RW traffic at at about 5GiB/s. Single OSD with slow ops was 
enough to create laggy PG and crash application generating this traffic.
We've even had issue where osd with slow ops was lagged for 6 hours and 
required manual restart.


Now we are experiencing slow ops even at much lower read only traffic 
~400MiB/s


Here is an example of slow ops on OSD:
{
"ops": [
{
"description": "osd_op(client.255949991.0:92728602 4.d22s0 
4:44b3390a:::1000b640ddc.039b:head [read 3633152~8192] snapc 0=[] 
ondisk+read+known_if_redirected e1117246)",

"initiated_at": "2024-07-08T10:19:58.469537+",
"age": 507.242936848,
"duration": 507.2429885483,
"type_data": {
"flag_point": "started",
"client_info": {
"client": "client.255949991",
"client_addr": "x.x.x.x:0/887459214",
"tid": 92728602
},
"events": [
{
"event": "initiated",
"time": "2024-07-08T10:19:58.469537+",
"duration": 0
},
{
"event": "throttled",
"time": "2024-07-08T10:19:58.469537+",
"duration": 0
},
{
"event": "header_read",
"time": "2024-07-08T10:19:58.469535+",
"duration": 4294967295.981
},
{
"event": "all_read",
"time": "2024-07-08T10:19:58.469571+",
"duration": 3.5859e-05
},
{
"event": "dispatched",
"time": "2024-07-08T10:19:58.469573+",
"duration": 2.08e-06
},
{
"event": "queued_for_pg",
"time": "2024-07-08T10:19:58.469586+",
"duration": 1.27210001e-05
},
{
"event": "reached_pg",
"time": "2024-07-08T10:19:58.485132+",
"duration": 0.0155460489
},
{
"event": "started",
"time": "2024-07-08T10:19:58.485147+",
"duration": 1.5161e-05
}
]
}
},
HDD with this OSD is not busy. Arm cores on these servers are slow but 
no process reaches full 100% core usage.


I think we may have the same issue as one described here: 
https://www.mail-archive.com/ceph-users@ceph.io/msg13273.html


I've tried to reduce osd_pool_default_read_lease_ratio form 0.8 to 0.2
I've tried to reduce osd_heartbeat_grace from 20 to 10.
It should lower read_lease_interval from 16 to 2 but it didn't help. 
Still see a lot of slow ops.


Could you give me tips what I could tune to fix this issue?

Could this be an issue with large number of EC PGs on large cluster with 
weak CPUs?


Best regards
Adam Prycki
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Huge amounts of objects orphaned by lifecycle policy.

2024-06-27 Thread Adam Prycki

Hi Casey,

I did use full `radosgw-admin gc process --include-all`
I know about that background gc delay.

Is running `radosgw-admin gc process --include-all` from terminal any 
different than gc process running in the background? I wonder if I 
should use it while trying to recreate this issue.


Best regards
Adam Prycki

W dniu 2024-06-27 20:58, Casey Bodley napisał(a):

hi Adam,

On Thu, Jun 27, 2024 at 4:41 AM Adam Prycki  
wrote:


Hello,

I have a question. Do people use rgw lifecycle policies in production?
I had big hopes for this technology bug in practice it seems to be 
very

unreliable.

Recently I've been testing different pool layouts and using lifecycle
policy to move data between them. Once I've checked orphaned objects
I've discovered that my pools were full of orphaned objects. One pool
was over 1/3 orphans by volume. Orphan object belonged to data that 
was

moved by lifecycle.

Yesterday I decided to recreate one of the pools with 3TiB of data. 
All

3TiB was located in a single directory of some buckets. I've created a
lifecycle which should move it all to STANDARD pool and run
radosgw-admin lc process --bucket. After lifecycle finished executing
ceph pool still contained 1TiB of data. Removing objects from
rgw-orphan-list output reduced pool size to 65GiB and 17k objects.

The 17k rados __shadow objects seem to belong to s3 objects which were
not moved by lifecycle. I tried lifecycle from radosgw-admin but
lifecycle seems to be unable to move them. s3cmd info show that they
still report old storage class. Filenames don't contain special
characters other than spaces. I have directories with sequentially 
named

objects, some of them cannot be moved by lifecycle.

Deleting all the objects form original 3TiB dataset also doesn't help.
After running gc and orphan finding tool there are still 1,2k rados
objects which should have been deleted but are not considered orphans.


i assume you used `radosgw-admin gc process` here - can you confirm
whether you added the --include-all option? without that option,
garbage collection won't delete objects newer than
rgw_gc_obj_min_wait=2hours in case they're still being read. it sounds
like these rados objects may still be in the gc queue, which could
explain why they aren't considered orphans



I've been testing on 18.2.2.

Best regards
Adam Prycki
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Huge amounts of objects orphaned by lifecycle policy.

2024-06-27 Thread Adam Prycki

Hello,

I have a question. Do people use rgw lifecycle policies in production?
I had big hopes for this technology bug in practice it seems to be very 
unreliable.


Recently I've been testing different pool layouts and using lifecycle 
policy to move data between them. Once I've checked orphaned objects 
I've discovered that my pools were full of orphaned objects. One pool 
was over 1/3 orphans by volume. Orphan object belonged to data that was 
moved by lifecycle.


Yesterday I decided to recreate one of the pools with 3TiB of data. All 
3TiB was located in a single directory of some buckets. I've created a 
lifecycle which should move it all to STANDARD pool and run 
radosgw-admin lc process --bucket. After lifecycle finished executing 
ceph pool still contained 1TiB of data. Removing objects from 
rgw-orphan-list output reduced pool size to 65GiB and 17k objects.


The 17k rados __shadow objects seem to belong to s3 objects which were 
not moved by lifecycle. I tried lifecycle from radosgw-admin but 
lifecycle seems to be unable to move them. s3cmd info show that they 
still report old storage class. Filenames don't contain special 
characters other than spaces. I have directories with sequentially named 
objects, some of them cannot be moved by lifecycle.


Deleting all the objects form original 3TiB dataset also doesn't help. 
After running gc and orphan finding tool there are still 1,2k rados 
objects which should have been deleted but are not considered orphans.


I've been testing on 18.2.2.

Best regards
Adam Prycki


smime.p7s
Description: Kryptograficzna sygnatura S/MIME
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io