[ceph-users] RGWs offline after upgrade to Nautilus

2023-07-20 Thread Ben . Zieglmeier
Hello,

We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The 
upgrade went mostly fine, though now several of our RGWs will not start. One 
RGW is working fine, the rest will not initialize. They are on a crash loop. 
This is part of a multisite configuration, and is currently not the master 
zone. Current master zone is running 14.2.22. These are the only two zones in 
the zonegroup. After turning debug up to 20, these are the log snippets between 
each crash:
```
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 
periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.52
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 
periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.54
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 
realms_names. 
2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got 
2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.371 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=114
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=686
2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup init ret 0
2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup name 
2023-07-20 14:29:56.374 7fd8dec40900 20 using current period zonegroup 

2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46
2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903
2023-07-20 14:29:56.375 7fd8dec40900 10 Cannot find current period zone using 
local zone
2023-07-20 14:29:56.375 7fd8dec40900 20 rados->read ofs=0 len=0
2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903
2023-07-20 14:29:56.375 7fd8dec40900 20 zone 
2023-07-20 14:29:56.375 7fd8dec40900 20 generating connection object for zone 
 id f10b465f-bf18-47d0-a51c-ca4f17118ee1
2023-07-20 14:34:56.198 7fd8cafe8700 -1 Initialization timeout, failed to 
initialize
```

I’ve checked all file permissions, filesystem free space, disabled selinux and 
firewalld, tried turning up the initialization timeout to 600, and tried 
removing all non-essential config from ceph.conf. All produce the same results. 
I would greatly appreciate any other ideas or insight.

Thanks,
Ben
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: Massive OMAP remediation

2023-04-27 Thread Ben . Zieglmeier
Hi Dan,

Thanks for the response. No I have not yet told the OSDs participating in that 
PG to compact. It was something I had thought about, but was somewhat concerned 
about what that might do, or what performance impact that might have (or if the 
OSD would come out alive on the other side). I think we may have found a less 
impactful way to trim these bilog entries by using `--start-marker` and 
`--end-marker` and simply looping and incrementing those marker values by 1000 
each time. This is far less impactful than running the commands without those 
flags: it was taking ~45 seconds each time to enumerate bilog entries to trim 
in which the lead OSD was nearly unresponsive. It took diving into the source 
code and the help of a few colleagues (as well as some trial and error on 
non-production systems) to figure out what values those arguments actually 
wanted. Thankfully I was able to get a listing of all OMAP keys for that object 
a couple weeks ago. I’m still not sure how comfortable I would be doing this to 
a bucket that was actually mission critical (this one contains non-critical 
data), but I think we may have a way forward to dislodge this large OMAP by 
trimming. Thanks again!

-Ben

From: Dan van der Ster 
Date: Wednesday, April 26, 2023 at 11:11 AM
To: Ben.Zieglmeier 
Cc: ceph-users@ceph.io 
Subject: [EXTERNAL] Re: [ceph-users] Massive OMAP remediation
Hi Ben,

Are you compacting the relevant osds periodically? ceph tell osd.x
compact (for the three osds holding the bilog) would help reshape the
rocksdb levels to least perform better for a little while until the
next round of bilog trims.

Otherwise, I have experience deleting ~50M object indices in one step
in the past, probably back in the luminous days IIRC. It will likely
lockup the relevant osds for a while while the omap is removed. If you
dare take that step, it might help to set nodown; that might prevent
other osds from flapping and creating more work.

Cheers, Dan

__
Clyso GmbH | 
https://urldefense.com/v3/__https://www.clyso.com__;!!A-7_uaOk87I!rAkZvWTiVOMlLhgs9UYh_GnFo0_SjmhHU9yBCmioZveHqD0td7g4PbmBewq_wjdaruksI1fcreeet106f6GIfmCrx5f7$


On Tue, Apr 25, 2023 at 2:45 PM Ben.Zieglmeier
 wrote:
>
> Hi All,
>
> We have a RGW cluster running Luminous (12.2.11) that has one object with an 
> extremely large OMAP database in the index pool. Listomapkeys on the object 
> returned 390 Million keys to start. Through bilog trim commands, we’ve 
> whittled that down to about 360 Million. This is a bucket index for a 
> regrettably unsharded bucket. There are only about 37K objects actually in 
> the bucket, but through years of neglect, the bilog grown completely out of 
> control. We’ve hit some major problems trying to deal with this particular 
> OMAP object. We just crashed 4 OSDs when a bilog trim caused enough churn to 
> knock one of the OSDs housing this PG out of the cluster temporarily. The OSD 
> disks are 6.4TB NVMe, but are split into 4 partitions, each housing their own 
> OSD daemon (collocated journal).
>
> We want to be rid of this large OMAP object, but are running out of options 
> to deal with it. Reshard outright does not seem like a viable option, as we 
> believe the deletion would deadlock OSDs can could cause impact. Continuing 
> to run `bilog trim` 1000 records at a time has been what we’ve done, but this 
> also seems to be creating impacts to performance/stability. We are seeking 
> options to remove this problematic object without creating additional 
> problems. It is quite likely this bucket is abandoned, so we could remove the 
> data, but I fear even the deletion of such a large OMAP could bring OSDs down 
> and cause potential for metadata loss (the other bucket indexes on that same 
> PG).
>
> Any insight available would be highly appreciated.
>
> Thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Massive OMAP remediation

2023-04-25 Thread Ben . Zieglmeier
Hi All,

We have a RGW cluster running Luminous (12.2.11) that has one object with an 
extremely large OMAP database in the index pool. Listomapkeys on the object 
returned 390 Million keys to start. Through bilog trim commands, we’ve whittled 
that down to about 360 Million. This is a bucket index for a regrettably 
unsharded bucket. There are only about 37K objects actually in the bucket, but 
through years of neglect, the bilog grown completely out of control. We’ve hit 
some major problems trying to deal with this particular OMAP object. We just 
crashed 4 OSDs when a bilog trim caused enough churn to knock one of the OSDs 
housing this PG out of the cluster temporarily. The OSD disks are 6.4TB NVMe, 
but are split into 4 partitions, each housing their own OSD daemon (collocated 
journal).

We want to be rid of this large OMAP object, but are running out of options to 
deal with it. Reshard outright does not seem like a viable option, as we 
believe the deletion would deadlock OSDs can could cause impact. Continuing to 
run `bilog trim` 1000 records at a time has been what we’ve done, but this also 
seems to be creating impacts to performance/stability. We are seeking options 
to remove this problematic object without creating additional problems. It is 
quite likely this bucket is abandoned, so we could remove the data, but I fear 
even the deletion of such a large OMAP could bring OSDs down and cause 
potential for metadata loss (the other bucket indexes on that same PG).

Any insight available would be highly appreciated.

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io