[ceph-users] Re: ceph multisite lifecycle not working

Christopher Durham Mon, 09 Dec 2024 13:28:48 -0800

 Casey,
Thanks.
I picked a few buckets in question and they have not been resharded (num_shards 
not changed since creation). However, radosgw-admin lc reshard fix --bucket 
BUCKET did restore the lifecycle and radosgw-admin lc process --bucket 
BUCKETNAME did start deleting tihngs as expected on the slave side,
I will check one I did did not process by hand to see if it has run tomorrow. I 
think may still have to change
 rgw_lc_max_workerrgw_lc_max_wp_workerrgw_lifecycle_work_time
but we will see.


-Chris

    On Monday, December 9, 2024 at 08:17:24 AM MST, Casey Bodley 
<[email protected]> wrote:   

 hi Chris,

https://docs.ceph.com/en/latest/radosgw/dynamicresharding/#lifecycle-fixes
may be relevant here. there's a `radosgw-admin lc reshard fix` command
that you can run on the secondary site to add buckets back to the lc
list. if you omit the --bucket argument, it should scan all buckets
and re-link everything with a lifecycle policy

On Fri, Dec 6, 2024 at 5:04 PM Christopher Durham <[email protected]> wrote:
>
>
> I have 18.2.4 on Rocky 9 Linux.This system has been updated from octopus ->  
> pacific -> quincy (18.2.2) -> (el8->el9 reinstall of each server, but ceph 
> osd and mon survival) -> reef (18.2.4) over several years.
>
> It appears that I have two probably related problems with lifecycle 
> expiration in a multsite configuration.
> I have two zones, one on each side of a multisite. I recently discovered 
> (about a month after the el9 and reef 18.2.4 updates) that lifecycle 
> expiration was (mostly) not working on the secondary zone side.I had thought 
> initially that there may be replication issues, but while there are 
> replication issues on individual buckets that required me to full sync 
> individual buckets, the majority of the issues are becauselifecycle 
> expiration is not working on the secondary side.
> The observation that caused me to think lifecycle is the issue is that based 
> on a lifecycle policy for a given bucket, all objects in that bucket should 
> be already deleted.What we are seeing is that all objects have been deleted 
> from the bucket on the master zone, but NONE of them have been deleted on the 
> slave side.This may vary based on the date the objects were created across 
> multiple lifecycle runs on the master side, but objects never get 
> deleted/expired on the slave side.
> I tracked this down to one of two causes, let's say for a given bucket bucket1
>
> 1. radosgw-admin lc list on the master shows that the bucket completes its 
> lifecycle processing periodically. But on the slave side, it shows:
> "started": "Thu, 01 Jan 1970 ...""status": "UNINITIAL"
> If I run:
> radosgw-admin lc process --bucket bucket1
> that particular bucket flushes all of its expired objects (takes awhile). But 
> as far as I can tell at this point, it never runs lifecycle again on the 
> slave side
>
> Now, let's say I have bucket2.
> 2. radosgw-admin lc list on the slave side does NOT show the bucket in the 
> json output, yet the same command on the master side shows it!
>
> Given this, if I run
> radosgw-admin lc process --bucket2
> causes C++ exceptions and the command crashes on the slave side (makes sense, 
> actually)
>
> Yet in this case if I do:
> aws --profile bucket2_owner s3api get-bucket-lifecycle-configuration --bucket 
> bucket2
> it shows the lifecycle configuration for the bucket, regardless whether I 
> point the awscli to the master or slave zone.
> In this case, if I redeploy the lifecycle with 
> put-bucket-lifecycle-configuration to the master side, then thelifecycle 
> status shows up in
> radosgw-admin lc list
> on the slave side (as well as on the master) as UNINITIAL, and this issue 
> devolves to #1 above,
> Note that lifecycle expiration on the slave side does work for some number of 
> buckets, but most remain in the UNINITIAL state, and others not there at all 
> until Iredeploy the lifecycle. The slave side is a lot more active in reading 
> and writing.
>
> So, why would the bucket not show up in lc list on the slave side, where it 
> had before (I can't say how long ago 'before' was)?How can I get it to 
> automatically perform lifecycle on the slave side? Would this perhaps be 
> related to
>
> rgw_lc_max_workerrgw_lc_max_wp_workerrgw_lifecycle_work_time
> It appears that lifecycle processing is independent on each side, meaning 
> that a lifecycle processing of bucket A on one side runs separately from 
> lifecycle processing of bucket A on the other side, and as such an object may 
> exist on one side for a time when it has been already deleted on the other 
> side.
>
> How does rgw_lifecycle_work_time work? Does it mean that outside of the 
> work_time window no new lifecyle processing starts, or that those in process 
> abort/stop?
> Either way this may explain my observations as to too many buckets staying in 
> UNINITIAL when those that are processing have a lot of data to delete.
> And why is this last one rgw_lifecycle_work_time and not rgw_lc_work_time?
> Anyway, any help on theses issues would be appreciated. Thanks
> \-Chris
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
  
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: ceph multisite lifecycle not working

Reply via email to