Hi Ceph-Users,

I have a multisite Ceph cluster deployed on containers within 3 VMs (6 VMs 
total over 2 sites). Each VM has a mon, osd, mgr, mds, and two rgw containers 
(regular and pubsub).  It was installed with ceph-ansible.

One of the sites has been up for a few years, the other site has been recently 
re-installed and paired with the initial site. The initial site is using 
Nautlius (14.2.9), the new site is on Octopus (15.2.13). (Side point - is this 
valid?)

I've noticed that on the new site, pubsub is building a gigantic queue of 
objects (it's building faster than our product can acknowledge the events). I'm 
having a rough time trying to debug this/understand why the queue is building.

I currently have 450k objects stored in an S3 bucket, that is mostly inactive 
(our test system backed by this cluster is off while we attempt to resolve 
this), synced between the two sites. The pubsub queue on the second site 
currently has 1.7M objects, and I've disabled the pubsub containers to prevent 
it building further.  As soon as I enable the pubsub containers again this 
starts building at an alarming rate.

What I've tried:

  *   Interacting with the pubsub REST API. I pulled all the events in the 
pubsub queue and did some analysis on them.
  *   Of the 1.7M events, there were 106k unique S3 objects referenced.
  *   The average S3 object had 13 pubsub events referring to it. This seems 
very odd given the inactivity of the data, I was expecting to find no duplicate 
entries here.
  *   The most mentioned S3 object was referred to 362 times (i.e. a single S3 
object had 362 pubsub OBJECT_CREATE events).
  *   All the mTimes are from 2020 (other than 35 in 2021) - the second site 
was only deployed this month.

Does anyone have any suggestions as to why this is occurring?

Thanks,
Alex
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to