I'd like to say that it was something smart but it was a bit of luck. I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the same hosts) to deal with another issue, and while checking the system I noticed that one of the OSDs was using a lot more CPU than the others. It made me think that the increased IOPS could put a strain on some of the OSDs without impacting the whole cluster so I decided to increate pg_num to spread the operations to more OSDs, and it did the trick. The qlen metric went back to something similar to what we had before the problems started.
We're going to look into adding CPU/RAM monitoring for all the OSDs next. Gauvain On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver <drew.wea...@thenap.com> wrote: > Can you say how you determined that this was a problem? > > -----Original Message----- > From: Gauvain Pocentek <gauvainpocen...@gmail.com> > Sent: Friday, December 22, 2023 8:09 AM > To: ceph-users@ceph.io > Subject: [ceph-users] Re: RGW requests piling up > > Hi again, > > It turns out that our rados cluster wasn't that happy, the rgw index pool > wasn't able to handle the load. Scaling the PG number helped (256 to 512), > and the RGW is back to a normal behaviour. > > There is still a huge number of read IOPS on the index, and we'll try to > figure out what's happening there. > > Gauvain > > On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek < > gauvainpocen...@gmail.com> > wrote: > > > Hello Ceph users, > > > > We've been having an issue with RGW for a couple days and we would > > appreciate some help, ideas, or guidance to figure out the issue. > > > > We run a multi-site setup which has been working pretty fine so far. > > We don't actually have data replication enabled yet, only metadata > > replication. On the master region we've started to see requests piling > > up in the rgw process, leading to very slow operations and failures > > all other the place (clients timeout before getting responses from > > rgw). The workaround for now is to restart the rgw containers regularly. > > > > We've made a mistake and forcefully deleted a bucket on a secondary > > zone, this might be the trigger but we are not sure. > > > > Other symptoms include: > > > > * Increased memory usage of the RGW processes (we bumped the container > > limits from 4G to 48G to cater for that) > > * Lots of read IOPS on the index pool (4 or 5 times more compared to > > what we were seeing before) > > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of > > active requests) seem to show that the number of concurrent requests > > increases with time, although we don't see more requests coming in on > > the load-balancer side. > > > > The current thought is that the RGW process doesn't close the requests > > properly, or that some requests just hang. After a restart of the > > process things look OK but the situation turns bad fairly quickly > > (after 1 hour we start to see many timeouts). > > > > The rados cluster seems completely healthy, it is also used for rbd > > volumes, and we haven't seen any degradation there. > > > > Has anyone experienced that kind of issue? Anything we should be > > looking at? > > > > Thanks for your help! > > > > Gauvain > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io