To conclude this story, we finally discovered that one of our users was
using a prometheus exporter (s3_exporter) that constantly listed the
content of their buckets containing millions of objects. That really didn't
play well with Ceph. 2 of these exporters were generating ~ 700k read IOPS
on the index pool, and managed to kill the RGWs (14 of them) after a few
hours.

I hope this can help someone in the future.

Gauvain

On Fri, Dec 22, 2023 at 3:09 PM Gauvain Pocentek <gauvainpocen...@gmail.com>
wrote:

> I'd like to say that it was something smart but it was a bit of luck.
>
> I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
> same hosts) to deal with another issue, and while checking the system I
> noticed that one of the OSDs was using a lot more CPU than the others. It
> made me think that the increased IOPS could put a strain on some of the
> OSDs without impacting the whole cluster so I decided to increate pg_num to
> spread the operations to more OSDs, and it did the trick. The qlen metric
> went back to something similar to what we had before the problems started.
>
> We're going to look into adding CPU/RAM monitoring for all the OSDs next.
>
> Gauvain
>
> On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver <drew.wea...@thenap.com>
> wrote:
>
>> Can you say how you determined that this was a problem?
>>
>> -----Original Message-----
>> From: Gauvain Pocentek <gauvainpocen...@gmail.com>
>> Sent: Friday, December 22, 2023 8:09 AM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: RGW requests piling up
>>
>> Hi again,
>>
>> It turns out that our rados cluster wasn't that happy, the rgw index pool
>> wasn't able to handle the load. Scaling the PG number helped (256 to 512),
>> and the RGW is back to a normal behaviour.
>>
>> There is still a huge number of read IOPS on the index, and we'll try to
>> figure out what's happening there.
>>
>> Gauvain
>>
>> On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
>> gauvainpocen...@gmail.com>
>> wrote:
>>
>> > Hello Ceph users,
>> >
>> > We've been having an issue with RGW for a couple days and we would
>> > appreciate some help, ideas, or guidance to figure out the issue.
>> >
>> > We run a multi-site setup which has been working pretty fine so far.
>> > We don't actually have data replication enabled yet, only metadata
>> > replication. On the master region we've started to see requests piling
>> > up in the rgw process, leading to very slow operations and failures
>> > all other the place (clients timeout before getting responses from
>> > rgw). The workaround for now is to restart the rgw containers regularly.
>> >
>> > We've made a mistake and forcefully deleted a bucket on a secondary
>> > zone, this might be the trigger but we are not sure.
>> >
>> > Other symptoms include:
>> >
>> > * Increased memory usage of the RGW processes (we bumped the container
>> > limits from 4G to 48G to cater for that)
>> > * Lots of read IOPS on the index pool (4 or 5 times more compared to
>> > what we were seeing before)
>> > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
>> > active requests) seem to show that the number of concurrent requests
>> > increases with time, although we don't see more requests coming in on
>> > the load-balancer side.
>> >
>> > The current thought is that the RGW process doesn't close the requests
>> > properly, or that some requests just hang. After a restart of the
>> > process things look OK but the situation turns bad fairly quickly
>> > (after 1 hour we start to see many timeouts).
>> >
>> > The rados cluster seems completely healthy, it is also used for rbd
>> > volumes, and we haven't seen any degradation there.
>> >
>> > Has anyone experienced that kind of issue? Anything we should be
>> > looking at?
>> >
>> > Thanks for your help!
>> >
>> > Gauvain
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to