[ceph-users] Re: RGW requests piling up

2023-12-28 Thread Gauvain Pocentek
To conclude this story, we finally discovered that one of our users was
using a prometheus exporter (s3_exporter) that constantly listed the
content of their buckets containing millions of objects. That really didn't
play well with Ceph. 2 of these exporters were generating ~ 700k read IOPS
on the index pool, and managed to kill the RGWs (14 of them) after a few
hours.

I hope this can help someone in the future.

Gauvain

On Fri, Dec 22, 2023 at 3:09 PM Gauvain Pocentek 
wrote:

> I'd like to say that it was something smart but it was a bit of luck.
>
> I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
> same hosts) to deal with another issue, and while checking the system I
> noticed that one of the OSDs was using a lot more CPU than the others. It
> made me think that the increased IOPS could put a strain on some of the
> OSDs without impacting the whole cluster so I decided to increate pg_num to
> spread the operations to more OSDs, and it did the trick. The qlen metric
> went back to something similar to what we had before the problems started.
>
> We're going to look into adding CPU/RAM monitoring for all the OSDs next.
>
> Gauvain
>
> On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver 
> wrote:
>
>> Can you say how you determined that this was a problem?
>>
>> -Original Message-
>> From: Gauvain Pocentek 
>> Sent: Friday, December 22, 2023 8:09 AM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: RGW requests piling up
>>
>> Hi again,
>>
>> It turns out that our rados cluster wasn't that happy, the rgw index pool
>> wasn't able to handle the load. Scaling the PG number helped (256 to 512),
>> and the RGW is back to a normal behaviour.
>>
>> There is still a huge number of read IOPS on the index, and we'll try to
>> figure out what's happening there.
>>
>> Gauvain
>>
>> On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
>> gauvainpocen...@gmail.com>
>> wrote:
>>
>> > Hello Ceph users,
>> >
>> > We've been having an issue with RGW for a couple days and we would
>> > appreciate some help, ideas, or guidance to figure out the issue.
>> >
>> > We run a multi-site setup which has been working pretty fine so far.
>> > We don't actually have data replication enabled yet, only metadata
>> > replication. On the master region we've started to see requests piling
>> > up in the rgw process, leading to very slow operations and failures
>> > all other the place (clients timeout before getting responses from
>> > rgw). The workaround for now is to restart the rgw containers regularly.
>> >
>> > We've made a mistake and forcefully deleted a bucket on a secondary
>> > zone, this might be the trigger but we are not sure.
>> >
>> > Other symptoms include:
>> >
>> > * Increased memory usage of the RGW processes (we bumped the container
>> > limits from 4G to 48G to cater for that)
>> > * Lots of read IOPS on the index pool (4 or 5 times more compared to
>> > what we were seeing before)
>> > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
>> > active requests) seem to show that the number of concurrent requests
>> > increases with time, although we don't see more requests coming in on
>> > the load-balancer side.
>> >
>> > The current thought is that the RGW process doesn't close the requests
>> > properly, or that some requests just hang. After a restart of the
>> > process things look OK but the situation turns bad fairly quickly
>> > (after 1 hour we start to see many timeouts).
>> >
>> > The rados cluster seems completely healthy, it is also used for rbd
>> > volumes, and we haven't seen any degradation there.
>> >
>> > Has anyone experienced that kind of issue? Anything we should be
>> > looking at?
>> >
>> > Thanks for your help!
>> >
>> > Gauvain
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW requests piling up

2023-12-22 Thread Gauvain Pocentek
I'd like to say that it was something smart but it was a bit of luck.

I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
same hosts) to deal with another issue, and while checking the system I
noticed that one of the OSDs was using a lot more CPU than the others. It
made me think that the increased IOPS could put a strain on some of the
OSDs without impacting the whole cluster so I decided to increate pg_num to
spread the operations to more OSDs, and it did the trick. The qlen metric
went back to something similar to what we had before the problems started.

We're going to look into adding CPU/RAM monitoring for all the OSDs next.

Gauvain

On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver  wrote:

> Can you say how you determined that this was a problem?
>
> -Original Message-
> From: Gauvain Pocentek 
> Sent: Friday, December 22, 2023 8:09 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: RGW requests piling up
>
> Hi again,
>
> It turns out that our rados cluster wasn't that happy, the rgw index pool
> wasn't able to handle the load. Scaling the PG number helped (256 to 512),
> and the RGW is back to a normal behaviour.
>
> There is still a huge number of read IOPS on the index, and we'll try to
> figure out what's happening there.
>
> Gauvain
>
> On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
> gauvainpocen...@gmail.com>
> wrote:
>
> > Hello Ceph users,
> >
> > We've been having an issue with RGW for a couple days and we would
> > appreciate some help, ideas, or guidance to figure out the issue.
> >
> > We run a multi-site setup which has been working pretty fine so far.
> > We don't actually have data replication enabled yet, only metadata
> > replication. On the master region we've started to see requests piling
> > up in the rgw process, leading to very slow operations and failures
> > all other the place (clients timeout before getting responses from
> > rgw). The workaround for now is to restart the rgw containers regularly.
> >
> > We've made a mistake and forcefully deleted a bucket on a secondary
> > zone, this might be the trigger but we are not sure.
> >
> > Other symptoms include:
> >
> > * Increased memory usage of the RGW processes (we bumped the container
> > limits from 4G to 48G to cater for that)
> > * Lots of read IOPS on the index pool (4 or 5 times more compared to
> > what we were seeing before)
> > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
> > active requests) seem to show that the number of concurrent requests
> > increases with time, although we don't see more requests coming in on
> > the load-balancer side.
> >
> > The current thought is that the RGW process doesn't close the requests
> > properly, or that some requests just hang. After a restart of the
> > process things look OK but the situation turns bad fairly quickly
> > (after 1 hour we start to see many timeouts).
> >
> > The rados cluster seems completely healthy, it is also used for rbd
> > volumes, and we haven't seen any degradation there.
> >
> > Has anyone experienced that kind of issue? Anything we should be
> > looking at?
> >
> > Thanks for your help!
> >
> > Gauvain
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW requests piling up

2023-12-22 Thread Gauvain Pocentek
Hi again,

It turns out that our rados cluster wasn't that happy, the rgw index pool
wasn't able to handle the load. Scaling the PG number helped (256 to 512),
and the RGW is back to a normal behaviour.

There is still a huge number of read IOPS on the index, and we'll try to
figure out what's happening there.

Gauvain

On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek 
wrote:

> Hello Ceph users,
>
> We've been having an issue with RGW for a couple days and we would
> appreciate some help, ideas, or guidance to figure out the issue.
>
> We run a multi-site setup which has been working pretty fine so far. We
> don't actually have data replication enabled yet, only metadata
> replication. On the master region we've started to see requests piling up
> in the rgw process, leading to very slow operations and failures all other
> the place (clients timeout before getting responses from rgw). The
> workaround for now is to restart the rgw containers regularly.
>
> We've made a mistake and forcefully deleted a bucket on a secondary zone,
> this might be the trigger but we are not sure.
>
> Other symptoms include:
>
> * Increased memory usage of the RGW processes (we bumped the container
> limits from 4G to 48G to cater for that)
> * Lots of read IOPS on the index pool (4 or 5 times more compared to what
> we were seeing before)
> * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
> active requests) seem to show that the number of concurrent requests
> increases with time, although we don't see more requests coming in on the
> load-balancer side.
>
> The current thought is that the RGW process doesn't close the requests
> properly, or that some requests just hang. After a restart of the process
> things look OK but the situation turns bad fairly quickly (after 1 hour we
> start to see many timeouts).
>
> The rados cluster seems completely healthy, it is also used for rbd
> volumes, and we haven't seen any degradation there.
>
> Has anyone experienced that kind of issue? Anything we should be looking
> at?
>
> Thanks for your help!
>
> Gauvain
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io