[ceph-users] Data recovery after resharding mishap

2024-08-18 Thread Gauvain Pocentek
Hello list,

We have made a mistake and dynamically resharded a bucket in a multi-site
RGW setup running Quincy (support for this has been added in Reef). So we
have now ~200 million objects still stored in the rados cluster, but
completely removed from the bucket index (basically ceph has created a new
index for the bucket).

We would really like to recover these objects, but we are facing a few
issues with our ideas. Any help would be appreciated.

The main problem we face is that the rados objects contain binary data,
where we expected json data. The RGW is configured to use zlib compression,
so we think it can be the reason (although using zlib to decompress doesn't
work). Has anyone already faced this, and managed to recover the data from
rados objects?

Another idea we have is to update the new bucket index to inject the old
data. This looks possible as the object marker hasn't changed. We also have
access to the old indexes/shards, so we could get all the omap key/value
pairs and inject them in the new index. Has anyone been mad enough to try
this by any chance?

Any other idea to recover the data would help of course.

Thank you!

Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW requests piling up

2023-12-28 Thread Gauvain Pocentek
To conclude this story, we finally discovered that one of our users was
using a prometheus exporter (s3_exporter) that constantly listed the
content of their buckets containing millions of objects. That really didn't
play well with Ceph. 2 of these exporters were generating ~ 700k read IOPS
on the index pool, and managed to kill the RGWs (14 of them) after a few
hours.

I hope this can help someone in the future.

Gauvain

On Fri, Dec 22, 2023 at 3:09 PM Gauvain Pocentek 
wrote:

> I'd like to say that it was something smart but it was a bit of luck.
>
> I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
> same hosts) to deal with another issue, and while checking the system I
> noticed that one of the OSDs was using a lot more CPU than the others. It
> made me think that the increased IOPS could put a strain on some of the
> OSDs without impacting the whole cluster so I decided to increate pg_num to
> spread the operations to more OSDs, and it did the trick. The qlen metric
> went back to something similar to what we had before the problems started.
>
> We're going to look into adding CPU/RAM monitoring for all the OSDs next.
>
> Gauvain
>
> On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver 
> wrote:
>
>> Can you say how you determined that this was a problem?
>>
>> -Original Message-
>> From: Gauvain Pocentek 
>> Sent: Friday, December 22, 2023 8:09 AM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: RGW requests piling up
>>
>> Hi again,
>>
>> It turns out that our rados cluster wasn't that happy, the rgw index pool
>> wasn't able to handle the load. Scaling the PG number helped (256 to 512),
>> and the RGW is back to a normal behaviour.
>>
>> There is still a huge number of read IOPS on the index, and we'll try to
>> figure out what's happening there.
>>
>> Gauvain
>>
>> On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
>> gauvainpocen...@gmail.com>
>> wrote:
>>
>> > Hello Ceph users,
>> >
>> > We've been having an issue with RGW for a couple days and we would
>> > appreciate some help, ideas, or guidance to figure out the issue.
>> >
>> > We run a multi-site setup which has been working pretty fine so far.
>> > We don't actually have data replication enabled yet, only metadata
>> > replication. On the master region we've started to see requests piling
>> > up in the rgw process, leading to very slow operations and failures
>> > all other the place (clients timeout before getting responses from
>> > rgw). The workaround for now is to restart the rgw containers regularly.
>> >
>> > We've made a mistake and forcefully deleted a bucket on a secondary
>> > zone, this might be the trigger but we are not sure.
>> >
>> > Other symptoms include:
>> >
>> > * Increased memory usage of the RGW processes (we bumped the container
>> > limits from 4G to 48G to cater for that)
>> > * Lots of read IOPS on the index pool (4 or 5 times more compared to
>> > what we were seeing before)
>> > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
>> > active requests) seem to show that the number of concurrent requests
>> > increases with time, although we don't see more requests coming in on
>> > the load-balancer side.
>> >
>> > The current thought is that the RGW process doesn't close the requests
>> > properly, or that some requests just hang. After a restart of the
>> > process things look OK but the situation turns bad fairly quickly
>> > (after 1 hour we start to see many timeouts).
>> >
>> > The rados cluster seems completely healthy, it is also used for rbd
>> > volumes, and we haven't seen any degradation there.
>> >
>> > Has anyone experienced that kind of issue? Anything we should be
>> > looking at?
>> >
>> > Thanks for your help!
>> >
>> > Gauvain
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW requests piling up

2023-12-22 Thread Gauvain Pocentek
I'd like to say that it was something smart but it was a bit of luck.

I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
same hosts) to deal with another issue, and while checking the system I
noticed that one of the OSDs was using a lot more CPU than the others. It
made me think that the increased IOPS could put a strain on some of the
OSDs without impacting the whole cluster so I decided to increate pg_num to
spread the operations to more OSDs, and it did the trick. The qlen metric
went back to something similar to what we had before the problems started.

We're going to look into adding CPU/RAM monitoring for all the OSDs next.

Gauvain

On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver  wrote:

> Can you say how you determined that this was a problem?
>
> -Original Message-----
> From: Gauvain Pocentek 
> Sent: Friday, December 22, 2023 8:09 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: RGW requests piling up
>
> Hi again,
>
> It turns out that our rados cluster wasn't that happy, the rgw index pool
> wasn't able to handle the load. Scaling the PG number helped (256 to 512),
> and the RGW is back to a normal behaviour.
>
> There is still a huge number of read IOPS on the index, and we'll try to
> figure out what's happening there.
>
> Gauvain
>
> On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
> gauvainpocen...@gmail.com>
> wrote:
>
> > Hello Ceph users,
> >
> > We've been having an issue with RGW for a couple days and we would
> > appreciate some help, ideas, or guidance to figure out the issue.
> >
> > We run a multi-site setup which has been working pretty fine so far.
> > We don't actually have data replication enabled yet, only metadata
> > replication. On the master region we've started to see requests piling
> > up in the rgw process, leading to very slow operations and failures
> > all other the place (clients timeout before getting responses from
> > rgw). The workaround for now is to restart the rgw containers regularly.
> >
> > We've made a mistake and forcefully deleted a bucket on a secondary
> > zone, this might be the trigger but we are not sure.
> >
> > Other symptoms include:
> >
> > * Increased memory usage of the RGW processes (we bumped the container
> > limits from 4G to 48G to cater for that)
> > * Lots of read IOPS on the index pool (4 or 5 times more compared to
> > what we were seeing before)
> > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
> > active requests) seem to show that the number of concurrent requests
> > increases with time, although we don't see more requests coming in on
> > the load-balancer side.
> >
> > The current thought is that the RGW process doesn't close the requests
> > properly, or that some requests just hang. After a restart of the
> > process things look OK but the situation turns bad fairly quickly
> > (after 1 hour we start to see many timeouts).
> >
> > The rados cluster seems completely healthy, it is also used for rbd
> > volumes, and we haven't seen any degradation there.
> >
> > Has anyone experienced that kind of issue? Anything we should be
> > looking at?
> >
> > Thanks for your help!
> >
> > Gauvain
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW requests piling up

2023-12-22 Thread Gauvain Pocentek
Hi again,

It turns out that our rados cluster wasn't that happy, the rgw index pool
wasn't able to handle the load. Scaling the PG number helped (256 to 512),
and the RGW is back to a normal behaviour.

There is still a huge number of read IOPS on the index, and we'll try to
figure out what's happening there.

Gauvain

On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek 
wrote:

> Hello Ceph users,
>
> We've been having an issue with RGW for a couple days and we would
> appreciate some help, ideas, or guidance to figure out the issue.
>
> We run a multi-site setup which has been working pretty fine so far. We
> don't actually have data replication enabled yet, only metadata
> replication. On the master region we've started to see requests piling up
> in the rgw process, leading to very slow operations and failures all other
> the place (clients timeout before getting responses from rgw). The
> workaround for now is to restart the rgw containers regularly.
>
> We've made a mistake and forcefully deleted a bucket on a secondary zone,
> this might be the trigger but we are not sure.
>
> Other symptoms include:
>
> * Increased memory usage of the RGW processes (we bumped the container
> limits from 4G to 48G to cater for that)
> * Lots of read IOPS on the index pool (4 or 5 times more compared to what
> we were seeing before)
> * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
> active requests) seem to show that the number of concurrent requests
> increases with time, although we don't see more requests coming in on the
> load-balancer side.
>
> The current thought is that the RGW process doesn't close the requests
> properly, or that some requests just hang. After a restart of the process
> things look OK but the situation turns bad fairly quickly (after 1 hour we
> start to see many timeouts).
>
> The rados cluster seems completely healthy, it is also used for rbd
> volumes, and we haven't seen any degradation there.
>
> Has anyone experienced that kind of issue? Anything we should be looking
> at?
>
> Thanks for your help!
>
> Gauvain
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW requests piling up

2023-12-21 Thread Gauvain Pocentek
Hello Ceph users,

We've been having an issue with RGW for a couple days and we would
appreciate some help, ideas, or guidance to figure out the issue.

We run a multi-site setup which has been working pretty fine so far. We
don't actually have data replication enabled yet, only metadata
replication. On the master region we've started to see requests piling up
in the rgw process, leading to very slow operations and failures all other
the place (clients timeout before getting responses from rgw). The
workaround for now is to restart the rgw containers regularly.

We've made a mistake and forcefully deleted a bucket on a secondary zone,
this might be the trigger but we are not sure.

Other symptoms include:

* Increased memory usage of the RGW processes (we bumped the container
limits from 4G to 48G to cater for that)
* Lots of read IOPS on the index pool (4 or 5 times more compared to what
we were seeing before)
* The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
active requests) seem to show that the number of concurrent requests
increases with time, although we don't see more requests coming in on the
load-balancer side.

The current thought is that the RGW process doesn't close the requests
properly, or that some requests just hang. After a restart of the process
things look OK but the situation turns bad fairly quickly (after 1 hour we
start to see many timeouts).

The rados cluster seems completely healthy, it is also used for rbd
volumes, and we haven't seen any degradation there.

Has anyone experienced that kind of issue? Anything we should be looking at?

Thanks for your help!

Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow backfilling/remapping of EC pool PGs

2023-03-21 Thread Gauvain Pocentek
On Tue, Mar 21, 2023 at 2:21 PM Clyso GmbH - Ceph Foundation Member <
joachim.kraftma...@clyso.com> wrote:

>
>
> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue
>

Since this requires a restart I went an other way to speed up the recovery
of degraded PGs and avoid weirdness while restarting the OSDs. I've
increased the value of osd_mclock_max_capacity_iops_hdd to a ridiculous
number for spinning disks (6000). The effect is not magical but the
recovery went from 4 to 60 objects/s. Ceph should be back to normal in a
few hours.

I will change the osd_op_queue value once the cluster is stable.

Thanks for the help, it's been really useful, and I know a little bit more
about Ceph :)

Gauvain



> ___
> Clyso GmbH - Ceph Foundation Member
>
> Am 21.03.23 um 12:51 schrieb Gauvain Pocentek:
>
> (adding back the list)
>
> On Tue, Mar 21, 2023 at 11:25 AM Joachim Kraftmayer <
> joachim.kraftma...@clyso.com> wrote:
>
>> i added the questions and answers below.
>>
>> ___
>> Best Regards,
>> Joachim Kraftmayer
>> CEO | Clyso GmbH
>>
>> Clyso GmbH
>> p: +49 89 21 55 23 91 2
>> a: Loristraße 8 | 80335 München | Germany
>> w: https://clyso.com | e: joachim.kraftma...@clyso.com
>>
>> We are hiring: https://www.clyso.com/jobs/
>> ---
>> CEO: Dipl. Inf. (FH) Joachim Kraftmayer
>> Unternehmenssitz: Utting am Ammersee
>> Handelsregister beim Amtsgericht: Augsburg
>> Handelsregister-Nummer: HRB 25866
>> USt. ID-Nr.: DE275430677
>>
>> Am 21.03.23 um 11:14 schrieb Gauvain Pocentek:
>>
>> Hi Joachim,
>>
>>
>> On Tue, Mar 21, 2023 at 10:13 AM Joachim Kraftmayer <
>> joachim.kraftma...@clyso.com> wrote:
>>
>>> Which Ceph version are you running, is mclock active?
>>>
>>>
>> We're using Quincy (17.2.5), upgraded step by step from Luminous if I
>> remember correctly.
>>
>> did you recreate the osds? if yes, at which version?
>>
>
> I actually don't remember all the history, but I think we added the HDD
> nodes while running Pacific.
>
>
>
>>
>> mlock seems active, set to high_client_ops profile. HDD OSDs have very
>> different settings for max capacity iops:
>>
>> osd.137basic osd_mclock_max_capacity_iops_hdd
>>  929.763899
>> osd.161basic osd_mclock_max_capacity_iops_hdd
>>  4754.250946
>> osd.222basic osd_mclock_max_capacity_iops_hdd
>>  540.016984
>> osd.281basic osd_mclock_max_capacity_iops_hdd
>>  1029.193945
>> osd.282basic osd_mclock_max_capacity_iops_hdd
>>  1061.762870
>> osd.283basic osd_mclock_max_capacity_iops_hdd
>>  462.984562
>>
>> We haven't set those explicitly, could they be the reason of the slow
>> recovery?
>>
>> i recommend to disable mclock for now, and yes we have seen slow recovery
>> caused by mclock.
>>
>
> Stupid question: how do you do that? I've looked through the docs but
> could only find information about changing the settings.
>
>
>>
>>
>> Bonus question: does ceph set that itself?
>>
>> yes and if you have a setup with HDD + SSD (db & wal) the discovery works
>> not in the right way.
>>
>
> Good to know!
>
>
> Gauvain
>
>
>>
>> Thanks!
>>
>> Gauvain
>>
>>
>>
>>
>>> Joachim
>>>
>>> ___
>>> Clyso GmbH - Ceph Foundation Member
>>>
>>> Am 21.03.23 um 06:53 schrieb Gauvain Pocentek:
>>> > Hello all,
>>> >
>>> > We have an EC (4+2) pool for RGW data, with HDDs + SSDs for WAL/DB.
>>> This
>>> > pool has 9 servers with each 12 disks of 16TBs. About 10 days ago we
>>> lost a
>>> > server and we've removed its OSDs from the cluster. Ceph has started to
>>> > remap and backfill as expected, but the process has been getting
>>> slower and
>>> > slower. Today the recovery rate is around 12 MiB/s and 10 objects/s.
>>> All
>>> > the remaining unclean PGs are backfilling:
>>> >
>>> >data:
>>> >  volumes: 1/1 healthy
>>> >  pools:   14 pools, 14497 pgs
>>> >  objects: 192.38M objects, 380 TiB
>>> >  usage:   764 TiB used, 1.3 PiB / 2.1 PiB avail
>>> >  pgs: 771559/1065561630 objects degraded (0.072%)
>>> >   1215899/1065561630 objects misplaced (0.114%)
>>> >   14428 active+clean
>>> >   50active+undersized+degraded+remapped+backfilling
>>> >   18active+remapped+backfilling
>>> >   1 active+clean+scrubbing+deep
>>> >
>>> > We've checked the health of the remaining servers, and everything looks
>>> > like (CPU/RAM/network/disks).
>>> >
>>> > Any hints on what could be happening?
>>> >
>>> > Thank you,
>>> > Gauvain
>>> > ___
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Very slow backfilling/remapping of EC pool PGs

2023-03-21 Thread Gauvain Pocentek
(adding back the list)

On Tue, Mar 21, 2023 at 11:25 AM Joachim Kraftmayer <
joachim.kraftma...@clyso.com> wrote:

> i added the questions and answers below.
>
> ___
> Best Regards,
> Joachim Kraftmayer
> CEO | Clyso GmbH
>
> Clyso GmbH
> p: +49 89 21 55 23 91 2
> a: Loristraße 8 | 80335 München | Germany
> w: https://clyso.com | e: joachim.kraftma...@clyso.com
>
> We are hiring: https://www.clyso.com/jobs/
> ---
> CEO: Dipl. Inf. (FH) Joachim Kraftmayer
> Unternehmenssitz: Utting am Ammersee
> Handelsregister beim Amtsgericht: Augsburg
> Handelsregister-Nummer: HRB 25866
> USt. ID-Nr.: DE275430677
>
> Am 21.03.23 um 11:14 schrieb Gauvain Pocentek:
>
> Hi Joachim,
>
>
> On Tue, Mar 21, 2023 at 10:13 AM Joachim Kraftmayer <
> joachim.kraftma...@clyso.com> wrote:
>
>> Which Ceph version are you running, is mclock active?
>>
>>
> We're using Quincy (17.2.5), upgraded step by step from Luminous if I
> remember correctly.
>
> did you recreate the osds? if yes, at which version?
>

I actually don't remember all the history, but I think we added the HDD
nodes while running Pacific.



>
> mlock seems active, set to high_client_ops profile. HDD OSDs have very
> different settings for max capacity iops:
>
> osd.137basic osd_mclock_max_capacity_iops_hdd
>  929.763899
> osd.161basic osd_mclock_max_capacity_iops_hdd
>  4754.250946
> osd.222basic osd_mclock_max_capacity_iops_hdd
>  540.016984
> osd.281basic osd_mclock_max_capacity_iops_hdd
>  1029.193945
> osd.282basic osd_mclock_max_capacity_iops_hdd
>  1061.762870
> osd.283basic osd_mclock_max_capacity_iops_hdd
>  462.984562
>
> We haven't set those explicitly, could they be the reason of the slow
> recovery?
>
> i recommend to disable mclock for now, and yes we have seen slow recovery
> caused by mclock.
>

Stupid question: how do you do that? I've looked through the docs but could
only find information about changing the settings.


>
>
> Bonus question: does ceph set that itself?
>
> yes and if you have a setup with HDD + SSD (db & wal) the discovery works
> not in the right way.
>

Good to know!


Gauvain


>
> Thanks!
>
> Gauvain
>
>
>
>
>> Joachim
>>
>> ___
>> Clyso GmbH - Ceph Foundation Member
>>
>> Am 21.03.23 um 06:53 schrieb Gauvain Pocentek:
>> > Hello all,
>> >
>> > We have an EC (4+2) pool for RGW data, with HDDs + SSDs for WAL/DB. This
>> > pool has 9 servers with each 12 disks of 16TBs. About 10 days ago we
>> lost a
>> > server and we've removed its OSDs from the cluster. Ceph has started to
>> > remap and backfill as expected, but the process has been getting slower
>> and
>> > slower. Today the recovery rate is around 12 MiB/s and 10 objects/s. All
>> > the remaining unclean PGs are backfilling:
>> >
>> >data:
>> >  volumes: 1/1 healthy
>> >  pools:   14 pools, 14497 pgs
>> >  objects: 192.38M objects, 380 TiB
>> >  usage:   764 TiB used, 1.3 PiB / 2.1 PiB avail
>> >  pgs: 771559/1065561630 objects degraded (0.072%)
>> >   1215899/1065561630 objects misplaced (0.114%)
>> >   14428 active+clean
>> >   50active+undersized+degraded+remapped+backfilling
>> >   18active+remapped+backfilling
>> >   1 active+clean+scrubbing+deep
>> >
>> > We've checked the health of the remaining servers, and everything looks
>> > like (CPU/RAM/network/disks).
>> >
>> > Any hints on what could be happening?
>> >
>> > Thank you,
>> > Gauvain
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Very slow backfilling/remapping of EC pool PGs

2023-03-20 Thread Gauvain Pocentek
Hello all,

We have an EC (4+2) pool for RGW data, with HDDs + SSDs for WAL/DB. This
pool has 9 servers with each 12 disks of 16TBs. About 10 days ago we lost a
server and we've removed its OSDs from the cluster. Ceph has started to
remap and backfill as expected, but the process has been getting slower and
slower. Today the recovery rate is around 12 MiB/s and 10 objects/s. All
the remaining unclean PGs are backfilling:

  data:
volumes: 1/1 healthy
pools:   14 pools, 14497 pgs
objects: 192.38M objects, 380 TiB
usage:   764 TiB used, 1.3 PiB / 2.1 PiB avail
pgs: 771559/1065561630 objects degraded (0.072%)
 1215899/1065561630 objects misplaced (0.114%)
 14428 active+clean
 50active+undersized+degraded+remapped+backfilling
 18active+remapped+backfilling
 1 active+clean+scrubbing+deep

We've checked the health of the remaining servers, and everything looks
like (CPU/RAM/network/disks).

Any hints on what could be happening?

Thank you,
Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Limited set of permissions for an RGW user (S3)

2023-02-13 Thread Gauvain Pocentek
Hi list,

A little bit of background: we provide S3 buckets using RGW (running
quincy), but users are not allowed to manage their buckets, just read and
write objects in them. Buckets are created by an admin user, and read/write
permissions are given to end users using S3 bucket policies. We set the
users quota to 0 for everything to forbid them to create buckets. This is
not really scalable and a bit annoying for the users.

So we are trying to find a solution to allow users to create their own
buckets but with a limited set of APIs available (no policy change for
example).

The ceph doc says that policies cannot be applied on users, groups or roles
yet. Is there any other way to achieve this?

Any feedback will be appreciated.

Thanks!
Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow OSD startup and slow ops

2022-10-17 Thread Gauvain Pocentek
Hello,

On Fri, Sep 30, 2022 at 8:12 AM Gauvain Pocentek 
wrote:

> Hi Stefan,
>
> Thanks for your feedback!
>
>
> On Thu, Sep 29, 2022 at 10:28 AM Stefan Kooman  wrote:
>
>> On 9/26/22 18:04, Gauvain Pocentek wrote:
>>
>> >
>> >
>> > We are running a Ceph Octopus (15.2.16) cluster with similar
>> > configuration. We have *a lot* of slow ops when starting OSDs. Also
>> > during peering. When the OSDs start they consume 100% CPU for up to
>> > ~ 10
>> > seconds, and after that consume 200% for a minute or more. During
>> that
>> > time the OSDs perform a compaction. You should be able to find this
>> in
>> > the OSD logs if it's the same in your case. After some the OSDs are
>> > done
>> > initializing and starting the boot process. As soon as they boot up
>> and
>> > start peering the slow ops start to kick in. Lot's of
>> "transitioning to
>> > Primary" and "transitioning to Stray" logging. Some time later the
>> OSD
>> > becomes "active". While the OSD is busy with peering it's also busy
>> > compacting. As I also see RocksDB compaction logging. So it might be
>> > due
>> > to RocksDB compactions impacting OSD performance while it's already
>> > busy
>> > becoming primary (and or secondary / tertiary) for it's PGs.
>> >
>> > We had norecover, nobackfill, norebalance active when booting the
>> OSDs.
>> >
>> > So, it might just take a long time to do RocksDB compaction. In this
>> > case it might be better to do all needed RocksDB compactions, and
>> then
>> > start booting. So, what might help is to set "ceph osd set noup".
>> This
>> > prevents the OSD from becoming active, then wait for the RocksDB
>> > compactions, and after that unset the flag.
>> >
>> > If you try this, please let me know how it goes.
>>
>> Last night we had storage switch maintenance. We turned off 2/3 of the
>> cluster and back on (one failure domain at a time). We used the "noup"
>> flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was
>> the time it took for the last OSD to finish it's RocksDB compactions. At
>> that point we unset the "noup" flag and allmost all OSDs came back
>> online instantly. This resulted in some slows ops, but ~ 30 times less
>> than before, and only for ~ 5 seconds. With a bit more planning you can
>> set the "noup" flag to individual OSDs. And then, in a loop with some
>> sleep, unset it per OSD. This might give less stress during peering.
>> This is however micro management. Ideally this "noup" step should not be
>> needed at all. The, maybe naive solution, would be to have the OSD
>> refrain itself from becoming active when it's in the bootup phase and
>> busy going through a whole batch of RocksDB compaction events. I'm
>> CC-ing Igor to see if he can comment on this.
>>
>> @Gauvain: Compared to your other clusters, does this cluster has more
>> Ceph services running that the others don't? Your other clusters might
>> have *way* less OMAP/metadata than the cluster giving you issues.
>>
>
> This cluster runs the same services as other clusters.
>
> It looks like we are hitting this bug:
> https://tracker.ceph.com/issues/53729. There seems to be a lot of
> duplicated op logs (I'm still trying to understand what that really is),
> huge memory usage (which hasn't been a problem because of the size of our
> servers, we have a lot of RAM), and so far no way to clean that online with
> Pacific. This blog post explains very clearly how to check if you are
> impacted: https://www.clyso.com/blog/osds-with-unlimited-ram-growth/
>
> All the clusters seem to be impacted, but that specific one shows worse
> signs.
>
> We are now looking into the offline cleanup. We're taking a lot of
> precautions because this is a production cluster and the problems have
> already impacted users.
>

After more analysis and testing we are definitely hitting the oplog dups
bug. That is causing the very slow startup of OSDs and some instabilities
on the cluster.

Since the fix is not yet released for Pacific we have started to manually
cleanup the OSDs.

We have compiled ceph-objectstore-tool from the pacific branch of the git
repo to get the `--op trim-pg-log-dups` feature and we're now running this
on all the OSDs:

(mon) for i in norebalance norecover nobackfill; 

[ceph-users] Re: Slow OSD startup and slow ops

2022-09-29 Thread Gauvain Pocentek
Hi Stefan,

Thanks for your feedback!


On Thu, Sep 29, 2022 at 10:28 AM Stefan Kooman  wrote:

> On 9/26/22 18:04, Gauvain Pocentek wrote:
>
> >
> >
> > We are running a Ceph Octopus (15.2.16) cluster with similar
> > configuration. We have *a lot* of slow ops when starting OSDs. Also
> > during peering. When the OSDs start they consume 100% CPU for up to
> > ~ 10
> > seconds, and after that consume 200% for a minute or more. During
> that
> > time the OSDs perform a compaction. You should be able to find this
> in
> > the OSD logs if it's the same in your case. After some the OSDs are
> > done
> > initializing and starting the boot process. As soon as they boot up
> and
> > start peering the slow ops start to kick in. Lot's of "transitioning
> to
> > Primary" and "transitioning to Stray" logging. Some time later the
> OSD
> > becomes "active". While the OSD is busy with peering it's also busy
> > compacting. As I also see RocksDB compaction logging. So it might be
> > due
> > to RocksDB compactions impacting OSD performance while it's already
> > busy
> > becoming primary (and or secondary / tertiary) for it's PGs.
> >
> > We had norecover, nobackfill, norebalance active when booting the
> OSDs.
> >
> > So, it might just take a long time to do RocksDB compaction. In this
> > case it might be better to do all needed RocksDB compactions, and
> then
> > start booting. So, what might help is to set "ceph osd set noup".
> This
> > prevents the OSD from becoming active, then wait for the RocksDB
> > compactions, and after that unset the flag.
> >
> > If you try this, please let me know how it goes.
>
> Last night we had storage switch maintenance. We turned off 2/3 of the
> cluster and back on (one failure domain at a time). We used the "noup"
> flag to prevent the OSDs from booting. Waited for ~ 10 minutes. That was
> the time it took for the last OSD to finish it's RocksDB compactions. At
> that point we unset the "noup" flag and allmost all OSDs came back
> online instantly. This resulted in some slows ops, but ~ 30 times less
> than before, and only for ~ 5 seconds. With a bit more planning you can
> set the "noup" flag to individual OSDs. And then, in a loop with some
> sleep, unset it per OSD. This might give less stress during peering.
> This is however micro management. Ideally this "noup" step should not be
> needed at all. The, maybe naive solution, would be to have the OSD
> refrain itself from becoming active when it's in the bootup phase and
> busy going through a whole batch of RocksDB compaction events. I'm
> CC-ing Igor to see if he can comment on this.
>
> @Gauvain: Compared to your other clusters, does this cluster has more
> Ceph services running that the others don't? Your other clusters might
> have *way* less OMAP/metadata than the cluster giving you issues.
>

This cluster runs the same services as other clusters.

It looks like we are hitting this bug: https://tracker.ceph.com/issues/53729.
There seems to be a lot of duplicated op logs (I'm still trying to
understand what that really is), huge memory usage (which hasn't been a
problem because of the size of our servers, we have a lot of RAM), and so
far no way to clean that online with Pacific. This blog post explains very
clearly how to check if you are impacted:
https://www.clyso.com/blog/osds-with-unlimited-ram-growth/

All the clusters seem to be impacted, but that specific one shows worse
signs.

We are now looking into the offline cleanup. We're taking a lot of
precautions because this is a production cluster and the problems have
already impacted users.

Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow OSD startup and slow ops

2022-09-26 Thread Gauvain Pocentek
Hello Stefan,

Thank you for your answers.

On Thu, Sep 22, 2022 at 5:54 PM Stefan Kooman  wrote:

> Hi,
>
> On 9/21/22 18:00, Gauvain Pocentek wrote:
> > Hello all,
> >
> > We are running several Ceph clusters and are facing an issue on one of
> > them, we would appreciate some input on the problems we're seeing.
> >
> > We run Ceph in containers on Centos Stream 8, and we deploy using
> > ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed
> that
> > OSDs were taking a very long time to restart on one of the clusters.
> (Other
> > clusters were not impacted at all.)
>
> Are the other clusters of similar size?
>

We have at least one cluster that is roughly the same size. It has not been
upgraded yet but restarting the OSDs doesn't create any issues.



> The OSD startup was so slow sometimes
> > that we ended up having slow ops, with 1 or 2 pg stuck in a peering
> state.
> > We've interrupted the upgrade and the cluster runs fine now, although we
> > have seen 1 OSD flapping recently, having trouble coming back to life.
> >
> > We've checked a lot of things and read a lot of mails from this list, and
> > here are some info:
> >
> > * this cluster has RBD pools for OpenStack and RGW pools; everything is
> > replicated x 3, except the RGW data pool which is EC 4+2
> > * we haven't found any hardware related issues; we run fully on SSDs and
> > they are all in good shape, no network issue, RAM and CPU are available
> on
> > all OSD hosts
> > * bluestore with an LVM collocated setup
> > * we have seen the slow restart with almost all the OSDs we've upgraded
> > (100 out of 350)
> > * on restart the ceph-osd process runs at 100% CPU but we haven't seen
> > anything weird on the host
>
> Are the containers restricted to use a certain amount of CPU? Do the
> OSDs, after ~ 10-20 seconds increase their CPU usage to 200% (if so this
> is proably because of rocksdb option max_background_compactions = 2).
>

This is actually a good point. We run the containers with --cpus=2. We also
had a couple incidents were OSDs started to act up on nodes were VMs were
running CPU intensive workloads (we have a hyperconverged setup with
OpenStack). So there's definitely something going on there.

I haven't had the opportunity to do a new restart to check more about the
CPU usage, but I hope to do that this week.


>
> > * no DB spillover
> > * we have other clusters with the same hardware, and we don't see
> problems
> > there
> >
> > The only thing that we found that looks suspicious is the number of op
> logs
> > for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k
> > but `ceph pg dump` show PGs with more than 100k logs (the largest one
> has >
> > 400k logs).
> >
> > Could this be the reason for the slow startup of OSDs? If so is there a
> way
> > to trim these logs without too much impact on the cluster?
>
> Not sure. We have ~ 2K logs per PG.
>
> >
> > Let me know if additional info or logs are needed.
>
> Do you have a log of slow ops and osd logs?
>

I will get more logs when I restart an OSD this week. What log levels for
bluestore/rocksdb would you recommend?


>
> Do you have any non-standard configuration for the daemons? I.e. ceph
> daemon osd.$id config diff
>

Nothing non-standard.


>
> We are running a Ceph Octopus (15.2.16) cluster with similar
> configuration. We have *a lot* of slow ops when starting OSDs. Also
> during peering. When the OSDs start they consume 100% CPU for up to ~ 10
> seconds, and after that consume 200% for a minute or more. During that
> time the OSDs perform a compaction. You should be able to find this in
> the OSD logs if it's the same in your case. After some the OSDs are done
> initializing and starting the boot process. As soon as they boot up and
> start peering the slow ops start to kick in. Lot's of "transitioning to
> Primary" and "transitioning to Stray" logging. Some time later the OSD
> becomes "active". While the OSD is busy with peering it's also busy
> compacting. As I also see RocksDB compaction logging. So it might be due
> to RocksDB compactions impacting OSD performance while it's already busy
> becoming primary (and or secondary / tertiary) for it's PGs.
>
> We had norecover, nobackfill, norebalance active when booting the OSDs.
>
> So, it might just take a long time to do RocksDB compaction. In this
> case it might be better to do all needed RocksDB compactions, and then
> start booting. So, what might help is to set "ceph osd set noup". This
> prevents the OSD from becoming active, then wait for the RocksDB
> compactions, and after that unset the flag.
>
> If you try this, please let me know how it goes.
>

That sounds like a good thing to try, I'll keep you posted.

Thanks again,
Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Slow OSD startup and slow ops

2022-09-21 Thread Gauvain Pocentek
Hello all,

We are running several Ceph clusters and are facing an issue on one of
them, we would appreciate some input on the problems we're seeing.

We run Ceph in containers on Centos Stream 8, and we deploy using
ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed that
OSDs were taking a very long time to restart on one of the clusters. (Other
clusters were not impacted at all.) The OSD startup was so slow sometimes
that we ended up having slow ops, with 1 or 2 pg stuck in a peering state.
We've interrupted the upgrade and the cluster runs fine now, although we
have seen 1 OSD flapping recently, having trouble coming back to life.

We've checked a lot of things and read a lot of mails from this list, and
here are some info:

* this cluster has RBD pools for OpenStack and RGW pools; everything is
replicated x 3, except the RGW data pool which is EC 4+2
* we haven't found any hardware related issues; we run fully on SSDs and
they are all in good shape, no network issue, RAM and CPU are available on
all OSD hosts
* bluestore with an LVM collocated setup
* we have seen the slow restart with almost all the OSDs we've upgraded
(100 out of 350)
* on restart the ceph-osd process runs at 100% CPU but we haven't seen
anything weird on the host
* no DB spillover
* we have other clusters with the same hardware, and we don't see problems
there

The only thing that we found that looks suspicious is the number of op logs
for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k
but `ceph pg dump` show PGs with more than 100k logs (the largest one has >
400k logs).

Could this be the reason for the slow startup of OSDs? If so is there a way
to trim these logs without too much impact on the cluster?

Let me know if additional info or logs are needed.

BR,
Gauvain
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io