[ceph-users] How to specify to only build ceph-radosgw package from source?

2023-05-31 Thread huy nguyen
Hi,

I usually install the SRPM and then build from ceph.spec like this:

rpmbuild -bb /root/rpmbuild/SPECS/ceph.spec --without ceph_test_package

But it take a long time and contain many packages that I don't need. So is 
there a way to optimize this build process for only needed package, for example 
ceph-radosgw?

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Small RGW objects and RADOS 64KB minimun size

2023-05-31 Thread David Oganezov
Hey Josh!

Sorry for necroing this thread, but my team is currently running a Pacific 
cluster that was updated from Nautilus, and we are rebuilding hosts one by one 
to reclaim the space in the OSDs.
We might have missed it, but was the automated rolling format with cephadm 
eventually implemented?

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff
Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name wasn’t. ;-)

Yes, I have five active MDS and five hot standbys. Static pinning isn’t really 
an options for our directory structure, so we’re using ephemeral pins.

Janek


> On 31. May 2023, at 18:44, Dan van der Ster  wrote:
> 
> Hi Janek,
> 
> A few questions and suggestions:
> - Do you have multi-active MDS? In my experience back in nautilus if
> something went wrong with mds export between mds's, the mds
> log/journal could grow unbounded like you observed until that export
> work was done. Static pinning could help if you are not using it
> already.
> - You definitely should disable the pg autoscaling on the mds metadata
> pool (and other pools imho) -- decide the correct number of PGs for
> your pools and leave it.
> - Which version are you running? You said nautilus but wrote 16.2.12
> which is pacific... If you're running nautilus v14 then I recommend
> disabling pg autoscaling completely -- IIRC it does not have a fix for
> the OSD memory growth "pg dup" issue which can occur during PG
> splitting/merging.
> 
> Cheers, Dan
> 
> __
> Clyso GmbH | https://www.clyso.com
> 
> 
> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
>  wrote:
>> 
>> I checked our logs from yesterday, the PG scaling only started today,
>> perhaps triggered by the snapshot trimming. I disabled it, but it didn't
>> change anything.
>> 
>> What did change something was restarting the MDS one by one, which had
>> got far behind with trimming their caches and with a bunch of stuck ops.
>> After restarting them, the pool size decreased quickly to 600GiB. I
>> noticed the same behaviour yesterday, though yesterday is was more
>> extreme and restarting the MDS took about an hour and I had to increase
>> the heartbeat timeout. This time, it took only half a minute per MDS,
>> probably because it wasn't that extreme yet and I had reduced the
>> maximum cache size. Still looks like a bug to me.
>> 
>> 
>> On 31/05/2023 11:18, Janek Bevendorff wrote:
>>> Another thing I just noticed is that the auto-scaler is trying to
>>> scale the pool down to 128 PGs. That could also result in large
>>> fluctuations, but this big?? In any case, it looks like a bug to me.
>>> Whatever is happening here, there should be safeguards with regard to
>>> the pool's capacity.
>>> 
>>> Here's the current state of the pool in ceph osd pool ls detail:
>>> 
>>> pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule
>>> 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
>>> pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
>>> 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
>>> expected_num_objects 300 recovery_op_priority 5 recovery_priority
>>> 2 application cephfs
>>> 
>>> Janek
>>> 
>>> 
>>> On 31/05/2023 10:10, Janek Bevendorff wrote:
 Forgot to add: We are still on Nautilus (16.2.12).
 
 
 On 31/05/2023 09:53, Janek Bevendorff wrote:
> Hi,
> 
> Perhaps this is a known issue and I was simply too dumb to find it,
> but we are having problems with our CephFS metadata pool filling up
> over night.
> 
> Our cluster has a small SSD pool of around 15TB which hosts our
> CephFS metadata pool. Usually, that's more than enough. The normal
> size of the pool ranges between 200 and 800GiB (which is quite a lot
> of fluctuation already). Yesterday, we had suddenly had the pool
> fill up entirely and they only way to fix it was to add more
> capacity. I increased the pool size to 18TB by adding more SSDs and
> could resolve the problem. After a couple of hours of reshuffling,
> the pool size finally went back to 230GiB.
> 
> But then we had another fill-up tonight to 7.6TiB. Luckily, I had
> adjusted the weights so that not all disks could fill up entirely
> like last time, so it ended there.
> 
> I wasn't really able to identify the problem yesterday, but under
> the more controllable scenario today, I could check the MDS logs at
> debug_mds=10 and to me it seems like the problem is caused by
> snapshot trimming. The logs contain a lot of snapshot-related
> messages for paths that haven't been touched in a long time. The
> messages all look something like this:
> 
> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first
> cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
> b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201'
> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100
> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
> 0x100 ...
> 
> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200
> 7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir
> 0x10218fe.10101* 

[ceph-users] Re: reef v18.1.0 QE Validation status

2023-05-31 Thread Yuri Weinstein
Casey

I will rerun rgw and we will see.
Stay tuned.

On Wed, May 31, 2023 at 10:27 AM Casey Bodley  wrote:
>
> On Tue, May 30, 2023 at 12:54 PM Yuri Weinstein  wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/61515#note-1
> > Release Notes - TBD
> >
> > Seeking approvals/reviews for:
> >
> > rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to
> > merge https://github.com/ceph/ceph/pull/51788 for
> > the core)
> > rgw - Casey
>
> the rgw suite had several new test_rgw_throttle.sh failures that i
> haven't seen before:
>
> qa/workunits/rgw/test_rgw_throttle.sh: line 3: ceph_test_rgw_throttle:
> command not found
>
> those only show up on rhel8 jobs, and none of your later reef runs fail this 
> way
>
> Yuri, is it possible that the suite-branch was mixed up somehow? the
> ceph "sha1: be098f4642e7d4bbdc3f418c5ad703e23d1e9fe0" didn't match the
> workunit "sha1: 4a02f3f496d9039326c49bf1fbe140388cd2f619"
>
> > fs - Venky
> > orch - Adam King
> > rbd - Ilya
> > krbd - Ilya
> > upgrade/octopus-x - deprecated
> > upgrade/pacific-x - known issues, Ilya, Laura?
> > upgrade/reef-p2p - N/A
> > clients upgrades - not run yet
> > powercycle - Brad
> > ceph-volume - in progress
> >
> > Please reply to this email with approval and/or trackers of known
> > issues/PRs to address them.
> >
> > gibba upgrade was done and will need to be done again this week.
> > LRC upgrade TBD
> >
> > TIA
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef v18.1.0 QE Validation status

2023-05-31 Thread Adam King
Orch approved. The orch/cephadm tests looked good and the orch/rook tests
are known to not work currently.

On Tue, May 30, 2023 at 12:54 PM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/61515#note-1
> Release Notes - TBD
>
> Seeking approvals/reviews for:
>
> rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to
> merge https://github.com/ceph/ceph/pull/51788 for
> the core)
> rgw - Casey
> fs - Venky
> orch - Adam King
> rbd - Ilya
> krbd - Ilya
> upgrade/octopus-x - deprecated
> upgrade/pacific-x - known issues, Ilya, Laura?
> upgrade/reef-p2p - N/A
> clients upgrades - not run yet
> powercycle - Brad
> ceph-volume - in progress
>
> Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
>
> gibba upgrade was done and will need to be done again this week.
> LRC upgrade TBD
>
> TIA
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef v18.1.0 QE Validation status

2023-05-31 Thread Casey Bodley
On Tue, May 30, 2023 at 12:54 PM Yuri Weinstein  wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/61515#note-1
> Release Notes - TBD
>
> Seeking approvals/reviews for:
>
> rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to
> merge https://github.com/ceph/ceph/pull/51788 for
> the core)
> rgw - Casey

the rgw suite had several new test_rgw_throttle.sh failures that i
haven't seen before:

qa/workunits/rgw/test_rgw_throttle.sh: line 3: ceph_test_rgw_throttle:
command not found

those only show up on rhel8 jobs, and none of your later reef runs fail this way

Yuri, is it possible that the suite-branch was mixed up somehow? the
ceph "sha1: be098f4642e7d4bbdc3f418c5ad703e23d1e9fe0" didn't match the
workunit "sha1: 4a02f3f496d9039326c49bf1fbe140388cd2f619"

> fs - Venky
> orch - Adam King
> rbd - Ilya
> krbd - Ilya
> upgrade/octopus-x - deprecated
> upgrade/pacific-x - known issues, Ilya, Laura?
> upgrade/reef-p2p - N/A
> clients upgrades - not run yet
> powercycle - Brad
> ceph-volume - in progress
>
> Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
>
> gibba upgrade was done and will need to be done again this week.
> LRC upgrade TBD
>
> TIA
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] bucket notification retries

2023-05-31 Thread Yuval Lifshitz
Dear Community,
I would like to collect your feedback on this issue. This is a followup
from a discussion that started in the RGW refactoring meeting on 31-May-23
(thanks @Krunal Chheda  for bringing up this
topic!).

Currently persistent notifications are retried indefinitely.
The only limiting mechanism that exists is that all notifications to a
specific topic are stored in one RADOS object (of size 128MB).
Assuming notifications are ~1KB at most, this would give us at least 128K
notifications that can wait in the queue.
When the queue fills up (e.g. kafka broker is down for 20 minutes, we are
sending ~100 notifications per second) we start sending "slow down" replies
to the client, and in this case the S3 operation will not be performed.
This means that, for example, an outage of the kafka system would
eventually cause an outage of our service. Note that this may also be a
result of a misconfiguration of the kafka broker, or decommissioning of a
broker.

To avoid that, we propose several options:
* use a fifo instead of a queue. This would allow us to hold more than 128K
messages - survive longer broker outages, and at a higher message rate.
there should still probably be a limit set on the size of the fifo
* define maximum number of retries allowed for a notification
* define maximum time the notification may stay in the queue before it is
removed

We should probably start with these definitions done as topic attributes,
reflecting our delivery guarantees for this specific destination.
Will try to capture the results of the discussion in this tracker:
https://tracker.ceph.com/issues/61532

Thanks,

Yuval
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Dan van der Ster
Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds metadata
pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix for
the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:
>
> I checked our logs from yesterday, the PG scaling only started today,
> perhaps triggered by the snapshot trimming. I disabled it, but it didn't
> change anything.
>
> What did change something was restarting the MDS one by one, which had
> got far behind with trimming their caches and with a bunch of stuck ops.
> After restarting them, the pool size decreased quickly to 600GiB. I
> noticed the same behaviour yesterday, though yesterday is was more
> extreme and restarting the MDS took about an hour and I had to increase
> the heartbeat timeout. This time, it took only half a minute per MDS,
> probably because it wasn't that extreme yet and I had reduced the
> maximum cache size. Still looks like a bug to me.
>
>
> On 31/05/2023 11:18, Janek Bevendorff wrote:
> > Another thing I just noticed is that the auto-scaler is trying to
> > scale the pool down to 128 PGs. That could also result in large
> > fluctuations, but this big?? In any case, it looks like a bug to me.
> > Whatever is happening here, there should be safeguards with regard to
> > the pool's capacity.
> >
> > Here's the current state of the pool in ceph osd pool ls detail:
> >
> > pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule
> > 5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128
> > pgp_num_target 128 autoscale_mode on last_change 1359013 lfor
> > 0/1358620/1358618 flags hashpspool,nodelete stripe_width 0
> > expected_num_objects 300 recovery_op_priority 5 recovery_priority
> > 2 application cephfs
> >
> > Janek
> >
> >
> > On 31/05/2023 10:10, Janek Bevendorff wrote:
> >> Forgot to add: We are still on Nautilus (16.2.12).
> >>
> >>
> >> On 31/05/2023 09:53, Janek Bevendorff wrote:
> >>> Hi,
> >>>
> >>> Perhaps this is a known issue and I was simply too dumb to find it,
> >>> but we are having problems with our CephFS metadata pool filling up
> >>> over night.
> >>>
> >>> Our cluster has a small SSD pool of around 15TB which hosts our
> >>> CephFS metadata pool. Usually, that's more than enough. The normal
> >>> size of the pool ranges between 200 and 800GiB (which is quite a lot
> >>> of fluctuation already). Yesterday, we had suddenly had the pool
> >>> fill up entirely and they only way to fix it was to add more
> >>> capacity. I increased the pool size to 18TB by adding more SSDs and
> >>> could resolve the problem. After a couple of hours of reshuffling,
> >>> the pool size finally went back to 230GiB.
> >>>
> >>> But then we had another fill-up tonight to 7.6TiB. Luckily, I had
> >>> adjusted the weights so that not all disks could fill up entirely
> >>> like last time, so it ended there.
> >>>
> >>> I wasn't really able to identify the problem yesterday, but under
> >>> the more controllable scenario today, I could check the MDS logs at
> >>> debug_mds=10 and to me it seems like the problem is caused by
> >>> snapshot trimming. The logs contain a lot of snapshot-related
> >>> messages for paths that haven't been touched in a long time. The
> >>> messages all look something like this:
> >>>
> >>> May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
> >>> 7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first
> >>> cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
> >>> b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201'
> >>> 2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100
> >>> 'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
> >>> 0x100 ...
> >>>
> >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200
> >>> 7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir
> >>> 0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0
> >>> child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0
> >>> tempexporting=0 0x5607759d9600]
> >>>
> >>> May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200
> >>> 7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir
> >>> 0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0
> >>> request=0 child=0 frozen=0 subtree=1 

[ceph-users] Re: BlueStore fragmentation woes

2023-05-31 Thread Stefan Kooman

On 5/31/23 16:15, Igor Fedotov wrote:


On 31/05/2023 15:26, Stefan Kooman wrote:

On 5/29/23 15:52, Igor Fedotov wrote:

Hi Stefan,

given that allocation probes include every allocation (including 
short 4K ones) your stats look pretty high indeed.


Although you omitted historic probes so it's hard to tell if there is 
negative trend in it..


I did not omit them. We (currently) don't store logs for longer than 7 
days. I will increase the interval in which the probes get created 
(every hour).



Allocation probe contains historic data on its own, e.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

in the snippet above probes -1 through -17 are historic data 1 through 
17 days (or more correctly probe attempts) back.


Ah, I just blindly applied your suggested grep keywords, but then you 
don't get the historic probes, here we go:


ceph-osd.55.log.1.gz:2023-05-30T18:42:02.340+0200 7ffb856f8700  0 
bluestore(/var/lib/ceph/osd/ceph-55)  allocation stats probe 15: cnt: 
19940282 frags: 37474984 size: 331549941760
ceph-osd.55.log.1.gz:2023-05-30T18:42:02.340+0200 7ffb856f8700  0 
bluestore(/var/lib/ceph/osd/ceph-55)  probe -1: 19269005,  35807252, 
265695309824
ceph-osd.55.log.1.gz:2023-05-30T18:42:02.340+0200 7ffb856f8700  0 
bluestore(/var/lib/ceph/osd/ceph-55)  probe -3: 18281841,  33452239, 
321271795712
ceph-osd.55.log.1.gz:2023-05-30T18:42:02.340+0200 7ffb856f8700  0 
bluestore(/var/lib/ceph/osd/ceph-55)  probe -7: 21188454,  37298950, 
278389411840
ceph-osd.55.log.1.gz:2023-05-30T18:42:02.340+0200 7ffb856f8700  0 
bluestore(/var/lib/ceph/osd/ceph-55)  probe -15: 20079763,  34770424, 
306357551104
ceph-osd.55.log.1.gz:2023-05-30T18:42:02.340+0200 7ffb856f8700  0 
bluestore(/var/lib/ceph/osd/ceph-55)  probe -31: 0,  0, 0
ceph-osd.66.log.1.gz:2023-05-30T18:42:01.496+0200 7fce80741700  0 
bluestore(/var/lib/ceph/osd/ceph-66)  allocation stats probe 15: cnt: 
18130163 frags: 32547753 size: 286495289344
ceph-osd.66.log.1.gz:2023-05-30T18:42:01.496+0200 7fce80741700  0 
bluestore(/var/lib/ceph/osd/ceph-66)  probe -1: 16671297,  28781657, 
219492544512
ceph-osd.66.log.1.gz:2023-05-30T18:42:01.496+0200 7fce80741700  0 
bluestore(/var/lib/ceph/osd/ceph-66)  probe -3: 16781500,  31159184, 
306530078720
ceph-osd.66.log.1.gz:2023-05-30T18:42:01.496+0200 7fce80741700  0 
bluestore(/var/lib/ceph/osd/ceph-66)  probe -7: 18475332,  33264944, 
266053271552
ceph-osd.66.log.1.gz:2023-05-30T18:42:01.496+0200 7fce80741700  0 
bluestore(/var/lib/ceph/osd/ceph-66)  probe -15: 18799693,  32644833, 
270106509312
ceph-osd.66.log.1.gz:2023-05-30T18:42:01.496+0200 7fce80741700  0 
bluestore(/var/lib/ceph/osd/ceph-66)  probe -31: 0,  0, 0
ceph-osd.67.log.1.gz:2023-05-30T18:42:03.784+0200 7fd3e6ee9700  0 
bluestore(/var/lib/ceph/osd/ceph-67)  allocation stats probe 15: cnt: 
19073940 frags: 39349763 size: 350442409984
ceph-osd.67.log.1.gz:2023-05-30T18:42:03.784+0200 7fd3e6ee9700  0 
bluestore(/var/lib/ceph/osd/ceph-67)  probe -1: 19346357,  40814157, 
291156762624
ceph-osd.67.log.1.gz:2023-05-30T18:42:03.784+0200 7fd3e6ee9700  0 
bluestore(/var/lib/ceph/osd/ceph-67)  probe -3: 18880370,  33120073, 
329544183808
ceph-osd.67.log.1.gz:2023-05-30T18:42:03.784+0200 7fd3e6ee9700  0 
bluestore(/var/lib/ceph/osd/ceph-67)  probe -7: 18550989,  35236438, 
273069948928
ceph-osd.67.log.1.gz:2023-05-30T18:42:03.784+0200 7fd3e6ee9700  0 
bluestore(/var/lib/ceph/osd/ceph-67)  probe -15: 18664247,  36208229, 
327013150720
ceph-osd.67.log.1.gz:2023-05-30T18:42:03.784+0200 7fd3e6ee9700  0 
bluestore(/var/lib/ceph/osd/ceph-67)  probe -31: 0,  0, 0
ceph-osd.68.log.1.gz:2023-05-30T18:42:03.704+0200 7f6bb6c05700  0 
bluestore(/var/lib/ceph/osd/ceph-68)  allocation stats probe 15: cnt: 
24318407 frags: 42340467 size: 324867026944
ceph-osd.68.log.1.gz:2023-05-30T18:42:03.704+0200 7f6bb6c05700  0 
bluestore(/var/lib/ceph/osd/ceph-68)  probe -1: 24202548,  41934313, 
263141662720
ceph-osd.68.log.1.gz:2023-05-30T18:42:03.704+0200 7f6bb6c05700  0 
bluestore(/var/lib/ceph/osd/ceph-68)  probe -3: 24258584,  43985640, 
348764803072
ceph-osd.68.log.1.gz:2023-05-30T18:42:03.704+0200 7f6bb6c05700  0 
bluestore(/var/lib/ceph/osd/ceph-68)  probe -7: 25312482,  43035393, 
287792226304
ceph-osd.68.log.1.gz:2023-05-30T18:42:03.704+0200 7f6bb6c05700  0 
bluestore(/var/lib/ceph/osd/ceph-68)  probe -15: 25280732,  40994669, 
298337083392
ceph-osd.68.log.1.gz:2023-05-30T18:42:03.704+0200 7f6bb6c05700  0 
bluestore(/var/lib/ceph/osd/ceph-68)  probe -31: 0,  0, 0
ceph-osd.75.log.1.gz:2023-05-30T18:42:03.664+0200 7ff67f434700  0 
bluestore(/var/lib/ceph/osd/ceph-75)  allocation stats probe 15: cnt: 
21755123 frags: 42279513 size: 364872880128
ceph-osd.75.log.1.gz:2023-05-30T18:42:03.664+0200 7ff67f434700  0 

[ceph-users] Re: RGW versioned bucket index issues

2023-05-31 Thread Cory Snyder
I've proposed some new radosgw-admin commands for both identifying and fixing 
these leftover index entries in this open PR: 
https://github.com/ceph/ceph/pull/51700

Cory



From: Mark Nelson 
Sent: Wednesday, May 31, 2023 10:42 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: RGW versioned bucket index issues

Thank you Cory for this excellent write up!  A quick question: Is there a 
simple method to find and more importantly fix the zombie index entries and OLH 
objects? I saw in https: //urldefense. com/v3/__https: //tracker. ceph. 
com/issues/59663__;!!J0dtj8f0ZRU!jQa1-QLVWrJY5uzYRRlQcHUBsz-SXCIgyDC6Z8QLZqhtBwtIscjRFfA5XKAZPydCLywOqLni4aXUyUzWvQ$
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

Report Suspicious

ZjQcmQRYFpfptBannerEnd

Thank you Cory for this excellent write up!  A quick question: Is there
a simple method to find and more importantly fix the zombie index
entries and OLH objects?

I saw in 
https://urldefense.com/v3/__https://tracker.ceph.com/issues/59663__;!!J0dtj8f0ZRU!jQa1-QLVWrJY5uzYRRlQcHUBsz-SXCIgyDC6Z8QLZqhtBwtIscjRFfA5XKAZPydCLywOqLni4aXUyUzWvQ$
 that there was an example
using radosgw-admin to examine the lifecycle/marker/garbage collection
info, but that looks a little cumbersome?


Mark


On 5/31/23 05:16, Cory Snyder wrote:
> Hi all,
>
> I wanted to call attention to some RGW issues that we've observed on a
> Pacific cluster over the past several weeks. The problems relate to versioned
> buckets and index entries that can be left behind after transactions complete
> abnormally. The scenario is multi-faceted and we're still investigating some 
> of
> the details, but I wanted to provide a big-picture summary of what we've found
> so far. It looks like most of these issues should be reproducible on versions
> before and after Pacific as well. I'll enumerate the individual issues below:
>
> 1. PUT requests during reshard of versioned bucket fail with 404 and leave
> behind dark data
>
> Tracker: 
> https://urldefense.com/v3/__https://tracker.ceph.com/issues/61359__;!!J0dtj8f0ZRU!jQa1-QLVWrJY5uzYRRlQcHUBsz-SXCIgyDC6Z8QLZqhtBwtIscjRFfA5XKAZPydCLywOqLni4aWArnEHYw$
>
> 2. When bucket index ops are cancelled it can leave behind zombie index 
> entries
>
> This one was merged a few months ago and did make the v16.2.13 release, 
> but
> in our case we had billions of extra index entries by the time that we had
> upgraded to the patched version.
>
> Tracker: 
> https://urldefense.com/v3/__https://tracker.ceph.com/issues/58673__;!!J0dtj8f0ZRU!jQa1-QLVWrJY5uzYRRlQcHUBsz-SXCIgyDC6Z8QLZqhtBwtIscjRFfA5XKAZPydCLywOqLni4aVmbj0i2g$
>
> 3. Issuing a delete for a key that already has a delete marker as the current
> version leaves behind index entries and OLH objects
>
> Note that the tracker's original description describes the problem a bit
> differently, but I've clarified the nature of the issue in a comment.
>
> Tracker: 
> https://urldefense.com/v3/__https://tracker.ceph.com/issues/59663__;!!J0dtj8f0ZRU!jQa1-QLVWrJY5uzYRRlQcHUBsz-SXCIgyDC6Z8QLZqhtBwtIscjRFfA5XKAZPydCLywOqLni4aXUyUzWvQ$
>
> The extra index entries and OLH objects that are left behind due to these 
> sorts
> of issues are obviously annoying in regards to the fact that they 
> unnecessarily
> consume space, but we've found that they can also cause severe performance
> degradation for bucket listings, lifecycle processing, and other ops 
> indirectly
> due to higher osd latencies.
>
> The reason for the performance impact is that bucket listing calls must
> repeatedly perform additional OSD ops until they find the requisite number
> of entries to return. The OSD cls method for bucket listing also does its own
> internal iteration for the same purpose. Since these entries are invalid, they
> are skipped. In the case that we observed, where some of our bucket indexes 
> were
> filled with a sea of contiguous leftover entries, the process of continually
> iterating over and skipping invalid entries caused enormous read 
> amplification.
> I believe that the following tracker is describing symptoms that are related 
> to
> the same issue: 
> https://urldefense.com/v3/__https://tracker.ceph.com/issues/59164__;!!J0dtj8f0ZRU!jQa1-QLVWrJY5uzYRRlQcHUBsz-SXCIgyDC6Z8QLZqhtBwtIscjRFfA5XKAZPydCLywOqLni4aU7OS3f7g$.
>
> Note that this can also cause LC processing to repeatedly fail in cases where
> there are enough contiguous invalid entries, since the OSD cls code eventually
> gives up and returns an error that isn't handled.
>
> The severity of these issues likely varies greatly based upon client behavior.
> If anyone has experienced similar problems, we'd love to hear about the nature
> of how they've manifested for you so that we can be 

[ceph-users] Re: slow mds requests with random read test

2023-05-31 Thread Ben
Thank you Patrick for help.
The random write tests are performing well enough, though. Wonder why read
test is so poor with the same configuration(resulting read bandwidth about
15MB/s vs 400MB/s of write).  especially the logs of slow requests are
irrelevant with testing ops. I am thinking it is something with cephfs
kernel client?

Any other thoughts?

Patrick Donnelly  于2023年5月31日周三 00:58写道:

> On Tue, May 30, 2023 at 8:42 AM Ben  wrote:
> >
> > Hi,
> >
> > We are performing couple performance tests on CephFS using fio. fio is
> run
> > in k8s pod and 3 pods will be up running mounting the same pvc to CephFS
> > volume. Here is command line for random read:
> > fio -direct=1 -iodepth=128 -rw=randread -ioengine=libaio -bs=4k -size=1G
> > -numjobs=5 -runtime=500 -group_reporting -directory=/tmp/cache
> > -name=Rand_Read_Testing_$BUILD_TIMESTAMP
> > The random read is performed very slow. Here is the cluster log from
> > dashboard:
> > [...]
> > Any suggestions on the problem?
>
> Your random read workload is too extreme for your cluster of OSDs.
> It's causing slow metadata ops for the MDS. To resolve this we would
> normally suggest allocating a set of OSDs on SSDs for use by the
> CephFS metadata pool to isolate the worklaods.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Red Hat Partner Engineer
> IBM, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] [Pacific] Admin keys no longer works I get access denied URGENT!!!

2023-05-31 Thread Beaman, Joshua
Greetings,

Try:

ceph -n mon. --keyring /var/lib/ceph//mon//keyring 
get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow 
*'

Thank you,
Josh Beaman

From: wodel youchi 
Date: Wednesday, May 31, 2023 at 5:39 AM
To: ceph-users@ceph.io 
Subject: [EXTERNAL] [ceph-users] [Pacific] Admin keys no longer works I get 
access denied URGENT!!!
Hi,

After a wrong manipulation, the admin key no longer works, it seems it has
been modified.

My cluster is built using containers.

When I execute ceph -s I get
[root@controllera ceph]# ceph -s
2023-05-31T11:33:20.940+0100 7ff7b2d13700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b1d11700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b2512700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
[errno 13] RADOS permission denied (error connecting to the cluster)

>From the log file I am getting :
May 31 11:03:02 controllera docker[214909]: debug
2023-05-31T11:03:02.714+0100 7fcfc0c91700  0 cephx server client.admin:
 unexpected key: req.key=5fea877f2a68548b expected_key=8c2074e03ffa449a

How can I recover the correct key?

Regards.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: all buckets mtime = "0.000000" after upgrade to 17.2.6

2023-05-31 Thread Casey Bodley
thanks for the report. this regression was already fixed in
https://tracker.ceph.com/issues/58932 and will be in the next quincy
point release

On Wed, May 31, 2023 at 10:46 AM  wrote:
>
> I was running on 17.2.5 since October, and just upgraded to 17.2.6, and now 
> the "mtime" property on all my buckets is 0.00.
>
> On all previous versions going back to Nautilus this wasn't an issue, and we 
> do like to have that value present. radosgw-admin has no quick way to get the 
> last object in the bucket.
>
> Here's my tracker submission:
> https://tracker.ceph.com/issues/61264#change-239348
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS corrupt (also RADOS-level copy?)

2023-05-31 Thread Janek Bevendorff
Forgot so say: As for your corrupt rank 0, you should check the logs 
with a higher debug level. Looks like you were less lucky than we were. 
Your journal position may be incorrect. This could be fixed by editing 
the journal header. You might also try to tell your MDS to skip corrupt 
entries. None of these operations are safe, though.



On 31/05/2023 16:41, Janek Bevendorff wrote:

Hi Jake,

Very interesting. This sounds very much like what we have been 
experiencing the last two days. We also had a sudden fill-up of the 
metadata pool, which repeated last night. See my question here: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/


I also noticed that I couldn't dump the current journal using the 
cephfs-journal-tool, as it would eat up all my RAM (probably not 
surprising with a journal that seems to be filling up a 16TiB pool).


Note: I did NOT need to reset the journal (and you probably don't need 
to either). I did, however, have to add extra capacity and balance out 
the data. After an MDS restart, the pool quickly cleared out again. 
The first MDS restart took an hour or so and I had to increase the MDS 
lag timeout (mds_beacon_grace), otherwise the MONs kept killing the 
MDS during the resolve phase. I set it to 1600 to be on the safe side.


While your MDS are recovering, you may want to set debug_mds to 10 for 
one of your MDS and check the logs. My logs were being spammed with 
snapshot-related messages, but I cannot really make sense of them. 
Still hoping for a reply on the ML.


In any case, once you are recovered, I recommend you adjust the 
weights of some of your OSDs to be much lower than others as a 
temporary safeguard. This way, only some OSDs would fill up and 
trigger your FULL watermark should this thing repeat.


Janek


On 31/05/2023 16:13, Jake Grimmett wrote:

Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:



Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.


We added 4 x extra SSD to increase pool capacity to 3.5TB, however 
MDS did not recover


# ceph fs status
cephfs2 - 0 clients
===
RANK   STATE MDS ACTIVITY   DNS    INOS   DIRS   CAPS
 0 failed
 1    resolve  wilma-s3    8065   8063   8047  0
 2    resolve  wilma-s2 901k   802k  34.4k 0
  POOL TYPE USED  AVAIL
    mds_ssd  metadata  2296G  3566G
primary_fs_data    data   0   3566G
    ec82pool   data    2168T  3557T
STANDBY MDS
  wilma-s1
  wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.


Following the disaster-recovery-experts guide, the first step we did 
was to export the MDS journals, e.g:


# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the 
process consumes all available RAM (470GB) and needs to be killed 
after 14 minutes.


# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied 
to MDS 2

# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"


at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 



At this point we are tempted to reset the journal on MDS 2, but 
wanted to get a feeling from others about how dangerous this could be?


We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore


any ideas gratefully received.

Jake



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueStore fragmentation woes

2023-05-31 Thread Mark Nelson


On 5/31/23 09:15, Igor Fedotov wrote:


On 31/05/2023 15:26, Stefan Kooman wrote:

On 5/29/23 15:52, Igor Fedotov wrote:

Hi Stefan,

given that allocation probes include every allocation (including 
short 4K ones) your stats look pretty high indeed.


Although you omitted historic probes so it's hard to tell if there 
is negative trend in it..


I did not omit them. We (currently) don't store logs for longer than 
7 days. I will increase the interval in which the probes get created 
(every hour).



Allocation probe contains historic data on its own, e.g.

allocation stats probe 33: cnt: 8148921 frags: 10958186 size: 1704348508>
probe -1: 35168547,  46401246, 1199516209152
probe -3: 27275094,  35681802, 200121712640
probe -5: 34847167,  52539758, 271272230912
probe -9: 44291522,  60025613, 523997483008
probe -17: 10646313,  10646313, 155178434560

in the snippet above probes -1 through -17 are historic data 1 through 
17 days (or more correctly probe attempts) back.


The major idea behind this representation is to try to visualize how 
allocation fragmentation evolved without the need for grep through all 
the logs.


From the info you shared it's unclear which records were for the 
current day and which were historic ones if any.


Hence no way to estimate the degradation over time.

Please note that probes are collected since OSD restart. Hence some 
historic records might be void if restart occurred not long ago.





As I mentioned in my reply to Hector one might want to make further 
investigation by e.g. building a histogram (chunk-size, num chanks) 
using the output from 'ceph tell osd.N bluestore allocator dump 
block' command and monitoring how  it evolves over time. Script to 
build such a histogram still to be written. ;).


We started to investigate such a script. But when we issue a "ceph 
tell osd.N bluestore allocator dump block" on OSDs that are primary 
for three or more CephFS metadata PGs, that will cause a massive 
amount of slow ops (thousands), osd op tp threads will time out 
(2023-05-31T11:52:35.454+0200 7fee13285700  1 heartbeat_map 
is_healthy 'OSD::osd_op_tp thread 0x7fedf6fa5700' had timed out after 
15.00954s) and the OSD will reboot itself. This is true for SSD 
as well as NVMe OSDs. So it seems that the whole OSD is just busy 
processing this data, and production IO (client / rep ops) are just 
starved. Ideally this call would be asynchronous,processed in 
batches, and not hinder IO in any way. Should I open a tracker for this?


ah... this makes sense, good to know.. I knew that this dump might be 
huge but never heard it causes that drastic impact.. Perhaps it's 
really big this time or you're writing it to slow device..


Unfortunately there is no simple enough way to process that in batches 
since we should collect a complete consistent snapshot made at a given 
point in time. Processing in batches would create potentially 
inconsistent chunks since allocation map is permanently updated by OSD 
which is processing regular user ops..


So for us this is not a suitable way of obtaining this data. The 
offline way of doing this, ceph-bluestore-tool --path 
/var/lib/ceph/osd/ceph-$id/ --allocator block free-dump > 
/root/osd.$id_free_dump did work and resulted in a 2.7 GiB file of 
JSON data. So that's quite a bit of data to process ...


Yeah, offline method is fine too. In fact Ceph codebase has a way to 
convert this JSON file to a binary format which might drastically 
improve processing time and save disk space.


The tool name is ceph_test_alloc_replay, it's primarily intended for 
dev purposes hence it's not very user-friendly. And I'm not sure it's 
included in regular ceph packages, perhaps you'll need to run it 
yourself.





As for Pacific release being a culprit - likely it is. But there 
were two major updates which could have the impact. Both came in the 
same PR (https://github.com/ceph/ceph/pull/34588):


1. 4K allocation unit for spinners


@Kevin: what drive types do you use in the clusters that are 
suffering from this problem? Did only HDD suffer from this after 
upgrading to Pacific?


2. Switch to avl/hybrid allocator.

Honestly I'd rather bet on 1.


We have no spinners. We have 4K alloc size since Luminous, bitmap 
since Luminous (12.2.13?). Not sure if we are suffering (more or 
less) on the 3 nodes that got provisioned / filled with hybrid 
allocator in use. We plan to do some experiments though: fill an OSD 
with PGs with bitmap allocator. At certain amount of PGs dump the 
free extents, until all PGs are present. Repeat this process with the 
same PGs on an OSD with hybrid allocator. My bet is on # 2 ;-).


Looking forward for the results... ;) Knowing internal design for both 
bitmap and hybrid allocator I'd be very surprised the latter one is 
worse in this regard...



Related to this, I was a little surprised to learn how the hybrid 
allocator works.  I figured we would do something like have a coarse 
grained implementation of one 

[ceph-users] Re: RGW versioned bucket index issues

2023-05-31 Thread Mark Nelson
Thank you Cory for this excellent write up!  A quick question: Is there 
a simple method to find and more importantly fix the zombie index 
entries and OLH objects?


I saw in https://tracker.ceph.com/issues/59663 that there was an example 
using radosgw-admin to examine the lifecycle/marker/garbage collection 
info, but that looks a little cumbersome?



Mark


On 5/31/23 05:16, Cory Snyder wrote:

Hi all,

I wanted to call attention to some RGW issues that we've observed on a
Pacific cluster over the past several weeks. The problems relate to versioned
buckets and index entries that can be left behind after transactions complete
abnormally. The scenario is multi-faceted and we're still investigating some of
the details, but I wanted to provide a big-picture summary of what we've found
so far. It looks like most of these issues should be reproducible on versions
before and after Pacific as well. I'll enumerate the individual issues below:

1. PUT requests during reshard of versioned bucket fail with 404 and leave
behind dark data

Tracker: https://tracker.ceph.com/issues/61359

2. When bucket index ops are cancelled it can leave behind zombie index entries

This one was merged a few months ago and did make the v16.2.13 release, but
in our case we had billions of extra index entries by the time that we had
upgraded to the patched version.

Tracker: https://tracker.ceph.com/issues/58673

3. Issuing a delete for a key that already has a delete marker as the current
version leaves behind index entries and OLH objects

Note that the tracker's original description describes the problem a bit
differently, but I've clarified the nature of the issue in a comment.

Tracker: https://tracker.ceph.com/issues/59663

The extra index entries and OLH objects that are left behind due to these sorts
of issues are obviously annoying in regards to the fact that they unnecessarily
consume space, but we've found that they can also cause severe performance
degradation for bucket listings, lifecycle processing, and other ops indirectly
due to higher osd latencies.

The reason for the performance impact is that bucket listing calls must
repeatedly perform additional OSD ops until they find the requisite number
of entries to return. The OSD cls method for bucket listing also does its own
internal iteration for the same purpose. Since these entries are invalid, they
are skipped. In the case that we observed, where some of our bucket indexes were
filled with a sea of contiguous leftover entries, the process of continually
iterating over and skipping invalid entries caused enormous read amplification.
I believe that the following tracker is describing symptoms that are related to
the same issue: https://tracker.ceph.com/issues/59164.

Note that this can also cause LC processing to repeatedly fail in cases where
there are enough contiguous invalid entries, since the OSD cls code eventually
gives up and returns an error that isn't handled.

The severity of these issues likely varies greatly based upon client behavior.
If anyone has experienced similar problems, we'd love to hear about the nature
of how they've manifested for you so that we can be more confident that we've
plugged all of the holes.

Thanks,

Cory Snyder
11:11 Systems
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Best Regards,
Mark Nelson
Head of R (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS corrupt (also RADOS-level copy?)

2023-05-31 Thread Janek Bevendorff

Hi Jake,

Very interesting. This sounds very much like what we have been 
experiencing the last two days. We also had a sudden fill-up of the 
metadata pool, which repeated last night. See my question here: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/


I also noticed that I couldn't dump the current journal using the 
cephfs-journal-tool, as it would eat up all my RAM (probably not 
surprising with a journal that seems to be filling up a 16TiB pool).


Note: I did NOT need to reset the journal (and you probably don't need 
to either). I did, however, have to add extra capacity and balance out 
the data. After an MDS restart, the pool quickly cleared out again. The 
first MDS restart took an hour or so and I had to increase the MDS lag 
timeout (mds_beacon_grace), otherwise the MONs kept killing the MDS 
during the resolve phase. I set it to 1600 to be on the safe side.


While your MDS are recovering, you may want to set debug_mds to 10 for 
one of your MDS and check the logs. My logs were being spammed with 
snapshot-related messages, but I cannot really make sense of them. Still 
hoping for a reply on the ML.


In any case, once you are recovered, I recommend you adjust the weights 
of some of your OSDs to be much lower than others as a temporary 
safeguard. This way, only some OSDs would fill up and trigger your FULL 
watermark should this thing repeat.


Janek


On 31/05/2023 16:13, Jake Grimmett wrote:

Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:



Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.


We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS 
did not recover


# ceph fs status
cephfs2 - 0 clients
===
RANK   STATE MDS ACTIVITY   DNS    INOS   DIRS   CAPS
 0 failed
 1    resolve  wilma-s3    8065   8063   8047  0
 2    resolve  wilma-s2 901k   802k  34.4k 0
  POOL TYPE USED  AVAIL
    mds_ssd  metadata  2296G  3566G
primary_fs_data    data   0   3566G
    ec82pool   data    2168T  3557T
STANDBY MDS
  wilma-s1
  wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.


Following the disaster-recovery-experts guide, the first step we did 
was to export the MDS journals, e.g:


# cephfs-journal-tool --rank=cephfs2:0 journal export /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the 
process consumes all available RAM (470GB) and needs to be killed 
after 14 minutes.


# cephfs-journal-tool --rank=cephfs2:2 journal export /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied to 
MDS 2

# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"


at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 



At this point we are tempted to reset the journal on MDS 2, but wanted 
to get a feeling from others about how dangerous this could be?


We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore


any ideas gratefully received.

Jake



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PGs incomplete - Data loss

2023-05-31 Thread Benno Wulf
Hi guys,
I'm awake since 36h and try to restore a broken ceph Pool (2 PGs incomplete)

My vm are all broken. Some Boot, some Dont Boot...

Also I have 5 removed disk with Data of that Pool "in my Hands" - Dont ask...

So my question is it possible to restore Data of these other disks and "add" 
them thee others for healing?

Best regards
Ben
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] all buckets mtime = "0.000000" after upgrade to 17.2.6

2023-05-31 Thread alyarb
I was running on 17.2.5 since October, and just upgraded to 17.2.6, and now the 
"mtime" property on all my buckets is 0.00. 

On all previous versions going back to Nautilus this wasn't an issue, and we do 
like to have that value present. radosgw-admin has no quick way to get the last 
object in the bucket.

Here's my tracker submission:
https://tracker.ceph.com/issues/61264#change-239348
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

2023-05-31 Thread Redouane Kachach
Hello all,

Thank you very much for your valuable feedback. I'd like to provide some
context and clarify certain points as there seems to be some confusion
regarding the objective of this discussion and how a cephadm initial
bootstrap works.

As you know, Ceph has the capability to run multiple clusters on the same
nodes, with certain limitations that I won't delve into in this discussion.
Each cluster has its own unique identifier (called fsid), which in fact is
a UUID generated by cephadm during cluster bootstrap or provided by the
user. Almost all cluster-related files, including cluster and daemon
configurations, systemd units, logs, etc., are specific to each cluster and
are stored in dedicated directories based on the fsid, such as
/var/lib/ceph/, /var/log/ceph/, /run/ceph/, and so on.
These directories ensure isolation between cluster files and daemons,
preventing any file or configuration sharing between clusters. Typically,
as a user, you need not concern yourself with the exact location of the
cluster files when deleting a cluster. For this purpose, cephadm provides a
dedicated command, "cephadm rm-cluster," (
https://docs.ceph.com/en/latest/cephadm/operations/#purging-a-cluster)
which handles the deletion of cluster files, removal of daemons, and so
forth. Importantly, this command uses the fsid to ensure the command's
safety in environments where multiple clusters coexist.

That being clarified, I want to emphasize that this discussion does not
revolve around the workings or options provided by the "cephadm rm-cluster"
command. This command is the official method for deleting a cluster and is
employed in both upstream and production clusters. In case you have
suggestions for improving the user experience with this command we can
start a separate thread for that purpose.

Back to the original subject:

During the process of bootstrapping a new cluster with cephadm, in addition
to installing files in their respective locations, core ceph daemons such
as mgr and mon are started. If the bootstrap process succeeds we end up
with a minimal ceph cluster consisting only of the necessary files and
daemons. In case of bootstrap failure, a minimal, broken, non-functional
ceph cluster is created with no actual data (no OSDs), and potentially with
some daemons (mgr/mon) running on the current node. Retaining these files
and daemons provides no real benefit to the user apart from facilitating
the investigation of bugs or issues that may prevent the bootstrap process.
Even in such cases, once the investigation is complete and the issue is
resolved, the user must delete this cluster since it is useless and may
have active daemons listening on mon/mgr sockets, thereby obstructing the
creation of future clusters on the same node due to occupied mon/mgr ports.

The purpose of this email thread is to discuss how to address this
situation. Given that we have full control over the bootstrap process, we
can automatically clean up this broken cluster (or at least assist the user
in doing so). The proposed rollback options are: either an automatic
cleanup (option 2) or a manual cleanup (option 1) as mentioned in the
original email. The goal of this thread is to get some feedback about your
preference as a user and gather input on the additional information you
would like to receive regarding each option.

Side Note:

As a response to the question why we don't use some mechanism like Rook
does. The answer is cephadm is a "binary" meant for bare-metal deployments.
Unlike Rook, which operates within the framework of a higher-level
orchestration system like k8s or Openshift, in the case of cephadm we have
no
daemon nor any other high level controller that can watch and fix a broken
installation. cephadm is the only binary needed (+ some minimal
dependencies)
to bootstrap a new ceph cluster.

Best Regards,
Redouane.


On Tue, May 30, 2023 at 10:30 AM Frank Schilder  wrote:

> Hi, I would like to second Nico's comment. What happened to the idea that
> a deployment tool should be idempotent? The most natural option would be:
>
> 1) start install -> something fails
> 2) fix problem
> 3) repeat exact same deploy command -> deployment picks up at current
> state (including cleaning up failed state markers) and tries to continue
> until next issue (go to 2)
>
> I'm not sure (meaning: its a terrible idea) if its a good idea to provide
> a single command to wipe a cluster. Just for the fat finger syndrome. This
> seems safe only if it would be possible to mark a cluster as production
> somehow (must be sticky, that is, cannot be unset), which prevents a
> cluster destroy command (or any too dangerous command) from executing. I
> understand the test case in the tracker, but having such test-case utils
> that can run on a production cluster and destroy everything seems a bit
> dangerous.
>
> I think destroying a cluster should be a manual and tedious process and
> figuring out how to do it should be part of the learning experience. 

[ceph-users] Re: MDS corrupt (also RADOS-level copy?)

2023-05-31 Thread Jake Grimmett

Dear All,

My apologies, I forgot to state we are using Quincy 17.2.6

thanks again,

Jake

root@wilma-s1 15:22 [~]: ceph -v
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy 
(stable)



Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:



Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.


We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS 
did not recover


# ceph fs status
cephfs2 - 0 clients
===
RANK   STATE MDS ACTIVITY   DNS    INOS   DIRS   CAPS
  0 failed
  1    resolve  wilma-s3    8065   8063   8047  0
  2    resolve  wilma-s2 901k   802k  34.4k 0
   POOL TYPE USED  AVAIL
     mds_ssd  metadata  2296G  3566G
primary_fs_data    data   0   3566G
     ec82pool   data    2168T  3557T
STANDBY MDS
   wilma-s1
   wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.


Following the disaster-recovery-experts guide, the first step we did was 
to export the MDS journals, e.g:


# cephfs-journal-tool --rank=cephfs2:0 journal export  /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the process 
consumes all available RAM (470GB) and needs to be killed after 14 minutes.


# cephfs-journal-tool --rank=cephfs2:2 journal export  /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied to 
MDS 2

# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"


at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 



At this point we are tempted to reset the journal on MDS 2, but wanted 
to get a feeling from others about how dangerous this could be?


We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore


any ideas gratefully received.

Jake



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS corrupt (also RADOS-level copy?)

2023-05-31 Thread Jake Grimmett

Dear All,

we are trying to recover from what we suspect is a corrupt MDS :(
and have been following the guide here:



Symptoms: MDS SSD pool (2TB) filled completely over the weekend, 
normally uses less than 400GB, resulting in MDS crash.


We added 4 x extra SSD to increase pool capacity to 3.5TB, however MDS 
did not recover


# ceph fs status
cephfs2 - 0 clients
===
RANK   STATE MDS ACTIVITY   DNSINOS   DIRS   CAPS
 0 failed
 1resolve  wilma-s38065   8063   8047  0
 2resolve  wilma-s2 901k   802k  34.4k 0
  POOL TYPE USED  AVAIL
mds_ssd  metadata  2296G  3566G
primary_fs_datadata   0   3566G
ec82pool   data2168T  3557T
STANDBY MDS
  wilma-s1
  wilma-s4

setting "ceph mds repaired 0" causes rank 0 to restart, and then 
immediately fail.


Following the disaster-recovery-experts guide, the first step we did was 
to export the MDS journals, e.g:


# cephfs-journal-tool --rank=cephfs2:0 journal export  /root/backup.bin.0
journal is 9744716714163~658103700
wrote 658103700 bytes at offset 9744716714163 to /root/backup.bin.0

so far so good, however when we try to backup the final MDS the process 
consumes all available RAM (470GB) and needs to be killed after 14 minutes.


# cephfs-journal-tool --rank=cephfs2:2 journal export  /root/backup.bin.2

similarly, "recover_dentries summary" consumes all RAM when applied to MDS 2
# cephfs-journal-tool --rank=cephfs2:2 event recover_dentries summary

We successfully ran "cephfs-journal-tool --rank=cephfs2:0 event 
recover_dentries summary" and "cephfs-journal-tool --rank=cephfs2:1 
event recover_dentries summary"


at this point, we tried to follow the instructions and make a RADOS 
level copy of the journal data, however the link in the docs doesn't 
explain how to do this and just points to 



At this point we are tempted to reset the journal on MDS 2, but wanted 
to get a feeling from others about how dangerous this could be?


We have a backup, but as there is 1.8PB of data, it's going to take a 
few weeks to restore


any ideas gratefully received.

Jake


--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to use ctdb_mutex_ceph_rados_helper

2023-05-31 Thread Angelo Höngens
Hey,

I have a test setup with a 3-node samba cluster. This cluster consists
of 3 vm's storing its locks on a replicated gluster volume.

I want to switch to 2 physical smb-gateways for performance reasons
(not enough money for 3), and since the 2-node cluster can't get
quorum, I hope to switch to storing the ctdb lock in ceph and hope
that will work reliably. (experiences with 2 node SMB clusters?)

I am looking into the ctdb rados helper:

[cluster]
recovery lock =
!/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph
client.tenant1 cephfs_metadata ctdb_lock

Now I do have a bit of experience with cephfs, rbd and rgw, but not
rados. How do I give the user client.tenant1 permissions?

We have a single cephfs, with 4 different tenants (departments). Each
department has their own samba cluster. We're using cephfs permissions
to limit the tenants to their own path (I hope).

example of ceph auth:

client.tenant1
key: *
caps: [mds] allow rws fsname=cephfs path=/tenant1
caps: [mon] allow r fsname=cephfs
caps: [osd] allow rw tag cephfs data=cephfs

If I try some stuff manually (without really knowing how to specify
objects or what that means), I get this permission denied error:

root@tenant1-1:~#
/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper ceph
client.tenant1 cephfs_metadata tenant1/ctdb_lock 1
/usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_ceph_rados_helper: Failed to
get lock on RADOS object 'tenant1/ctdb_lock' - (Operation not
permitted)

Angelo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Important: RGW multisite bug may silently corrupt encrypted objects on replication

2023-05-31 Thread Casey Bodley
On Wed, May 31, 2023 at 7:24 AM Tobias Urdin  wrote:
>
> Hello Casey,
>
> Understood, thanks!
>
> That means that the original copy in the site that it was uploaded to is still
> safe as long as that copy is not removed, and no underlying changes below
> RadosGW in the Ceph storage could corrupt the original copy?

right, the original multipart upload remains intact and can be
decrypted successfully

as i noted above, take care not to delete or modify any replicas that
were corrupted. replication is bidirectional by default, so those
changes would sync back and delete/overwrite the original copy

>
> Best regards
> Tobias
>
> On 30 May 2023, at 14:48, Casey Bodley  wrote:
>
> On Tue, May 30, 2023 at 8:22 AM Tobias Urdin 
> mailto:tobias.ur...@binero.com>> wrote:
>
> Hello Casey,
>
> Thanks for the information!
>
> Can you please confirm that this is only an issue when using 
> “rgw_crypt_default_encryption_key”
> config opt that says “testing only” in the documentation [1] to enable 
> encryption and not when using
> Barbican or Vault as KMS or using SSE-C with the S3 API?
>
> unfortunately, all flavors of server-side encryption (SSE-C, SSE-KMS,
> SSE-S3, and rgw_crypt_default_encryption_key) are affected by this
> bug, as they share the same encryption logic. the main difference is
> where they get the key
>
>
> [1] 
> https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only
>
> On 26 May 2023, at 22:45, Casey Bodley  wrote:
>
> Our downstream QE team recently observed an md5 mismatch of replicated
> objects when testing rgw's server-side encryption in multisite. This
> corruption is specific to s3 multipart uploads, and only affects the
> replicated copy - the original object remains intact. The bug likely
> affects Ceph releases all the way back to Luminous where server-side
> encryption was first introduced.
>
> To expand on the cause of this corruption: Encryption of multipart
> uploads requires special handling around the part boundaries, because
> each part is uploaded and encrypted separately. In multisite, objects
> are replicated in their encrypted form, and multipart uploads are
> replicated as a single part. As a result, the replicated copy loses
> its knowledge about the original part boundaries required to decrypt
> the data correctly.
>
> We don't have a fix yet, but we're tracking it in
> https://tracker.ceph.com/issues/46062. The fix will only modify the
> replication logic, so won't repair any objects that have already
> replicated incorrectly. We'll need to develop a radosgw-admin command
> to search for affected objects and reschedule their replication.
>
> In the meantime, I can only advise multisite users to avoid using
> encryption for multipart uploads. If you'd like to scan your cluster
> for existing encrypted multipart uploads, you can identify them with a
> s3 HeadObject request. The response would include a
> x-amz-server-side-encryption header, and the ETag header value (with
> "s removed) would be longer than 32 characters (multipart ETags are in
> the special form "-"). Take care not to delete the
> corrupted replicas, because an active-active multisite configuration
> would go on to delete the original copy.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Important: RGW multisite bug may silently corrupt encrypted objects on replication

2023-05-31 Thread Tobias Urdin
Hello Casey,

Understood, thanks!

That means that the original copy in the site that it was uploaded to is still
safe as long as that copy is not removed, and no underlying changes below
RadosGW in the Ceph storage could corrupt the original copy?

Best regards
Tobias

On 30 May 2023, at 14:48, Casey Bodley  wrote:

On Tue, May 30, 2023 at 8:22 AM Tobias Urdin 
mailto:tobias.ur...@binero.com>> wrote:

Hello Casey,

Thanks for the information!

Can you please confirm that this is only an issue when using 
“rgw_crypt_default_encryption_key”
config opt that says “testing only” in the documentation [1] to enable 
encryption and not when using
Barbican or Vault as KMS or using SSE-C with the S3 API?

unfortunately, all flavors of server-side encryption (SSE-C, SSE-KMS,
SSE-S3, and rgw_crypt_default_encryption_key) are affected by this
bug, as they share the same encryption logic. the main difference is
where they get the key


[1] 
https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only

On 26 May 2023, at 22:45, Casey Bodley  wrote:

Our downstream QE team recently observed an md5 mismatch of replicated
objects when testing rgw's server-side encryption in multisite. This
corruption is specific to s3 multipart uploads, and only affects the
replicated copy - the original object remains intact. The bug likely
affects Ceph releases all the way back to Luminous where server-side
encryption was first introduced.

To expand on the cause of this corruption: Encryption of multipart
uploads requires special handling around the part boundaries, because
each part is uploaded and encrypted separately. In multisite, objects
are replicated in their encrypted form, and multipart uploads are
replicated as a single part. As a result, the replicated copy loses
its knowledge about the original part boundaries required to decrypt
the data correctly.

We don't have a fix yet, but we're tracking it in
https://tracker.ceph.com/issues/46062. The fix will only modify the
replication logic, so won't repair any objects that have already
replicated incorrectly. We'll need to develop a radosgw-admin command
to search for affected objects and reschedule their replication.

In the meantime, I can only advise multisite users to avoid using
encryption for multipart uploads. If you'd like to scan your cluster
for existing encrypted multipart uploads, you can identify them with a
s3 HeadObject request. The response would include a
x-amz-server-side-encryption header, and the ETag header value (with
"s removed) would be longer than 32 characters (multipart ETags are in
the special form "-"). Take care not to delete the
corrupted replicas, because an active-active multisite configuration
would go on to delete the original copy.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff
I checked our logs from yesterday, the PG scaling only started today, 
perhaps triggered by the snapshot trimming. I disabled it, but it didn't 
change anything.


What did change something was restarting the MDS one by one, which had 
got far behind with trimming their caches and with a bunch of stuck ops. 
After restarting them, the pool size decreased quickly to 600GiB. I 
noticed the same behaviour yesterday, though yesterday is was more 
extreme and restarting the MDS took about an hour and I had to increase 
the heartbeat timeout. This time, it took only half a minute per MDS, 
probably because it wasn't that extreme yet and I had reduced the 
maximum cache size. Still looks like a bug to me.



On 31/05/2023 11:18, Janek Bevendorff wrote:
Another thing I just noticed is that the auto-scaler is trying to 
scale the pool down to 128 PGs. That could also result in large 
fluctuations, but this big?? In any case, it looks like a bug to me. 
Whatever is happening here, there should be safeguards with regard to 
the pool's capacity.


Here's the current state of the pool in ceph osd pool ls detail:

pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule 
5 object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 
pgp_num_target 128 autoscale_mode on last_change 1359013 lfor 
0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 
expected_num_objects 300 recovery_op_priority 5 recovery_priority 
2 application cephfs


Janek


On 31/05/2023 10:10, Janek Bevendorff wrote:

Forgot to add: We are still on Nautilus (16.2.12).


On 31/05/2023 09:53, Janek Bevendorff wrote:

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, 
but we are having problems with our CephFS metadata pool filling up 
over night.


Our cluster has a small SSD pool of around 15TB which hosts our 
CephFS metadata pool. Usually, that's more than enough. The normal 
size of the pool ranges between 200 and 800GiB (which is quite a lot 
of fluctuation already). Yesterday, we had suddenly had the pool 
fill up entirely and they only way to fix it was to add more 
capacity. I increased the pool size to 18TB by adding more SSDs and 
could resolve the problem. After a couple of hours of reshuffling, 
the pool size finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely 
like last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under 
the more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by 
snapshot trimming. The logs contain a lot of snapshot-related 
messages for paths that haven't been touched in a long time. The 
messages all look something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first 
cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 
request=0 child=0 frozen=0 subtree=1 importing=0 replicated=0 
waiter=0 authpin=0 tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving 
realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular 
snapshots.


I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible 

[ceph-users] Re: CEPH Version choice

2023-05-31 Thread Marc
Hi Frank,

Thanks! I have added this to my test environment todo

> 
> I uploaded all scripts and a rudimentary readme to
> https://github.com/frans42/cephfs-bench . I hope it is sufficient to get
> started. I'm afraid its very much tailored to our deployment and I can't
> make it fully configurable anytime soon. I hope it serves a purpose
> though - at least I discovered a few bugs with it.

I think I know where I should add something for creating and comparing hashes. 
For me this integrity of el9 + Nautilus is most important. 

> We actually kept the benchmark running through an upgrade from mimic to
> octopus. Was quite interesting to see how certain performance properties
> change with that. 

So you have stats that show the current performance of a host having mimic and 
from another host that has octopus?

> This benchmark makes it possible to compare versions
> with live timings coming in.
> 

Do you do something with this time output like store it seperately in 
prometheus/influx? Or all your statistics coming from what is being reported by 
ceph itself?

ps. my el7/el9 time does not have an option -f 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [Pacific] Admin keys no longer works I get access denied URGENT!!!

2023-05-31 Thread wodel youchi
Hi,

After a wrong manipulation, the admin key no longer works, it seems it has
been modified.

My cluster is built using containers.

When I execute ceph -s I get
[root@controllera ceph]# ceph -s
2023-05-31T11:33:20.940+0100 7ff7b2d13700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b1d11700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
2023-05-31T11:33:20.940+0100 7ff7b2512700 -1 monclient(hunting):
handle_auth_bad_method server allowed_methods [2] but i only support [2,1]
[errno 13] RADOS permission denied (error connecting to the cluster)

>From the log file I am getting :
May 31 11:03:02 controllera docker[214909]: debug
2023-05-31T11:03:02.714+0100 7fcfc0c91700  0 cephx server client.admin:
 unexpected key: req.key=5fea877f2a68548b expected_key=8c2074e03ffa449a

How can I recover the correct key?

Regards.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueStore fragmentation woes

2023-05-31 Thread Igor Fedotov

Hi Kevin,

according to the shared probes there were no fragmented allocations - 
cnt = frags for all the probes.  And average allocation request is 
pretty large - more than 1.5 MB for the probes I checked.


So to me it looks like your disk fragmentation (at least for new 
allocations) is of little significance at the moment - it doesn't affect 
write requests.


As I mentioned before for further analysis you might want to run through 
the output from 'ceph tell osd.N bluestore allocator dump block' command.


This is my recent commit to build free space histogram from it: 
https://github.com/ceph/ceph/pull/51820


One can use this as an example and create a script to do the same (just 
to avoid all the tricks with building/upgrading Ceph binaries) or 
backport and build custom Ceph image.



Thanks,

Igor

On 31/05/2023 01:11, Fox, Kevin M wrote:

Ok, I restarted it May 25th, ~11:30, let it run over the long weekend and just 
checked on it. Data attached.

May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-21T18:24:34.040+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 107: cnt: 17991 fr
ags: 17991 size: 32016760832
May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-21T18:24:34.040+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 20267,  20267, 39482425344
May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-21T18:24:34.040+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -3: 19737,  19737, 37299027968
May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-21T18:24:34.040+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -7: 18498,  18498, 32395558912
May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-21T18:24:34.040+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -11: 20373,  20373, 35302801408
May 21 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-21T18:24:34.040+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -27: 19072,  19072, 33645854720
May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-22T18:24:34.057+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 108: cnt: 24594 fr
ags: 24594 size: 56951898112
May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-22T18:24:34.057+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 17991,  17991, 32016760832
May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-22T18:24:34.057+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -2: 20267,  20267, 39482425344
May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-22T18:24:34.057+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -4: 19737,  19737, 37299027968
May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-22T18:24:34.057+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -12: 20373,  20373, 35302801408
May 22 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-22T18:24:34.057+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -28: 19072,  19072, 33645854720
May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-23T18:24:34.095+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  allocation stats probe 109: cnt: 24503 
frags: 24503 size: 58141900800
May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-23T18:24:34.095+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -1: 24594,  24594, 56951898112
May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-23T18:24:34.095+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -3: 20267,  20267, 39482425344
May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-23T18:24:34.095+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -5: 19737,  19737, 37299027968
May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-23T18:24:34.095+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -13: 20373,  20373, 35302801408
May 23 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-23T18:24:34.095+ 7f53603fc700  0 
bluestore(/var/lib/ceph/osd/ceph-183)  probe -29: 19072,  19072, 33645854720
May 24 11:24:34 cf8 ceph-4e4184f5-7733-453b-b72c-2b43422fd027-osd-183[2282674]: 
debug 2023-05-24T18:24:34.105+ 7f53603fc700  0 

[ceph-users] RGW versioned bucket index issues

2023-05-31 Thread Cory Snyder
Hi all,

I wanted to call attention to some RGW issues that we've observed on a
Pacific cluster over the past several weeks. The problems relate to versioned
buckets and index entries that can be left behind after transactions complete
abnormally. The scenario is multi-faceted and we're still investigating some of
the details, but I wanted to provide a big-picture summary of what we've found
so far. It looks like most of these issues should be reproducible on versions
before and after Pacific as well. I'll enumerate the individual issues below:

1. PUT requests during reshard of versioned bucket fail with 404 and leave
   behind dark data

   Tracker: https://tracker.ceph.com/issues/61359

2. When bucket index ops are cancelled it can leave behind zombie index entries

   This one was merged a few months ago and did make the v16.2.13 release, but
   in our case we had billions of extra index entries by the time that we had
   upgraded to the patched version.

   Tracker: https://tracker.ceph.com/issues/58673

3. Issuing a delete for a key that already has a delete marker as the current
   version leaves behind index entries and OLH objects

   Note that the tracker's original description describes the problem a bit
   differently, but I've clarified the nature of the issue in a comment.

   Tracker: https://tracker.ceph.com/issues/59663

The extra index entries and OLH objects that are left behind due to these sorts
of issues are obviously annoying in regards to the fact that they unnecessarily
consume space, but we've found that they can also cause severe performance
degradation for bucket listings, lifecycle processing, and other ops indirectly
due to higher osd latencies.

The reason for the performance impact is that bucket listing calls must
repeatedly perform additional OSD ops until they find the requisite number
of entries to return. The OSD cls method for bucket listing also does its own
internal iteration for the same purpose. Since these entries are invalid, they
are skipped. In the case that we observed, where some of our bucket indexes were
filled with a sea of contiguous leftover entries, the process of continually
iterating over and skipping invalid entries caused enormous read amplification.
I believe that the following tracker is describing symptoms that are related to
the same issue: https://tracker.ceph.com/issues/59164.

Note that this can also cause LC processing to repeatedly fail in cases where
there are enough contiguous invalid entries, since the OSD cls code eventually
gives up and returns an error that isn't handled.

The severity of these issues likely varies greatly based upon client behavior. 
If anyone has experienced similar problems, we'd love to hear about the nature
of how they've manifested for you so that we can be more confident that we've
plugged all of the holes.

Thanks,

Cory Snyder
11:11 Systems
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff
Another thing I just noticed is that the auto-scaler is trying to scale 
the pool down to 128 PGs. That could also result in large fluctuations, 
but this big?? In any case, it looks like a bug to me. Whatever is 
happening here, there should be safeguards with regard to the pool's 
capacity.


Here's the current state of the pool in ceph osd pool ls detail:

pool 110 'cephfs.storage.meta' replicated size 4 min_size 3 crush_rule 5 
object_hash rjenkins pg_num 495 pgp_num 471 pg_num_target 128 
pgp_num_target 128 autoscale_mode on last_change 1359013 lfor 
0/1358620/1358618 flags hashpspool,nodelete stripe_width 0 
expected_num_objects 300 recovery_op_priority 5 recovery_priority 2 
application cephfs


Janek


On 31/05/2023 10:10, Janek Bevendorff wrote:

Forgot to add: We are still on Nautilus (16.2.12).


On 31/05/2023 09:53, Janek Bevendorff wrote:

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, 
but we are having problems with our CephFS metadata pool filling up 
over night.


Our cluster has a small SSD pool of around 15TB which hosts our 
CephFS metadata pool. Usually, that's more than enough. The normal 
size of the pool ranges between 200 and 800GiB (which is quite a lot 
of fluctuation already). Yesterday, we had suddenly had the pool fill 
up entirely and they only way to fix it was to add more capacity. I 
increased the pool size to 18TB by adding more SSDs and could resolve 
the problem. After a couple of hours of reshuffling, the pool size 
finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely 
like last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under the 
more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by 
snapshot trimming. The logs contain a lot of snapshot-related 
messages for paths that haven't been touched in a long time. The 
messages all look something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first 
cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 
authpin=0 tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving 
realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular snapshots.

I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible that the MDS are trimming old snapshots without taking 
care not to fill up the entire metadata pool?


Cheers
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process

2023-05-31 Thread Patrick Begou
I'm a new ceph user and I have some trouble with boostraping with 
cephadm: using Pacific or Quincy  no hard drive are detected by Ceph. 
Using Octopus all the hard drives are detected. As I do not know how to 
really clean, even a successful  install but not functional, each test  
requires me a full reinstall of the node (it is a test node, no problem 
except needed time). A detailed (and working) cleaning (or uninstalling) 
methode (or command) of a ceph deployment for a Ceph newbie will be very 
helpfull.


About how to do this, I'm using proxmox for vitualization and removing a 
VM via the web interface requires typing again the ID of the VM. May be 
Ceph could require the user providing the cluster ID when running the 
command ? In the command arguments if building a new cluster create 
always a different id or when command is running as a double check.


Best regards,

Patrick

Le 30/05/2023 à 11:23, Frank Schilder a écrit :

What I'm having in mind is if the command is already in history. A wrong history 
reference can execute a command with "--yes-i-really-mean-it" even though you 
really don't mean it. Been there. For an OSD this is maybe tolerable, but for an entire 
cluster ... not really. Some things need to be hard to limit the blast radius of a typo 
(or attacker).

For example, when issuing such a command the first time, the cluster could print a nonce 
that needs to be included in such a command to make it happen and which is only valid 
once for this exact command, so one actually needs to type something new every time to 
destroy stuff. An exception could be if a "safe-to-destroy" query for any 
daemon (pool etc.) returns true.

I would still not allow an entire cluster to be wiped with a single command. In 
a single step, only allow to destroy what could be recovered in some way (there 
has to be some form of undo). And there should be notifications to all admins 
about what is going on to be able to catch malicious execution of destructive 
commands.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Nico Schottelius 
Sent: Tuesday, May 30, 2023 10:51 AM
To: Frank Schilder
Cc: Nico Schottelius; Redouane Kachach; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap 
process


Hey Frank,

in regards to destroying a cluster, I'd suggest to reuse the old
--yes-i-really-mean-it parameter, as it is already in use by ceph osd
destroy [0]. Then it doesn't matter whether it's prod or not, if you
really mean it ... ;-)

Best regards,

Nico

[0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

Frank Schilder  writes:


Hi, I would like to second Nico's comment. What happened to the idea that a 
deployment tool should be idempotent? The most natural option would be:

1) start install -> something fails
2) fix problem
3) repeat exact same deploy command -> deployment picks up at current state 
(including cleaning up failed state markers) and tries to continue until next 
issue (go to 2)

I'm not sure (meaning: its a terrible idea) if its a good idea to
provide a single command to wipe a cluster. Just for the fat finger
syndrome. This seems safe only if it would be possible to mark a
cluster as production somehow (must be sticky, that is, cannot be
unset), which prevents a cluster destroy command (or any too dangerous
command) from executing. I understand the test case in the tracker,
but having such test-case utils that can run on a production cluster
and destroy everything seems a bit dangerous.

I think destroying a cluster should be a manual and tedious process
and figuring out how to do it should be part of the learning
experience. So my answer to "how do I start over" would be "go figure
it out, its an important lesson".

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Nico Schottelius 
Sent: Friday, May 26, 2023 10:40 PM
To: Redouane Kachach
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap 
process


Hello Redouane,

much appreciated kick-off for improving cephadm. I was wondering why
cephadm does not use a similar approach to rook in the sense of "repeat
until it is fixed?"

For the background, rook uses a controller that checks the state of the
cluster, the state of monitors, whether there are disks to be added,
etc. It periodically restarts the checks and when needed shifts
monitors, creates OSDs, etc.

My question is, why not have a daemon or checker subcommand of cephadm
that a) checks what the current cluster status is (i.e. cephadm
verify-cluster) and b) fixes the situation (i.e. cephadm 
verify-and-fix-cluster)?

I think that option would be much more beneficial than the other two
suggested ones.

Best regards,

Nico


--
Sustainable and modern Infrastructures by ungleich.ch

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff

Forgot to add: We are still on Nautilus (16.2.12).


On 31/05/2023 09:53, Janek Bevendorff wrote:

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, 
but we are having problems with our CephFS metadata pool filling up 
over night.


Our cluster has a small SSD pool of around 15TB which hosts our CephFS 
metadata pool. Usually, that's more than enough. The normal size of 
the pool ranges between 200 and 800GiB (which is quite a lot of 
fluctuation already). Yesterday, we had suddenly had the pool fill up 
entirely and they only way to fix it was to add more capacity. I 
increased the pool size to 18TB by adding more SSDs and could resolve 
the problem. After a couple of hours of reshuffling, the pool size 
finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely like 
last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under the 
more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by snapshot 
trimming. The logs contain a lot of snapshot-related messages for 
paths that haven't been touched in a long time. The messages all look 
something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first 
cap, joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0 
tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving 
realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular snapshots.

I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible that the MDS are trimming old snapshots without taking 
care not to fill up the entire metadata pool?


Cheers
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-05-31 Thread Janek Bevendorff

Hi,

Perhaps this is a known issue and I was simply too dumb to find it, but 
we are having problems with our CephFS metadata pool filling up over night.


Our cluster has a small SSD pool of around 15TB which hosts our CephFS 
metadata pool. Usually, that's more than enough. The normal size of the 
pool ranges between 200 and 800GiB (which is quite a lot of fluctuation 
already). Yesterday, we had suddenly had the pool fill up entirely and 
they only way to fix it was to add more capacity. I increased the pool 
size to 18TB by adding more SSDs and could resolve the problem. After a 
couple of hours of reshuffling, the pool size finally went back to 230GiB.


But then we had another fill-up tonight to 7.6TiB. Luckily, I had 
adjusted the weights so that not all disks could fill up entirely like 
last time, so it ended there.


I wasn't really able to identify the problem yesterday, but under the 
more controllable scenario today, I could check the MDS logs at 
debug_mds=10 and to me it seems like the problem is caused by snapshot 
trimming. The logs contain a lot of snapshot-related messages for paths 
that haven't been touched in a long time. The messages all look 
something like this:


May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200 
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first cap, 
joining realm snaprealm(0x100 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 ...


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200 
7f0e6a6ca700 10 mds.0.cache | |__ 3 rep [dir 
0x10218fe.10101* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0 
tempexporting=0 0x5607759d9600]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200 
7f0e6a6ca700 10 mds.0.cache | | | 4 rep [dir 
0x10ff904.10001010* /storage/REDACTED/| ptrwaiter=0 request=0 
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0 
tempexporting=0 0x56034ed25200]


May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200 
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941 
0x100 'monthly_20230201' 
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x100 
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24 
0x100 'monthly_20230401'  ...) len=384


May 31 09:25:36 deltaweb055 ceph-mds[3268481]: 
2023-05-31T09:25:36.076+0200 7f0e6becd700 10 
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving realm 
snaprealm(0x100 seq 1b1c lc 1b1b cr 1b1b cps 2 
snaps={185f=snap(185f 0x100 'monthly_20221201' 
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x100 
'monthly_20230101'  ...)


The daily_*, montly_* etc. names are the names of our regular snapshots.

I posted a larger log file snippet using ceph-post-file with the ID: 
da0eb93d-f340-4457-8a3f-434e8ef37d36


Is it possible that the MDS are trimming old snapshots without taking 
care not to fill up the entire metadata pool?


Cheers
Janek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io