[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-25 Thread Zakhar Kirpichenko
Hi! Could you please elaborate what you meant by "adding another disc to the recovery process"? /Z On Sat, 25 May 2024, 22:49 Mazzystr, wrote: > Well this was an interesting journey through the bowels of Ceph. I have > about 6 hours into tweaking every setting imaginable just to circle back

[ceph-users] Re: Remove failed OSD

2024-05-04 Thread Zakhar Kirpichenko
I ended up manually cleaning up the OSD host, removing stale LVs and DM entries, and then purging the OSD with `ceph osd purge osd.19`. Looks like it's gone for good. /Z On Sat, 4 May 2024 at 08:29, Zakhar Kirpichenko wrote: > Hi! > > An OSD failed in our 16.2.15 cluster. I

[ceph-users] Remove failed OSD

2024-05-03 Thread Zakhar Kirpichenko
Hi! An OSD failed in our 16.2.15 cluster. I prepared it for removal and ran `ceph orch daemon rm osd.19 --force`. Somehow that didn't work as expected, so now we still have osd.19 in the crush map: -10 122.66965 host ceph02 19 1.0 osd.19

[ceph-users] Re: [EXTERN] Re: Ceph 16.2.x mon compactions, disk writes

2024-04-16 Thread Zakhar Kirpichenko
> > Zitat von Eugen Block : > > > You can use the extra container arguments I pointed out a few months > > ago. Those work in my test clusters, although I haven’t enabled that > > in production yet. But it shouldn’t make a difference if it’s a test > > cluster or not.

[ceph-users] Re: [EXTERN] Re: Ceph 16.2.x mon compactions, disk writes

2024-04-16 Thread Zakhar Kirpichenko
ut issues? > Do you know if this works also with reef (we see massive writes as well > there)? > > Can you briefly tabulate the commands you used to persistently set the > compression options? > > Thanks so much, > >Dietmar > > > On 10/18/23 06:14, Zakhar K

[ceph-users] Re: cephadm: daemon osd.x on yyy is in error state

2024-04-06 Thread Zakhar Kirpichenko
Well, I've replaced the failed drives and that cleared the error. Arguably, it was a better solution :-) /Z On Sat, 6 Apr 2024 at 10:13, wrote: > did it help? Maybe you found a better solution? > ___ > ceph-users mailing list -- ceph-users@ceph.io >

[ceph-users] Re: Pacific 16.2.15 `osd noin`

2024-04-04 Thread Zakhar Kirpichenko
cluster as well, but of course you'd have to reweight > new OSDs manually. > > Regards, > Eugen > > Zitat von Zakhar Kirpichenko : > > > Any comments regarding `osd noin`, please? > > > > /Z > > > > On Tue, 2 Apr 2024 at 16:09, Zakhar Kirpiche

[ceph-users] Re: Pacific 16.2.15 `osd noin`

2024-04-04 Thread Zakhar Kirpichenko
Thanks, this is a good suggestion! /Z On Thu, 4 Apr 2024 at 10:29, Janne Johansson wrote: > Den tors 4 apr. 2024 kl 06:11 skrev Zakhar Kirpichenko : > > Any comments regarding `osd noin`, please? > > > > > > I'm adding a few OSDs to an existing cluster, the cluster i

[ceph-users] Re: Pacific 16.2.15 `osd noin`

2024-04-03 Thread Zakhar Kirpichenko
Any comments regarding `osd noin`, please? /Z On Tue, 2 Apr 2024 at 16:09, Zakhar Kirpichenko wrote: > Hi, > > I'm adding a few OSDs to an existing cluster, the cluster is running with > `osd noout,noin`: > > cluster: > id: 3f50555a-ae2a-11eb-a2fc-ffde

[ceph-users] Re: Replace block drives of combined NVME+HDD OSDs

2024-04-02 Thread Zakhar Kirpichenko
d drives > it should work. I also don't expect an impact on the rest of the OSDs > (except for backfilling, of course). > > Regards, > Eugen > > [1] https://docs.ceph.com/en/latest/cephadm/services/osd/#replacing-an-osd > > Zitat von Zakhar Kirpichenko : > > >

[ceph-users] Pacific 16.2.15 `osd noin`

2024-04-02 Thread Zakhar Kirpichenko
Hi, I'm adding a few OSDs to an existing cluster, the cluster is running with `osd noout,noin`: cluster: id: 3f50555a-ae2a-11eb-a2fc-ffde44714d86 health: HEALTH_WARN noout,noin flag(s) set Specifically `noin` is documented as "prevents booting OSDs from being marked

[ceph-users] Replace block drives of combined NVME+HDD OSDs

2024-04-01 Thread Zakhar Kirpichenko
Hi, Unfortunately, some of our HDDs failed and we need to replace these drives which are parts of "combined" OSDs (DB/WAL on NVME, block storage on HDD). All OSDs are defined with a service definition similar to this one: ``` service_type: osd service_id: ceph02_combined_osd service_name:

[ceph-users] cephadm: daemon osd.x on yyy is in error state

2024-03-30 Thread Zakhar Kirpichenko
Hi, A disk failed in our cephadm-managed 16.2.15 cluster, the affected OSD is down, out and stopped with cephadm, I also removed the failed drive from the host's service definition. The cluster has finished recovering but the following warning persists: [WRN] CEPHADM_FAILED_DAEMON: 1 failed

[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-05 Thread Zakhar Kirpichenko
se it's an option that has to be present *during* mon > startup, not *after* the startup when it can read the config store. > > Zitat von Zakhar Kirpichenko : > > > Hi Eugen, > > > > It is correct that I manually added the configuration, but not to the > > unit.run b

[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-05 Thread Zakhar Kirpichenko
true,bottommost_compression=kLZ4HCCompression,max_background_jobs=4,max_subcompactions=2' > > Regards, > Eugen > > Zitat von Zakhar Kirpichenko : > > > Hi, > > > > I have upgraded my test and production cephadm-managed clusters from > > 16.2.14 to 16.2.15. The u

[ceph-users] Upgraded 16.2.14 to 16.2.15

2024-03-04 Thread Zakhar Kirpichenko
Hi, I have upgraded my test and production cephadm-managed clusters from 16.2.14 to 16.2.15. The upgrade was smooth and completed without issues. There were a few things which I noticed after each upgrade: 1. RocksDB options, which I provided to each mon via their configuration files, got

[ceph-users] Re: v16.2.15 Pacific released

2024-03-04 Thread Zakhar Kirpichenko
This is great news! Many thanks! /Z On Mon, 4 Mar 2024 at 17:25, Yuri Weinstein wrote: > We're happy to announce the 15th, and expected to be the last, > backport release in the Pacific series. > > https://ceph.io/en/news/blog/2024/v16-2-15-pacific-released/ > > Notable Changes >

[ceph-users] What's up with 16.2.15?

2024-02-29 Thread Zakhar Kirpichenko
Hi, We randomly got several Pacific package updates to 16.2.15 available for Ubuntu 20.04. As far as I can see, 16.2.15 hasn't been released and there's been no release announcement. The updates seem to be no longer available. What's going on with 16.2.15? /Z

[ceph-users] Re: cephadm Failed to apply 1 service(s)

2024-02-16 Thread Zakhar Kirpichenko
and apply it: > > ceph orch apply -i new-drivegroup.yml > > Zitat von Zakhar Kirpichenko : > > > Many thanks for your response, Eugen! > > > > I tried to fail mgr twice, unfortunately that had no effect on the issue. > > Neither `cephadm ceph-volume inventory

[ceph-users] Re: cephadm Failed to apply 1 service(s)

2024-02-16 Thread Zakhar Kirpichenko
Answering my own question: I exported the spec, removed the failed drive and re-applied the spec again; the spec appears to have been updated correctly and the warning is gone. /Z On Fri, 16 Feb 2024 at 14:33, Zakhar Kirpichenko wrote: > Many thanks for your response, Eugen! > >

[ceph-users] Re: cephadm Failed to apply 1 service(s)

2024-02-16 Thread Zakhar Kirpichenko
orch ls --export > > Does it contain specific device paths or something? Does 'cephadm ls' > on that node show any traces of the previous OSD? > I'd probably try to check some things like > > cephadm ceph-volume inventory > ceph device ls-by-host > > Regards, > E

[ceph-users] cephadm Failed to apply 1 service(s)

2024-02-16 Thread Zakhar Kirpichenko
Hi, We had a physical drive malfunction in one of our Ceph OSD hosts managed by cephadm (Ceph 16.2.14). I have removed the drive from the system, and the kernel no longer sees it: ceph03 ~]# ls -al /dev/sde ls: cannot access '/dev/sde': No such file or directory I have removed the corresponding

[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-07 Thread Zakhar Kirpichenko
Indeed, it looks like it's been recently reopened. Thanks for this! /Z On Wed, 7 Feb 2024 at 15:43, David Orman wrote: > That tracker's last update indicates it's slated for inclusion. > > On Thu, Feb 1, 2024, at 10:47, Zakhar Kirpichenko wrote: > > Hi, > > > >

[ceph-users] Re: pacific 16.2.15 QE validation status

2024-02-01 Thread Zakhar Kirpichenko
Hi, Please consider not leaving this behind: https://github.com/ceph/ceph/pull/55109 It's a serious bug, which potentially affects a whole node stability if the affected mgr is colocated with OSDs. The bug was known for quite a while and really shouldn't be left unfixed. /Z On Thu, 1 Feb 2024

[ceph-users] Re: Ceph OSD reported Slow operations

2024-01-28 Thread Zakhar Kirpichenko
ere any zombie or unwanted process that make the ceph cluster busy or > the IOPS budget of the disk that makes the cluster busy? > > > On November 4, 2023 at 4:29 PM Zakhar Kirpichenko > wrote: > > You have an IOPS budget, i.e. how much I/O your spinners can deliver. > Space

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Zakhar Kirpichenko
I have to say that not including a fix for a serious issue into the last minor release of Pacific is a rather odd decision. /Z On Thu, 25 Jan 2024 at 09:00, Konstantin Shalygin wrote: > Hi, > > The backport to pacific was rejected [1], you may switch to reef, when [2] > merged and released > >

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Zakhar Kirpichenko
I found that quickly restarting the affected mgr every 2 days is an okay kludge. It takes less than a second to restart, and never grows to dangerous sizes which is when it randomly starts ballooning. /Z On Thu, 25 Jan 2024, 03:12 changzhi tan, <544463...@qq.com> wrote: > Is there any way to

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-12-18 Thread Zakhar Kirpichenko
thoughts? /Z On Mon, 11 Dec 2023 at 12:34, Zakhar Kirpichenko wrote: > Hi, > > Another update: after 2 more weeks the mgr process grew to ~1.5 GB, which > again was expected: > > mgr.ceph01.vankui ceph01 *:8443,9283 running (2w)102s ago 2y > 1519M-

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-12-11 Thread Zakhar Kirpichenko
112M- 16.2.14 fc0182d6cda5 1c3d2d83b6df The cluster is healthy and operating normally, the mgr process is growing slowly. It's still unclear what caused the ballooning and OOM issue under very similar conditions. /Z On Sat, 25 Nov 2023 at 08:31, Zakhar Kirpichenko wrote: >

[ceph-users] Re: Ceph 16.2.14: osd crash, bdev() _aio_thread got r=-1 ((1) Operation not permitted)

2023-12-05 Thread Zakhar Kirpichenko
now. I've already checked the file descriptor numbers, the defaults already are very high and the usage is rather low. /Z On Wed, 6 Dec 2023 at 03:24, Tyler Stachecki wrote: > On Tue, Dec 5, 2023 at 10:13 AM Zakhar Kirpichenko > wrote: > > > > Any input from anyone? &g

[ceph-users] Re: Ceph 16.2.14: osd crash, bdev() _aio_thread got r=-1 ((1) Operation not permitted)

2023-12-05 Thread Zakhar Kirpichenko
Any input from anyone? /Z On Mon, 4 Dec 2023 at 12:52, Zakhar Kirpichenko wrote: > Hi, > > Just to reiterate, I'm referring to an OSD crash loop because of the > following error: > > "2023-12-03T04:00:36.686+ 7f08520e2700 -1 bdev(0x55f02a28a400 > /var/

[ceph-users] Re: Ceph 16.2.14: osd crash, bdev() _aio_thread got r=-1 ((1) Operation not permitted)

2023-12-04 Thread Zakhar Kirpichenko
ideas? /Z On Sun, 3 Dec 2023 at 16:09, Zakhar Kirpichenko wrote: > Thanks! The bug I referenced is the reason for the 1st OSD crash, but not > for the subsequent crashes. The reason for those is described where you > . I'm asking for help with that one. > > /Z > > On Sun, 3 Dec

[ceph-users] Re: Ceph 16.2.14: osd crash, bdev() _aio_thread got r=-1 ((1) Operation not permitted)

2023-12-03 Thread Zakhar Kirpichenko
Thanks! The bug I referenced is the reason for the 1st OSD crash, but not for the subsequent crashes. The reason for those is described where you . I'm asking for help with that one. /Z On Sun, 3 Dec 2023 at 15:31, Kai Stian Olstad wrote: > On Sun, Dec 03, 2023 at 06:53:08AM +0200, Zak

[ceph-users] Ceph 16.2.14: osd crash, bdev() _aio_thread got r=-1 ((1) Operation not permitted)

2023-12-02 Thread Zakhar Kirpichenko
Hi, One of our 16.2.14 cluster OSDs crashed again because of the dreaded https://tracker.ceph.com/issues/53906 bug. Usually an OSD, which crashed because of this bug, restarts within seconds and continues normal operation. This time it failed to restart and kept crashing: "assert_condition":

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-24 Thread Zakhar Kirpichenko
days, which likely means that whatever triggers the issue happens randomly and quite suddenly. I'll continue monitoring the mgr and get back with more observations. /Z On Wed, 22 Nov 2023 at 16:33, Zakhar Kirpichenko wrote: > Thanks for this. This looks similar to what we're observing. Altho

[ceph-users] Re: cephadm vs ceph.conf

2023-11-23 Thread Zakhar Kirpichenko
Hi, Please note that there are cases where the use of ceph.conf inside a container is justified. For example, I was unable to set monitor's mon_rocksdb_options by any means except for providing them in monitor's own ceph.conf within the container, all other attempts to pass this settings were

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
We use podman, could it > > be some docker restriction? > > > > Zitat von Zakhar Kirpichenko : > > > >> It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has > >> 384 > >> GB of RAM, each OSD has a memory target of 16 GB, about 100

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
; > Zitat von Zakhar Kirpichenko : > > > It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has 384 > > GB of RAM, each OSD has a memory target of 16 GB, about 100 GB of memory, > > give or take, is available (mostly used by page cache) on each node > during &

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
gt; COMMAND > 6077 ceph 20 0 6357560 4,522g 22316 S 12,00 1,797 > 57022:54 ceph-mgr > > In our own cluster (smaller than that and not really heavily used) the > mgr uses almost 2 GB. So those numbers you have seem relatively small. > > Zitat von Zakhar Kirpichenko

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
2023 at 13:07, Eugen Block wrote: > I see these progress messages all the time, I don't think they cause > it, but I might be wrong. You can disable it just to rule that out. > > Zitat von Zakhar Kirpichenko : > > > Unfortunately, I don't have a full stack trace because ther

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
ock wrote: > Do you have the full stack trace? The pastebin only contains the > "tcmalloc: large alloc" messages (same as in the tracker issue). Maybe > comment in the tracker issue directly since Radek asked for someone > with a similar problem in a newer release.

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-21 Thread Zakhar Kirpichenko
n’t this quite similar? > > https://tracker.ceph.com/issues/45136 > > Zitat von Zakhar Kirpichenko : > > > Hi, > > > > I'm facing a rather new issue with our Ceph cluster: from time to time > > ceph-mgr on one of the two mgr nodes gets oom-killed after consum

[ceph-users] Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-21 Thread Zakhar Kirpichenko
Hi, I'm facing a rather new issue with our Ceph cluster: from time to time ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over 100 GB RAM: [Nov21 15:02] tp_osd_tp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ +0.10]

[ceph-users] Re: [CEPH] OSD Memory Usage

2023-11-16 Thread Zakhar Kirpichenko
running (3d) 7m ago > 12d3039M4096M 17.2.6 90a2664234e1 16e04a1da987 > osd.34 sg-osd02 running (3d) 7m ago > 12d2434M4096M 17.2.6 90a2664234e1 014076e28182 > > > > btw as you said, I feel this value does

[ceph-users] Re: [CEPH] OSD Memory Usage

2023-11-15 Thread Zakhar Kirpichenko
out memory leak. A nice man, @Anthony D'Atri , > on this forum helped me to understand that it wont help to limit OSD usage. > > I set it to 1GB because I want to see how this option works. > > I will read and test with caches options. > > Nguyen Huu Khoi > > > On Thu

[ceph-users] Re: [CEPH] OSD Memory Usage

2023-11-15 Thread Zakhar Kirpichenko
Hi, osd_memory_target is a "target", i.e. an OSD make an effort to consume up to the specified amount of RAM, but won't consume less than required for its operation and caches, which have some minimum values such as for example osd_memory_cache_min, bluestore_cache_size, bluestore_cache_size_hdd,

[ceph-users] Re: Ceph OSD reported Slow operations

2023-11-08 Thread Zakhar Kirpichenko
Take hints from this: "544 pgs not deep-scrubbed in time". Your OSDs are unable to scrub their data in time, likely because they cannot cope with the client + scrubbing I/O. I.e. there's too much data on too few and too slow spindles. You can play with osd_deep_scrub_interval and increase the

[ceph-users] Re: Ceph OSD reported Slow operations

2023-11-06 Thread Zakhar Kirpichenko
; Is there any parameter in the ceph osd log and ceph mon log that gives me > the clue for the cluster business? > Is there any zombie or unwanted process that make the ceph cluster busy or > the IOPS budget of the disk that makes the cluster busy? > > > On November 4, 2023 at 4:29 P

[ceph-users] Re: Ceph OSD reported Slow operations

2023-11-04 Thread Zakhar Kirpichenko
orage of 1.6 TB free in each of my OSD, > that will not help in my IOPS issue right? > Please guide me > > On November 2, 2023 at 12:47 PM Zakhar Kirpichenko > wrote: > > >1. The calculated IOPS is for the rw operation right ? > > Total drive IOPS, read or write. Depending

[ceph-users] Re: Ceph OSD reported Slow operations

2023-11-02 Thread Zakhar Kirpichenko
om the output of ceph osd df tree that is count of > pgs(45/OSD) and use% (65 to 67%). Is that not significant? > Correct me if my queries are irrelevant > > > > On November 2, 2023 at 11:36 AM Zakhar Kirpichenko > wrote: > > Sure, it's 36 OSDs at 200 IOPS each (tops, like

[ceph-users] Re: Ceph OSD reported Slow operations

2023-11-02 Thread Zakhar Kirpichenko
busy and OSDs aren't coping. Also your nodes are not balanced. /Z On Thu, 2 Nov 2023 at 07:33, V A Prabha wrote: > Can you please elaborate your identifications and the statement . > > > On November 2, 2023 at 9:40 AM Zakhar Kirpichenko > wrote: > > I'm afraid you're s

[ceph-users] Re: Ceph OSD reported Slow operations

2023-11-01 Thread Zakhar Kirpichenko
I'm afraid you're simply hitting the I/O limits of your disks. /Z On Thu, 2 Nov 2023 at 03:40, V A Prabha wrote: > Hi Eugen > Please find the details below > > > root@meghdootctr1:/var/log/ceph# ceph -s > cluster: > id: c59da971-57d1-43bd-b2b7-865d392412a5 > health: HEALTH_WARN >

[ceph-users] Re: Ceph 16.2.14: pgmap updated every few seconds for no apparent reason

2023-10-25 Thread Zakhar Kirpichenko
tick_period 10 > > > > Regards, > > Eugen > > > > Zitat von Chris Palmer : > > > >> I have just checked 2 quincy 17.2.6 clusters, and I see exactly the > >> same. The pgmap version is bumping every two seconds (which ties in > >> with th

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-20 Thread Zakhar Kirpichenko
rst. > > Alternatively you might consider building updated code yourself and make > patched binaries on top of .14... > > > Thanks, > > Igor > > > On 20/10/2023 15:10, Zakhar Kirpichenko wrote: > > Thank you, Igor. > > It is somewhat disappointing

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-20 Thread Zakhar Kirpichenko
This should be coupled with enabling >> > 'level_compaction_dynamic_level_bytes' mode in RocksDB - there is pretty >> > good spec on applying this mode to BlueStore attached to >> > https://github.com/ceph/ceph/pull/37156. >> > >> > >> > T

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-20 Thread Zakhar Kirpichenko
fast" mode. This should be coupled with enabling > 'level_compaction_dynamic_level_bytes' mode in RocksDB - there is pretty > good spec on applying this mode to BlueStore attached to > https://github.com/ceph/ceph/pull/37156. > > > Thanks, > > Igor > On 20/10/2

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-19 Thread Zakhar Kirpichenko
16/10/2023 14:13, Zakhar Kirpichenko wrote: > > Many thanks, Igor. I found previously submitted bug reports and subscribed > to them. My understanding is that the issue is going to be fixed in the > next Pacific minor release. > > /Z > > On Mon, 16 Oct 2023 at 14:03, Igor

[ceph-users] Re: Ceph 16.2.14: pgmap updated every few seconds for no apparent reason

2023-10-19 Thread Zakhar Kirpichenko
ers are healthy with > > nothing apart from client IO happening. > > > > On 13/10/2023 12:09, Zakhar Kirpichenko wrote: > >> Hi, > >> > >> I am investigating excessive mon writes in our cluster and wondering > >> whether excessive pgmap updates could be t

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-18 Thread Zakhar Kirpichenko
> > since its a bit beyond of the scope of basic, could you please post the > complete ceph.conf config section for these changes for reference? > > Thanks! > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > _________

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-17 Thread Zakhar Kirpichenko
confirm > that compression works well for MONs too, compression could be enabled > by default as well. > > Regards, > Eugen > > https://tracker.ceph.com/issues/63229 > > Zitat von Zakhar Kirpichenko : > > > With the help of community members, I managed to enable RocksDB >

[ceph-users] Re: Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-17 Thread Zakhar Kirpichenko
with the config database and which require this > extra-entrypoint-argument. > > Thanks again, Mykola! > Eugen > > [1] > > https://docs.ceph.com/en/quincy/cephadm/services/#extra-entrypoint-arguments > > Zitat von Zakhar Kirpichenko : > > > Thanks for the sugg

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-16 Thread Zakhar Kirpichenko
adding compression to other monitors. /Z On Mon, 16 Oct 2023 at 14:57, Zakhar Kirpichenko wrote: > The issue persists, although to a lesser extent. Any comments from the > Ceph team please? > > /Z > > On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko wrote: > >> &

[ceph-users] Re: Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-16 Thread Zakhar Kirpichenko
set. The reason I think this is that rocksdb mount options are needed > _before_ the mon is able to access any of the centralized conf data, > which I believe is itself stored in rocksdb. > > Josh > > On Sun, Oct 15, 2023 at 10:29 PM Zakhar Kirpichenko > wrote: > > > > Out of curi

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-16 Thread Zakhar Kirpichenko
The issue persists, although to a lesser extent. Any comments from the Ceph team please? /Z On Fri, 13 Oct 2023 at 20:51, Zakhar Kirpichenko wrote: > > Some of it is transferable to RocksDB on mons nonetheless. > > Please point me to relevant Ceph documentation, i.e. a descri

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-16 Thread Zakhar Kirpichenko
e similar issue at: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/ > > > Thanks, > > Igor > > On 16/10/2023 09:26, Zakhar Kirpichenko wrote: > > Hi, > > > > After upgrading to Ceph 16.2.14 w

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-16 Thread Zakhar Kirpichenko
ed: 17024548864 unmapped: 4164534272 heap: 21189083136 old mem: 13797582406 new mem: 13797582406 There's plenty of RAM in the system, about 120 GB free and used for cache. /Z On Mon, 16 Oct 2023 at 09:26, Zakhar Kirpichenko wrote: > Hi, > > After upgrading to Ceph 16.2.14 we had several OSD c

[ceph-users] Re: Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-16 Thread Zakhar Kirpichenko
Not sure how it managed to screw up formatting, OSD configuration in a more readable form: https://pastebin.com/mrC6UdzN /Z On Mon, 16 Oct 2023 at 09:26, Zakhar Kirpichenko wrote: > Hi, > > After upgrading to Ceph 16.2.14 we had several OSD crashes > in bstore_kv_sync thread

[ceph-users] Ceph 16.2.14: OSDs randomly crash in bstore_kv_sync

2023-10-16 Thread Zakhar Kirpichenko
Hi, After upgrading to Ceph 16.2.14 we had several OSD crashes in bstore_kv_sync thread: 1. "assert_thread_name": "bstore_kv_sync", 2. "backtrace": [ 3. "/lib64/libpthread.so.0(+0x12cf0) [0x7ff2f6750cf0]", 4. "gsignal()", 5. "abort()", 6. "(ceph::__ceph_assert_fail(char

[ceph-users] Re: Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-15 Thread Zakhar Kirpichenko
it if someone from the Ceph team could please chip in and suggest a working way to enable RocksDB compression in Ceph monitors. /Z On Sat, 14 Oct 2023 at 19:16, Zakhar Kirpichenko wrote: > Thanks for your response, Josh. Our ceph.conf doesn't have anything but > the mon addresses, moder

[ceph-users] Re: Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-14 Thread Zakhar Kirpichenko
I wonder if mon settings like > this one won't actually apply the way you want because they're needed > before the mon has the ability to obtain configuration from, > effectively, itself. > > Josh > > On Sat, Oct 14, 2023 at 1:32 AM Zakhar Kirpichenko > wrote: >

[ceph-users] Re: Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-14 Thread Zakhar Kirpichenko
from anyone, please? /Z On Fri, 13 Oct 2023 at 23:01, Zakhar Kirpichenko wrote: > Hi, > > I'm still trying to fight large Ceph monitor writes. One option I > considered is enabling RocksDB compression, as our nodes have more than > sufficient RAM and CPU. Unfortunately

[ceph-users] Ceph 16.2.14: how to set mon_rocksdb_options to enable RocksDB compression?

2023-10-13 Thread Zakhar Kirpichenko
Hi, I'm still trying to fight large Ceph monitor writes. One option I considered is enabling RocksDB compression, as our nodes have more than sufficient RAM and CPU. Unfortunately, monitors seem to completely ignore the compression setting: I tried: - setting ceph config set mon.ceph05

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-13 Thread Zakhar Kirpichenko
; > > Please point me to such recommendations, if they're on docs.ceph.com I'll > get them updated. > > On Oct 13, 2023, at 13:34, Zakhar Kirpichenko wrote: > > Thank you, Anthony. As I explained to you earlier, the article you had > sent is about RocksDB tuning for Bluestore OSD

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-13 Thread Zakhar Kirpichenko
a client SKU and really not suited for > enterprise use. If you had the 1TB SKU you'd get much longer life, or you > could change the overprovisioning on the ones you have. > > On Oct 13, 2023, at 12:30, Zakhar Kirpichenko wrote: > > I would very much appreciate it if someone

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-13 Thread Zakhar Kirpichenko
would very much appreciate it if someone with a better understanding of monitor internals and use of RocksDB could please chip in. /Z On Wed, 11 Oct 2023 at 19:00, Zakhar Kirpichenko wrote: > Thank you, Frank. This confirms that monitors indeed do this, and > > Our boot drives in 3 systems

[ceph-users] Re: Please help collecting stats of Ceph monitor disk writes

2023-10-13 Thread Zakhar Kirpichenko
Thank you, Frank. Tbh, I think it doesn't matter if the number of manual compactions is for 24h or for a smaller period, as long as it's over a reasonable period of time, so that an average number of compactions per hour can be calculated. /Z On Fri, 13 Oct 2023 at 16:01, Frank Schilder wrote:

[ceph-users] Ceph 16.2.14: pgmap updated every few seconds for no apparent reason

2023-10-13 Thread Zakhar Kirpichenko
Hi, I am investigating excessive mon writes in our cluster and wondering whether excessive pgmap updates could be the culprit. Basically pgmap is updated every few seconds, sometimes over ten times per minute, in a healthy cluster with no OSD and/or PG changes: Oct 13 11:03:03 ceph03 bash[4019]:

[ceph-users] Please help collecting stats of Ceph monitor disk writes

2023-10-13 Thread Zakhar Kirpichenko
Hi! Further to my thread "Ceph 16.2.x mon compactions, disk writes" ( https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/XGCI2LFW5RH3GUOQFJ542ISCSZH3FRX2/) where we have established that Ceph monitors indeed write considerable amounts of data to disks, I would like to request fellow

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Zakhar Kirpichenko
very large and also provide extra > endurance with SSDs with good controllers. > > I also think the recommendations on the ceph docs deserve a reality check. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > >

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Zakhar Kirpichenko
uot;hundreds of GB per day"? I see similar stats as > Frank on different clusters with different client IO. > > Zitat von Zakhar Kirpichenko : > > > Sure, nothing unusual there: > > > > --- > > > > cluster: > > id: 3f50555a-ae2a-11eb

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Zakhar Kirpichenko
> Can you add some more details as requested by Frank? Which mgr modules > are enabled? What's the current 'ceph -s' output? > > > Is autoscaler running and doing stuff? > > Is balancer running and doing stuff? > > Is backfill going on? > > Is recovery going on? > >

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Zakhar Kirpichenko
an option to limit logging to the MON store? > > I don't recall at the moment, worth checking tough. > > Zitat von Zakhar Kirpichenko : > > > Thank you, Frank. > > > > The cluster is healthy, operating normally, nothing unusual is going on. > We >

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Zakhar Kirpichenko
n to limit logging to the MON store? > > For information to readers, we followed old recommendations from a Dell > white paper for building a ceph cluster and have a 1TB Raid10 array on 6x > write intensive SSDs for the MON stores. After 5 years we are below 10% > wear. Average size of the M

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Zakhar Kirpichenko
probably wouldn't change too much, only if you know what you're doing. > Maybe Igor can comment if some other tuning makes sense here. > > Regards, > Eugen > > Zitat von Zakhar Kirpichenko : > > > Any input from anyone, please? > > > > On Tue, 10 Oct 2023 at 09

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Zakhar Kirpichenko
Any input from anyone, please? On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko wrote: > Any input from anyone, please? > > It's another thing that seems to be rather poorly documented: it's unclear > what to expect, what 'normal' behavior should be, and what can be done > about

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-10 Thread Zakhar Kirpichenko
Any input from anyone, please? It's another thing that seems to be rather poorly documented: it's unclear what to expect, what 'normal' behavior should be, and what can be done about the huge amount of writes by monitors. /Z On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko wrote: >

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-09 Thread Zakhar Kirpichenko
Thanks for the suggestion. That pid belongs to the mon process. I.e. the monitor is logging all client connections and commands. /Z On Mon, 9 Oct 2023 at 14:24, Kai Stian Olstad wrote: > On 09.10.2023 10:05, Zakhar Kirpichenko wrote: > > I did try to play with various debug settings.

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-09 Thread Zakhar Kirpichenko
s with > > ceph daemon mon.a config show | grep debug_ | grep mgr > > > > ceph tell mon.* injectargs --$monk=0/0 > > > > > > > > Any input from anyone, please? > > > > > > This part of Ceph is

[ceph-users] Ceph 16.2.x mon compactions, disk writes

2023-10-09 Thread Zakhar Kirpichenko
Hi, Monitors in our 16.2.14 cluster appear to quite often run "manual compaction" tasks: debug 2023-10-09T09:30:53.888+ 7f48a329a700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1696843853892760, "job": 64225, "event": "flush_started", "num_memtables": 1, "num_entries": 715, "num_deletes": 251,

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-09 Thread Zakhar Kirpichenko
_ | grep mgr > > ceph tell mon.* injectargs --$monk=0/0 > > > > > Any input from anyone, please? > > > > This part of Ceph is very poorly documented. Perhaps there's a better > place > > to ask this question? Please let me know. > > > > /Z > &

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-09 Thread Zakhar Kirpichenko
Any input from anyone, please? This part of Ceph is very poorly documented. Perhaps there's a better place to ask this question? Please let me know. /Z On Sat, 7 Oct 2023 at 22:00, Zakhar Kirpichenko wrote: > Hi! > > I am still fighting excessive logging. I've reduced unnecessar

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-07 Thread Zakhar Kirpichenko
produces a significant part of the logging traffic. >> >> >> Thanks, >> >> Igor >> >> On 04/10/2023 20:51, Zakhar Kirpichenko wrote: >> > Any input from anyone, please? >> > >> > On Tue, 19 Sept 2023 at 09:01, Zakhar Kirpiche

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-04 Thread Zakhar Kirpichenko
? /Z On Wed, 4 Oct 2023 at 21:23, Igor Fedotov wrote: > Hi Zakhar, > > do reduce rocksdb logging verbosity you might want to set debug_rocksdb > to 3 (or 0). > > I presume it produces a significant part of the logging traffic. > > > Thanks, > > Igor > >

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-10-04 Thread Zakhar Kirpichenko
Any input from anyone, please? On Tue, 19 Sept 2023 at 09:01, Zakhar Kirpichenko wrote: > Hi, > > Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very > detailed messages, Ceph logs alone on hosts with monitors and several OSDs > has already eaten through 50% o

[ceph-users] Re: 16.2.14: [progress WARNING root] complete: ev {UUID} does not exist

2023-09-29 Thread Zakhar Kirpichenko
Many thanks for the clarification! /Z On Fri, 29 Sept 2023 at 16:43, Tyler Stachecki wrote: > > > On Fri, Sep 29, 2023, 9:40 AM Zakhar Kirpichenko wrote: > >> Thanks for the suggestion, Tyler! Do you think switching the progress >> module off will have no material

[ceph-users] Re: 16.2.14: [progress WARNING root] complete: ev {UUID} does not exist

2023-09-29 Thread Zakhar Kirpichenko
Thanks for the suggestion, Tyler! Do you think switching the progress module off will have no material impact on the operation of the cluster? /Z On Fri, 29 Sept 2023 at 14:13, Tyler Stachecki wrote: > On Fri, Sep 29, 2023, 5:55 AM Zakhar Kirpichenko wrote: > >> Thank you, Eugen.

[ceph-users] Re: 16.2.14: [progress WARNING root] complete: ev {UUID} does not exist

2023-09-29 Thread Zakhar Kirpichenko
would have, so maybe > investigate first and then try just clearing it. Maybe a mgr failover > would do the same, not sure. > > Regards, > Eugen > > [1] > > https://github.com/ceph/ceph/blob/1d10b71792f3be8887a7631e69851ac2df3585af/src/pybind/mgr/progress/module.py#

[ceph-users] 16.2.14: [progress WARNING root] complete: ev {UUID} does not exist

2023-09-29 Thread Zakhar Kirpichenko
Hi, Mgr of my cluster logs this every few seconds: [progress WARNING root] complete: ev 7de5bb74-790b-4fda-8838-e4af4af18c62 does not exist [progress WARNING root] complete: ev fff93fce-b630-4141-81ee-19e7a3e61483 does not exist [progress WARNING root] complete: ev

[ceph-users] Re: Ceph 16.2.x excessive logging, how to reduce?

2023-09-19 Thread Zakhar Kirpichenko
Any input from anyone, please? On Tue, 19 Sept 2023 at 09:01, Zakhar Kirpichenko wrote: > Hi, > > Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very > detailed messages, Ceph logs alone on hosts with monitors and several OSDs > has already eaten through 50% o

[ceph-users] Ceph 16.2.x excessive logging, how to reduce?

2023-09-19 Thread Zakhar Kirpichenko
Hi, Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very detailed messages, Ceph logs alone on hosts with monitors and several OSDs has already eaten through 50% of the endurance of the flash system drives over a couple of years. Cluster logging settings are default, and it seems

  1   2   >