[ceph-users] Permanently ignore some warning classes

2023-01-11 Thread Nicola Mori
Dear Ceph users, my cluster is build with old hardware on a gigabit network, so I often experience warnings like OSD_SLOW_PING_TIME_BACK. These in turn triggers alert mails too often, forcing me to disable alerts which is not sustainable. So my question is: is it possible to tell Ceph to ignor

[ceph-users] ceph orch cannot refresh

2023-01-16 Thread Nicola Mori
Dear Ceph users, after a host failure in my cluster (quincy 17.2.3 managed by cephadm) it seems that ceph orch got somehow stuck and it cannot operate. For example, it seems that it cannot refresh the status of several services since about 20 hours: # ceph orch ls NAME

[ceph-users] Re: Permanently ignore some warning classes

2023-02-04 Thread Nicola Mori
Thank you Reed. I tried your solution but it didn't work, the warning emails are arriving anyway. Two possible reasons: 1) I issued `ceph health mute OSD_SLOW_PING_TIME_BACK --sticky` while the warning was not active, so it had no effect 2) according to this (https://people.redhat.com/bhubbard/n

[ceph-users] Re: Permanently ignore some warning classes

2023-02-04 Thread Nicola Mori
Well, I guess the mute is now active: ``` # ceph health detail HEALTH_WARN 4 OSD(s) have spurious read errors; (muted: OSD_SLOW_PING_TIME_BACK OSD_SLOW_PING_TIME_FRONT) ``` but I still get emails from the alert module reporting about OSD_SLOW_PING_TIME_BACK/FRONT. Is this expected? _

[ceph-users] Re: Permanently ignore some warning classes

2023-02-07 Thread Nicola Mori
Yesterday I caught the cluster while OSD_SLOW_PING_TIME_FRONT was active: # ceph health detail HEALTH_WARN 4 slow ops, oldest one blocked for 9233 sec, daemons [mon.aka,mon.balin] have slow ops.; (muted: OSD_SLOW_PING_TIME_BACK OSD_SLOW_PING_TIME_FRONT) (MUTED, STICKY) [WRN] OSD_SLOW_PING_TIME_F

[ceph-users] Re: Permanently ignore some warning classes

2023-02-09 Thread Nicola Mori
I finally found the (hard) way to avoid receiving unwanted email alerts: I modified the alerts module in order to be able to specify the set of alert codes for which no notification is sent. If someone is interested I can share it, just let me know. __

[ceph-users] Upgrade cephadm cluster

2023-02-20 Thread Nicola Mori
Dear Ceph users, my cephadm-managed cluster is currently based on 17.2.3. I see that 17.2.5 is available on quay.io, so I'd like to upgrade. I read the upgrade guide (https://docs.ceph.com/en/quincy/cephadm/upgrade/) and the "Potential problems" section is reassuringly short. Still I'm worried

[ceph-users] Re: Upgrade cephadm cluster

2023-02-28 Thread Nicola Mori
So I decided to proceed and everything went very well, with the cluster remaining up and running during the whole process. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Eccessive occupation of small OSDs

2023-03-30 Thread Nicola Mori
Dear Ceph users, my cluster is made up of 10 old machines, with uneven number of disks and disk size. Essentially I have just one big data pool (6+2 erasure code, with host failure domain) for which I am currently experiencing a very poor available space (88 TB of which 40 TB occupied, as repor

[ceph-users] Re: Eccessive occupation of small OSDs

2023-03-30 Thread Nicola Mori
GB's in pairs, and make OSDs from the pairs; this would result in 1TB OSDs that might work better. Any cluster on this hardware is going to be duct tape and string, though. On Thu, Mar 30, 2023 at 10:35 AM Nicola Mori wrote: Dear Ceph users, my cluster is made up of 10 old machines, with un

[ceph-users] Re: Eccessive occupation of small OSDs

2023-04-02 Thread Nicola Mori
the smallest host which is 8TB if I read the output correctly. You need to balance your hosts better by swapping drives. On Fri, 31 Mar 2023 at 03:34, Nicola Mori <mailto:m...@fi.infn.it>> wrote: Dear Ceph users, my cluster is made up of 10 old machines, with uneven number of

[ceph-users] Increase timeout for marking osd down

2023-04-25 Thread Nicola Mori
Dear Ceph users, my cluster is made of very old machines on a Gbit ethernet. I see that sometimes some OSDs are marked down due to slow networking, especially on heavy network load like during recovery. This causes problems, for example PGs keeps being deactivated and activated as the OSDs are

[ceph-users] PGs stuck undersized and not scrubbed

2023-06-05 Thread Nicola Mori
Dear Ceph users, after an outage and recovery of one machine I have several PGs stuck in active+recovering+undersized+degraded+remapped. Furthermore, many PGs have not been (deep-)scrubbed in time. See below for status and health details. It's been like this for two days, with no recovery I/O

[ceph-users] Re: PGs stuck undersized and not scrubbed

2023-06-05 Thread Nicola Mori
Dear Wes, thank you for your suggestion! I restarted OSDs 57 and 79 and the recovery operations restarted as well. In the log I found that for both of them a kernel issue raised, but they were not in error state. Probably they got stuck because of this. Thanks again for your help, Nicola s

[ceph-users] OSD stuck down

2023-06-12 Thread Nicola Mori
Dear Ceph users, after a host reboot one of the OSDs is now stuck down (and out). I tried several times to restart it and even to reboot the host, but it still remains down. # ceph -s cluster: id: b1029256-7bb3-11ec-a8ce-ac1f6b627b45 health: HEALTH_WARN 4 OSD(s) have

[ceph-users] Re: OSD stuck down

2023-06-15 Thread Nicola Mori
# ceph osd tree | grep 34 34hdd1.81940 osd.34 down 0 1.0 I really need help with this since I don't know what more to look at. Thanks in advance, Nicola On 13/06/23 08:35, Nicola Mori wrote: Dear Ceph users, after a host reboot one of the OSDs i

[ceph-users] Re: OSD stuck down

2023-06-15 Thread Nicola Mori
Hi Dario, I think the connectivity is ok. My cluster has just a public interface, and all of the other services on the same machine (osds and mgr) work flawlessly so I guess the connectivity is ok. Or in other words, I don't know what to look for in the network since all the other services wor

[ceph-users] Re: OSD stuck down

2023-06-15 Thread Nicola Mori
Hi Curt, I increased the debug level but still the OSD daemon doesn't log anything more than I already posted. dmesg does not report anything suspect (the osd disk has the very same messages as other disks for working osds), and smart is not very helpful: # smartctl -a /dev/sdf smartctl 7.1

[ceph-users] Re: OSD stuck down

2023-06-15 Thread Nicola Mori
I have been able to (sort-of) fix the problem by removing the problematic OSD, zapping the disk and starting a new OSD. The new OSD is backfilling, but now the problem is that some parts of Ceph are still waiting for the OSD removal, and the OSD (despite not running anymore on the host) is se

[ceph-users] Re: OSD stuck down

2023-06-16 Thread Nicola Mori
The osd daemon finally disappeared without further intervention. I guess I should have had more patience and wait the purge process to finish. Thanks to everybody who helped. Nicola Il 15 giugno 2023 15:02:16 CEST, Nicola Mori ha scritto: > >I have been able to (sort-of) fix the prob

[ceph-users] OSD delete vs destroy vs purge

2023-08-09 Thread Nicola Mori
Dear Ceph users, I see that the OSD page of the Ceph dashboard offers three possibilities for "removing" an OSD: delete, destroy and purge. The delete operation has the possibility to flag the "Preserve OSD ID(s) for replacement." option. I searched for explanations of the differences between

[ceph-users] Re: OSD delete vs destroy vs purge

2023-08-19 Thread Nicola Mori
Thanks Eugen for the explanation. To summarize what I understood: - delete from GUI simply does a drain+destroy; - destroy will preserve the OSD id so that it will be used by the next OSD that will be created on that host; - purge will remove everything, and the next OSD that will be crated will

[ceph-users] Awful new dashboard in Reef

2023-09-07 Thread Nicola Mori
Dear Ceph users, I just upgraded my cluster to Reef, and with the new version came also a revamped dashboard. Unfortunately the new dashboard is really awful to me: 1) it's no longer possible to see the status of the PGs: in the old dashboard it was very easy to see e.g. how many PGs were rec

[ceph-users] Re: Awful new dashboard in Reef

2023-09-07 Thread Nicola Mori
My cluster has 104 OSDs, so I don't think this can be a factor for the malfunctioning. smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Awful new dashboard in Reef

2023-09-11 Thread Nicola Mori
Hi Nizam, many thanks for the tip. And sorry for the quite rude subject of my post, I really appreciate the dashboard revamp effort but I was frustrated about the malfunctioning and missing features. By the way, one of the things that really need to be improved is the support for mobile devic

[ceph-users] Memory footprint of increased PG number

2023-11-08 Thread Nicola Mori
Dear Ceph user, I'm wondering how much an increase of PG number would impact on the memory occupancy of OSD daemons. In my cluster I currently have 512 PGs and I would like to increase it to 1024 to mitigate some disk occupancy issues, but having machines with low amount of memory (down to 24 G

[ceph-users] Duplicated device IDs

2023-12-01 Thread Nicola Mori
Dear Ceph users, I am replacing some small disks on one of my hosts with bigger ones. I delete the OSD from the web UI, preserving the ID for replacement, then after the rebalancing is finished I change the disk and the cluster automatically re-creates the OSD with the same ID. Then I adjust t

[ceph-users] Help with deep scrub warnings

2024-03-04 Thread Nicola Mori
Dear Ceph users, in order to reduce the deep scrub load on my cluster I set the deep scrub interval to 2 weeks, and tuned other parameters as follows: # ceph config get osd osd_deep_scrub_interval 1209600.00 # ceph config get osd osd_scrub_sleep 0.10 # ceph config get osd osd_scrub_loa

[ceph-users] Re: Help with deep scrub warnings

2024-03-05 Thread Nicola Mori
Hi Anthony, thanks for the tips. I reset all the values but osd_deep_scrub_interval to their defaults as reported at https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ : # ceph config set osd osd_scrub_sleep 0.0 # ceph config set osd osd_scrub_load_threshold 0.5 # ceph config

[ceph-users] Write issues on CephFS mounted with root_squash

2024-05-15 Thread Nicola Mori
Dear Ceph users, I'm trying to export a CephFS with the root_squash option. This is the client configuration: client.wizardfs_rootsquash key: caps: [mds] allow rw fsname=wizardfs root_squash caps: [mon] allow r fsname=wizardfs caps: [osd] allow rw t

[ceph-users] Re: Write issues on CephFS mounted with root_squash

2024-05-15 Thread Nicola Mori
Thank you Bailey, I'll give it a try ASAP. By the way, is this issue with the kernel driver something that will be fixed at a given point? If I'm correct the kernel driver has better performance than FUSE so I'd like to use it. Cheers, Nicola smime.p7s Description: S/MIME Cryptographic Signa

[ceph-users] Re: Write issues on CephFS mounted with root_squash

2024-05-16 Thread Nicola Mori
Thank you Kotresh! My cluster is currently on Reef 18.2.2, which should be the current version and which is affected. Will the fix be included in the next Reef release? Cheers, Nicola smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-us

[ceph-users] Ceph 19 Squid released?

2024-07-21 Thread Nicola Mori
Dear Ceph users, on quay.io I see available images for 19.1.0. Yet I can't find any public release announcement, and on this page: https://docs.ceph.com/en/latest/releases/ version 19 is still not mentioned at all. So what's going on? Nicola smime.p7s Description: S/MIME Cryptographic Si

[ceph-users] Pull failed on cluster upgrade

2024-08-05 Thread Nicola Mori
Dear Ceph users, during an upgrade from 18.2.2 to 18.2.4 the image pull from Dockerhub failed on one machine running a monitor daemon, while it succeeded on the previous ones. # ceph orch upgrade status { "target_image": "snack14/ceph-wizard@sha256:b1994328eb078778abdba0a17a7cf7b371e7d95

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-06 Thread Nicola Mori
I think I found the problem. Setting the cephadm log level to debug and then watching the logs during the upgrade: ceph config set mgr mgr/cephadm/log_to_cluster_level debug ceph -W cephadm --watch-debug I found this line just before the error: ceph: stderr Fatal glibc error: CPU does no

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-07 Thread Nicola Mori
Unfortunately I'm on bare metal, with very old hardware so I cannot do much. I'd try to build a Ceph image based on Rocky Linux 8 if I could get the Dockerfile of the current image to start with, but I've not been able to find it. Can you please help me with this? Cheers, Nicola smime.p7s De

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-07 Thread Nicola Mori
Thank you Konstantin, as it was foreseeable this problem didn't hit just me. So I hope the build of images based on CentOS Stream 8 will be resumed. Otherwise I'll try to build myself. Nicola smime.p7s Description: S/MIME Cryptographic Signature ___

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-21 Thread Nicola Mori
In the end I built up an image based on Ubuntu 22.04 which does not mandate x86-64-v2. I installed the official Ceph packages and hacked here and there (e.g. it was necessary to set the uid and gid of the Ceph user and group identical to those used by the CentOS Stream 8 image to avoid to mess

[ceph-users] Re: Pull failed on cluster upgrade

2024-08-21 Thread Nicola Mori
The upgrade ended successfully, but now the cluster reports this error: MDS_CLIENTS_BROKEN_ROOTSQUASH: 1 MDS report clients with broken root_squash implementation From what I understood this is due to a new feature meant to fix a bug in the root_squash implementation, and that will be relea

[ceph-users] Can't stop ceph-mgr from continuously logging to file

2022-01-07 Thread Nicola Mori
In my test cluster the ceph-mgr is continuously logging to file like this: 2022-01-07T16:24:09.890+ 7fc49f1cc700 0 log_channel(cluster) log [DBG] : pgmap v832: 1 pgs: 1 active+undersized; 0 B data, 5.9 MiB used, 16 GiB / 18 GiB avail 2022-01-07T16:24:11.890+ 7fc49f1cc700 0 log_channel

[ceph-users] PG count deviation alert on OSDs of high weight

2022-01-26 Thread Nicola Mori
I set up a test cluster (Pacific 16.2.7 deployed with cephadm) with several hdds of different sizes, 1.8 Tb and 3.6 TB; they have weight 1.8 and 3.6, respectively, with 2 pools (metadata+data for CephFS). I'm currently having a PG count varying from 177 to 182 for OSDs with small disks and from

[ceph-users] File access issue with root_squashed fs client

2022-02-03 Thread Nicola Mori
with the kernel driver using this autofs configuration: ceph-test -fstype=ceph,name=wizardfs,noatime 172.16.253.2,172.16.253.1:/ I guess this could be a configuration problem but I can't figure out what I might be doing wrong here. So I'd greatly appr

[ceph-users] Clarifications about automatic PG scaling

2022-09-02 Thread Nicola Mori
Dear Ceph users, I'm setting up a cluster, at the moment I have 56 OSDs for a total available space of 109 TiB, and an erasure coded pool with a total occupancy of just 90 GB. The autoscale mode for the pool is set to "on", but I still have just 32 PGs. As far as I understand (admittedly not

[ceph-users] osd_memory_target for low-memory machines

2022-10-02 Thread Nicola Mori
Dear Ceph users, I put together a cluster by reusing some (very) old machines with low amounts of RAM, as low as 4 GB for the worst case. I'd need to set osd_memory_target properly to avoid going OOM, but it seems there is a lower limit preventing me to do so consistently: 1) in the cluster

[ceph-users] Re: osd_memory_target for low-memory machines

2022-10-02 Thread Nicola Mori
. Nicola On 02/10/22 17:21, Joseph Mundackal wrote: can you share `ceph daemon osd.8 config show` and `ceph config dump`? On Sun, Oct 2, 2022 at 5:10 AM Nicola Mori <mailto:m...@fi.infn.it>> wrote: Dear Ceph users, I put together a cluster by reusing some (very) old machines

[ceph-users] Re: osd_memory_target for low-memory machines

2022-10-03 Thread Nicola Mori
host) but nevertheless the actual memory usage is far above the limit (568M vs 384M). How can this be? Am I overlooking something? On 02/10/22 19:16, Nicola Mori wrote: I attach two files with the requested info. Some more details: the cluster has been deployed with cephadm using Ceph 17.2.3

[ceph-users] Re: osd_memory_target for low-memory machines

2022-10-03 Thread Nicola Mori
if only temporarily. What you are setting is more or less like setting the autopilot engine level on an airplane at the cruising altitude, it does *not* control how much engine is needed for take-off, landings or emergencies. -- Nicola Mori, Ph.D. INFN sezione di Firenze Via Bruno Rossi 1, 50019 Se

[ceph-users] ceph tell setting ignored?

2022-10-05 Thread Nicola Mori
Dear Ceph users, I am trying to tune my cluster's recovery and backfill. On the web I found that I can set related tunables by e.g.: ceph tell osd.* injectargs --osd-recovery-sleep-hdd=0.0 --osd-max-backfills=8 --osd-recovery-max-active=8 --osd-recovery-max-single-start=4 but I cannot find

[ceph-users] Re: ceph tell setting ignored?

2022-10-05 Thread Nicola Mori
away when the daemon restarts. On Oct 5, 2022, at 6:10 AM, Nicola Mori wrote: Dear Ceph users, I am trying to tune my cluster's recovery and backfill. On the web I found that I can set related tunables by e.g.: ceph tell osd.* injectargs --osd-recovery-sleep-hdd=0.0 --osd-max-backfills=8 -

[ceph-users] Re: ceph tell setting ignored?

2022-10-05 Thread Nicola Mori
quot; } makes little sense to me. This means you have the mClock IO scheduler, and it gives back this value since you are meant to change the mClock priorities and not the number of backfills. Some more info at https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/#dmclock-qos

[ceph-users] Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Nicola Mori
Dear Ceph users, my cluster is stuck since several days with some PG backfilling. The number of misplaced objects slowly decreases down to 5%, and at that point jumps up again to about 7%, and so on. I found several possible reasons for this behavior. One is related to the balancer, which anyw

[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Nicola Mori
was updated to 126, and after a last bump it is now at 128 with a ~4% of misplaced objects currently decreasing. Sorry for the noise, Nicola On 07/10/22 09:15, Nicola Mori wrote: Dear Ceph users, my cluster is stuck since several days with some PG backfilling. The number of misplaced objects

[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-12 Thread Nicola Mori
Thank you Frank for the insight. I'd need to study a bit more the details of all of this, but for sure now I understand it a bit better. Nicola ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Understanding the total space in CephFS

2022-10-13 Thread Nicola Mori
Dear Ceph users, I'd need some help in understanding the total space in a CephFS. My cluster is currently built of 8 machines, the one with the smallest capacity has 8 TB of total disk space, and the total available raw space is 153 TB. I set up a 3x replicated metadata pool and a 6+2 erasure

[ceph-users] Re: Understanding the total space in CephFS

2022-10-13 Thread Nicola Mori
Hi Stefan, the cluster is built of several old machines, with different numbers of disks (from 8 to 16) and disk sizes (from 500 GB to 4 TB). After the PG increase it is still recovering: the number of PGP is at 213 and has to grow up to 256. The balancer status gives: { "active": true,

[ceph-users] Spam on /var/log/messages due to config leftover?

2022-10-16 Thread Nicola Mori
Dear Ceph users, on one of my nodes I see that the /var/log/messages is being spammed by these messages: Oct 16 12:51:11 bofur bash[2473311]: :::172.16.253.2 - - [16/Oct/2022:10:51:11] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.33.4" Oct 16 12:51:12 bofur bash[2487821]: ts=2022-10-16T

[ceph-users] Re: Spam on /var/log/messages due to config leftover?

2022-10-17 Thread Nicola Mori
I solved the problem by stopping prometheus on the problematic host, then removing the folder containing the prometheus config and storage (/var/lib/ceph//prometheus.) and then restarting the alertmanager and node-exporter units. Nicola ___ ceph-user

[ceph-users] Missing OSD in up set

2022-11-02 Thread Nicola Mori
Dear Ceph users, I have one PG in my cluster that is constantly in active+clean+remapped state. From what I understand there might a problem with the up set: # ceph pg map 3.5e osdmap e23638 pg 3.5e (3.5e) -> up [38,78,55,49,40,39,64,2147483647] acting [38,78,55,49,40,39,64,68] The last OSD

[ceph-users] Re: Missing OSD in up set

2022-11-03 Thread Nicola Mori
Hi Frank, I checked the first hypothesis, and I found something strange. This is the decompiled rule: rule wizard_data { id 1 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step chooseleaf indep 0 type host

[ceph-users] Re: Missing OSD in up set

2022-11-03 Thread Nicola Mori
better once you have more host buckets to choose OSDs from. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nicola Mori Sent: 03 November 2022 10:39:30 To: ceph-users Subject: [ceph-users] Re: Missing OSD in up

[ceph-users] Re: Missing OSD in up set

2022-11-03 Thread Nicola Mori
If I use set_choose_tries 100 and choose_total_tries 250 I get a lot of bad mappings with crushtool; # crushtool -i better-totaltries--crush.map --test --show-bad-mappings --rule 1 --num-rep 8 --min-x 1 --max-x 100 --show-choose-tries bad mapping rule 1 x 319 num_rep 8 result [43,40,58,69

[ceph-users] Re: Missing OSD in up set

2022-11-03 Thread Nicola Mori
Ok, I'd say I fixed it. I set both parameters to 250, recompiled the crush map and loaded it, and now the PG is in active+undersized+degraded+remapped+backfilling state and mapped as: # ceph pg map 3.5e osdmap e23741 pg 3.5e (3.5e) -> up [38,78,55,49,40,39,64,20] acting [38,78,55,49,40,39,64,2