Dear Ceph users,
my cluster is build with old hardware on a gigabit network, so I often
experience warnings like OSD_SLOW_PING_TIME_BACK. These in turn triggers
alert mails too often, forcing me to disable alerts which is not
sustainable. So my question is: is it possible to tell Ceph to ignor
Dear Ceph users,
after a host failure in my cluster (quincy 17.2.3 managed by cephadm) it
seems that ceph orch got somehow stuck and it cannot operate. For
example, it seems that it cannot refresh the status of several services
since about 20 hours:
# ceph orch ls
NAME
Thank you Reed. I tried your solution but it didn't work, the warning emails
are arriving anyway. Two possible reasons:
1) I issued `ceph health mute OSD_SLOW_PING_TIME_BACK --sticky` while the
warning was not active, so it had no effect
2) according to this
(https://people.redhat.com/bhubbard/n
Well, I guess the mute is now active:
```
# ceph health detail
HEALTH_WARN 4 OSD(s) have spurious read errors; (muted: OSD_SLOW_PING_TIME_BACK
OSD_SLOW_PING_TIME_FRONT)
```
but I still get emails from the alert module reporting about
OSD_SLOW_PING_TIME_BACK/FRONT. Is this expected?
_
Yesterday I caught the cluster while OSD_SLOW_PING_TIME_FRONT was active:
# ceph health detail
HEALTH_WARN 4 slow ops, oldest one blocked for 9233 sec, daemons
[mon.aka,mon.balin] have slow ops.; (muted: OSD_SLOW_PING_TIME_BACK
OSD_SLOW_PING_TIME_FRONT)
(MUTED, STICKY) [WRN] OSD_SLOW_PING_TIME_F
I finally found the (hard) way to avoid receiving unwanted email alerts: I
modified the alerts module in order to be able to specify the set of alert
codes for which no notification is sent. If someone is interested I can share
it, just let me know.
__
Dear Ceph users,
my cephadm-managed cluster is currently based on 17.2.3. I see that
17.2.5 is available on quay.io, so I'd like to upgrade. I read the
upgrade guide (https://docs.ceph.com/en/quincy/cephadm/upgrade/) and the
"Potential problems" section is reassuringly short. Still I'm worried
So I decided to proceed and everything went very well, with the cluster
remaining up and running during the whole process.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Dear Ceph users,
my cluster is made up of 10 old machines, with uneven number of disks and disk
size. Essentially I have just one big data pool (6+2 erasure code, with host
failure domain) for which I am currently experiencing a very poor available
space (88 TB of which 40 TB occupied, as repor
GB's in pairs, and make OSDs from the
pairs; this would result in 1TB OSDs that might work better. Any
cluster on this hardware is going to be duct tape and string, though.
On Thu, Mar 30, 2023 at 10:35 AM Nicola Mori wrote:
Dear Ceph users,
my cluster is made up of 10 old machines, with un
the smallest host which is 8TB if I
read the output correctly. You need to balance your hosts better by
swapping drives.
On Fri, 31 Mar 2023 at 03:34, Nicola Mori <mailto:m...@fi.infn.it>> wrote:
Dear Ceph users,
my cluster is made up of 10 old machines, with uneven number of
Dear Ceph users,
my cluster is made of very old machines on a Gbit ethernet. I see that
sometimes some OSDs are marked down due to slow networking, especially
on heavy network load like during recovery. This causes problems, for
example PGs keeps being deactivated and activated as the OSDs are
Dear Ceph users,
after an outage and recovery of one machine I have several PGs stuck in
active+recovering+undersized+degraded+remapped. Furthermore, many PGs
have not been (deep-)scrubbed in time. See below for status and health
details.
It's been like this for two days, with no recovery I/O
Dear Wes,
thank you for your suggestion! I restarted OSDs 57 and 79 and the
recovery operations restarted as well. In the log I found that for both
of them a kernel issue raised, but they were not in error state.
Probably they got stuck because of this.
Thanks again for your help,
Nicola
s
Dear Ceph users,
after a host reboot one of the OSDs is now stuck down (and out). I tried
several times to restart it and even to reboot the host, but it still
remains down.
# ceph -s
cluster:
id: b1029256-7bb3-11ec-a8ce-ac1f6b627b45
health: HEALTH_WARN
4 OSD(s) have
# ceph osd tree | grep 34
34hdd1.81940 osd.34 down 0 1.0
I really need help with this since I don't know what more to look at.
Thanks in advance,
Nicola
On 13/06/23 08:35, Nicola Mori wrote:
Dear Ceph users,
after a host reboot one of the OSDs i
Hi Dario,
I think the connectivity is ok. My cluster has just a public interface,
and all of the other services on the same machine (osds and mgr) work
flawlessly so I guess the connectivity is ok. Or in other words, I don't
know what to look for in the network since all the other services wor
Hi Curt,
I increased the debug level but still the OSD daemon doesn't log
anything more than I already posted. dmesg does not report anything
suspect (the osd disk has the very same messages as other disks for
working osds), and smart is not very helpful:
# smartctl -a /dev/sdf
smartctl 7.1
I have been able to (sort-of) fix the problem by removing the
problematic OSD, zapping the disk and starting a new OSD. The new OSD is
backfilling, but now the problem is that some parts of Ceph are still
waiting for the OSD removal, and the OSD (despite not running anymore on
the host) is se
The osd daemon finally disappeared without further intervention. I guess I
should have had more patience and wait the purge process to finish.
Thanks to everybody who helped.
Nicola
Il 15 giugno 2023 15:02:16 CEST, Nicola Mori ha scritto:
>
>I have been able to (sort-of) fix the prob
Dear Ceph users,
I see that the OSD page of the Ceph dashboard offers three possibilities
for "removing" an OSD: delete, destroy and purge. The delete operation
has the possibility to flag the "Preserve OSD ID(s) for replacement."
option. I searched for explanations of the differences between
Thanks Eugen for the explanation. To summarize what I understood:
- delete from GUI simply does a drain+destroy;
- destroy will preserve the OSD id so that it will be used by the next
OSD that will be created on that host;
- purge will remove everything, and the next OSD that will be crated
will
Dear Ceph users,
I just upgraded my cluster to Reef, and with the new version came also a
revamped dashboard. Unfortunately the new dashboard is really awful to me:
1) it's no longer possible to see the status of the PGs: in the old
dashboard it was very easy to see e.g. how many PGs were rec
My cluster has 104 OSDs, so I don't think this can be a factor for the
malfunctioning.
smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Hi Nizam,
many thanks for the tip. And sorry for the quite rude subject of my
post, I really appreciate the dashboard revamp effort but I was
frustrated about the malfunctioning and missing features. By the way,
one of the things that really need to be improved is the support for
mobile devic
Dear Ceph user,
I'm wondering how much an increase of PG number would impact on the memory
occupancy of OSD daemons. In my cluster I currently have 512 PGs and I would
like to increase it to 1024 to mitigate some disk occupancy issues, but having
machines with low amount of memory (down to 24 G
Dear Ceph users,
I am replacing some small disks on one of my hosts with bigger ones. I
delete the OSD from the web UI, preserving the ID for replacement, then
after the rebalancing is finished I change the disk and the cluster
automatically re-creates the OSD with the same ID. Then I adjust t
Dear Ceph users,
in order to reduce the deep scrub load on my cluster I set the deep
scrub interval to 2 weeks, and tuned other parameters as follows:
# ceph config get osd osd_deep_scrub_interval
1209600.00
# ceph config get osd osd_scrub_sleep
0.10
# ceph config get osd osd_scrub_loa
Hi Anthony,
thanks for the tips. I reset all the values but osd_deep_scrub_interval
to their defaults as reported at
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ :
# ceph config set osd osd_scrub_sleep 0.0
# ceph config set osd osd_scrub_load_threshold 0.5
# ceph config
Dear Ceph users,
I'm trying to export a CephFS with the root_squash option. This is the
client configuration:
client.wizardfs_rootsquash
key:
caps: [mds] allow rw fsname=wizardfs root_squash
caps: [mon] allow r fsname=wizardfs
caps: [osd] allow rw t
Thank you Bailey, I'll give it a try ASAP. By the way, is this issue
with the kernel driver something that will be fixed at a given point? If
I'm correct the kernel driver has better performance than FUSE so I'd
like to use it.
Cheers,
Nicola
smime.p7s
Description: S/MIME Cryptographic Signa
Thank you Kotresh! My cluster is currently on Reef 18.2.2, which should
be the current version and which is affected. Will the fix be included
in the next Reef release?
Cheers,
Nicola
smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-us
Dear Ceph users,
on quay.io I see available images for 19.1.0. Yet I can't find any
public release announcement, and on this page:
https://docs.ceph.com/en/latest/releases/
version 19 is still not mentioned at all. So what's going on?
Nicola
smime.p7s
Description: S/MIME Cryptographic Si
Dear Ceph users,
during an upgrade from 18.2.2 to 18.2.4 the image pull from Dockerhub
failed on one machine running a monitor daemon, while it succeeded on
the previous ones.
# ceph orch upgrade status
{
"target_image":
"snack14/ceph-wizard@sha256:b1994328eb078778abdba0a17a7cf7b371e7d95
I think I found the problem. Setting the cephadm log level to debug and
then watching the logs during the upgrade:
ceph config set mgr mgr/cephadm/log_to_cluster_level debug
ceph -W cephadm --watch-debug
I found this line just before the error:
ceph: stderr Fatal glibc error: CPU does no
Unfortunately I'm on bare metal, with very old hardware so I cannot do
much. I'd try to build a Ceph image based on Rocky Linux 8 if I could
get the Dockerfile of the current image to start with, but I've not been
able to find it. Can you please help me with this?
Cheers,
Nicola
smime.p7s
De
Thank you Konstantin, as it was foreseeable this problem didn't hit just
me. So I hope the build of images based on CentOS Stream 8 will be
resumed. Otherwise I'll try to build myself.
Nicola
smime.p7s
Description: S/MIME Cryptographic Signature
___
In the end I built up an image based on Ubuntu 22.04 which does not
mandate x86-64-v2. I installed the official Ceph packages and hacked
here and there (e.g. it was necessary to set the uid and gid of the Ceph
user and group identical to those used by the CentOS Stream 8 image to
avoid to mess
The upgrade ended successfully, but now the cluster reports this error:
MDS_CLIENTS_BROKEN_ROOTSQUASH: 1 MDS report clients with broken
root_squash implementation
From what I understood this is due to a new feature meant to fix a bug
in the root_squash implementation, and that will be relea
In my test cluster the ceph-mgr is continuously logging to file like this:
2022-01-07T16:24:09.890+ 7fc49f1cc700 0 log_channel(cluster) log
[DBG] : pgmap v832: 1 pgs: 1 active+undersized; 0 B data, 5.9 MiB used,
16 GiB / 18 GiB avail
2022-01-07T16:24:11.890+ 7fc49f1cc700 0 log_channel
I set up a test cluster (Pacific 16.2.7 deployed with cephadm) with
several hdds of different sizes, 1.8 Tb and 3.6 TB; they have weight 1.8
and 3.6, respectively, with 2 pools (metadata+data for CephFS). I'm
currently having a PG count varying from 177 to 182 for OSDs with small
disks and from
with the kernel
driver using this autofs configuration:
ceph-test -fstype=ceph,name=wizardfs,noatime 172.16.253.2,172.16.253.1:/
I guess this could be a configuration problem but I can't figure out
what I might be doing wrong here. So I'd greatly appr
Dear Ceph users,
I'm setting up a cluster, at the moment I have 56 OSDs for a total
available space of 109 TiB, and an erasure coded pool with a total
occupancy of just 90 GB. The autoscale mode for the pool is set to "on",
but I still have just 32 PGs. As far as I understand (admittedly not
Dear Ceph users,
I put together a cluster by reusing some (very) old machines with low
amounts of RAM, as low as 4 GB for the worst case. I'd need to set
osd_memory_target properly to avoid going OOM, but it seems there is a
lower limit preventing me to do so consistently:
1) in the cluster
.
Nicola
On 02/10/22 17:21, Joseph Mundackal wrote:
can you share `ceph daemon osd.8 config show` and `ceph config dump`?
On Sun, Oct 2, 2022 at 5:10 AM Nicola Mori <mailto:m...@fi.infn.it>> wrote:
Dear Ceph users,
I put together a cluster by reusing some (very) old machines
host) but nevertheless the actual memory usage is far above the
limit (568M vs 384M).
How can this be? Am I overlooking something?
On 02/10/22 19:16, Nicola Mori wrote:
I attach two files with the requested info. Some more details: the
cluster has been deployed with cephadm using Ceph 17.2.3
if only temporarily.
What you are setting is more or less like setting the autopilot engine
level on an airplane at the cruising altitude, it does *not* control
how much engine is needed for take-off, landings or emergencies.
--
Nicola Mori, Ph.D.
INFN sezione di Firenze
Via Bruno Rossi 1, 50019 Se
Dear Ceph users,
I am trying to tune my cluster's recovery and backfill. On the web I
found that I can set related tunables by e.g.:
ceph tell osd.* injectargs --osd-recovery-sleep-hdd=0.0
--osd-max-backfills=8 --osd-recovery-max-active=8
--osd-recovery-max-single-start=4
but I cannot find
away when the daemon restarts.
On Oct 5, 2022, at 6:10 AM, Nicola Mori wrote:
Dear Ceph users,
I am trying to tune my cluster's recovery and backfill. On the web I found that
I can set related tunables by e.g.:
ceph tell osd.* injectargs --osd-recovery-sleep-hdd=0.0 --osd-max-backfills=8
-
quot;
}
makes little sense to me.
This means you have the mClock IO scheduler, and it gives back this
value since you are meant to change the mClock priorities and not the
number of backfills.
Some more info at
https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/#dmclock-qos
Dear Ceph users,
my cluster is stuck since several days with some PG backfilling. The
number of misplaced objects slowly decreases down to 5%, and at that
point jumps up again to about 7%, and so on. I found several possible
reasons for this behavior. One is related to the balancer, which anyw
was updated to 126, and after a last bump it is now at 128 with a
~4% of misplaced objects currently decreasing.
Sorry for the noise,
Nicola
On 07/10/22 09:15, Nicola Mori wrote:
Dear Ceph users,
my cluster is stuck since several days with some PG backfilling. The
number of misplaced objects
Thank you Frank for the insight. I'd need to study a bit more the
details of all of this, but for sure now I understand it a bit better.
Nicola
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Dear Ceph users,
I'd need some help in understanding the total space in a CephFS. My
cluster is currently built of 8 machines, the one with the smallest
capacity has 8 TB of total disk space, and the total available raw space
is 153 TB. I set up a 3x replicated metadata pool and a 6+2 erasure
Hi Stefan,
the cluster is built of several old machines, with different numbers of
disks (from 8 to 16) and disk sizes (from 500 GB to 4 TB). After the PG
increase it is still recovering: the number of PGP is at 213 and has to
grow up to 256. The balancer status gives:
{
"active": true,
Dear Ceph users,
on one of my nodes I see that the /var/log/messages is being spammed by
these messages:
Oct 16 12:51:11 bofur bash[2473311]: :::172.16.253.2 - -
[16/Oct/2022:10:51:11] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.33.4"
Oct 16 12:51:12 bofur bash[2487821]: ts=2022-10-16T
I solved the problem by stopping prometheus on the problematic host,
then removing the folder containing the prometheus config and storage
(/var/lib/ceph//prometheus.) and then restarting
the alertmanager and node-exporter units.
Nicola
___
ceph-user
Dear Ceph users,
I have one PG in my cluster that is constantly in active+clean+remapped
state. From what I understand there might a problem with the up set:
# ceph pg map 3.5e
osdmap e23638 pg 3.5e (3.5e) -> up [38,78,55,49,40,39,64,2147483647]
acting [38,78,55,49,40,39,64,68]
The last OSD
Hi Frank, I checked the first hypothesis, and I found something strange.
This is the decompiled rule:
rule wizard_data {
id 1
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
better once you have more host buckets to choose OSDs from.
Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
From: Nicola Mori
Sent: 03 November 2022 10:39:30
To: ceph-users
Subject: [ceph-users] Re: Missing OSD in up
If I use set_choose_tries 100 and choose_total_tries 250 I get a lot of
bad mappings with crushtool;
# crushtool -i better-totaltries--crush.map --test --show-bad-mappings
--rule 1 --num-rep 8 --min-x 1 --max-x 100 --show-choose-tries
bad mapping rule 1 x 319 num_rep 8 result [43,40,58,69
Ok, I'd say I fixed it. I set both parameters to 250, recompiled the
crush map and loaded it, and now the PG is in
active+undersized+degraded+remapped+backfilling state and mapped as:
# ceph pg map 3.5e
osdmap e23741 pg 3.5e (3.5e) -> up [38,78,55,49,40,39,64,20] acting
[38,78,55,49,40,39,64,2
62 matches
Mail list logo