[ceph-users] Re: pool size ...

2022-10-16 Thread Eugen Block

Hi,

for a replicated pool there's a hard-coded limit of 10:

$ ceph osd pool set test-pool size 20
Error EINVAL: pool size must be between 1 and 10

And it seems reasonable to limit a replicated pool, so many replicas  
increase the cost and network traffic without having too much of a  
benefit.
For an erasure-coded pool the number of OSDs is basically the limit.  
The largest pool size we have in a customer environment is 18 chunks  
(k7 m11) across two datacenters (to sustain the loss of one DC) and it  
works quite well. They don't have a huge load on the cluster though,  
so those 18 chunks don't realy hurt. But I don't know how the impact  
would be on a heavily used cluster. On a different cluster with a much  
higher load we have an EC pool with 9 chunks (k4 m5) and it also works  
perfectly fine.
But what is your question aiming at? Usually you'd carefully plan what  
your resiliency requirements are depending on the DCs/racks/hosts etc.  
and choose a fitting EC profile or replicated size.


Regards,
Eugen

Zitat von Christopher Durham :


Hi,
I've seen Dan's talk:
https://www.youtube.com/watch?v=0i7ew3XXb7Q
and other similar ones that talk about CLUSTER size.
But, I see nothing (perhaps I have not looked hard enough), on any  
recommendations regarding max POOL size.
So, are there any limitations on a given pool that has all OSDs of  
the same type?
I know that this is vague, and may depend on device type, crush  
rule, ec vs replicated, network bandwidth, etc. But if thereare any  
limitations or experiences that have exposed limits you don't want  
to go over, it would be nice to know.
Also, an ancedotal 'our biggest pool is X, and we don't have  
problems', or, pools over Y started to show problem Z', would be  
great too.

Thanks

-Chris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Spam on /var/log/messages due to config leftover?

2022-10-16 Thread Nicola Mori

Dear Ceph users,

on one of my nodes I see that the /var/log/messages is being spammed by 
these messages:


Oct 16 12:51:11 bofur bash[2473311]: :::172.16.253.2 - - 
[16/Oct/2022:10:51:11] "GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.33.4"
Oct 16 12:51:12 bofur bash[2487821]: ts=2022-10-16T10:51:12.324Z 
caller=manager.go:609 level=warn component="rule manager" group=pools 
msg="Evaluating rule failed" rule="alert: CephPoolGrowthWarning\nexpr: 
(predict_linear(ceph_pool_percent_used[2d], 3600 * 24 * 5) * on(pool_id) 
group_right()\n  ceph_pool_metadata) >= 95\nlabels:\n  oid: 
1.3.6.1.4.1.50495.1.2.1.9.2\n  severity: warning\n  type: 
ceph_default\nannotations:\n  description: |\nPool '{{ $labels.name 
}}' will be full in less than 5 days assuming the average fill-up rate 
of the past 48 hours.\n  summary: Pool growth rate may soon exceed it's 
capacity\n" err="found duplicate series for the match group 
{pool_id=\"1\"} on the left hand-side of the operation: 
[{instance=\"bofur.localdomain:9283\", job=\"ceph\", pool_id=\"1\"}, 
{instance=\"172.16.253.3:9283\", job=\"ceph\", 
pool_id=\"1\"}];many-to-many matching not allowed: matching labels must 
be unique on one side"


(sorry for the ugly formatting but this is the original format). Other 
nodes do not experience the same. I don't clearly understand the reason; 
the only thing I noticed is that ceph_pool_metadata is mentioned in the 
message: I had one such pool when experimenting with Ceph before 
deleting that fs and creating the production one. Currently I have only 
these pools:


#  ceph osd lspools
1 .mgr
2 wizard_metadata
3 wizard_data

so I don't understand why ceph_pool_metadata is appearing in the logs. 
Maybe the log spamming is due to some leftover in the configuration? I 
tried to stop and restart prometheus, when the service is down the 
spamming stops but it restarts as soon as prometheus is restarted.

Thanks in advance for any help,

Nicola
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pool size ...

2022-10-16 Thread Janne Johansson
> Hi,
> I've seen Dan's talk:
> https://www.youtube.com/watch?v=0i7ew3XXb7Q
> and other similar ones that talk about CLUSTER size.
> But, I see nothing (perhaps I have not looked hard enough), on any 
> recommendations regarding max POOL size.
> So, are there any limitations on a given pool that has all OSDs of the same 
> type?
> I know that this is vague, and may depend on device type, crush rule, ec vs 
> replicated, network bandwidth, etc. But if thereare any limitations or 
> experiences that have exposed limits you don't want to go over, it would be 
> nice to know.
> Also, an ancedotal 'our biggest pool is X, and we don't have problems', or, 
> pools over Y started to show problem Z', would be great too.

We got into troubles with a 12-node cluster with 660M objects of
average size <180k on spin-only disks where it had troubles keeping
all PGs/objects in sync. Having ssd or nvme for WAL/DB might have
worked out fine, as would perhaps having lots more hosts.

In older times, we built a large cluster made up of 250+ SMR drives on
cheap atom cpus, that one crashed and burned. SMR drives are fine and
good for certain types of usage, but "ceph repair and backfills" is
not one of them, so while it would work (and perform rather well for
its price) when all was ok in the cluster, each failed drive or other
outage would have tons of OSDs flip, timeout and die during recovery
because SMR just is poor at non-linear work loads. Just don't buy SMR
unless you treat them like tapes, writing large sequential IOs against
them linearly.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-16 Thread Michel Jouvin

Hi,

We have a production cluster made of 12 OSD servers with 16 OSD each 
(all the same HW) which has been running fine for 5 years (initially 
installed with Luminous) and which has been running Octopus (15.2.16) 
for 1 year and was recently upgraded to 15.2.17 (1 week before the 
problem started but doesn't seem to be linked with this upgrade). Since 
beginning of October, we started to see PGs in state "active+laggy" and 
slow requests always related to the same OSD and looking at its log, we 
saw "log_latency_fn slow" messages. There was no disk error logged in 
any system log file. Restarting the OSD didn't really help but no 
functionnal problems were seen.


Looking again at the problem in the last days, we saw that the cluster 
was in HEALTH_WARN state because several PGs were not deep-scrubbed in 
time. In the logs we saw also (but may be we just missed them initially) 
"heartbeat_map is_healthy 'OSD::osd_op_tp thread...' had timed out after 
15" messages. This number increased days after days and is now almost 3 
times the number of PGs hosted by the laggy OSD (despite hundreds of 
deep scrubs running successfully, the cluster has 4297 PGs). It seems 
that in the list we find all PGs that have a replica (all the pools are 
with 3 replica, no EC) on the laggy OSD. We confirmed that there is no 
detected disk error in the system.


Today we restarted the server hosted this OSD, without much hope. It 
didn't help and the same OSD (and only this one) continues to have the 
same problem. In addition to the messages mentioned, the admin socket 
for this OSD became unresponsive: despite command being executed (see 
below), it was not returning in a decent amount of times (several minutes).


As the OSD RocksDB have probably never been compacted, we decided to 
compact the laggy OSD. Despite the "ceph tell osd.10 compact" never 
returned (it was killed after a few hours as the OSD has been marked 
down during a few seconds), the compaction started and lasted ~5 
hours... but completed successfully. But the only improvement that was 
seen after the compaction was that the admin socket is now responsive 
(despite a bit slow). The messages about log_latency_fn and 
heartbeat_map are still present (and frequent) and the deep scrubs are 
still blocked.


We are looking for advices on what to do to fix this issue. We'd in mind 
to stop this OSD, zap it and resintall it but we are worrying it may be 
risky to do this with an OSD that has not been deep scrubbed for a long 
time. And we are sure there is a better solution! Understanding the 
cause would be a much better approach!


Thanks in advance for any help. Best regards,

Michel



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-16 Thread Dan van der Ster
Hi Michel,

Are you sure there isn't a hardware problem with the disk? E.g. maybe you
have SCSI timeouts in dmesg or high ioutil with iostat?

Anyway I don't think there's a big risk related to draining and stopping
the osd. Just consider this a disk failure, which can happen at any time
anyway.

Start by marking it out. If there are still too many slow requests or laggy
PGs, try setting primary affinity to zero.
And if that still doesn't work, I wouldn't hesitate to stop that sick osd
so objects backfill from the replicas

(We had a somewhat similar issue today, btw .. some brand of SSDs
occasionally hangs IO across a whole SCSI bus when failing. Stopping the
osd revives the rest of the disks on the box).

Cheers, Dan



On Sun, Oct 16, 2022, 22:08 Michel Jouvin  wrote:

> Hi,
>
> We have a production cluster made of 12 OSD servers with 16 OSD each
> (all the same HW) which has been running fine for 5 years (initially
> installed with Luminous) and which has been running Octopus (15.2.16)
> for 1 year and was recently upgraded to 15.2.17 (1 week before the
> problem started but doesn't seem to be linked with this upgrade). Since
> beginning of October, we started to see PGs in state "active+laggy" and
> slow requests always related to the same OSD and looking at its log, we
> saw "log_latency_fn slow" messages. There was no disk error logged in
> any system log file. Restarting the OSD didn't really help but no
> functionnal problems were seen.
>
> Looking again at the problem in the last days, we saw that the cluster
> was in HEALTH_WARN state because several PGs were not deep-scrubbed in
> time. In the logs we saw also (but may be we just missed them initially)
> "heartbeat_map is_healthy 'OSD::osd_op_tp thread...' had timed out after
> 15" messages. This number increased days after days and is now almost 3
> times the number of PGs hosted by the laggy OSD (despite hundreds of
> deep scrubs running successfully, the cluster has 4297 PGs). It seems
> that in the list we find all PGs that have a replica (all the pools are
> with 3 replica, no EC) on the laggy OSD. We confirmed that there is no
> detected disk error in the system.
>
> Today we restarted the server hosted this OSD, without much hope. It
> didn't help and the same OSD (and only this one) continues to have the
> same problem. In addition to the messages mentioned, the admin socket
> for this OSD became unresponsive: despite command being executed (see
> below), it was not returning in a decent amount of times (several minutes).
>
> As the OSD RocksDB have probably never been compacted, we decided to
> compact the laggy OSD. Despite the "ceph tell osd.10 compact" never
> returned (it was killed after a few hours as the OSD has been marked
> down during a few seconds), the compaction started and lasted ~5
> hours... but completed successfully. But the only improvement that was
> seen after the compaction was that the admin socket is now responsive
> (despite a bit slow). The messages about log_latency_fn and
> heartbeat_map are still present (and frequent) and the deep scrubs are
> still blocked.
>
> We are looking for advices on what to do to fix this issue. We'd in mind
> to stop this OSD, zap it and resintall it but we are worrying it may be
> risky to do this with an OSD that has not been deep scrubbed for a long
> time. And we are sure there is a better solution! Understanding the
> cause would be a much better approach!
>
> Thanks in advance for any help. Best regards,
>
> Michel
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-16 Thread Michel Jouvin

Hi Dan,

Thanks for your quick answer. No I check, really nothing in dmesg or 
/var/log/messages. We'll try to remove it either gracefully or abruptly.


Cheers,

Michel

Le 16/10/2022 à 22:16, Dan van der Ster a écrit :

Hi Michel,

Are you sure there isn't a hardware problem with the disk? E.g. maybe 
you have SCSI timeouts in dmesg or high ioutil with iostat?


Anyway I don't think there's a big risk related to draining and 
stopping the osd. Just consider this a disk failure, which can happen 
at any time anyway.


Start by marking it out. If there are still too many slow requests or 
laggy PGs, try setting primary affinity to zero.
And if that still doesn't work, I wouldn't hesitate to stop that sick 
osd so objects backfill from the replicas


(We had a somewhat similar issue today, btw .. some brand of SSDs 
occasionally hangs IO across a whole SCSI bus when failing. Stopping 
the osd revives the rest of the disks on the box).


Cheers, Dan



On Sun, Oct 16, 2022, 22:08 Michel Jouvin  wrote:

Hi,

We have a production cluster made of 12 OSD servers with 16 OSD each
(all the same HW) which has been running fine for 5 years (initially
installed with Luminous) and which has been running Octopus (15.2.16)
for 1 year and was recently upgraded to 15.2.17 (1 week before the
problem started but doesn't seem to be linked with this upgrade).
Since
beginning of October, we started to see PGs in state
"active+laggy" and
slow requests always related to the same OSD and looking at its
log, we
saw "log_latency_fn slow" messages. There was no disk error logged in
any system log file. Restarting the OSD didn't really help but no
functionnal problems were seen.

Looking again at the problem in the last days, we saw that the
cluster
was in HEALTH_WARN state because several PGs were not
deep-scrubbed in
time. In the logs we saw also (but may be we just missed them
initially)
"heartbeat_map is_healthy 'OSD::osd_op_tp thread...' had timed out
after
15" messages. This number increased days after days and is now
almost 3
times the number of PGs hosted by the laggy OSD (despite hundreds of
deep scrubs running successfully, the cluster has 4297 PGs). It seems
that in the list we find all PGs that have a replica (all the
pools are
with 3 replica, no EC) on the laggy OSD. We confirmed that there
is no
detected disk error in the system.

Today we restarted the server hosted this OSD, without much hope. It
didn't help and the same OSD (and only this one) continues to have
the
same problem. In addition to the messages mentioned, the admin socket
for this OSD became unresponsive: despite command being executed (see
below), it was not returning in a decent amount of times (several
minutes).

As the OSD RocksDB have probably never been compacted, we decided to
compact the laggy OSD. Despite the "ceph tell osd.10 compact" never
returned (it was killed after a few hours as the OSD has been marked
down during a few seconds), the compaction started and lasted ~5
hours... but completed successfully. But the only improvement that
was
seen after the compaction was that the admin socket is now responsive
(despite a bit slow). The messages about log_latency_fn and
heartbeat_map are still present (and frequent) and the deep scrubs
are
still blocked.

We are looking for advices on what to do to fix this issue. We'd
in mind
to stop this OSD, zap it and resintall it but we are worrying it
may be
risky to do this with an OSD that has not been deep scrubbed for a
long
time. And we are sure there is a better solution! Understanding the
cause would be a much better approach!

Thanks in advance for any help. Best regards,

Michel



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 OSD laggy: log_latency_fn slow; heartbeat_map is_healthy had timed out after 15

2022-10-16 Thread Frank Schilder
A disk may be failing without smartctl or other tools showing anything. Does it 
have remapped sectors? I would just throw the disk out and get a new one.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michel Jouvin 
Sent: 16 October 2022 22:07:44
To: ceph-users@ceph.io
Subject: [ceph-users] 1 OSD laggy: log_latency_fn slow; heartbeat_map 
is_healthy had timed out after 15

Hi,

We have a production cluster made of 12 OSD servers with 16 OSD each
(all the same HW) which has been running fine for 5 years (initially
installed with Luminous) and which has been running Octopus (15.2.16)
for 1 year and was recently upgraded to 15.2.17 (1 week before the
problem started but doesn't seem to be linked with this upgrade). Since
beginning of October, we started to see PGs in state "active+laggy" and
slow requests always related to the same OSD and looking at its log, we
saw "log_latency_fn slow" messages. There was no disk error logged in
any system log file. Restarting the OSD didn't really help but no
functionnal problems were seen.

Looking again at the problem in the last days, we saw that the cluster
was in HEALTH_WARN state because several PGs were not deep-scrubbed in
time. In the logs we saw also (but may be we just missed them initially)
"heartbeat_map is_healthy 'OSD::osd_op_tp thread...' had timed out after
15" messages. This number increased days after days and is now almost 3
times the number of PGs hosted by the laggy OSD (despite hundreds of
deep scrubs running successfully, the cluster has 4297 PGs). It seems
that in the list we find all PGs that have a replica (all the pools are
with 3 replica, no EC) on the laggy OSD. We confirmed that there is no
detected disk error in the system.

Today we restarted the server hosted this OSD, without much hope. It
didn't help and the same OSD (and only this one) continues to have the
same problem. In addition to the messages mentioned, the admin socket
for this OSD became unresponsive: despite command being executed (see
below), it was not returning in a decent amount of times (several minutes).

As the OSD RocksDB have probably never been compacted, we decided to
compact the laggy OSD. Despite the "ceph tell osd.10 compact" never
returned (it was killed after a few hours as the OSD has been marked
down during a few seconds), the compaction started and lasted ~5
hours... but completed successfully. But the only improvement that was
seen after the compaction was that the admin socket is now responsive
(despite a bit slow). The messages about log_latency_fn and
heartbeat_map are still present (and frequent) and the deep scrubs are
still blocked.

We are looking for advices on what to do to fix this issue. We'd in mind
to stop this OSD, zap it and resintall it but we are worrying it may be
risky to do this with an OSD that has not been deep scrubbed for a long
time. And we are sure there is a better solution! Understanding the
cause would be a much better approach!

Thanks in advance for any help. Best regards,

Michel



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io