from:"Frank Schilder"

[ceph-users] How to configure something like osd_deep_scrub_min_interval?

2023-11-15 Thread Frank Schilder

d since 21 intervals (24h) 19.1322* 19.10f6*
  1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc*
  1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f*

PGs marked with a * are on busy OSDs and not eligible for scrubbing.


The script (pasted here because attaching doesn't work):

# cat bin/scrub-report 
#!/bin/bash

# Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h.
# Print how many PGs have not been (deep-)scrubbed since #intervals.

ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json
echo ""

T0="$(date +%s)"

scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [
.pgid,
(.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - 
fromdateiso8601)/(60*60*6)|ceil),
(.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - 
fromdateiso8601)/(60*60*24)|ceil),
.state,
(.acting | join(" "))
] | @tsv
' /root/.cache/ceph/pgs_dump.json)"

# less <<<"$scrub_info"

# 1  2   3  45..NF
# pg_id scrub-ints deep-scrub-ints status acting[]
awk <<<"$scrub_info" '{
for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i
if($4 == "active+clean") {
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
} else if($4 ~ /scrubbing\+deep/) {
deep_scrubbing[$3]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else if($4 ~ /scrubbing/) {
scrubbing[$2]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else {
unclean[$2]++
unclean_d[$3]++
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
for(i=5; i<=NF; ++i) osd[$i]="busy"
}
}
END {
print "Scrub report:"
for(si=1; si<=si_mx; ++si) {
if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue;
printf("%7d PGs not scrubbed since %2d intervals (6h)", 
pg_sn[si], si)
if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si])
if(unclean[si])   printf(" %d unclean", unclean[si])
if(pg_sn[si]<=5) {
split(pg_sn_ids[si], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") 
osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "Deep-scrub report:"
for(dsi=1; dsi<=dsi_mx; ++dsi) {
if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && 
unclean_d[dsi]==0) continue;
printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", 
pg_dsn[dsi], dsi)
if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", 
deep_scrubbing[dsi])
if(unclean_d[dsi])  printf(" %d unclean", unclean_d[dsi])
if(pg_dsn[dsi]<=5) {
split(pg_dsn_ids[dsi], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") 
osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "PGs marked with a * are on busy OSDs and not eligible for 
scrubbing."
}
'

Don't forget the last "'" when copy-pasting.

Thanks for any pointers.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-11-16 Thread Frank Schilder

ng this value to the time interval 
for which about 70% of PGs were scrubbed (leaving 30% eligible), the allocation 
of deep-scrub states is much much better. I expect both tails to get shorter 
and the overall deep-scrub load to go down as well. I hope to reach a state 
where I only need to issue a few deep-scrubs manually per day to get everything 
scrubbed within 1 week and deep-scrubbed within 3-4 weeks.

For now I will wait what effect the global settings have on the SSD pools and 
what the HDD pool converges to. This will need 1-2 months observations and I 
will report back when significant changes show up.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Frank Schilder 
Sent: Wednesday, November 15, 2023 11:14 AM
To: ceph-users@ceph.io
Subject: [ceph-users] How to configure something like 
osd_deep_scrub_min_interval?

Hi folks,

I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a 
likely cause of why the distribution of last_deep_scrub_stamps is so weird. I 
wrote a small script to extract a histogram of scrubs by "days not scrubbed" 
(more precisely, intervals not scrubbed; see code) to find out how (deep-) 
scrub times are distributed. Output below.

What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, 
while they try to deep-scrub every 7-14 days. In other words, OSDs that have 
been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep 
state. However, what I see is completely different. There seems to be no 
distinction between scrub- and deep-scrub start times. This is really 
unexpected as nobody would try to deep-scrub HDDs every day. Weekly to 
bi-weekly is normal, specifically for large drives.

Is there a way to configure something like osd_deep_scrub_min_interval (no, I 
don't want to run cron jobs for scrubbing yet)? In the output below, I would 
like to be able to configure a minimum period of 1-2 weeks before the next 
deep-scrub happens. How can I do that?

The observed behavior is very unusual for RAID systems (if its not a bug in the 
report script). With this behavior its not surprising that people complain 
about "not deep-scrubbed in time" messages and too high deep-scrub IO load when 
such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days 
again already.

Sample output:

# scrub-report
dumped pgs

Scrub report:
   4121 PGs not scrubbed since  1 intervals (6h)
   3831 PGs not scrubbed since  2 intervals (6h)
   4012 PGs not scrubbed since  3 intervals (6h)
   3986 PGs not scrubbed since  4 intervals (6h)
   2998 PGs not scrubbed since  5 intervals (6h)
   1488 PGs not scrubbed since  6 intervals (6h)
909 PGs not scrubbed since  7 intervals (6h)
771 PGs not scrubbed since  8 intervals (6h)
582 PGs not scrubbed since  9 intervals (6h) 2 scrubbing
431 PGs not scrubbed since 10 intervals (6h)
333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing
265 PGs not scrubbed since 12 intervals (6h)
195 PGs not scrubbed since 13 intervals (6h)
116 PGs not scrubbed since 14 intervals (6h)
 78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing
 72 PGs not scrubbed since 16 intervals (6h)
 37 PGs not scrubbed since 17 intervals (6h)
  5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 
19.1233* 14.40e*
 33 PGs not scrubbed since 20 intervals (6h)
 23 PGs not scrubbed since 21 intervals (6h)
 16 PGs not scrubbed since 22 intervals (6h)
 12 PGs not scrubbed since 23 intervals (6h)
  8 PGs not scrubbed since 24 intervals (6h)
  2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3*
  4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 
14.1ed*
  5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 
19.1788* 19.16c0*
  6 PGs not scrubbed since 28 intervals (6h)
  2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d*
  3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a*
  1 PGs not scrubbed since 32 intervals (6h) 19.133f*
  1 PGs not scrubbed since 33 intervals (6h) 19.1103*
  3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248*
  1 PGs not scrubbed since 39 intervals (6h) 19.1984*
  1 PGs not scrubbed since 41 intervals (6h) 14.449*
  1 PGs not scrubbed since 44 intervals (6h) 19.179f*

Deep-scrub report:
   3723 PGs not deep-scrubbed since  1 intervals (24h)
   4621 PGs not deep-scrubbed since  2 intervals (24h) 8 scrubbing+deep
   3588 PGs not deep-scrubbed since  3 intervals (24h) 8 scrubbing+deep
   2929 PGs not deep-scrubbed since  4 intervals (24h) 3 scrubbing+deep
   1705 PGs not deep-scrubbed since  5 intervals (24h) 4 scrubbing+deep
   1904 PGs not deep-scrubbed since  6 intervals (24h) 5 scrubbing+deep
   1540 PGs not deep-scrubbed since  7 intervals (24h) 7 scrubbing+deep
   1304 PGs not deep-

[ceph-users] Re: How to use hardware

2023-11-20 Thread Frank Schilder

Hi Simon,

we are using something similar for ceph-fs. For a backup system your setup can 
work, depending on how you back up. While HDD pools have poor IOP/s 
performance, they are very good for streaming workloads. If you are using 
something like Borg backup that writes huge files sequentially, a HDD back-end 
should be OK.

Here some things to consider and try out:

1. You really need to get a bunch of enterprise SSDs with power loss protection 
for the FS meta data pool (disable write cache if enabled, this will disable 
volatile write cache and switch to protected caching). We are using (formerly 
Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to raise 
performance. Place the meta-data pool and the primary data pool on these disks. 
Create a secondary data pool on the HDDs and assign it to the root *before* 
creating anything on the FS (see the recommended 3-pool layout for ceph file 
systems in the docs). I would not even consider running this without SSDs. 1 
such SSD per host is the minimum, 2 is better. If Borg or whatever can make use 
of a small fast storage directory, assign a sub-dir of the root to the primary 
data pool.

2. Calculate with sufficient extra disk space. As long as utilization stays 
below 60-70% bluestore will try to make large object writes sequential, which 
is really important for HDDs. On our cluster we currently have 40% utilization 
and I get full HDD bandwidth out for large sequential reads/writes. Make sure 
your backup application makes large sequential IO requests.

3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can 
run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use 
ephemeral pinning. Depending on the CPUs you are using, 48 cores can be plenty. 
The latest generation Intel Xeon Scalable Processors is so efficient with ceph 
that 1HT per HDD is more than enough.

4. 3 MON+MGR nodes are sufficient. You can do something else with the remaining 
2 nodes. Of course, you can use them as additional MON+MGR nodes. We also use 5 
and it improves maintainability a lot.

Something more exotic if you have time:

5. To improve sequential performance further, you can experiment with larger 
min_alloc_sizes for OSDs (on creation time, you will need to scrap and 
re-deploy the cluster to test different values). Every HDD has a preferred 
IO-size for which random IO achieves nearly the same band-with as sequential 
writes. (But see 7.)

6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With 
object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is 
the preferred IO size (usually between 256K-1M). (But see 7.)

7. Important: large min_alloc_sizes are only good if your workload *never* 
modifies files, but only replaces them. A bit like a pool without EC overwrite 
enabled. The implementation of EC overwrites has a "feature" that can lead to 
massive allocation amplification. If your backup workload does modifications to 
files instead of adding new+deleting old, do *not* experiment with options 
5.-7. Instead, use the default and make sure you have sufficient unused 
capacity to increase the chances for large bluestore writes (keep utilization 
below 60-70% and just buy extra disks). A workload with large min_alloc_sizes 
has to be S3-like, only upload, download and delete are allowed.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Anthony D'Atri 
Sent: Saturday, November 18, 2023 3:24 PM
To: Simon Kepp
Cc: Albert Shih; ceph-users@ceph.io
Subject: [ceph-users] Re: How to use hardware

Common motivations for this strategy include the lure of unit economics and RUs.

Often ultra dense servers can’t fill racks anyway due to power and weight 
limits.

Here the osd_memory_target would have to be severely reduced to avoid 
oomkilling.  Assuming the OSDs are top load LFF HDDs with expanders, the HBA 
will be a bottleck as well.  I’ve suffered similar systems for RGW.  All the 
clever juggling in the world could not override the math, and the solution was 
QLC.

“We can lose 4 servers”

Do you realize that your data would then be unavailable ?  When you lose even 
one, you will not be able to restore redundancy and your OSDs likely will 
oomkill.

If you’re running CephFS, how are you provisioning fast OSDs for the metadata 
pool?  Are the CPUs high-clock for MDS responsiveness?

Even given the caveats this seems like a recipe for at best disappointment.

At the very least add RAM.  8GB per OSD plus ample for other daemons.  Better 
would be 3x normal additional hosts for the others.

> On Nov 17, 2023, at 8:33 PM, Simon Kepp  wrote:
>
> I know that your question is regarding the service servers, but may I ask,
> why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6)
> (= 50 OSDs per node)?
> This is possible to do, but sounds like the nodes were desi

[ceph-users] Re: mds slow request with “failed to authpin, subtree is being exported"

2023-11-22 Thread Frank Schilder

There are some unhandled race conditions in the MDS cluster in rare 
circumstances.

We had this issue with mimic and octopus and it went away after manually 
pinning sub-dirs to MDS ranks; see 
https://docs.ceph.com/en/nautilus/cephfs/multimds/?highlight=dir%20pin#manually-pinning-directory-trees-to-a-particular-rank.

This has the added advantage that one can bypass the internal load-balancer, 
which was horrible for our work loads. I have a related post about ephemeral 
pinning on this list one-two years ago. You should be able to find it. Short 
story: after manually pinning all user directories to ranks, all our problems 
disappeared and performance improved a lot. MDS load dropped from 130% average 
to 10-20%. So did memory consumption and cache recycling.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, November 22, 2023 12:30 PM
To: ceph-users@ceph.io
Subject: [ceph-users]  Re: mds slow request with “failed to authpin, subtree is 
being exported"

Hi,

we've seen this a year ago in a Nautilus cluster with multi-active MDS
as well. It turned up only once within several years and we decided
not to look too closely at that time. How often do you see it? Is it
reproducable? In that case I'd recommend to create a tracker issue.

Regards,
Eugen

Zitat von zxcs :

> HI, Experts,
>
> we are using cephfs with  16.2.* with multi active mds, and
> recently, we have two nodes mount with ceph-fuse due to the old os
> system.
>
> and  one nodes run a python script with `glob.glob(path)`, and
> another client doing `cp` operation on the same path.
>
> then we see some log about `mds slow request`, and logs complain
> “failed to authpin, subtree is being exported"
>
> then need to restart mds,
>
>
> our question is, does there any dead lock?  how can we avoid this
> and how to fix it without restart mds(it will influence other users) ?
>
>
> Thanks a ton!
>
>
> xz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder

Hi Denis,

I would agree with you that a single misconfigured host should not take out 
healthy hosts under any circumstances. I'm not sure if your incident is 
actually covered by the devs comments, it is quite possible that you observed 
an unintended side effect that is a bug in handling the connection error. I 
think the intention is to shut down fast the OSDs with connection refused 
(where timeouts are not required) and not other OSDs.

A bug report with tracker seems warranted.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Denis Krienbühl 
Sent: Friday, November 24, 2023 9:01 AM
To: ceph-users
Subject: [ceph-users] Full cluster outage when ECONNREFUSED is triggered

Hi

We’ve recently had a serious outage at work, after a host had a network problem:

- We rebooted a single host in a cluster of fifteen hosts across three racks.
- The single host had a bad network configuration after booting, causing it to 
send some packets to the wrong network.
- One network still worked and offered a connection to the mons.
- The other network connection was bad. Packets were refused, not dropped.
- Due to osd_fast_fail_on_connection_refused=true, the broken host forced the 
mons to take all other OSDs down (immediate failure).
- Only after shutting down the faulty host, was it possible to start the shut 
down OSDs, to restore the cluster.

We have since solved the problem by removing the default route that caused the 
packets to end up in the wrong network, where they were summarily rejected by a 
firewall. That is, we made sure that packets would be dropped in the future, 
not rejected.

Still, I figured I’ll send this experience of ours to this mailing list, as 
this seems to be something others might encounter as well.

In the following PR, that introduced osd_fast_fail_on_connection_refused, 
there’s this description:

> This changeset adds additional handler (handle_refused()) to the dispatchers
> and code that detects when connection attempt fails with ECONNREFUSED error
> (connection refused) which is a clear indication that host is alive, but
> daemon isn't, so daemons can instantly mark the other side as undoubtly
> downed without the need for grace timer.

And this comment:

> As for flapping, we discussed it on ceph-devel ml
> and came to conclusion that it requires either broken firewall or network
> configuration to cause this, and these are more serious issues that should
> be resolved first before worrying about OSDs flapping (either way, flapping
> OSDs could be good for getting someone's attention).

https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558

It has left us wondering if these are the right assumptions. An ECONNREFUSED 
condition can bring down a whole cluster, and I wonder if there should be some 
kind of safe-guard to ensure that this is avoided. One badly configured host 
should generally not be able do that, and if the packets are dropped, instead 
of refused, the cluster notices that the OSD down reports come only from one 
host, and acts accordingly.

What do you think? Does this warrant a change in Ceph? I’m happy to provide 
details and create a ticket.

Cheers,

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder

Hi Denis.

>  The mon then propagates that failure, without taking any other reports into 
> consideration:

Exactly. I cannot imagine that this change of behavior is intended. The configs 
on OSD down reporting ought to be honored in any failure situation. Since you 
already investigated the relevant code lines, please update/create the tracker 
with your findings. Hope a dev looks at this.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Denis Krienbühl 
Sent: Friday, November 24, 2023 12:04 PM
To: Burkhard Linke
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

> On 24 Nov 2023, at 11:49, Burkhard Linke 
>  wrote:
>
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true
>  changes this behaviour.

In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com

The mon then propagates that failure, without taking any other reports into 
consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 
ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder

Hi Dennis,

I have to ask a clarifying question. If I understand the intend of 
osd_fast_fail_on_connection_refused correctly, an OSD that receives a 
connection_refused should get marked down fast to avoid unnecessarily long wait 
times. And *only* OSDs that receive connection refused.

In your case, did booting up the server actually create a network route for all 
other OSDs to the wrong network as well? In other words, did it act as a 
gateway and all OSDs received connection refused messages and not just the ones 
on the critical host? If so, your observation would be expected. If not, then 
there is something wrong with the down reporting that should be looked at.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: Friday, November 24, 2023 1:20 PM
To: Denis Krienbühl; Burkhard Linke
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

Hi Denis.

>  The mon then propagates that failure, without taking any other reports into 
> consideration:

Exactly. I cannot imagine that this change of behavior is intended. The configs 
on OSD down reporting ought to be honored in any failure situation. Since you 
already investigated the relevant code lines, please update/create the tracker 
with your findings. Hope a dev looks at this.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Denis Krienbühl 
Sent: Friday, November 24, 2023 12:04 PM
To: Burkhard Linke
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

> On 24 Nov 2023, at 11:49, Burkhard Linke 
>  wrote:
>
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true
>  changes this behaviour.

In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com

The mon then propagates that failure, without taking any other reports into 
consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 
ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-24 Thread Frank Schilder

Hi Xiubo,

thanks for the update. I will test your scripts in our system next week. 
Something important: running both scripts on a single client will not produce a 
difference. You need 2 clients. The inconsistency is between clients, not on 
the same client. For example:

Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs

Test 1
- on host1: execute shutil.copy2
- execute ls -l /mnt/kcephfs/ on host1 and host2: same result

Test 2
- on host1: shutil.copy
- execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 while 
correct on host 1

Your scripts only show output of one host, but the inconsistency requires two 
hosts for observation. The stat information is updated on host1, but not 
synchronized to host2 in the second test. In case you can't reproduce that, I 
will append results from our system to the case.

Also it would be important to know the python and libc versions. We observe 
this only for newer versions of both.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Thursday, November 23, 2023 3:47 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent

I just raised one tracker to follow this:
https://tracker.ceph.com/issues/63510

Thanks

- Xiubo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-12-01 Thread Frank Schilder

Hi Xiubo,

I uploaded a test script with session output showing the issue. When I look at 
your scripts, I can't see the stat-check on the second host anywhere. Hence, I 
don't really know what you are trying to compare.

If you want me to run your test scripts on our system for comparison, please 
include the part executed on the second host explicitly in an ssh-command. 
Running your scripts alone in their current form will not reproduce the issue.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Monday, November 27, 2023 3:59 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent


On 11/24/23 21:37, Frank Schilder wrote:
> Hi Xiubo,
>
> thanks for the update. I will test your scripts in our system next week. 
> Something important: running both scripts on a single client will not produce 
> a difference. You need 2 clients. The inconsistency is between clients, not 
> on the same client. For example:

Frank,

Yeah, I did this with 2 different kclients.

Thanks

> Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs
>
> Test 1
> - on host1: execute shutil.copy2
> - execute ls -l /mnt/kcephfs/ on host1 and host2: same result
>
> Test 2
> - on host1: shutil.copy
> - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 while 
> correct on host 1
>
> Your scripts only show output of one host, but the inconsistency requires two 
> hosts for observation. The stat information is updated on host1, but not 
> synchronized to host2 in the second test. In case you can't reproduce that, I 
> will append results from our system to the case.
>
> Also it would be important to know the python and libc versions. We observe 
> this only for newer versions of both.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Xiubo Li 
> Sent: Thursday, November 23, 2023 3:47 AM
> To: Frank Schilder; Gregory Farnum
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent
>
> I just raised one tracker to follow this:
> https://tracker.ceph.com/issues/63510
>
> Thanks
>
> - Xiubo
>
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [ext] CephFS pool not releasing space after data deletion

2023-12-02 Thread Frank Schilder

Hi Mathias,

have you made any progress on this? Did the capacity become available 
eventually?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Kuhring, Mathias 
Sent: Friday, October 27, 2023 3:52 PM
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ext] [ceph-users] CephFS pool not releasing space after data 
deletion

Dear ceph users,

We are wondering, if this might be the same issue as with this bug:
https://tracker.ceph.com/issues/52581

Except that we seem to have been snapshots dangling on the old pool.
And the bug report snapshots dangling on the new pool.
But maybe it's both?

I mean, once the global root layout was created to a new pool,
the new pool became in charge for snapshooting at least of new data, right?
What about data which is overwritten? Is there a conflict of responsibility?

We do have similar listings of snaps with "ceph osd pool ls detail", I
think:

0|0[root@osd-1 ~]# ceph osd pool ls detail | grep -B 1 removed_snaps_queue
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 115 pgp_num 107 pg_num_target 32
pgp_num_target 32 autoscale_mode on last_change 803558 lfor
0/803250/803248 flags hashpspool,selfmanaged_snaps stripe_width 0
expected_num_objects 1 application cephfs
 removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 3 'hdd_ec' erasure profile hdd_ec size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
last_change 803558 lfor 0/87229/87229 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application
cephfs
 removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 20 'hdd_ec_8_2_pool' erasure profile hdd_ec_8_2_profile size 10
min_size 9 crush_rule 5 object_hash rjenkins pg_num 8192 pgp_num 8192
autoscale_mode off last_change 803558 lfor 0/0/681917 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 32768
application cephfs
 removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]

Here, pool hdd_ec_8_2_pool is the one we recently assigned to the root
layout.
Pool hdd_ec is the one which was assigned before and which won't release
space (at least where I know of).

Is this removed_snaps_queue the same as removed_snaps in the bug issue
(i.e. the label was renamed)?
And is it normal that all queues list the same info or should this be
different per pool?
Might this be related to pools having now share responsibility over some
snaps due to layout changes?

And for the big question:
How can I actually trigger/speedup the removal of those snaps?
I find the removed_snaps/removed_snaps_queue mentioned a few times in
the user list.
But never with some conclusive answer how to deal with them.
And the only mentions in the docs are just change logs.

I also looked into and started cephfs stray scrubbing:
https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub
But according to the status output, no scrubbing is actually active.

I would appreciate any further ideas. Thanks a lot.

Best Wishes,
Mathias

On 10/23/2023 12:42 PM, Kuhring, Mathias wrote:
> Dear Ceph users,
>
> Our CephFS is not releasing/freeing up space after deleting hundreds of
> terabytes of data.
> By now, this drives us in a "nearfull" osd/pool situation and thus
> throttles IO.
>
> We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
> quincy (stable).
>
> Recently, we moved a bunch of data to a new pool with better EC.
> This was done by adding a new EC pool to the FS.
> Then assigning the FS root to the new EC pool via the directory layout xattr
> (so all new data is written to the new pool).
> And finally copying old data to new folders.
>
> I swapped the data as follows to remain the old directory structures.
> I also made snapshots for validation purposes.
>
> So basically:
> cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
> mkdir mymount/mydata/.snap/tovalidate
> mkdir mymount/new/mydata/.snap/tovalidate
> mv mymount/mydata/ mymount/old/
> mv mymount/new/mydata mymount/
>
> I could see the increase of data in the new pool as expected (ceph df).
> I compared the snapshots with hashdeep to make sure the new data is alright.
>
> Then I went ahead deleting the old data, basically:
> rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
> older snapshots
> rm -r mymount/old/mydata
>
> At first we had a bunch of PGs with snaptrim/snaptrim_wa

[ceph-users] ceph df reports incorrect stats

2023-12-06 Thread Frank Schilder

 1074.27673  host ceph-09   
  
 -23 1075.67920  host ceph-10   
  
 -15 1067.16492  host ceph-11   
  
 -25 1080.21912  host ceph-12   
  
 -83 1061.17480  host ceph-13   
  
 -85 1047.70276  host ceph-14   
  
 -87 1079.02820  host ceph-15   
  
-136 1012.55048  host ceph-16   
  
-139 1073.61475  host ceph-17   
  
-261 1125.57202  host ceph-23   
  
-262 1054.32227  host ceph-24   
  
-148  885.49133  datacenter MultiSite   
  
 -65   86.16304  host ceph-04   
  
 -67  101.50623  host ceph-05   
  
 -69  104.85805  host ceph-06   
  
 -71   96.39923  host ceph-07   
  
 -81   97.54230  host ceph-18   
  
 -94   98.48271  host ceph-19   
  
  -4   97.20181  host ceph-20   
  
 -64   99.77657  host ceph-21   
  
 -66  103.56137  host ceph-22   
  
 -49  885.49133  datacenter ServerRoom  
  
 -55  885.49133  room SR-113
  
 -65   86.16304  host ceph-04   
  
 -67  101.50623  host ceph-05   
  
 -69  104.85805  host ceph-06   
  
 -71   96.39923  host ceph-07   
  
 -81   97.54230  host ceph-18   
  
 -94   98.48271  host ceph-19   
  
  -4   97.20181  host ceph-20   
  
 -64   99.77657  host ceph-21   
  
 -66  103.56137  host ceph-22   
  
  -1  0  root default

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: EC Profiles & DR

2023-12-06 Thread Frank Schilder

Hi,

the post linked in the previous message is a good source for different 
approaches.

To provide some first-hand experience, I was operating a pool with a 6+2 EC 
profile on 4 hosts for a while (until we got more hosts) and the "subdivide a 
physical host into 2 crush-buckets" approach is actually working best (I 
basically tried all the approaches described in the linked post and they all 
had pitfalls).

Procedure is more or less:

- add second (logical) host bucket for each physical host by suffixing the host 
name with "-B" (ceph osd crush add-bucket   )
- move half the OSDs per host to this new host bucket (ceph osd crush move 
osd.ID host=HOSTNAME-B)
- make this location persist reboot of the OSDs (ceph config set osd.ID 
crush_location host=HOSTNAME-B")

This will allow you to move OSDs back easily when you get more hosts and can 
afford the recommended 1 shard per host. It will also show which and where OSDs 
are moved to with a simple "ceph config dump | grep crush_location". Bets of 
all, you don't have to fiddle around with crush maps and hope they do what you 
want. Just use failure domain host and you are good. No more than 2 host 
buckets per physical host means no more than 2 shards per physical host with 
default placement rules.

I was operating this set-up with min_size=6 and feeling bad about it due to the 
reduced maintainability (risk of data loss during maintenance). Its not great 
really, but sometimes there is no way around it. I was happy when I got the 
extra hosts.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Curt 
Sent: Wednesday, December 6, 2023 3:56 PM
To: Patrick Begou
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: EC Profiles & DR

Hi Patrick,

Yes K and M are chunks, but the default crush map is a chunk per host,
which is probably the best way to do it, but I'm no expert. I'm not sure
why you would want to do a crush map with 2 chunks per host and min size 4
as it' s just asking for trouble at some point, in my opinion.  Anyway,
take a look at this post if your interested in doing 2 chunks per host it
will give you an idea of crushmap setup,
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NB3M22GNAC7VNWW7YBVYTH6TBZOYLTWA/
.

Regards,
Curt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-12-08 Thread Frank Schilder

Hi Xiubo,

I will update the case. I'm afraid this will have to wait a little bit though. 
I'm too occupied for a while and also don't have a test cluster that would help 
speed things up. I will update you, please keep the tracker open.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Tuesday, December 5, 2023 1:58 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent

Frank,

By using your script I still couldn't reproduce it. Locally my python
version is 3.9.16, and I didn't have other VMs to test python other
versions.

Could you check the tracker to provide the debug logs ?

Thanks

- Xiubo

On 12/1/23 21:08, Frank Schilder wrote:
> Hi Xiubo,
>
> I uploaded a test script with session output showing the issue. When I look 
> at your scripts, I can't see the stat-check on the second host anywhere. 
> Hence, I don't really know what you are trying to compare.
>
> If you want me to run your test scripts on our system for comparison, please 
> include the part executed on the second host explicitly in an ssh-command. 
> Running your scripts alone in their current form will not reproduce the issue.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Xiubo Li 
> Sent: Monday, November 27, 2023 3:59 AM
> To: Frank Schilder; Gregory Farnum
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent
>
>
> On 11/24/23 21:37, Frank Schilder wrote:
>> Hi Xiubo,
>>
>> thanks for the update. I will test your scripts in our system next week. 
>> Something important: running both scripts on a single client will not 
>> produce a difference. You need 2 clients. The inconsistency is between 
>> clients, not on the same client. For example:
> Frank,
>
> Yeah, I did this with 2 different kclients.
>
> Thanks
>
>> Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs
>>
>> Test 1
>> - on host1: execute shutil.copy2
>> - execute ls -l /mnt/kcephfs/ on host1 and host2: same result
>>
>> Test 2
>> - on host1: shutil.copy
>> - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 
>> while correct on host 1
>>
>> Your scripts only show output of one host, but the inconsistency requires 
>> two hosts for observation. The stat information is updated on host1, but not 
>> synchronized to host2 in the second test. In case you can't reproduce that, 
>> I will append results from our system to the case.
>>
>> Also it would be important to know the python and libc versions. We observe 
>> this only for newer versions of both.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Xiubo Li 
>> Sent: Thursday, November 23, 2023 3:47 AM
>> To: Frank Schilder; Gregory Farnum
>> Cc: ceph-users@ceph.io
>> Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent
>>
>> I just raised one tracker to follow this:
>> https://tracker.ceph.com/issues/63510
>>
>> Thanks
>>
>> - Xiubo
>>
>>
>
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-12-12 Thread Frank Schilder

ep-scrubbed since  4 intervals ( 96h) [1 scrubbing]
  31% 619 PGs not deep-scrubbed since  5 intervals (120h)
  39% 660 PGs not deep-scrubbed since  6 intervals (144h)
  47% 726 PGs not deep-scrubbed since  7 intervals (168h) [1 scrubbing]
  57% 743 PGs not deep-scrubbed since  8 intervals (192h)
  65% 727 PGs not deep-scrubbed since  9 intervals (216h) [1 scrubbing]
  73% 656 PGs not deep-scrubbed since 10 intervals (240h)
  75% 107 PGs not deep-scrubbed since 11 intervals (264h)
  82% 626 PGs not deep-scrubbed since 12 intervals (288h)
  90% 588 PGs not deep-scrubbed since 13 intervals (312h)
  94% 388 PGs not deep-scrubbed since 14 intervals (336h) [1 scrubbing]
  96% 129 PGs not deep-scrubbed since 15 intervals (360h) 2 scrubbing+deep
  98% 207 PGs not deep-scrubbed since 16 intervals (384h) 2 scrubbing+deep
  99%  79 PGs not deep-scrubbed since 17 intervals (408h) 1 scrubbing+deep
 100%  10 PGs not deep-scrubbed since 18 intervals (432h) 2 scrubbing+deep
 8192 PGs out of 8192 reported, 0 missing, 7 scrubbing+deep, 0 unclean.

con-fs2-data2  scrub_min_interval=66h  (11i/84%/625PGs÷i)
con-fs2-data2  scrub_max_interval=168h  (7d)
con-fs2-data2  deep_scrub_interval=336h  (14d/~89%/~520PGs÷d)
osd.338  osd_scrub_interval_randomize_ratio=0.363636  scrubs start after: 
66h..90h
osd.338  osd_deep_scrub_randomize_ratio=0.00
osd.338  osd_max_scrubs=1
osd.338  osd_scrub_backoff_ratio=0.931900  rec. this pool: .9319 (class hdd, 
size 11)
mon.ceph-01  mon_warn_pg_not_scrubbed_ratio=0.50  warn: 10.5d (42.0i)
mon.ceph-01  mon_warn_pg_not_deep_scrubbed_ratio=0.75  warn: 24.5d

Best regards, merry Christmas and a happy new year to everyone!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: increasing number of (deep) scrubs

2023-12-12 Thread Frank Schilder

Hi all,

if you follow this thread, please see the update in "How to configure something 
like osd_deep_scrub_min_interval?" 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/).
 I found out how to tune the scrub machine and I posted a quick update in the 
other thread, because the solution was not to increase the number of scrubs, 
but to tune parameters.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder
Sent: Monday, January 9, 2023 9:14 AM
To: Dan van der Ster
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Dan,

thanks for your answer. I don't have a problem with increasing osd_max_scrubs 
(=1 at the moment) as such. I would simply prefer a somewhat finer grained way 
of controlling scrubbing than just doubling or tripling it right away.

Some more info. These 2 pools are data pools for a large FS. Unfortunately, we 
have a large percentage of small files, which is a pain for recovery and 
seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to 
increase the warning interval already to 2 weeks. With all the warning grace 
parameters this means that we manage to deep scrub everything about every 
month. I need to plan for 75% utilisation and a 3 months period is a bit far on 
the risky side.

Our data is to a large percentage cold data. Client reads will not do the check 
for us, we need to combat bit-rot pro-actively.

The reasons I'm interested in parameters initiating more scrubs while also 
converting more scrubs into deep scrubs are, that

1) scrubs seem to complete very fast. I almost never catch a PG in state 
"scrubbing", I usually only see "deep scrubbing".

2) I suspect the low deep-scrub count is due to a low number of deep-scrubs 
scheduled and not due to conflicting per-OSD deep scrub reservations. With the 
OSD count we have and the distribution over 12 servers I would expect at least 
a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing 
now. It ought to be possible to schedule more PGs for deep scrub than actually 
are.

3) Every OSD having only 1 deep scrub active seems to have no measurable impact 
on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it 
would already help a lot. Once this is working, I can eventually increase 
osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) 
scrub scheduling looks a bit harder and schedules more eligible PGs per time 
unit.

If we can get deep scrubbing up to an average of 42PGs completing per hour with 
keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to 
complete a deep scrub with 75% full OSDs in about 30 days. This is the current 
tail-time with 25% utilisation. I believe currently a deep scrub of a PG in 
these pools takes 2-3 hours. Its just a gut feeling from some repair and 
deep-scrub commands, I would need to check logs for more precise info.

Increasing osd_max_scrubs would then be a further and not the only option to 
push for more deep scrubbing. My expectation would be that values of 2-3 are 
fine due to the increasingly higher percentage of cold data for which no 
interference with client IO will happen.

Hope that makes sense and there is a way beyond bumping osd_max_scrubs to 
increase the number of scheduled and executed deep scrubs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Dan van der Ster 
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Frank,

What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the scrub to complete in time, you need to increase the
amount of scrub slots accordingly.

On the other hand, IMHO the 1-week deadline for deep scrubs is often
much too ambitious for large clusters -- increasing the scrub
intervals is one solution, or I find it simpler to increase
mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio
until you find a ratio that works for your cluster.
Of course, all of this can impact detection of bit-rot, which anyway
can be covered by client reads if most data is accessed periodically.
But if the cluster is mostly idle or objects are generally not read,
then it would be preferable to increase slots osd_max_scrubs.

Cheers, Dan

On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder  wrote:
>
> Hi all,
>
> we are using 16T and 18T spinning drives as OSDs and I'm observing that they 
> are not scrubbed as often as I would like.

[ceph-users] Re: increasing number of (deep) scrubs

2023-12-13 Thread Frank Schilder

Yes, octopus. -- Frank
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Szabo, Istvan (Agoda) 
Sent: Wednesday, December 13, 2023 6:13 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: increasing number of (deep) scrubs

Hi,

You are on octopus right?

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Frank Schilder 
Sent: Tuesday, December 12, 2023 7:33 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: increasing number of (deep) scrubs

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi all,

if you follow this thread, please see the update in "How to configure something 
like osd_deep_scrub_min_interval?" 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/).
 I found out how to tune the scrub machine and I posted a quick update in the 
other thread, because the solution was not to increase the number of scrubs, 
but to tune parameters.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder
Sent: Monday, January 9, 2023 9:14 AM
To: Dan van der Ster
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Dan,

thanks for your answer. I don't have a problem with increasing osd_max_scrubs 
(=1 at the moment) as such. I would simply prefer a somewhat finer grained way 
of controlling scrubbing than just doubling or tripling it right away.

Some more info. These 2 pools are data pools for a large FS. Unfortunately, we 
have a large percentage of small files, which is a pain for recovery and 
seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to 
increase the warning interval already to 2 weeks. With all the warning grace 
parameters this means that we manage to deep scrub everything about every 
month. I need to plan for 75% utilisation and a 3 months period is a bit far on 
the risky side.

Our data is to a large percentage cold data. Client reads will not do the check 
for us, we need to combat bit-rot pro-actively.

The reasons I'm interested in parameters initiating more scrubs while also 
converting more scrubs into deep scrubs are, that

1) scrubs seem to complete very fast. I almost never catch a PG in state 
"scrubbing", I usually only see "deep scrubbing".

2) I suspect the low deep-scrub count is due to a low number of deep-scrubs 
scheduled and not due to conflicting per-OSD deep scrub reservations. With the 
OSD count we have and the distribution over 12 servers I would expect at least 
a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing 
now. It ought to be possible to schedule more PGs for deep scrub than actually 
are.

3) Every OSD having only 1 deep scrub active seems to have no measurable impact 
on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it 
would already help a lot. Once this is working, I can eventually increase 
osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) 
scrub scheduling looks a bit harder and schedules more eligible PGs per time 
unit.

If we can get deep scrubbing up to an average of 42PGs completing per hour with 
keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to 
complete a deep scrub with 75% full OSDs in about 30 days. This is the current 
tail-time with 25% utilisation. I believe currently a deep scrub of a PG in 
these pools takes 2-3 hours. Its just a gut feeling from some repair and 
deep-scrub commands, I would need to check logs for more precise info.

Increasing osd_max_scrubs would then be a further and not the only option to 
push for more deep scrubbing. My expectation would be that values of 2-3 are 
fine due to the increasingly higher percentage of cold data for which no 
interference with client IO will happen.

Hope that makes sense and there is a way beyond bumping osd_max_scrubs to 
increase the number of scheduled and executed deep scrubs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Dan van der Ster 
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Frank,

What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-12-13 Thread Frank Schilder

Hi all,

since there seems to be some interest, here some additional notes.

1) The script is tested on octopus. It seems that there was a change in the 
output of ceph commands used and it might need some tweaking to get it to work 
on other versions.

2) If you want to give my findings a shot, you can do so in a gradual way. The 
most important change is setting osd_deep_scrub_randomize_ratio=0 (with 
osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the 
requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp 
younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the 
one change to test, all other settings have less impact. The script will not 
report some numbers at the end, but the histogram will be correct. Let it run a 
few deep-scrub-interval rounds until the histogram is evened out.

If you start your test after using osd_max_scrubs>1 for a while -as I did - you 
will need a lot of patience and might need to mute some scrub warnings for a 
while.

3) The changes are mostly relevant for large HDDs that take a long time to 
deep-scrub (many small objects). The overall load reduction, however, is useful 
in general.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-12-15 Thread Frank Schilder

Hi all,

another quick update: please use this link to download the script: 
https://github.com/frans42/ceph-goodies/blob/main/scripts/pool-scrub-report

The one I sent originally does not follow latest.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2024-01-09 Thread Frank Schilder

Quick answers:

  *   ... osd_deep_scrub_randomize_ratio ... but not on Octopus: is it still a 
valid parameter?

Yes, this parameter exists and can be used to prevent premature deep-scrubs. 
The effect is dramatic.

  *   ... essentially by playing with osd_scrub_min_interval,...

The main parameter is actually osd_deep_scrub_randomize_ratio, all other 
parameters have less effect in terms of scrub load. osd_scrub_min_interval is 
the second most important parameter and needs increasing for large 
SATA-/NL-SAS-HDDs. For sufficiently fast drives the default of 24h is good 
(although might be a bit aggressive/paranoid).

  *   Another small question: you opt for osd_max_scrubs=1 just to make sure
your I/O is not adversely affected by scrubbing, or is there a more
profound reason for that?

Well, not affecting user-IO too much is a quite profound reason and many admins 
try to avoid scrubbing at all when users are on the system. It makes IO 
somewhat unpredictable and can trigger user complaints.

However, there is another profound reason: for HDDs it increases deep-scrub 
load (that is, interference with user IO) a lot while it actually slows down 
the deep-scrubbing. HDDs can't handle the implied random IO of concurrent 
deep-scrubs well. On my system I saw that with osd_max_scrubs=2 the scrub time 
for a PG increased a bit more than double. In other words: more scrub load, 
less scrub progress = useless, do not do this.

I plan to document the script a bit more and am waiting for some deep-scrub 
histograms to converge to equilibrium. This takes months for our large pools, 
but I would like to have the numbers for an example of how it should look like.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Fulvio Galeazzi
Sent: Monday, January 8, 2024 4:21 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: How to configure something like 
osd_deep_scrub_min_interval?

Hallo Frank,
just found this post, thank you! I have also been puzzled/struggling
with scrub/deep-scrub and found your post very useful: will give this a
try, soon.

One thing, first: I am using Octopus, too, but I cannot find any
documentation about osd_deep_scrub_randomize_ratio. I do see that in
past releases, but not on Octopus: is it still a valid parameter?

Let me check whether I understood your procedure: you optimize scrub
time distribution essentially by playing with osd_scrub_min_interval,
thus "forcing" the automated algorithm to preferentially select
older-scrubbed PGs, am I correct?

Another small question: you opt for osd_max_scrubs=1 just to make sure
your I/O is not adversely affected by scrubbing, or is there a more
profound reason for that?

   Thanks!

Fulvio

On 12/13/23 13:36, Frank Schilder wrote:
> Hi all,
>
> since there seems to be some interest, here some additional notes.
>
> 1) The script is tested on octopus. It seems that there was a change in the 
> output of ceph commands used and it might need some tweaking to get it to 
> work on other versions.
>
> 2) If you want to give my findings a shot, you can do so in a gradual way. 
> The most important change is setting osd_deep_scrub_randomize_ratio=0 (with 
> osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the 
> requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp 
> younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the 
> one change to test, all other settings have less impact. The script will not 
> report some numbers at the end, but the histogram will be correct. Let it run 
> a few deep-scrub-interval rounds until the histogram is evened out.
>
> If you start your test after using osd_max_scrubs>1 for a while -as I did - 
> you will need a lot of patience and might need to mute some scrub warnings 
> for a while.
>
> 3) The changes are mostly relevant for large HDDs that take a long time to 
> deep-scrub (many small objects). The overall load reduction, however, is 
> useful in general.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Fulvio Galeazzi
GARR-Net Department
tel.: +39-334-6533-250
skype: fgaleazzi70
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Rack outage test failing when nodes get integrated again

2024-01-11 Thread Frank Schilder

Hi Steve,

I also observed that setting mon_osd_reporter_subtree_level to anything else 
than host leads to incorrect behavior.

In our case, I actually observed the opposite. I had 
mon_osd_reporter_subtree_level=datacenter (we have 3 DCs in the crush tree). 
After cutting off a single host with ifdown - also a network cut-off albeit not 
via firewall rules, I observed that not all OSDs on that host were marked down 
(neither was the host), leading to blocked IO. I didn't wait for very long 
(only a few minutes, less than 5), because its a production system. I also 
didn't find the time to file a tracker issue. I observed this with mimic, but 
since you report it for Pacific I'm pretty sure its affecting all versions.

My guess is that this is not part of the CI testing, at least not in a way that 
covers network cut-off.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Steve Baker 
Sent: Thursday, January 11, 2024 8:45 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Rack outage test failing when nodes get integrated again

Hi, we're currently testing a ceph (v 16.2.14) cluster, 3 mon nodes, 6 osd
nodes à 8 nvme ssd osds distributed over 3 racks. Daemons are deployed in
containers with cephadm / podman. We got 2 pools on it, one with 3x
replication and min_size=2, one with an EC (k3m3). With 1 mon node and 2
osd nodes in each rack, the crush rules are configured in a way (for the 3x
pool chooseleaf_firstn rack, for the ec pool choose_indep 3 rack /
chooseleaf_indep 2 host), so that a full rack can go down while the cluster
stays accessible for client operations. Other options we have set are
mon_osd_down_out_subtree_limit=host so that in case of a host/rack outage
the cluster does not automatically start to backfill, but will continue to
run in a degraded state until human interaction comes to fix it. Also we
set mon_osd_reporter_subtree_level=rack.

We tested - while under (synthetic test-)client load - what happens if we
take a full rack (one mon node and 2 osd nodes) out of the cluster. We did
that using iptables to block the nodes of the rack from other nodes of the
cluster (global and cluster network), as well as from the clients. As
expected, the remainder of the cluster continues to run in a degraded state
without to start any backfilling or recovery processes. All client requests
gets served while the rack is out.

But then a strange thing happens when we take the rack (1mon node, 2 osd
nodes) back into the cluster again by deleting all firewall rules with
iptables -F at once. Some osds get integrated in the cluster again
immediatelly but some others remain in state "down" for exactly 10 minutes.
These osds that stay down for the 10 minutes are in a state where they
still seem to not be able to reach other osd nodes (see heartbeat_check
logs below). After these 10 minutes have passed, these osds come up as well
but then at exactly that time, many PGs get stuck in peering state and
other osds that were all the time in the cluster get slow requests and the
cluster blocks client traffic (I think it's just the PGs stuck in peering
soaking all the client threads). Then, exactly 45 minutes after the nodes
of the rack were made reachable by iptables -F again, the situation
recovers, peering succeeds and client load gets handled again.

We have repeated this test several times and it's always exactly the same
10 min "down interval" and a 45 min affected client requests. When we
integrate the nodes into the cluster again one after another with a delay
of some minutes inbetween, this does not happen at all. I wonder what's
happening there. It must be some kind of split-brain situation having to do
with blocking the nodes using iptables but not rebooting them completelly.
The 10 min and 45 min intervals I described occure every time. For the 10
minutes, some osds stay down after the hosts got integrated again. It's not
all of the 16 osds from the 2 osd hosts that got integrated again but just
some of them. Which ones varies randomly. Sometimes it's also only just
one. We also observerd, the longer the hosts were out of the cluster, the
more osds are affected. Then even after they get up again after 10 minutes,
it takes another 45 minutes until the stuck peering situation resolves.
Also during these 45 minutes, we see slow ops on osds thet remained into
the cluster.


See here some OSD logs that get written after the reintegration:


2024-01-04T08:25:03.856+ 7f369132b700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2024-01-04T07:25:03.860426+)
2024-01-04T08:25:06.556+ 7f3682882700  0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.0 down, but it is still running
2024-01-04T08:25

[ceph-users] Re: 3 DC with 4+5 EC not quite working

2024-01-12 Thread Frank Schilder

Is it maybe this here: 
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

I always have to tweak the num-tries parameters.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Torkil Svensgaard 
Sent: Friday, January 12, 2024 10:17 AM
To: Frédéric Nass
Cc: ceph-users@ceph.io; Ruben Vestergaard
Subject: [ceph-users] Re: 3 DC with 4+5 EC not quite working



On 12-01-2024 09:35, Frédéric Nass wrote:
>
> Hello Torkil,

Hi Frédéric

> We're using the same ec scheme than yours with k=5 and m=4 over 3 DCs with 
> the below rule:
>
>
> rule ec54 {
>  id 3
>  type erasure
>  min_size 3
>  max_size 9
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step choose indep 0 type datacenter
>  step chooseleaf indep 3 type host
>  step emit
> }
>
> Works fine. The only difference I see with your EC rule is the fact that we 
> set min_size and max_size but I doubt this has anything to do with your 
> situation.

Great, thanks. I wonder if we might need to tweak the min_size. I think
I tried lowering it to no avail and then set it back to 5 after editing
the crush rule.

> Since the cluster still complains about "Pool cephfs.hdd.data has 1024 
> placement groups, should have 2048", did you run "ceph osd pool set 
> cephfs.hdd.data pgp_num 2048" right after running "ceph osd pool set 
> cephfs.hdd.data pg_num 2048"? [1]
>
> Might be that the pool still has 1024 PGs.

Hmm coming from RHCS we didn't do this as:

"
RHCS 4.x and 5.x does not require the pgp_num value to be set. This will
be done by ceph-mgr automatically. For RHCS 4.x and 5.x, only the pg_num
is required to be incremented for the necessary pools.
"

So I only did "ceph osd pool set cephfs.hdd.data pgp_num 2048" and let
the mgr handle the rest. I had a watch running to see how it went and
the pool was up to something like 1922 PGs when it got stuck.

As I read the documentation[1] this shouldn't get us stuck like we did,
but we would have to set the pgp_num eventually to get it to rebalance?

Mvh.

Torkil

[1]
https://docs.ceph.com/en/quincy/rados/operations/placement-groups/#setting-the-number-of-pgs

>
>
> Regards,
> Frédéric.
>
> [1] 
> https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups
>
>
>
> -Message original-
>
> De: Torkil 
> à: ceph-users 
> Cc: Ruben 
> Envoyé: vendredi 12 janvier 2024 09:00 CET
> Sujet : [ceph-users] 3 DC with 4+5 EC not quite working
>
> We are looking to create a 3 datacenter 4+5 erasure coded pool but can't
> quite get it to work. Ceph version 17.2.7. These are the hosts (there
> will eventually be 6 hdd hosts in each datacenter):
>
> -33 886.00842 datacenter 714
> -7 209.93135 host ceph-hdd1
>
> -69 69.86389 host ceph-flash1
> -6 188.09579 host ceph-hdd2
>
> -3 233.57649 host ceph-hdd3
>
> -12 184.54091 host ceph-hdd4
> -34 824.47168 datacenter DCN
> -73 69.86389 host ceph-flash2
> -2 201.78067 host ceph-hdd5
>
> -81 288.26501 host ceph-hdd6
>
> -31 264.56207 host ceph-hdd7
>
> -36 1284.48621 datacenter TBA
> -77 69.86389 host ceph-flash3
> -21 190.83224 host ceph-hdd8
>
> -29 199.08838 host ceph-hdd9
>
> -11 193.85382 host ceph-hdd10
>
> -9 237.28154 host ceph-hdd11
>
> -26 187.19536 host ceph-hdd12
>
> -4 206.37102 host ceph-hdd13
>
> We did this:
>
> ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd
> plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default
> crush-failure-domain=datacenter crush-device-class=hdd
>
> ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd
> ceph osd pool set cephfs.hdd.data allow_ec_overwrites true
> ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn
>
> Didn't quite work:
>
> "
> [WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
> incomplete
> pg 33.0 is creating+incomplete, acting
> [104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool
> cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for
> 'incomplete')
> "
>
> I then manually changed the crush rule from this:
>
> "
> rule cephfs.hdd.data {
> id 7
> type erasure
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step chooseleaf indep 0 type datacenter
> step emit
> }
> "
>
> To this:
>
> "
> rule cephfs.hdd.data {
> id 7
> type erasure
> step set_ch

[ceph-users] Re: Recomand number of k and m erasure code

2024-01-15 Thread Frank Schilder

I would like to add here a detail that is often overlooked: maintainability 
under degraded conditions.

For production systems I would recommend to use EC profiles with at least m=3. 
The reason being that if you have a longer problem with a node that is down and 
m=2 it is not possible to do any maintenance on the system without loosing 
write access. Don't trust what users claim they are willing to tolerate - at 
least get it in writing. Once a problem occurs they will be at your door step 
no matter what they said before.

Similarly, when doing a longer maintenance task and m=2, any disk fail during 
maintenance will imply loosing write access.

Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being 
unavailable simultaneously while service is fully operational. That can be a 
life saver in many situations.

An additional reason for larger m is systematic failures of drives if your 
vendor doesn't mix drives from different batches and factories. If a batch has 
a systematic production error, failures are no longer statistically 
independent. In such a situation, if one drives fails the likelihood that more 
drives fail at the same time is very high. Having a larger number of parity 
shards increases the chances of recovering from such events.

For similar reasons I would recommend to deploy 5 MONs instead of 3. My life 
got so much better after having the extra redundancy.

As some background, in our situation we experience(d) somewhat heavy 
maintenance operations including modifying/updating ceph nodes (hardware, not 
software), exchanging Racks, switches, cooling and power etc. This required 
longer downtime and/or moving of servers and moving the ceph hardware was the 
easiest compared with other systems due to the extra redundancy bits in it. We 
had no service outages during such operations.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Anthony D'Atri 
Sent: Saturday, January 13, 2024 5:36 PM
To: Phong Tran Thanh
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Recomand number of k and m erasure code

There are nuances, but in general the higher the sum of m+k, the lower the 
performance, because *every* operation has to hit that many drives, which is 
especially impactful with HDDs.  So there’s a tradeoff between storage 
efficiency and performance.  And as you’ve seen, larger parity groups 
especially mean slower recovery/backfill.

There’s also a modest benefit to choosing values of m and k that have small 
prime factors, but I wouldn’t worry too much about that.

You can find EC efficiency tables on the net:

https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html

I should really add a table to the docs, making a note to do that.

There’s a nice calculator at the OSNEXUS site:

https://www.osnexus.com/ceph-designer

The overhead factor is (k+m) / k

So for a 4,2 profile, that’s 6 / 4 == 1.5

For 6,2, 8 / 6 = 1.33

For 10,2, 12 / 10 = 1.2

and so forth.  As k increases, the incremental efficiency gain sees diminishing 
returns, but performance continues to decrease.

Think of m as the number of copies you can lose without losing data, and m-1 as 
the number you can lose / have down and still have data *available*.

I also suggest that the number of failure domains — in your cases this means 
OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9.

With RBD and many CephFS implementations, we mostly have relatively large RADOS 
objects that are striped over many OSDs.

When using RGW especially, one should attend to average and median S3 object 
size.  There’s an analysis of the potential for space amplification in the docs 
so I won’t repeat it here in detail. This sheet 
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
 visually demonstrates this.

Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually 
small objects — if you have a lot of S3 objects in the multiples of KB size, 
you waste a significant fraction of underlying storage.  This is exacerbated by 
EC, and the larger the sum of k+m, the more waste.

When people ask me about replication vs EC and EC profile, the first question I 
ask is what they’re storing.  When EC isn’t a non-starter, I tend to recommend 
4,2 as a profile until / unless someone has specific needs and can understand 
the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not 
going overboard on the performance hit.

If you care about your data, do not set m=1.

If you need to survive the loss of many drives, say if your cluster is across 
multiple buildings or sites, choose a larger value of k.  There are people 
running profiles like 4,6 because they have unusual and specific needs.

> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh  wrote:
>
> Hi ceph user!
>
> I need t

[ceph-users] Re: Performance impact of Heterogeneous environment

2024-01-18 Thread Frank Schilder

For multi- vs. single-OSD per flash drive decision the following test might be 
useful:

We found dramatic improvements using multiple OSDs per flash drive with octopus 
*if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one 
and this thread is effectively sequentializing otherwise async IO if saturated.

There was a dev discussion about having more kv_sync_threads per OSD daemon by 
splitting up rocks-dbs for PGs, but I don't know if this ever materialized.

My guess is that for good NVMe drives it is possible that a single 
kv_sync_thread can saturate the device and there will be no advantage of having 
more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments 
usually are better, because the on-disk controller requires concurrency to 
saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with 
iodepth=1.

With good NVME drives I have seen fio-tests with direct IO saturate the drive 
with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that 
and I could imagine that here 1 OSD/drive is sufficient. For such drives, 
storage access quickly becomes CPU bound, so some benchmarking taking all 
system properties into account is required. If you are already CPU bound (too 
many NVMe drives per core, many standard servers with 24+ NVMe drives have that 
property) there is no point adding extra CPU load with more OSD daemons.

Don't just look at single disks, look at the whole system.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Bailey Allison 
Sent: Thursday, January 18, 2024 12:36 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Performance impact of Heterogeneous environment

+1 to this, great article and great research. Something we've been keeping a 
very close eye on ourselves.

Overall we've mostly settled on the old keep it simple stupid methodology with 
good results. Especially as the benefits have gotten less beneficial the more 
recent your ceph version, and have been rocking with single OSD/NVMe, but as 
always everything is workload dependant and there is sometimes a need for 
doubling up 😊

Regards,

Bailey

> -Original Message-
> From: Maged Mokhtar 
> Sent: January 17, 2024 4:59 PM
> To: Mark Nelson ; ceph-users@ceph.io
> Subject: [ceph-users] Re: Performance impact of Heterogeneous
> environment
>
> Very informative article you did Mark.
>
> IMHO if you find yourself with very high per-OSD core count, it may be logical
> to just pack/add more nvmes per host, you'd be getting the best price per
> performance and capacity.
>
> /Maged
>
>
> On 17/01/2024 22:00, Mark Nelson wrote:
> > It's a little tricky.  In the upstream lab we don't strictly see an
> > IOPS or average latency advantage with heavy parallelism by running
> > muliple OSDs per NVMe drive until per-OSD core counts get very high.
> > There does seem to be a fairly consistent tail latency advantage even
> > at moderately low core counts however.  Results are here:
> >
> > https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/
> >
> > Specifically for jitter, there is probably an advantage to using 2
> > cores per OSD unless you are very CPU starved, but how much that
> > actually helps in practice for a typical production workload is
> > questionable imho.  You do pay some overhead for running 2 OSDs per
> > NVMe as well.
> >
> >
> > Mark
> >
> >
> > On 1/17/24 12:24, Anthony D'Atri wrote:
> >> Conventional wisdom is that with recent Ceph releases there is no
> >> longer a clear advantage to this.
> >>
> >>> On Jan 17, 2024, at 11:56, Peter Sabaini  wrote:
> >>>
> >>> One thing that I've heard people do but haven't done personally with
> >>> fast NVMes (not familiar with the IronWolf so not sure if they
> >>> qualify) is partition them up so that they run more than one OSD
> >>> (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth.
> >>> See
> >>> https://ceph.com/community/bluestore-default-vs-tuned-
> performance-co
> >>> mparison/
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> >> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email
> to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

2024-01-18 Thread Frank Schilder

Hi, maybe this is related. On a system with many disks I also had aio problems 
causing OSDs to hang. Here it was the kernel parameter fs.aio-max-nr that was 
way too low by default. I bumped it to fs.aio-max-nr = 1048576 (sysctl/tuned) 
and OSDs came up right away.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Thursday, January 18, 2024 9:46 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

I'm glad to hear (or read) that it worked for you as well. :-)

Zitat von Torkil Svensgaard :

> On 18/01/2024 09:30, Eugen Block wrote:
>> Hi,
>>
>>> [ceph: root@lazy /]# ceph-conf --show-config | egrep
>>> osd_max_pg_per_osd_hard_ratio
>>> osd_max_pg_per_osd_hard_ratio = 3.00
>>
>> I don't think this is the right tool, it says:
>>
>>> --show-config-value    Print the corresponding ceph.conf value
>>> that matches the specified key.
>>> Also searches
>>> global defaults.
>>
>> I suggest to query the daemon directly:
>>
>> storage01:~ # ceph config set osd osd_max_pg_per_osd_hard_ratio 5
>>
>> storage01:~ # ceph tell osd.0 config get osd_max_pg_per_osd_hard_ratio
>> {
>> "osd_max_pg_per_osd_hard_ratio": "5.00"
>> }
>
> Copy that, verified to be 5 now.
>
>>> Daemons are running but those last OSDs won't come online.
>>> I've tried upping bdev_aio_max_queue_depth but it didn't seem to
>>> make a difference.
>>
>> I don't have any good idea for that right now except what you
>> already tried. Which values for bdev_aio_max_queue_depth have you
>> tried?
>
> The previous value was 1024, I bumped it to 4096.
>
> A couple of the OSDs seemingly stuck on the aio thing has now come
> to, so I went ahead and added the rest. Some of them came in right
> away, some are stuck on the aio thing. Hopefully they will recover
> eventually.
>
> Thanks you again for the osd_max_pg_per_osd_hard_ratio suggestion,
> that seems to have solved the core issue =)
>
> Mvh.
>
> Torkil
>
>>
>> Zitat von Torkil Svensgaard :
>>
>>> On 18/01/2024 07:48, Eugen Block wrote:
>>>> Hi,
>>>>
>>>>>  -3281> 2024-01-17T14:57:54.611+ 7f2c6f7ef540  0 osd.431
>>>>> 2154828 load_pgs opened 750 pgs <---
>>>>
>>>> I'd say that's close enough to what I suspected. ;-) Not sure why
>>>> the "maybe_wait_for_max_pg" message isn't there but I'd give it a
>>>> try with a higher osd_max_pg_per_osd_hard_ratio.
>>>
>>> Might have helped, not quite sure.
>>>
>>> I've set these since I wasn't sure which one was the right one?:
>>>
>>> "
>>> ceph config dump | grep osd_max_pg_per_osd_hard_ratio
>>> globaladvanced  osd_max_pg_per_osd_hard_ratio
>>> 5.00 osd   advanced
>>> osd_max_pg_per_osd_hard_ratio5.00
>>> "
>>>
>>> Restarted MONs and MGRs. Still getting this with ceph-conf though:
>>>
>>> "
>>> [ceph: root@lazy /]# ceph-conf --show-config | egrep
>>> osd_max_pg_per_osd_hard_ratio
>>> osd_max_pg_per_osd_hard_ratio = 3.00
>>> "
>>>
>>> I re-added a couple small SSD OSDs and they came in just fine. I
>>> then added a couple HDD OSDs and they also came in after a bit of
>>> aio_submit spam. I added a couple more and have now been looking
>>> at this for 40 minutes:
>>>
>>>
>>> "
>>> ...
>>>
>>> 2024-01-18T07:42:01.789+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 10
>>> 2024-01-18T07:42:01.808+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 4
>>> 2024-01-18T07:42:01.819+ 7f735d1b8700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 82
>>> 2024-01-18T07:42:07.499+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 6
>>> 2024-01-18T07:42:07.542+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 8
>>> 2024-01-18T07:42:07.554+ 7f735d1b8700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submi

[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

2024-01-22 Thread Frank Schilder

You seem to have a problem with your crush rule(s):

14.3d ... [18,17,16,3,1,0,NONE,NONE,12]

If you really just took out 1 OSD, having 2xNONE in the acting set indicates 
that your crush rule can't find valid mappings. You might need to tune crush 
tunables: 
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs

It is possible that your low OSD count causes the "crush gives up too soon" 
issue. You might also consider to use a crush rule that places exactly 3 shards 
per host (examples were in posts just last week). Otherwise, it is not 
guaranteed that "... data remains available if a whole host goes down ..." 
because you might have 4 chunks on one of the hosts and fall below min_size 
(the failure domain of your crush rule for the EC profiles is OSD).

To test if your crush rules can generate valid mappings, you can pull the 
osdmap of your cluster and use osdmaptool to experiment with it without risk of 
destroying anything. It allows you to try different crush rules and failure 
scenarios on off-line but real cluster meta-data.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Hector Martin 
Sent: Friday, January 19, 2024 10:12 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Degraded PGs on EC pool when marking an OSD out

I'm having a bit of a weird issue with cluster rebalances with a new EC
pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD).
Until now I've been using an erasure coded k=5 m=3 pool for most of my
data. I've recently started to migrate to a k=5 m=4 pool, so I can
configure the CRUSH rule to guarantee that data remains available if a
whole host goes down (3 chunks per host, 9 total). I also moved the 5,3
pool to this setup, although by nature I know its PGs will become
inactive if a host goes down (need at least k+1 OSDs to be up).

I've only just started migrating data to the 5,4 pool, but I've noticed
that any time I trigger any kind of backfilling (e.g. take one OSD out),
a bunch of PGs in the 5,4 pool become degraded (instead of just
misplaced/backfilling). This always seems to happen on that pool only,
and the object count is a significant fraction of the total pool object
count (it's not just "a few recently written objects while PGs were
repeering" or anything like that, I know about that effect).

Here are the pools:

pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14133 lfor 0/11307/11305 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14509 lfor 0/0/14234 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs

EC profiles:

# ceph osd erasure-code-profile get ec5.3
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get ec5.4
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=4
plugin=jerasure
technique=reed_sol_van
w=8

They both use the same CRUSH rule, which is designed to select 9 OSDs
balanced across the hosts (of which only 8 slots get used for the older
5,3 pool):

rule hdd-ec-x3 {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 3 type host
step choose indep 3 type osd
step emit
}

If I take out an OSD (14), I get something like this:

health: HEALTH_WARN
Degraded data redundancy: 37631/120155160 objects degraded
(0.031%), 38 pgs degraded

All the degraded PGs are in the 5,4 pool, and the total object count is
around 50k, so this is *most* of the data in the pool becoming degraded
just because I marked an OSD out (without stopping it). If I mark the
OSD in again, the degraded state goes away.

Example degraded PGs:

# ceph pg dump | grep degraded
dumped all
14.3c812   0   838  00
119250277580   0  1088 0  1088
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.786745+0900 15440'1088 15486:10772
[18,17,16,1,3,2,11,13,12]  18[18,17,16,1,3,2,11,NONE,12]
 18  14537'432  2024-01-12T11:25:54.168048+0900
0'0  2024-01-08T15:18:21.654679+0900  02
 periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900
  2410
14.3d772   0  1602  00
11303280223

[ceph-users] List contents of stray buckets with octopus

2024-01-24 Thread Frank Schilder

Hi all,

I need to list the contents of the stray buckets on one of our MDSes. The MDS 
reports 772674 stray entries. However, if I dump its cache and grep for stray I 
get only 216 hits.

How can I get to the contents of the stray buckets?

Please note that Octopus is still hit by https://tracker.ceph.com/issues/57059 
so a "dump tree" will not work. In addition, I clearly don't just need the 
entries in cache, I need a listing of everything. How can I get that? I'm 
willing to run rados commands and pipe through ceph-encoder if necessary.

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

2024-01-24 Thread Frank Schilder

Hi,

Hector also claims that he observed an incomplete acting set after *adding* an 
OSD. Assuming that the cluster was health OK before that, that should not 
happen in theory. In practice this was observed with certain definitions of 
crush maps. There is, for example, the issue with "choose" and "chooseleaf" not 
doing the same thing in situations they should. Another one was that spurious 
(temporary) allocations of PGs could exceed hard limits without being obvious 
or reported at all. Without seeing the crush maps its hard to tell what is 
going on. With just 3 hosts and 4 OSDs per hosts the cluster might be hitting 
corner cases with such a wide EC profile.

Having the osdmap of the cluster in normal conditions would allow to simulate 
OSD downs and ups off-line and one might gain inside why crush fails to compute 
a complete acting set (yes, I'm not talking about the up set, I was always 
talking about the acting set). There might also be an issue with the 
PG-/OSD-map logs tracking the full history of the PGs in question.

A possible way to test is to issue a re-peer command after all peering finished 
on a PG with incomplete acting set to see if this resolves the PG. If so, there 
is a temporary condition that prevents the PGs from becoming clean when going 
through the standard peering procedure.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Wednesday, January 24, 2024 9:45 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

Hi,

this topic pops up every now and then, and although I don't have
definitive proof for my assumptions I still stand with them. ;-)
As the docs [2] already state, it's expected that PGs become degraded
after some sort of failure (setting an OSD "out" falls into that
category IMO):

> It is normal for placement groups to enter “degraded” or “peering”
> states after a component failure. Normally, these states reflect the
> expected progression through the failure recovery process. However,
> a placement group that stays in one of these states for a long time
> might be an indication of a larger problem.

And you report that your PGs do not stay in that state but eventually
recover. My understanding is as follows:
PGs have to be recreated on different hosts/OSDs after setting an OSD
"out". During this transition (peering) the PGs are degraded until the
newly assigned OSD have noticed their new responsibility (I'm not
familiar with the actual data flow). The degraded state then clears as
long as the out OSD is up (its PGs are active). If you stop that OSD
("down") the PGs become and stay degraded until they have been fully
recreated on different hosts/OSDs. Not sure what impacts the duration
until the degraded state clears, but in my small test cluster (similar
osd tree as yours) the degraded state clears after a few seconds only,
but I only have a few (almost empty) PGs in the EC test pool.

I guess a comment from the devs couldn't hurt to clear this up.

[2]
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups

Zitat von Hector Martin :

> On 2024/01/22 19:06, Frank Schilder wrote:
>> You seem to have a problem with your crush rule(s):
>>
>> 14.3d ... [18,17,16,3,1,0,NONE,NONE,12]
>>
>> If you really just took out 1 OSD, having 2xNONE in the acting set
>> indicates that your crush rule can't find valid mappings. You might
>> need to tune crush tunables:
>> https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs
>
> Look closely: that's the *acting* (second column) OSD set, not the *up*
> (first column) OSD set. It's supposed to be the *previous* set of OSDs
> assigned to that PG, but inexplicably some OSDs just "fall off" when the
> PGs get remapped around.
>
> Simply waiting lets the data recover. At no point are any of my PGs
> actually missing OSDs according to the current cluster state, and CRUSH
> always finds a valid mapping. Rather the problem is that the *previous*
> set of OSDs just loses some entries some for some reason.
>
> The same problem happens when I *add* an OSD to the cluster. For
> example, right now, osd.15 is out. This is the state of one pg:
>
> 14.3d   1044   0 0  00
> 157307567310   0  1630 0  1630
> active+clean  2024-01-22T20:15:46.684066+0900 15550'1630
> 15550:16184  [18,17,16,3,1,0,11,14,12]  18
> [18,17,16,3,1,0,11,14,12]  18 15550'1629
> 2024-01-22T20:15:46.683491+0900  0'0
> 2024-01-08T15:18:21.654679+0900  0

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Frank Schilder

Hi, this message is one of those that are often spurious. I don't recall in 
which thread/PR/tracker I read it, but the story was something like that:

If an MDS gets under memory pressure it will request dentry items back from 
*all* clients, not just the active ones or the ones holding many of them. If 
you have a client that's below the min-threshold for dentries (its one of the 
client/mds tuning options), it will not respond. This client will be flagged as 
not responding, which is a false positive.

I believe the devs are working on a fix to get rid of these spurious warnings. 
There is a "bug/feature" in the MDS that does not clear this warning flag for 
inactive clients. Hence, the message hangs and never disappears. I usually 
clear it with a "echo 3 > /proc/sys/vm/drop_caches" on the client. However, 
except for being annoying in the dashboard, it has no performance or otherwise 
negative impact.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Friday, January 26, 2024 10:05 AM
To: Özkan Göksu
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure 
(quincy:17.2.6)

Performance for small files is more about IOPS rather than throughput,
and the IOPS in your fio tests look okay to me. What you could try is
to split the PGs to get around 150 or 200 PGs per OSD. You're
currently at around 60 according to the ceph osd df output. Before you
do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
head'? I don't need the whole output, just to see how many objects
each PG has. We had a case once where that helped, but it was an older
cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
So this might not be the solution here, but it could improve things as
well.


Zitat von Özkan Göksu :

> Every user has a 1x subvolume and I only have 1 pool.
> At the beginning we were using each subvolume for ldap home directory +
> user data.
> When a user logins any docker on any host, it was using the cluster for
> home and the for user related data, we was have second directory in the
> same subvolume.
> Time to time users were feeling a very slow home environment and after a
> month it became almost impossible to use home. VNC sessions became
> unresponsive and slow etc.
>
> 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
> performance is better for only user_data without home.
> But still the performance is not good enough as I expected because of the
> problems related to MDS.
> The usage is low but allocation is high and Cpu usage is high. You saw the
> IO Op/s, it's nothing but allocation is high.
>
> I develop a fio benchmark script and I run the script on 4x test server at
> the same time, the results are below:
> Script:
> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
>
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
>
> While running benchmark, I take sample values for each type of iobench run.
>
> Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
> client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
> client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
>
> Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
> client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
>
> Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
> client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
> client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr
>
> Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
> client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
> client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
> client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr
>
> It seems I only have problems with the 4K,8K,16K other sector sizes.
>
>
>
>
> Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı:
>
>> I understand that your MDS shows a high CPU usage, but other than that
>> what is y

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-27 Thread Frank Schilder

Hi Özkan,

> ... The client is actually at idle mode and there is no reason to fail at 
> all. ...

if you re-read my message, you will notice that I wrote that

- its not the client failing, its a false positive error flag that
- is not cleared for idle clients.

You seem to encounter exactly this situation and a simple

echo 3 > /proc/sys/vm/drop_caches

would probably have cleared the warning. There is nothing wrong with your 
client, its an issue with the client-MDS communication protocol that is 
probably still under review. You will encounter these warnings every now and 
then until its fixed.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Frank Schilder

Hi Michel,

are your OSDs HDD or SSD? If they are HDD, its possible that they can't handle 
the deep-scrub load with default settings. In that case, have a look at this 
post 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/
 for some basic tuning info and a script to check your scrub stamp distribution.

You should also get rid of slow/failing ones. Look at smartctl output and throw 
out disks with remapped sectors, uncorrectable r/w errors or unusually many 
corrected read-write-errors (assuming you have disks with ECC).

A basic calculation for deep-scrub is as follows: max number of PGs that can be 
scrubbed at the same time: A=#OSDs/replication factor (rounded down). Take the 
B=deep-scrub times from the OSD logs (grep for deep-scrub) in minutes. Your 
pool can deep-scrub at a max A*24*(60/B) PGs per day. For reasonable operations 
you should not do more than 50% of that. With that you can calculate how many 
days it needs to deep-scrub your PGs.

Usual reasons for slow deep-scrub progress is too few PGs. With replication 
factor 3 and 48 OSDs you have a PG budget of ca. 3200 (ca 200/OSD) but use only 
385. You should consider increasing the PG count for pools with lots of data. 
This should already relax the situation somewhat. Then do the calc above and 
tune deep-scrub times per pool such that they match with disk performance.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Michel Niyoyita 
Sent: Monday, January 29, 2024 7:42 AM
To: E Taka
Cc: ceph-users
Subject: [ceph-users] Re: 6 pgs not deep-scrubbed in time

Now they are increasing , Friday I tried to deep-scrubbing manually and
they have been successfully done , but Monday morning I found that they are
increasing to 37 , is it the best to deep-scrubbing manually while we are
using the cluster? if not what is the best to do in order to address that .

Best Regards.

Michel

 ceph -s
  cluster:
id: cb0caedc-eb5b-42d1-a34f-96facfda8c27
health: HEALTH_WARN
37 pgs not deep-scrubbed in time

  services:
mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M)
mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1
osd: 48 osds: 48 up (since 11M), 48 in (since 11M)
rgw: 6 daemons active (6 hosts, 1 zones)

  data:
pools:   10 pools, 385 pgs
objects: 6.00M objects, 23 TiB
usage:   151 TiB used, 282 TiB / 433 TiB avail
pgs: 381 active+clean
 4   active+clean+scrubbing+deep

  io:
client:   265 MiB/s rd, 786 MiB/s wr, 3.87k op/s rd, 699 op/s wr

On Sun, Jan 28, 2024 at 6:14 PM E Taka <0eta...@gmail.com> wrote:

> 22 is more often there than the others. Other operations may be blocked
> because of a deep-scrub is not finished yet. I would remove OSD 22, just to
> be sure about this: ceph orch osd rm osd.22
>
> If this does not help, just add it again.
>
> Am Fr., 26. Jan. 2024 um 08:05 Uhr schrieb Michel Niyoyita <
> mico...@gmail.com>:
>
>> It seems that are different OSDs as shown here . how have you managed to
>> sort this out?
>>
>> ceph pg dump | grep -F 6.78
>> dumped all
>> 6.78   44268   0 0  00
>> 1786796401180   0  10099 10099
>>  active+clean  2024-01-26T03:51:26.781438+0200  107547'115445304
>> 107547:225274427  [12,36,37]  12  [12,36,37]  12
>> 106977'114532385  2024-01-24T08:37:53.597331+0200  101161'109078277
>> 2024-01-11T16:07:54.875746+0200  0
>> root@ceph-osd3:~# ceph pg dump | grep -F 6.60
>> dumped all
>> 6.60   9   0 0  00
>> 179484338742  716  36  10097 10097
>>  active+clean  2024-01-26T03:50:44.579831+0200  107547'153238805
>> 107547:287193139   [32,5,29]  32   [32,5,29]  32
>> 107231'152689835  2024-01-25T02:34:01.849966+0200  102171'147920798
>> 2024-01-13T19:44:26.922000+0200  0
>> 6.3a   44807   0 0  00
>> 1809690056940   0  10093 10093
>>  active+clean  2024-01-26T03:53:28.837685+0200  107547'114765984
>> 107547:238170093  [22,13,11]  22  [22,13,11]  22
>> 106945'113739877  2024-01-24T04:10:17.224982+0200  102863'109559444
>> 2024-01-15T05:31:36.606478+0200  0
>> root@ceph-osd3:~# ceph pg dump | grep -F 6.5c
>> 6.5c   44277   0 0  00
>> 1787649782300   0  10051 10051
>>  active+clean  2024-01-26T03:55:23.339584+0200  107547'126480090
>> 107547:264432655  [22,37,30]

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Frank Schilder

Setting osd_max_scrubs = 2 for HDD OSDs was a mistake I made. The result was 
that PGs needed a bit more than twice as long to deep-scrub. Net effect: high 
scrub load, much less user IO and, last but not least, the "not deep-scrubbed 
in time" problem got worse, because (2+eps)/2 > 1.

For spinners a consideration looking at the actually available drive 
performance is required, plus a few things more, like PG count, distribution 
etc.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Wesley Dillingham 
Sent: Monday, January 29, 2024 7:14 PM
To: Michel Niyoyita
Cc: Josh Baergen; E Taka; ceph-users
Subject: [ceph-users] Re: 6 pgs not deep-scrubbed in time

Respond back with "ceph versions" output

If your sole goal is to eliminate the not scrubbed in time errors you can
increase the aggressiveness of scrubbing by setting:
osd_max_scrubs = 2

The default in pacific is 1.

if you are going to start tinkering manually with the pg_num you will want
to turn off the pg autoscaler on the pools you are touching.
reducing the size of your PGs may make sense and help with scrubbing but if
the pool has a lot of data it will take a long long time to finish.





Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>


On Mon, Jan 29, 2024 at 10:08 AM Michel Niyoyita  wrote:

> I am running ceph pacific , version 16 , ubuntu 20 OS , deployed using
> ceph-ansible.
>
> Michel
>
> On Mon, Jan 29, 2024 at 4:47 PM Josh Baergen 
> wrote:
>
> > Make sure you're on a fairly recent version of Ceph before doing this,
> > though.
> >
> > Josh
> >
> > On Mon, Jan 29, 2024 at 5:05 AM Janne Johansson 
> > wrote:
> > >
> > > Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita  >:
> > > >
> > > > Thank you Frank ,
> > > >
> > > > All disks are HDDs . Would like to know if I can increase the number
> > of PGs
> > > > live in production without a negative impact to the cluster. if yes
> > which
> > > > commands to use .
> > >
> > > Yes. "ceph osd pool set  pg_num "
> > > where the number usually should be a power of two that leads to a
> > > number of PGs per OSD between 100-200.
> > >
> > > --
> > > May the most significant bit of your life be positive.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Frank Schilder

You will have to look at the output of "ceph df" and make a decision to balance 
"objects per PG" and "GB per PG". Increase he PG count for the pools with the 
worst of these two numbers most such that it balances out as much as possible. 
If you have pools that see significantly more user-IO than others, prioritise 
these. 

You will have to find out for your specific cluster, we can only give general 
guidelines. Make changes, run benchmarks, re-evaluate. Take the time for it. 
The better you know your cluster and your users, the better the end result will 
be.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michel Niyoyita 
Sent: Monday, January 29, 2024 2:04 PM
To: Janne Johansson
Cc: Frank Schilder; E Taka; ceph-users
Subject: Re: [ceph-users] Re: 6 pgs not deep-scrubbed in time

This is how it is set , if you suggest to make some changes please advises.

Thank you.


ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1407 
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application 
mgr_devicehealth
pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1393 flags 
hashpspool stripe_width 0 application rgw
pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1394 flags 
hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1395 
flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1396 flags 
hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw
pool 6 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 128 pgp_num 128 autoscale_mode on last_change 108802 lfor 0/0/14812 
flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps_queue 
[22d7~3,11561~2,11571~1,11573~1c,11594~6,1159b~f,115b0~1,115b3~1,115c3~1,115f3~1,115f5~e,11613~6,1161f~c,11637~1b,11660~1,11663~2,11673~1,116d1~c,116f5~10,11721~c]
pool 7 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 94609 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 8 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 1399 flags hashpspool 
stripe_width 0 application rbd
pool 9 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 108783 lfor 0/561/559 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps_queue [3fa~1,3fc~3,400~1,402~1]
pool 10 'testbench' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 20931 lfor 
0/20931/20929 flags hashpspool stripe_width 0


On Mon, Jan 29, 2024 at 2:09 PM Michel Niyoyita 
mailto:mico...@gmail.com>> wrote:
Thank you Janne ,

no need of setting some flags like ceph osd set nodeep-scrub  ???

Thank you

On Mon, Jan 29, 2024 at 2:04 PM Janne Johansson 
mailto:icepic...@gmail.com>> wrote:
Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita 
mailto:mico...@gmail.com>>:
>
> Thank you Frank ,
>
> All disks are HDDs . Would like to know if I can increase the number of PGs
> live in production without a negative impact to the cluster. if yes which
> commands to use .

Yes. "ceph osd pool set  pg_num "
where the number usually should be a power of two that leads to a
number of PGs per OSD between 100-200.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Frank Schilder

Hi all, coming late to the party but want to ship in as well with some 
experience.

The problem of tail latencies of individual OSDs is a real pain for any 
redundant storage system. However, there is a way to deal with this in an 
elegant way when using large replication factors. The idea is to use the 
counterpart of the "fast read" option that exists for EC pools and:

1) make this option available to replicated pools as well (is on the road map 
as far as I know), but also
2) implement an option "fast write" for all pool types.

Fast write enabled would mean that the primary OSD sends #size copies to the 
entire active set (including itself) in parallel and sends an ACK to the client 
as soon as min_size ACKs have been received from the peers (including itself). 
In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever 
reason) without suffering performance penalties immediately (only after too 
many requests started piling up, which will show as a slow requests warning).

I have fast read enabled on all EC pools. This does increase the 
cluster-internal network traffic, which is nowadays absolutely no problem (in 
the good old 1G times it potentially would be). In return, the read latencies 
on the client side are lower and much more predictable. In effect, the user 
experience improved dramatically.

I would really wish that such an option gets added as we use wide replication 
profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large 
replication factors (more precisely, large (size-min_size)) to mitigate the 
impact of slow OSDs would be awesome. It would also add some incentive to stop 
the ridiculous size=2 min_size=1 habit, because one gets an extra gain from 
replication on top of redundancy.

In the long run, the ceph write path should try to deal with a-priori known 
different-latency connections (fast local ACK with async remote completion, was 
asked for a couple of times), for example, for stretched clusters where one has 
an internal connection for the local part and external connections for the 
remote parts. It would be great to have similar ways of mitigating some 
penalties of the slow write paths to remote sites.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Peter Grandi 
Sent: Wednesday, February 21, 2024 1:10 PM
To: list Linux fs Ceph
Subject: [ceph-users] Re: Performance improvement suggestion

> 1. Write object A from client.
> 2. Fsync to primary device completes.
> 3. Ack to client.
> 4. Writes sent to replicas.
[...]

As mentioned in the discussion this proposal is the opposite of
what the current policy, is, which is to wait for all replicas
to be written before writes are acknowledged to the client:

https://github.com/ceph/ceph/blob/main/doc/architecture.rst

   "After identifying the target placement group, the client
   writes the object to the identified placement group's primary
   OSD. The primary OSD then [...] confirms that the object was
   stored successfully in the secondary and tertiary OSDs, and
   reports to the client that the object was stored
   successfully."

A more revolutionary option would be for 'librados' to write in
parallel to all the "active set" OSDs and report this to the
primary, but that would greatly increase client-Ceph traffic,
while the current logic increases traffic only among OSDs.

> So I think that to maintain any semblance of reliability,
> you'd need to at least wait for a commit ack from the first
> replica (i.e. min_size=2).

Perhaps it could be similar to 'k'+'m' for EC, that is 'k'
synchronous (write completes to the client only when all at
least 'k' replicas, including primary, have been committed) and
'm' asynchronous, instead of 'k' being just 1 or 2.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Frank Schilder

>>> Fast write enabled would mean that the primary OSD sends #size copies to the
>>> entire active set (including itself) in parallel and sends an ACK to the
>>> client as soon as min_size ACKs have been received from the peers (including
>>> itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow
>>> for whatever reason) without suffering performance penalties immediately
>>> (only after too many requests started piling up, which will show as a slow
>>> requests warning).
>>>
>> What happens if there occurs an error on the slowest osd after the min_size 
>> ACK has already been send to the client?
>>
>This should not be different than what exists today..unless of-course if
>the error happens on the local/primary osd

Can this be addressed with reasonable effort? I don't expect this to be a 
quick-fix and it should be tested. However, beating the tail-latency statistics 
with the extra redundancy should be worth it. I observe fluctuations of 
latencies, OSDs become randomly slow for whatever reason for short time 
intervals and then return to normal.

A reason for this could be DB compaction. I think during compaction latency 
tends to spike.

A fast-write option would effectively remove the impact of this.

Best regards and thanks for considering this!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

2024-04-16 Thread Frank Schilder

Question about HA here: I understood the documentation of the fuse NFS client 
such that the connection state of all NFS clients is stored on ceph in rados 
objects and, if using a floating IP, the NFS clients should just recover from a 
short network timeout.

Not sure if this is what should happen with this specific HA set-up in the 
original request, but a fail-over of the NFS server ought to be handled 
gracefully by starting a new one up with the IP of the down one. Or not?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Tuesday, April 16, 2024 11:24 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

Ah, okay, thanks for the hint. In that case what I see is expected.

Zitat von Robert Sander :

> Hi,
>
> On 16.04.24 10:49, Eugen Block wrote:
>>
>> I believe I can confirm your suspicion, I have a test cluster on
>> Reef 18.2.1 and deployed nfs without HAProxy but with keepalived [1].
>> Stopping the active NFS daemon doesn't trigger anything, the MGR
>> notices that it's stopped at some point, but nothing else seems to
>> happen.
>
> There is currently no failover for NFS.
>
> The ingress service (haproxy + keepalived) that cephadm deploys for
> an NFS cluster does not have a health check configured. Haproxy does
> not notice if a backend NFS server dies. This does not matter as
> there is no failover and the NFS client cannot be "load balanced" to
> another backend NFS server.
>
> There is no use to configure an ingress service currently without failover.
>
> The NFS clients have to remount the NFS share in case of their
> current NFS server dies anyway.
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] (deep-)scrubs blocked by backfill

2024-04-17 Thread Frank Schilder

Hi all, I have a technical question about scrub scheduling. I replaced a disk 
and it is back-filling slowly. We have set osd_scrub_during_recovery = true and 
still observe that scrub times continuously increase (number of PGs not 
scrubbed in time is continuously increasing). Investigating the situation it 
looks like any OSD that has a PG in states "backfill_wait" or "backfilling" is 
preventing scrubs to be scheduled on PGs it is a member of. However, it seems 
it is not quite like that.

On the one hand I have never seen a PG in a state like 
"active+scrubbing+remapped+backfilling", so backfilling PGs at least never seem 
to scrub. On the other hand, it seems like more PGs are scrubbed than would be 
eligible if *all* OSDs with a remapped PG on it would refuse scrubs. It looks 
like something in between "only OSDs with a backfilling PG block requests for 
scrub reservations" and "all OSDs with a PG in states backfilling or 
backfill_wait block requests for scrub reservations". Does the position in the 
backfill reservation queue play a role?

If anyone has insight into how scrub reservations are granted and when not in 
the situation of an OSD backfilling that would be great. My naive 
interpretation of "osd_scrub_during_recovery = true" was that scrubs proceed as 
if no backfill was going on. This, however, is clearly not the case. Having an 
answer to my question above would help me a lot to get an idea when things will 
go back to normal.

Thanks a lot and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Latest Doco Out Of Date?

2024-04-24 Thread Frank Schilder

Hi Eugen,

I would ask for a slight change here:

> If a client already has a capability for file-system name a and path
> dir1, running fs authorize again for FS name a but path dir2,
> instead of modifying the capabilities client already holds, a new
> cap for dir2 will be granted

The formulation "a new cap for dir2 will be granted" is very misleading. I 
would also read it as that the new cap is in addition to the already existing 
cap. I tried to modify caps with fs authorize as well in the past, because it 
will set caps using pool tags and the docu sounded like it will allow to modify 
caps. In my case, I got the same error and thought that its implementation is 
buggy and did it with the authtool.

To be honest, when I look at the command sequence

ceph fs authorize a client.x /dir1 rw
ceph fs authorize a client.x /dir2 rw

and it goes through without error, I would expect the client to have both 
permissions as a result - no matter what the documentation says. There is no 
"revoke caps" instruction anywhere. Revoking caps in this way is a really 
dangerous side effect and telling people to read the documentation about a 
command that should follow how other linux tools manage permissions is not the 
best answer. There is something called parallelism in software engineering and 
this command line syntax violates this in a highly un-intuitive way. The 
intuition of the syntax clearly is that it *adds* capabilities, its incremental.

A command like this should follow how existing linux tools work so that context 
switching will be easier for admins. Here, the choice of the term "authorize" 
seems to be unlucky. A more explicit command that follows setfacl a bit could be

ceph fs caps set a client.x /dir1 rw
ceph fs caps modify a client.x /dir2 rw

or even more parallel

ceph fs setcaps a client.x /dir1 rw
ceph fs setcaps -m a client.x /dir2 rw

Such parallel syntax will not only avoid the reported confusion but also make 
it possible to implement a modify operation in the future without breaking 
stuff. And you can save time on the documentation, because it works like other 
stuff.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Wednesday, April 24, 2024 9:02 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Latest Doco Out Of Date?

Hi,

I believe the docs [2] are okay, running 'ceph fs authorize' will
overwrite the existing caps, it will not add more caps to the client:

> Capabilities can be modified by running fs authorize only in the
> case when read/write permissions must be changed.

> If a client already has a capability for file-system name a and path
> dir1, running fs authorize again for FS name a but path dir2,
> instead of modifying the capabilities client already holds, a new
> cap for dir2 will be granted

To add more caps you'll need to use the 'ceph auth caps' command, for example:

quincy-1:~ # ceph fs authorize cephfs client.usera /dir1 rw
[client.usera]
 key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw==

quincy-1:~ # ceph auth get client.usera
[client.usera]
 key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw==
 caps mds = "allow rw fsname=cephfs path=/dir1"
 caps mon = "allow r fsname=cephfs"
 caps osd = "allow rw tag cephfs data=cephfs"

quincy-1:~ # ceph auth caps client.usera mds 'allow rw fsname=cephfs
path=/dir1, allow rw fsname=cephfs path=/dir2' mon 'allow r
fsname=cephfs' osd 'allow rw tag cephfs data=cephfs'
updated caps for client.usera

quincy-1:~ # ceph auth get client.usera
[client.usera]
 key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw==
 caps mds = "allow rw fsname=cephfs path=/dir1, allow rw
fsname=cephfs path=/dir2"
 caps mon = "allow r fsname=cephfs"
 caps osd = "allow rw tag cephfs data=cephfs"

Note that I don't actually have these directories in that cephfs, it's
just to demonstrate, so you'll need to make sure your caps actually
work.

Thanks,
Eugen

[2]
https://docs.ceph.com/en/latest/cephfs/client-auth/#changing-rw-permissions-in-caps

Zitat von Zac Dover :

> It's in my list of ongoing initiatives. I'll stay up late tonight
> and ask Venky directly what's going on in this instance.
>
> Sometime later today, I'll create an issue tracking bug and I'll
> send it to you for review. Make sure that I haven't misrepresented
> this issue.
>
> Zac
>
> On Wednesday, April 24th, 2024 at 2:10 PM, duluxoz  wrote:
>
>> Hi Zac,
>>
>> Any movement on this? We really need to come up with an
>> answer/solution - thanks
>>
>> Dulux-Oz
>>
>> On 19/04/2024 18:03, duluxoz wrote:
>>
>>> Cool!
>>&g

[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

2024-04-30 Thread Frank Schilder

Hi all,

I second Eugen's recommendation. We have a cluster with large HDD OSDs where 
the following timings are found:

- drain an OSD: 2 weeks.
- down an OSD and let cluster recover: 6 hours.

The drain OSD procedure is - in my experience - a complete waste of time, 
actually puts your cluster at higher risk of a second failure (its not 
guaranteed that the bad PG(s) is/are drained first) and also screws up all 
sorts of internal operations like scrub etc for an unnecessarily long time. The 
recovery procedure is much faster, because it uses all-to-all recovery while 
drain is limited to no more than max_backfills PGs at a time and your broken 
disk sits much longer in the cluster.

On SSDs the "down OSD"-method shows a similar speed-up factor.

For a security measure, don't destroy the OSD right away, wait for recovery to 
complete and only then destroy the OSD and throw away the disk. In case an 
error occurs during recovery, you can almost always still export PGs from a 
failed disk and inject it back into the cluster. This, however, requires to 
take disks out as soon as they show problems and before they fail hard. Keep a 
little bit of life time to have a chance to recover data. Look at the manual of 
ddrescue why it is important to stop IO from a failing disk as soon as possible.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Saturday, April 27, 2024 10:29 AM
To: Mary Zhang
Cc: ceph-users@ceph.io; Wesley Dillingham
Subject: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

If the rest of the cluster is healthy and your resiliency is
configured properly, for example to sustain the loss of one or more
hosts at a time, you don’t need to worry about a single disk. Just
take it out and remove it (forcefully) so it doesn’t have any clients
anymore. Ceph will immediately assign different primary OSDs and your
clients will be happy again. ;-)

Zitat von Mary Zhang :

> Thank you Wesley for the clear explanation between the 2 methods!
> The tracker issue you mentioned https://tracker.ceph.com/issues/44400 talks
> about primary-affinity. Could primary-affinity help remove an OSD with
> hardware issue from the cluster gracefully?
>
> Thanks,
> Mary
>
>
> On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham 
> wrote:
>
>> What you want to do is to stop the OSD (and all its copies of data it
>> contains) by stopping the OSD service immediately. The downside of this
>> approach is it causes the PGs on that OSD to be degraded. But the upside is
>> the OSD which has bad hardware is immediately no  longer participating in
>> any client IO (the source of your RGW 503s). In this situation the PGs go
>> into degraded+backfilling
>>
>> The alternative method is to keep the failing OSD up and in the cluster
>> but slowly migrate the data off of it, this would be a long drawn out
>> period of time in which the failing disk would continue to serve client
>> reads and also facilitate backfill but you wouldnt take a copy of the data
>> out of the cluster and cause degraded PGs. In this scenario the PGs would
>> be remapped+backfilling
>>
>> I tried to find a way to have your cake and eat it to in relation to this
>> "predicament" in this tracker issue: https://tracker.ceph.com/issues/44400
>> but it was deemed "wont fix".
>>
>> Respectfully,
>>
>> *Wes Dillingham*
>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>> w...@wesdillingham.com
>>
>>
>>
>>
>> On Fri, Apr 26, 2024 at 11:25 AM Mary Zhang 
>> wrote:
>>
>>> Thank you Eugen for your warm help!
>>>
>>> I'm trying to understand the difference between 2 methods.
>>> For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph
>>> Documentation
>>> <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd>
>>> says
>>> it involves 2 steps:
>>>
>>>1.
>>>
>>>evacuating all placement groups (PGs) from the OSD
>>>2.
>>>
>>>removing the PG-free OSD from the cluster
>>>
>>> For method 2, or the procedure you recommended, Adding/Removing OSDs —
>>> Ceph
>>> Documentation
>>> <
>>> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual
>>> >
>>> says
>>> "After the OSD has been taken out of the cluster, Ceph begins rebalancing
>>> the cluster by migrating placement groups out of the OSD that was removed.
>>> "
>>>
>>> What's the difference bet

[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

2024-04-30 Thread Frank Schilder

I think you are panicking way too much. Chances are that you will never need 
that command, so don't get fussed out by an old post.

Just follow what I wrote and, in the extremely rare case that recovery does not 
complete due to missing information, send an e-mail to this list and state that 
you still have the disk of the down OSD. Someone will send you the 
export/import commands within a short time.

So stop worrying and just administrate your cluster with common storage admin 
sense.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Mary Zhang 
Sent: Tuesday, April 30, 2024 5:00 PM
To: Frank Schilder
Cc: Eugen Block; ceph-users@ceph.io; Wesley Dillingham
Subject: Re: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

Thank you Frank for sharing such valuable experience! I really appreciate it.
We observe similar timelines: it took more than 1 week to drain our OSD.
Regarding export PGs from failed disk and inject it back to the cluster, do you 
have any documentations? I find this online Ceph.io — Incomplete PGs -- OH 
MY!<https://ceph.io/en/news/blog/2015/incomplete-pgs-oh-my/>, but not sure 
whether it's the standard process.

Thanks,
Mary

On Tue, Apr 30, 2024 at 3:27 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi all,

I second Eugen's recommendation. We have a cluster with large HDD OSDs where 
the following timings are found:

- drain an OSD: 2 weeks.
- down an OSD and let cluster recover: 6 hours.

The drain OSD procedure is - in my experience - a complete waste of time, 
actually puts your cluster at higher risk of a second failure (its not 
guaranteed that the bad PG(s) is/are drained first) and also screws up all 
sorts of internal operations like scrub etc for an unnecessarily long time. The 
recovery procedure is much faster, because it uses all-to-all recovery while 
drain is limited to no more than max_backfills PGs at a time and your broken 
disk sits much longer in the cluster.

On SSDs the "down OSD"-method shows a similar speed-up factor.

For a security measure, don't destroy the OSD right away, wait for recovery to 
complete and only then destroy the OSD and throw away the disk. In case an 
error occurs during recovery, you can almost always still export PGs from a 
failed disk and inject it back into the cluster. This, however, requires to 
take disks out as soon as they show problems and before they fail hard. Keep a 
little bit of life time to have a chance to recover data. Look at the manual of 
ddrescue why it is important to stop IO from a failing disk as soon as possible.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block mailto:ebl...@nde.ag>>
Sent: Saturday, April 27, 2024 10:29 AM
To: Mary Zhang
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; Wesley Dillingham
Subject: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

If the rest of the cluster is healthy and your resiliency is
configured properly, for example to sustain the loss of one or more
hosts at a time, you don’t need to worry about a single disk. Just
take it out and remove it (forcefully) so it doesn’t have any clients
anymore. Ceph will immediately assign different primary OSDs and your
clients will be happy again. ;-)

Zitat von Mary Zhang mailto:maryzhang0...@gmail.com>>:

> Thank you Wesley for the clear explanation between the 2 methods!
> The tracker issue you mentioned https://tracker.ceph.com/issues/44400 talks
> about primary-affinity. Could primary-affinity help remove an OSD with
> hardware issue from the cluster gracefully?
>
> Thanks,
> Mary
>
>
> On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham 
> mailto:w...@wesdillingham.com>>
> wrote:
>
>> What you want to do is to stop the OSD (and all its copies of data it
>> contains) by stopping the OSD service immediately. The downside of this
>> approach is it causes the PGs on that OSD to be degraded. But the upside is
>> the OSD which has bad hardware is immediately no  longer participating in
>> any client IO (the source of your RGW 503s). In this situation the PGs go
>> into degraded+backfilling
>>
>> The alternative method is to keep the failing OSD up and in the cluster
>> but slowly migrate the data off of it, this would be a long drawn out
>> period of time in which the failing disk would continue to serve client
>> reads and also facilitate backfill but you wouldnt take a copy of the data
>> out of the cluster and cause degraded PGs. In this scenario the PGs would
>> be remapped+backfilling
>>
>> I tried to find a way to have your cake and eat it to in relation to this
>> "predicament" in this tracker issue: https://tracker.

[ceph-users] Re: Please discuss about Slow Peering

2024-05-16 Thread Frank Schilder

This is a long shot: if you are using octopus, you might be hit by this 
pglog-dup problem: https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. 
They don't mention slow peering explicitly in the blog, but its also a 
consequence because the up+acting OSDs need to go through the PG_log during 
peering.

We are also using octopus and I'm not sure if we have ever seen slow ops caused 
by peering alone. It usually happens when a disk cannot handle load under 
peering. We have, unfortunately, disks that show random latency spikes 
(firmware update pending). You can try to monitor OPS latencies for your drives 
when peering and look for something that sticks out. People on this list were 
reporting quite bad results for certain infamous NVMe brands. If you state your 
model numbers, someone else might recognize it.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: 서민우 
Sent: Thursday, May 16, 2024 7:39 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Please discuss about Slow Peering

Env:
- OS: Ubuntu 20.04
- Ceph Version: Octopus 15.0.0.1
- OSD Disk: 2.9TB NVMe
- BlockStorage (Replication 3)

Symptom:
- Peering when OSD's node up is very slow. Peering speed varies from PG to
PG, and some PG may even take 10 seconds. But, there is no log for 10
seconds.
- I checked the effect of client VM's. Actually, Slow queries of mysql
occur at the same time.

There are Ceph OSD logs of both Best and Worst.

Best Peering Case (0.5 Seconds)
2024-04-11T15:32:44.693+0900 7f108b522700  1 osd.7 pg_epoch: 27368 pg[6.8]
state: transitioning to Primary
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
state: Peering, affected_by_map, going to Reset
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11],
acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
state: transitioning to Primary
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11],
acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting

Worst Peering Case (11.6 Seconds)
2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
state: transitioning to Stray
2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
state: transitioning to Stray
2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting
2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
state: transitioning to Stray
2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting

*I wish to know about*
- Why some PG's take 10 seconds until Peering finishes.
- Why Ceph log is quiet during peering.
- Is this symptom intended in Ceph.

*And please give some advice,*
- Is there any way to improve peering speed?
- Or, Is there a way to not affect the client when peering occurs?

P.S
- I checked the symptoms in the following environments.
-> Octopus Version, Reef Version, Cephadm, Ceph-Ansible
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please discuss about Slow Peering

2024-05-21 Thread Frank Schilder

We are using the read-intensive kioxia drives (octopus cluster) in RBD pools 
and are very happy with them. I don't think its the drives.

The last possibility I could think of is CPU. We run 4 OSDs per 1.92TB Kioxia 
drive to utilize their performance (single OSD per disk doesn't cut it at all) 
and have 2x16-core Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz per server. 
During normal operations the CPU is only lightly loaded. During peering this 
load peaks at or above 100%. If not enough CPU power is available, peering will 
be hit very badly. What you could check is:

- number of cores: at least 1 HT per OSD, better 1 core per OSD.
- cstates disabled: we run with virtualization-performance profile and the CPU 
is basically always at all-core boost (3.2GHz)
- sufficient RAM: we run these OSDs with 6G memory limit, that's 24G per disk! 
Still, the servers have 50% OSD RAM utilisation and 50% buffers, so there is 
enough for fast peak allocations during peering.
- check vm.min_free_kbytes: the default is way too low for OSD hosts, we use 
vm.min_free_kbytes=4194304 (4G), this can have latency impact for network 
connections
- swap disabled: disable swap on OSD hosts
- sysctl network tuning: check that your network parameters are appropriate for 
your network cards, the kernel defaults are still for 1G connections, there are 
great tuning guides on-line, here some of our settings for 10G NICs:

# Increase autotuning TCP buffer limits
# 10G fiber/64MB buffers (67108864)
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 22500   218450 67108864
net.ipv4.tcp_wmem = 22500  81920 67108864

- last check: are you using WPQ or MCLOCK? The mclock scheduler still has 
serious issues and switching to WPQ might help.

If none of these help, I'm out of ideas. For us the Kioxia drives work like a 
charm, its the pool that is easiest to manage and maintain with super-fast 
recovery and really good sustained performance.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: 서민우 
Sent: Tuesday, May 21, 2024 11:25 AM
To: Anthony D'Atri
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Please discuss about Slow Peering

We used the "kioxia kcd6xvul3t20" model.
Any infamous information of this Model?

2024년 5월 17일 (금) 오전 2:58, Anthony D'Atri 
mailto:anthony.da...@gmail.com>>님이 작성:
If using jumbo frames, also ensure that they're consistently enabled on all OS 
instances and network devices.

> On May 16, 2024, at 09:30, Frank Schilder mailto:fr...@dtu.dk>> 
> wrote:
>
> This is a long shot: if you are using octopus, you might be hit by this 
> pglog-dup problem: 
> https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't 
> mention slow peering explicitly in the blog, but its also a consequence 
> because the up+acting OSDs need to go through the PG_log during peering.
>
> We are also using octopus and I'm not sure if we have ever seen slow ops 
> caused by peering alone. It usually happens when a disk cannot handle load 
> under peering. We have, unfortunately, disks that show random latency spikes 
> (firmware update pending). You can try to monitor OPS latencies for your 
> drives when peering and look for something that sticks out. People on this 
> list were reporting quite bad results for certain infamous NVMe brands. If 
> you state your model numbers, someone else might recognize it.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: 서민우 mailto:smw940...@gmail.com>>
> Sent: Thursday, May 16, 2024 7:39 AM
> To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] Please discuss about Slow Peering
>
> Env:
> - OS: Ubuntu 20.04
> - Ceph Version: Octopus 15.0.0.1
> - OSD Disk: 2.9TB NVMe
> - BlockStorage (Replication 3)
>
> Symptom:
> - Peering when OSD's node up is very slow. Peering speed varies from PG to
> PG, and some PG may even take 10 seconds. But, there is no log for 10
> seconds.
> - I checked the effect of client VM's. Actually, Slow queries of mysql
> occur at the same time.
>
> There are Ceph OSD logs of both Best and Worst.
>
> Best Peering Case (0.5 Seconds)
> 2024-04-11T15:32:44.693+0900 7f108b522700  1 osd.7 pg_epoch: 27368 pg[6.8]
> state: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> state: Peering, affected_by_map, going to Reset
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> start_peering_interval up [7,6,11] -> [6,11], acting [7,

[ceph-users] Re: Please discuss about Slow Peering

2024-05-21 Thread Frank Schilder

> Not with the most recent Ceph releases.

Actually, this depends. If its SSDs for which IOPs profit from higher iodepth, 
it is very likely to improve performance, because until today each OSD has only 
one kv_sync_thread and this is typically the bottleneck with heavy IOPs load. 
Having 2-4 kv_sync_threads per SSD, meaning 2-4 OSDs per disk, will help a lot 
if this thread is saturating.

For NVMes this is usually not required.

The question still remains, do you have enough CPU? If you have 13 disks with 4 
OSDs each, you will need a core-count of at least 50-ish per host. Newer OSDs 
might be able to utilize even more on fast disks. You will also need 4 times 
the RAM.

> I suspect your PGs are too few though.

In addition, on these drives you should aim for 150-200 PGs per OSD (another 
reason to go x4 OSDs - x4 PGs per drive). We have 198PGs/OSD on average and 
this helps a lot with IO, recovery, everything.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: Tuesday, May 21, 2024 3:06 PM
To: 서민우
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Please discuss about Slow Peering



I have additional questions,
We use 13 disk (3.2TB NVMe) per server and allocate one OSD to each disk. In 
other words 1 Node has 13 osds.
Do you think this is inefficient?
Is it better to create more OSD by creating LV on the disk?

Not with the most recent Ceph releases.  I suspect your PGs are too few though.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: dkim on this mailing list

2024-05-21 Thread Frank Schilder

Hi Marc,

in case you are working on the list server, at least for me the situation seems 
to have improved no more than 2-3 hours ago. My own e-mails to the list now 
pass.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc 
Sent: Tuesday, May 21, 2024 5:34 PM
To: ceph-users@ceph.io
Subject: [ceph-users] dkim on this mailing list

Just to confirm if I am messing up my mailserver configs. But currently all 
messages from this mailing list should generate a dkim pass status?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Frank Schilder

Hi Stefan,

can you provide a link to or copy of the contents of the tuned-profile so 
others can also profit from it?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Bauer 
Sent: Wednesday, May 22, 2024 10:51 AM
To: Anthony D'Atri; ceph-users@ceph.io
Subject: [ceph-users] Re: How network latency affects ceph performance really 
with NVME only storage?

Hi Anthony and others,

thank you for your reply.  To be honest, I'm not even looking for a
solution, i just wanted to ask if latency affects the performance at all
in my case and how others handle this ;)

One of our partners delivered a solution with a latency-optimized
profile for tuned-daemon. Now the latency is much better:

apt install tuned

tuned-adm profile network-latency

# ping 10.1.4.13
PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
^C
--- 10.1.4.13 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9001ms
rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms

Am 21.05.24 um 15:08 schrieb Anthony D'Atri:
> Check the netmask on your interfaces, is it possible that you're sending 
> inter-node traffic up and back down needlessly?
>
>> On May 21, 2024, at 06:02, Stefan Bauer  wrote:
>>
>> Dear Users,
>>
>> i recently setup a new ceph 3 node cluster. Network is meshed between all 
>> nodes (2 x 25G with DAC).
>> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)
>>
>> The latency with ping tests between the nodes shows:
>>
>> # ping 10.1.3.13
>> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
>> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
>> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
>> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
>> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
>> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
>> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
>> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
>> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
>> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
>> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
>> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
>> --- 10.1.3.13 ping statistics ---
>> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms
>> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms
>>
>>
>> On another cluster i have much better values, with 10G SFP+ and fibre-cables:
>>
>> 64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms
>> ^C
>> --- long-ipv6-ip ping statistics ---
>> 53 packets transmitted, 53 received, 0% packet loss, time 53260ms
>> rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms
>>
>> If i want best performance, does the latency difference matter at all? 
>> Should i change DAC to SFP-transceivers wwith fibre-cables to improve 
>> overall ceph performance or is this nitpicking?
>>
>> Thanks a lot.
>>
>> Stefan
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Mit freundlichen Grüßen

Stefan Bauer
Schulstraße 5
83308 Trostberg
0179-1194767
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Frank Schilder

Hi Stefan,

ahh OK, misunderstood your e-mail. It sounded like it was a custom profile, not 
a standard one shipped with tuned.

Thanks for the clarification!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Bauer 
Sent: Wednesday, May 22, 2024 12:44 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How network latency affects ceph performance really 
with NVME only storage?

Hi Frank,

it's pretty straightforward. Just follow the steps:

apt install tuned

tuned-adm profile network-latency

According to [1]:

network-latency
A server profile focused on lowering network latency.
This profile favors performance over power savings by setting
|intel_pstate| and |min_perf_pct=100|. It disables transparent huge
pages, and automatic NUMA balancing. It also uses *cpupower* to set
the |performance| cpufreq governor, and requests a
/|cpu_dma_latency|/ value of |1|. It also sets /|busy_read|/ and
/|busy_poll|/ times to |50| μs, and /|tcp_fastopen|/ to |3|.

[1]
https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-tuned_adm

Cheers.

Stefan

Am 22.05.24 um 12:18 schrieb Frank Schilder:
> Hi Stefan,
>
> can you provide a link to or copy of the contents of the tuned-profile so 
> others can also profit from it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Stefan Bauer
> Sent: Wednesday, May 22, 2024 10:51 AM
> To: Anthony D'Atri;ceph-users@ceph.io
> Subject: [ceph-users] Re: How network latency affects ceph performance really 
> with NVME only storage?
>
> Hi Anthony and others,
>
> thank you for your reply.  To be honest, I'm not even looking for a
> solution, i just wanted to ask if latency affects the performance at all
> in my case and how others handle this ;)
>
> One of our partners delivered a solution with a latency-optimized
> profile for tuned-daemon. Now the latency is much better:
>
> apt install tuned
>
> tuned-adm profile network-latency
>
> # ping 10.1.4.13
> PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
> 64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
> 64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
> 64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
> 64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
> 64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
> 64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
> 64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
> 64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
> 64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
> 64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
> ^C
> --- 10.1.4.13 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9001ms
> rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms
>
> Am 21.05.24 um 15:08 schrieb Anthony D'Atri:
>> Check the netmask on your interfaces, is it possible that you're sending 
>> inter-node traffic up and back down needlessly?
>>
>>> On May 21, 2024, at 06:02, Stefan Bauer  wrote:
>>>
>>> Dear Users,
>>>
>>> i recently setup a new ceph 3 node cluster. Network is meshed between all 
>>> nodes (2 x 25G with DAC).
>>> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)
>>>
>>> The latency with ping tests between the nodes shows:
>>>
>>> # ping 10.1.3.13
>>> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
>>> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
>>> --- 10.1.3.13 ping statistics ---
>>> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms
>>> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms
>>>
>>>
>>> On another cluster i have much better values, with 10G SFP+ and 
>>> fibre-cables:
>>>

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

I'm afraid the description of your observation breaks a bit with causality and 
this might be the reason for the few replies. To produce a bit more structure 
for when exactly what happened, let's look at what I did and didn't get:

Before adding the hosts you have situation

1)
default
  DCA
host A1 ... AN
  DCB
host B1 ... BM

Now you add K+L hosts, they go into the default root and we have situation

2)
default
  host C1 ... CK, D1 ... DL
  DCA
host A1 ... AN
  DCB
host B1 ... BM

As a last step, you move the hosts to their final locations and we arrive at 
situation

3)
default
  DCA
host A1 ... AN, C1 ... CK
  DCB
host B1 ... BM, D1 ... DL

Please correct if this is wrong. Assuming its correct, I conclude the following.

Now, from your description it is not clear to me on which of the transitions 
1->2 or 2->3 you observe
- peering and/or
- unknown PGs.

We use a somewhat similar procedure except that we have a second root (separate 
disjoint tree) for new hosts/OSDs. However, in terms of peering it is the same 
and if everything is configured correctly I would expect this to happen (this 
is what happens when we add OSDs/hosts):

transition 1->2: hosts get added: no peering, no remapped objects, nothing, 
just new OSDs doing nothing
transition 2->3: hosts get moved: peering starts and remapped objects appear, 
all PGs active+clean

Unknown PGs should not occur (maybe only temporarily when the primary changes 
or the PG is slow to respond/report status??). The crush bug with too few 
set_choose_tries is observed if one has *just enough hosts* for the EC profile 
and should not be observed if all PGs are active+clean and one *adds hosts*. 
Persistent unknown PGs can (to my understanding, does unknown mean "has no 
primary"?) only occur if the number of PGs changes (autoscaler messing 
around??) because all PGs were active+clean before. The crush bug leads to 
incomplete PGs, so PGs can go incomplete but they should always have an acting 
primary.

This is assuming no OSDs went down/out during the process.

Can you please check if my interpretation is correct and describe at which step 
exactly things start diverging from my expectations.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Thursday, May 23, 2024 12:05 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

Hi again,

I'm still wondering if I misunderstand some of the ceph concepts.
Let's assume the choose_tries value is too low and ceph can't find
enough OSDs for the remapping. I would expect that there are some PG
chunks in remapping state or unknown or whatever, but why would it
affect the otherwise healthy cluster in such a way?
Even if ceph doesn't know where to put some of the chunks, I wouldn't
expect inactive PGs and have a service interruption.
What am I missing here?

Thanks,
Eugen

Zitat von Eugen Block :

> Thanks, Konstantin.
> It's been a while since I was last bitten by the choose_tries being
> too low... Unfortunately, I won't be able to verify that... But I'll
> definitely keep that in mind, or least I'll try to. :-D
>
> Thanks!
>
> Zitat von Konstantin Shalygin :
>
>> Hi Eugen
>>
>>> On 21 May 2024, at 15:26, Eugen Block  wrote:
>>>
>>> step set_choose_tries 100
>>
>> I think you should try to increase set_choose_tries to 200
>> Last year we had an Pacific EC 8+2 deployment of 10 racks. And even
>> with 50 hosts, the value of 100 not worked for us
>>
>>
>> k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

thanks for this clarification. Yes, with the observations you describe for 
transition 1->2, something is very wrong. Nothing should happen. Unfortunately, 
I'm going to be on holidays and, generally, don't have too much time. If they 
can afford to share the osdmap (ceph osd getmap -o file), I could also take a 
look at some point.

I don't think it has to do with set_choose_tries, there is likely something 
else screwed up badly. There should simply not be any remapping going on at 
this stage. Just for fun, you should be able to produce a clean crushmap from 
scratch with a similar or the same tree and check if you see the same problems.

Using the full osdmap with osdmaptool allows to reproduce the exact mappings as 
used in the cluster and it encodes other important information as well. That's 
why I'm asking for this instead of just the crush map.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Thursday, May 23, 2024 1:26 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: unknown PGs after adding hosts in different 
subtree

Hi Frank,

thanks for chiming in here.

> Please correct if this is wrong. Assuming its correct, I conclude
> the following.

You assume correctly.

> Now, from your description it is not clear to me on which of the
> transitions 1->2 or 2->3 you observe
> - peering and/or
> - unknown PGs.

The unknown PGs were observed during/after 1 -> 2. All or almost all
PGs were reported as "remapped", I don't remember the exact number,
but it was more than 4k, and the largest pool has 4096 PGs. We didn't
see down OSDs at all.
Only after moving the hosts into their designated location (the DCs)
the unknown PGs cleared and the application resumed its operation.

I don't want to overload this thread but I asked for a copy of their
crushmap to play around a bit. I moved the new hosts out of the DCs
into the default root via 'crushtool --move ...', then running the
crushtool --test command

# crushtool -i crushmap --test --rule 1 --num-rep 18
--show-choose-tries [--show-bad-mappings] --show-utilization

results in a couple of issues:

- there are lots of bad mappings no matter how high the number for
set_choose_tries is set
- the show-utilization output shows 240 OSDs in usage (there were 240
OSDs before the expansion), but plenty of them have only 9 chunks
assigned:

rule 1 (rule-ec-k7m11), x = 0..1023, numrep = 18..18
rule 1 (rule-ec-k7m11) num_rep 18 result size == 0: 55/1024
rule 1 (rule-ec-k7m11) num_rep 18 result size == 9: 488/1024
rule 1 (rule-ec-k7m11) num_rep 18 result size == 18:481/1024

And this reminds me of the inactive PGs we saw before I failed the
mgr, those inactive PGs showed only 9 chunks in the acting set. With
k=7 (and min_size=8) that should still be enough, we have successfully
tested disaster recovery with one entire DC down multiple times.

- with --show-mappings some lines contain an empty set like this:

CRUSH rule 1 x 22 []

And one more observation: with the currently active crushmap there are
no bad mappings at all when the hosts are in their designated location.
So there's definitely something wrong here, I just can't tell what it
is yet. I'll play a bit more with that crushmap...

Thanks!
Eugen


Zitat von Frank Schilder :

> Hi Eugen,
>
> I'm afraid the description of your observation breaks a bit with
> causality and this might be the reason for the few replies. To
> produce a bit more structure for when exactly what happened, let's
> look at what I did and didn't get:
>
> Before adding the hosts you have situation
>
> 1)
> default
>   DCA
> host A1 ... AN
>   DCB
> host B1 ... BM
>
> Now you add K+L hosts, they go into the default root and we have situation
>
> 2)
> default
>   host C1 ... CK, D1 ... DL
>   DCA
> host A1 ... AN
>   DCB
> host B1 ... BM
>
> As a last step, you move the hosts to their final locations and we
> arrive at situation
>
> 3)
> default
>   DCA
> host A1 ... AN, C1 ... CK
>   DCB
> host B1 ... BM, D1 ... DL
>
> Please correct if this is wrong. Assuming its correct, I conclude
> the following.
>
> Now, from your description it is not clear to me on which of the
> transitions 1->2 or 2->3 you observe
> - peering and/or
> - unknown PGs.
>
> We use a somewhat similar procedure except that we have a second
> root (separate disjoint tree) for new hosts/OSDs. However, in terms
> of peering it is the same and if everything is configured correctly
> I would expect this to happen (this is what happens when we add
> OSDs/hosts):
>
> transition 1->2: hosts get added: no peering, no remapped obj

[ceph-users] Re: does the RBD client block write when the Watcher times out?

2024-05-23 Thread Frank Schilder

Hi, we run into the same issue and there is actually another use case: 
live-migration of VMs. This requires an RBD image being mapped to two clients 
simultaneously, so this is intentional. If multiple clints map an image in 
RW-mode, the ceph back-end will cycle the write lock between the clients to 
allow each of them to flush writes, this is intentional. The way to coordinate 
here is the job of the orchestrator. In this case specifically, its explicitly 
managing a write lock during live-migration such that writes occur in the 
correct order.

Its not a ceph job, its an orchestration job. The rbd interface just provides 
the tools to do it, for example, you can attach information that helps you 
hunting down dead-looking clients and kill them proper before mapping an image 
somewhere else.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Ilya Dryomov 
Sent: Thursday, May 23, 2024 2:05 PM
To: Yuma Ogami
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: does the RBD client block write when the Watcher 
times out?

On Thu, May 23, 2024 at 4:48 AM Yuma Ogami  wrote:
>
> Hello.
>
> I'm currently verifying the behavior of RBD on failure. I'm wondering
> about the consistency of RBD images after network failures. As a
> result of my investigation, I found that RBD sets a Watcher to RBD
> image if a client mounts this volume to prevent multiple mounts. In

Hi Yuma,

The watcher is created to watch for updates (technically, to listen to
notifications) on the RBD image, not to prevent multiple mounts.  RBD
allows the same image to be mapped multiple times on the same node or
on different nodes.

> addition, I found that if the client is isolated from the network for
> a long time, the Watcher is released. However, the client still mounts
> this image. In this situation, if another client can also mount this
> image and the image is writable from both clients, data corruption
> occurs. Could you tell me whether this is a realistic scenario?

Yes, this is a realistic scenario which can occur even if the client
isn't isolated from the network.  If the user does this, it's up to the
user to ensure that everything remains consistent.  One use case for
mapping the same image on multiple nodes is a clustered (also referred
to as a shared disk) filesystem, such as OCFS2.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

I'm at home now. Could you please check all the remapped PGs that they have no 
shards on the new OSDs, i.e. its just shuffling around mappings within the same 
set of OSDs under rooms?

If this is the case, it is possible that this is partly intentional and partly 
buggy. The remapping is then probably intentional and the method I use with a 
disjoint tree for new hosts prevents such remappings initially (the crush code 
sees the new OSDs in the root, doesn't use them but their presence does change 
choice orders resulting in remapped PGs). However, the unknown PGs should 
clearly not occur.

I'm afraid that the peering code has quite a few bugs, I reported something at 
least similarly weird a long time ago: https://tracker.ceph.com/issues/56995 
and https://tracker.ceph.com/issues/46847. Might even be related. It looks like 
peering can loose track of PG members in certain situations (specifically after 
adding OSDs until rebalancing completed). In my cases, I get degraded objects 
even though everything is obviously still around. Flipping between the 
crush-maps before/after the change re-discovers everything again.

Issue 46847 is long-standing and still unresolved. In case you need to file a 
tracker, please consider to refer to the two above as well as "might be 
related" if you deem that they might be related.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

just to add another strangeness observation from long ago: 
https://www.spinics.net/lists/ceph-users/msg74655.html. I didn't see any 
reweights in your trees, so its something else. However, there seem to be 
multiple issues with EC pools and peering.

I also want to clarify:

> If this is the case, it is possible that this is partly intentional and 
> partly buggy.

"Partly intentional" here means the code behaviour changes when you add OSDs to 
the root outside the rooms and this change is not considered a bug. It is 
clearly *not* expected as it means you cannot do maintenance on a pool living 
on a tree A without affecting pools on the same device class living on an 
unmodified subtree of A.

>From a ceph user's point of view everything you observe looks buggy. I would 
>really like to see a good explanation why the mappings in the subtree *should* 
>change when adding OSDs above that subtree as in your case when the 
>expectation for good reasons is that they don't. This would help devising 
>clean procedures for adding hosts when you (and I) want to add OSDs first 
>without any peering and then move OSDs into place to have it happen separate 
>from adding and not a total mess with everything in parallel.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: Thursday, May 23, 2024 6:32 PM
To: Eugen Block
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

Hi Eugen,

I'm at home now. Could you please check all the remapped PGs that they have no 
shards on the new OSDs, i.e. its just shuffling around mappings within the same 
set of OSDs under rooms?

If this is the case, it is possible that this is partly intentional and partly 
buggy. The remapping is then probably intentional and the method I use with a 
disjoint tree for new hosts prevents such remappings initially (the crush code 
sees the new OSDs in the root, doesn't use them but their presence does change 
choice orders resulting in remapped PGs). However, the unknown PGs should 
clearly not occur.

I'm afraid that the peering code has quite a few bugs, I reported something at 
least similarly weird a long time ago: https://tracker.ceph.com/issues/56995 
and https://tracker.ceph.com/issues/46847. Might even be related. It looks like 
peering can loose track of PG members in certain situations (specifically after 
adding OSDs until rebalancing completed). In my cases, I get degraded objects 
even though everything is obviously still around. Flipping between the 
crush-maps before/after the change re-discovers everything again.

Issue 46847 is long-standing and still unresolved. In case you need to file a 
tracker, please consider to refer to the two above as well as "might be 
related" if you deem that they might be related.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-24 Thread Frank Schilder

Hi Eugen,

so it is partly "unexpectedly expected" and partly buggy. I really wish the 
crush implementation was honouring a few obvious invariants. It is extremely 
counter-intuitive that mappings taken from a sub-set change even if both, the 
sub-set and the mapping instructions themselves don't.

> - Use different root names

That's what we are doing and it works like a charm, also for draining OSDs.

> more specific crush rules.

I guess you mean use something like "step take DCA class hdd" instead of "step 
take default class hdd" as in:

rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take DCA class hdd
step chooseleaf indep 9 type host
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
}

According to the documentation, this should actually work and be almost 
equivalent to your crush rule. The difference here is that it will make sure 
that the first 9 shards are from DCA and the second 9 shards from DCB (its an 
ordering). Side effect is that all primary OSDs will be in DCA if both DCs are 
up. I remember people asking for that as a feature in multi-DC set-ups to pick 
the one with lowest latency to have the primary OSDs by default.

Can you give this crush rule a try and report back whether or not the behaviour 
when adding hosts changes?

In case you have time, it would be great if you could collect information on 
(reproducing) the fatal peering problem. While remappings might be 
"unexpectedly expected" it is clearly a serious bug that incomplete and unknown 
PGs show up in the process of adding hosts at the root.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Friday, May 24, 2024 2:51 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

I start to think that the root cause of the remapping is just the fact
that the crush rule(s) contain(s) the "step take default" line:

  step take default class hdd

My interpretation is that crush simply tries to honor the rule:
consider everything underneath the "default" root, so PGs get remapped
if new hosts are added there (but not in their designated subtree
buckets). The effect (unknown PGs) is bad, but there are a couple of
options to avoid that:

- Use different root names and/or more specific crush rules.
- Use host spec file(s) to place new hosts directly where they belong.
- Set osd_crush_initial_weight = 0 to avoid remapping until everything
is where it's supposed to be, then reweight the OSDs.


Zitat von Eugen Block :

> Hi Frank,
>
> thanks for looking up those trackers. I haven't looked into them
> yet, I'll read your response in detail later, but I wanted to add
> some new observation:
>
> I added another root bucket (custom) to the osd tree:
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -12   0  root custom
>  -1 0.27698  root default
>  -8 0.09399  room room1
>  -3 0.04700  host host1
>   7hdd  0.02299  osd.7   up   1.0  1.0
>  10hdd  0.02299  osd.10  up   1.0  1.0
> ...
>
> Then I tried this approach to add a new host directly to the
> non-default root:
>
> # cat host5.yaml
> service_type: host
> hostname: host5
> addr: 192.168.168.54
> location:
>   root: custom
> labels:
>- osd
>
> # ceph orch apply -i host5.yaml
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -12 0.04678  root custom
> -23 0.04678  host host5
>   1hdd  0.02339  osd.1   up   1.0  1.0
>  13hdd  0.02339  osd.13  up   1.0  1.0
>  -1 0.27698  root default
>  -8 0.09399  room room1
>  -3 0.04700  host host1
>   7hdd  0.02299  osd.7   up   1.0  1.0
>  10hdd  0.02299  osd.10  up   1.0  1.0
> ...
>
> host5 is placed directly underneath the new custom root correctly,
> but not a single PG is marked "remapped"! So this is actually what I
> (or we) expected. I'm not sure yet what to make of it, but I'm
> leaning towards using this approach in the future and add hosts
> underneath a different root first, and then move it to its
> designated location.
>
> Just to validate again, I added host6 without a location spec, so
> it's placed underneath the default root again:
>
> # ceph osd tree
> ID   CLASS

[ceph-users] Re: Documentation for meaning of "tag cephfs" in OSD caps

2024-06-11 Thread Frank Schilder

There is a tiny bit more to it. The idea is that, when adding a data pool, any 
cephfs client can access the new pool without changing and updating the caps. 
To this end, the fs-caps must include 2 pieces of information, the application 
name "cephfs" and the file system name (ceph can have multiple file systems). 
Any cephfs enabled pool with the correct file system name will be accessible to 
a properly authorized client of that file system without having to add that 
pool to the client caps explicitly, as was necessary in older versions.

The 2 pieces of information are provided like:

application name cephfs: "tag cephfs"
file system name: "data=con-fs2"

One can check what is encoded for each pool using

ceph osd pool ls detail --format=json | jq '.[] | .pool_name, 
.application_metadata'

For a ceph-fs pool, it will look something like

"con-fs2-data2"
{
  "cephfs": {
"data": "con-fs2"
  }
}

As of today, it seems indeed undocumented black magic and you need to search 
very carefully to find ceph-user cases that discuss (issues with) these tags, 
thereby explaining it as a side effect.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Tuesday, June 11, 2024 2:14 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Documentation for meaning of "tag cephfs" in OSD caps

I assume it means that pools with an enabled application "cephfs" can
be targeted by specifying this tag instead of listing each pool
separately. Browsing through the code [1] seems to confirm that
(somehow, I'm not a dev):

> if (g.match.pool_tag.application == ng.match.pool_tag.application

But I agree, it's worth adding that to the docs.

[1]
https://github.com/ceph/ceph/blob/09e81319648dd504cfd94edfdd321c7163cefa98/src/osd/OSDCap.cc#L549

Zitat von Petr Bena :

> Hello
>
> In https://docs.ceph.com/en/latest/cephfs/client-auth/ we can find that
>
> ceph fs authorize cephfs_a client.foo / r /bar rw Results in
>
> client.foo
>   key: *key*
>   caps:  [mds]  allow  r,  allow  rw  path=/bar
>   caps:  [mon]  allow  r
>   caps:  [osd]  allow  rw  tag  cephfs  data=cephfs_a
>
>
> What is this "tag cephfs" thing? It seems like some undocumented
> black magic to me, since I can't find anything that documents it.
> Can someone explain how it works under the hood? What does it expand
> to? What does it limit and how?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS metadata pool size

2024-06-12 Thread Frank Schilder

Hi, there seem to be replies missing to this list. For example, I can't find 
any messages that contain information that could lead to this conclusion:

> * pg_num too low (defaults are too low)
> * pg_num not a power of 2
> * pg_num != number of OSDs in the pool
> * balancer not enabled

It is horrible for other users to follow threads or learn from them if part of 
the communication is private. This thread is not the first occurrence, it seems 
to become more frequent recently. Could posters please reply to the list 
instead of individual users?

Thanks for your consideration.
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: Wednesday, June 12, 2024 2:53 PM
To: Eugen Block
Cc: Lars Köppel; ceph-users@ceph.io
Subject: [ceph-users] Re: CephFS metadata pool size

If you have:

* pg_num too low (defaults are too low)
* pg_num not a power of 2
* pg_num != number of OSDs in the pool
* balancer not enabled

any of those might result in imbalance.

> On Jun 12, 2024, at 07:33, Eugen Block  wrote:
>
> I don't have any good explanation at this point. Can you share some more 
> information like:
>
> ceph pg ls-by-pool 
> ceph osd df (for the relevant OSDs)
> ceph df
>
> Thanks,
> Eugen
>
> Zitat von Lars Köppel :
>
>> Since my last update the size of the largest OSD increased by 0.4 TiB while
>> the smallest one only increased by 0.1 TiB. How is this possible?
>>
>> Because the metadata pool reported to have only 900MB space left, I stopped
>> the hot-standby MDS. This gave me 8GB back but these filled up in the last
>> 2h.
>> I think I have to zap the next OSD because the filesystem is getting read
>> only...
>>
>> How is it possible that an OSD has over 1 TiB less data on it after a
>> rebuild? And how is it possible to have so different sizes of OSDs?
>>
>>
>> [image: ariadne.ai Logo] Lars Köppel
>> Developer
>> Email: lars.koep...@ariadne.ai
>> Phone: +49 6221 5993580 <+4962215993580>
>> ariadne.ai (Germany) GmbH
>> Häusserstraße 3, 69115 Heidelberg
>> Amtsgericht Mannheim, HRB 744040
>> Geschäftsführer: Dr. Fabian Svara
>> https://ariadne.ai
>>
>>
>> On Tue, Jun 11, 2024 at 3:47 PM Lars Köppel  wrote:
>>
>>> Only in warning mode. And there were no PG splits or merges in the last 2
>>> month.
>>>
>>>
>>> [image: ariadne.ai Logo] Lars Köppel
>>> Developer
>>> Email: lars.koep...@ariadne.ai
>>> Phone: +49 6221 5993580 <+4962215993580>
>>> ariadne.ai (Germany) GmbH
>>> Häusserstraße 3, 69115 Heidelberg
>>> Amtsgericht Mannheim, HRB 744040
>>> Geschäftsführer: Dr. Fabian Svara
>>> https://ariadne.ai
>>>
>>>
>>> On Tue, Jun 11, 2024 at 3:32 PM Eugen Block  wrote:
>>>
>>>> I don't think scrubs can cause this. Do you have autoscaler enabled?
>>>>
>>>> Zitat von Lars Köppel :
>>>>
>>>> > Hi,
>>>> >
>>>> > thank you for your response.
>>>> >
>>>> > I don't think this thread covers my problem, because the OSDs for the
>>>> > metadata pool fill up at different rates. So I would think this is no
>>>> > direct problem with the journal.
>>>> > Because we had earlier problems with the journal I changed some
>>>> > settings(see below). I already restarted all MDS multiple times but no
>>>> > change here.
>>>> >
>>>> > The health warnings regarding cache pressure resolve normally after a
>>>> > short period of time, when the heavy load on the client ends. Sometimes
>>>> it
>>>> > stays a bit longer because an rsync is running and copying data on the
>>>> > cluster(rsync is not good at releasing the caps).
>>>> >
>>>> > Could it be a problem if scrubs run most of the time in the background?
>>>> Can
>>>> > this block any other tasks or generate new data itself?
>>>> >
>>>> > Best regards,
>>>> > Lars
>>>> >
>>>> >
>>>> > global  basic mds_cache_memory_limit
>>>> > 17179869184
>>>> > global  advanced  mds_max_caps_per_client
>>>> >16384
>>>> > global  advanced
>>>> mds_recall_global_max_decay_threshold
>>>> >262144
>>>> > global

[ceph-users] Can't comment on my own tracker item any more

2024-06-13 Thread Frank Schilder

Hi all,

I just received a notification about a bug I reported 4 years ago 
(https://tracker.ceph.com/issues/45253):

> Issue #45253 has been updated by Victoria Mackie.

I would like to leave a comment, but the comment function seems not available 
any more even through I'm logged in and I'm reported as the author.

I can still edit the item itself, but I'm not able to leave comments.

Can someone please look into that?

Thanks!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Can't comment on my own tracker item any more

2024-06-13 Thread Frank Schilder

OK, I can click on the little "quote" symbol and then a huge dialog opens that 
says "edit" but means "comment". Would it be possible to add the simple comment 
action again? Also, that the quote action removes nested text makes it a little 
bit less useful than it could be. I had to copy the code example back by hand.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: Thursday, June 13, 2024 11:40 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Can't comment on my own tracker item any more

Hi all,

I just received a notification about a bug I reported 4 years ago 
(https://tracker.ceph.com/issues/45253):

> Issue #45253 has been updated by Victoria Mackie.

I would like to leave a comment, but the comment function seems not available 
any more even through I'm logged in and I'm reported as the author.

I can still edit the item itself, but I'm not able to leave comments.

Can someone please look into that?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: deep scrubb and scrubb does get the job done

2024-06-13 Thread Frank Schilder

Yes, there is: 
https://github.com/frans42/ceph-goodies/blob/main/doc/TuningScrub.md

This is work in progress and a few details are missing, but it should help you 
find the right parameters. Note that this is tested on octopus with WPQ.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Manuel Oetiker 
Sent: Thursday, June 13, 2024 4:37 PM
To: ceph-users@ceph.io
Subject: [ceph-users] deep scrubb and scrubb does get the job done

Hi

our cluster is on warning for more than two weeks we had to move some pools 
form ssd to hdd
and it looked good ... but somehow the pgs scrubb does not get done with his 
jobs

* PG_NOT_DEEP_SCRUBBED : 171 pgs not deep-scrubbed in time
* PG_NOT_SCRUBBED : 132 pgs not scrubbed in time

Till the move the cluster was happy without any warnings...

There is no have load on the cluster I don't see why the cluster get not done 
with that...
is there a way to find out why ...

Thanks for any hint
Manuel


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: why not block gmail?

2024-06-17 Thread Frank Schilder

Could we at least stop approving requests from obvious spammers?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eneko Lacunza 
Sent: Monday, June 17, 2024 9:18 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: why not block gmail?

Hi,

El 15/6/24 a las 11:49, Marc escribió:
> If you don't block gmail, gmail/google will never make an effort to clean up 
> their shit. I don't think people with a gmail.com will mind, because this is 
> free and get somewhere else a free account.
>
> tip: google does not really know what part of their infrastructure is sending 
> email so they use spf ~all. If you process gmail.com and force the -all 
> manually, you block mostly spam.
In May, of 111 list messages 41 (no-spam) came from gmail.com

I think banning gmail.com will be an issue for the list, at least
short-term.

Applying SPF -all seems better, but not sure about how easy that would
be to implement... :)

Cheers

Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg deep-scrub control scheme

2024-06-27 Thread Frank Schilder

>  Is there a calculation formula that can be used to easily configure
> the scrub/deepscrub strategy?

There is: 
https://github.com/frans42/ceph-goodies/blob/main/doc/RecommendationsForScrub.md

Tested on Octopus with osd_op_queue=wpq, osd_op_queue_cut_off=high.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Yang 
Sent: Thursday, June 27, 2024 3:50 AM
To: Ceph Users
Subject: [ceph-users] pg deep-scrub control scheme

Hello everyone.

I have a cluster with 8321 pgs and recently I started to get pg not
deep-scrub warnings.
The reason is that I reduced max_scrub to avoid the impact of scrub on IO.

Here is my current scrub configuration:

~]# ceph tell osd.1 config show|grep scrub
"mds_max_scrub_ops_in_progress": "5",
"mon_scrub_inject_crc_mismatch": "0.00",
"mon_scrub_inject_missing_keys": "0.00",
"mon_scrub_interval": "86400",
"mon_scrub_max_keys": "100",
"mon_scrub_timeout": "300",
"mon_warn_pg_not_deep_scrubbed_ratio": "0.80",
"mon_warn_pg_not_scrubbed_ratio": "0.50",
"osd_debug_deep_scrub_sleep": "0.00",
"osd_deep_scrub_interval": "1296000.00",
"osd_deep_scrub_keys": "1024",
"osd_deep_scrub_large_omap_object_key_threshold": "20",
"osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
"osd_deep_scrub_randomize_ratio": "0.08",
"osd_deep_scrub_stride": "131072",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_max_scrubs": "1",
"osd_requested_scrub_priority": "120",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_backoff_ratio": "0.66",
"osd_scrub_begin_hour": "0",
"osd_scrub_begin_week_day": "0",
"osd_scrub_chunk_max": "25",
"osd_scrub_chunk_min": "5",
"osd_scrub_cost": "52428800",
"osd_scrub_during_recovery": "false",
"osd_scrub_end_hour": "0",
"osd_scrub_end_week_day": "0",
"osd_scrub_extended_sleep": "0.00",
"osd_scrub_interval_randomize_ratio": "0.50",
"osd_scrub_invalid_stats": "true",
"osd_scrub_load_threshold": "0.50",
"osd_scrub_max_interval": "1296000.00",
"osd_scrub_max_preemptions": "5",
"osd_scrub_min_interval": "259200.00",
"osd_scrub_priority": "5",
"osd_scrub_sleep": "0.00",

I am currently trying to adjust the interval of scrub.

Is there a calculation formula that can be used to easily configure
the scrub/deepscrub strategy?

At present, there is only the adjustment of individual values, and
then it is a long wait, and there may be no progress in the end.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg deep-scrub control scheme

2024-06-27 Thread Frank Schilder

Sorry, the entry point is actually 
https://github.com/frans42/ceph-goodies/blob/main/doc/TuningScrub.md
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Thursday, June 27, 2024 9:02 AM
To: David Yang; Ceph Users
Subject: Re: [ceph-users] pg deep-scrub control scheme

>  Is there a calculation formula that can be used to easily configure
> the scrub/deepscrub strategy?

There is: 
https://github.com/frans42/ceph-goodies/blob/main/doc/RecommendationsForScrub.md

Tested on Octopus with osd_op_queue=wpq, osd_op_queue_cut_off=high.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: David Yang 
Sent: Thursday, June 27, 2024 3:50 AM
To: Ceph Users
Subject: [ceph-users] pg deep-scrub control scheme

Hello everyone.

I have a cluster with 8321 pgs and recently I started to get pg not
deep-scrub warnings.
The reason is that I reduced max_scrub to avoid the impact of scrub on IO.

Here is my current scrub configuration:

~]# ceph tell osd.1 config show|grep scrub
"mds_max_scrub_ops_in_progress": "5",
"mon_scrub_inject_crc_mismatch": "0.00",
"mon_scrub_inject_missing_keys": "0.00",
"mon_scrub_interval": "86400",
"mon_scrub_max_keys": "100",
"mon_scrub_timeout": "300",
"mon_warn_pg_not_deep_scrubbed_ratio": "0.80",
"mon_warn_pg_not_scrubbed_ratio": "0.50",
"osd_debug_deep_scrub_sleep": "0.00",
"osd_deep_scrub_interval": "1296000.00",
"osd_deep_scrub_keys": "1024",
"osd_deep_scrub_large_omap_object_key_threshold": "20",
"osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824",
"osd_deep_scrub_randomize_ratio": "0.08",
"osd_deep_scrub_stride": "131072",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_max_scrubs": "1",
"osd_requested_scrub_priority": "120",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_backoff_ratio": "0.66",
"osd_scrub_begin_hour": "0",
"osd_scrub_begin_week_day": "0",
"osd_scrub_chunk_max": "25",
"osd_scrub_chunk_min": "5",
"osd_scrub_cost": "52428800",
"osd_scrub_during_recovery": "false",
"osd_scrub_end_hour": "0",
"osd_scrub_end_week_day": "0",
"osd_scrub_extended_sleep": "0.00",
"osd_scrub_interval_randomize_ratio": "0.50",
"osd_scrub_invalid_stats": "true",
"osd_scrub_load_threshold": "0.50",
"osd_scrub_max_interval": "1296000.00",
"osd_scrub_max_preemptions": "5",
"osd_scrub_min_interval": "259200.00",
"osd_scrub_priority": "5",
"osd_scrub_sleep": "0.00",

I am currently trying to adjust the interval of scrub.

Is there a calculation formula that can be used to easily configure
the scrub/deepscrub strategy?

At present, there is only the adjustment of individual values, and
then it is a long wait, and there may be no progress in the end.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph tracker broken?

2024-07-01 Thread Frank Schilder

Hi all, hopefully someone on this list can help me out. I recently started to 
receive unsolicited e-mail from the ceph tracker and also certain merge/pull 
requests. The latest one is:

[CephFS - Bug #66763] (New) qa: revert commit to unblock snap-schedule 
testing

I have nothing to do with that and I have not subscribed to this tracker item 
(https://tracker.ceph.com/issues/66763) eithrt. Yet, I receive unrequested 
updates.

Could someone please take a look and try to find out what the problem is?

Thanks a lot!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph tracker broken?

2024-07-01 Thread Frank Schilder

Hi Gregory,

thanks a lot! I hope that sets things back to normal.

It seems possible that my account got assigned to a project by accident. Not 
sure if you (or I myself) can find out about that. I'm not a dev and should not 
be on projects but some of my tickets were picked up as "low hanging fruits" 
and that's when it started. I got added to some related PRs and maybe on this 
occasion to a lot more by accident.

Thanks for your help!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Gregory Farnum 
Sent: Monday, July 1, 2024 8:38 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Ceph tracker broken?

You currently have "Email notifications" set to "For any event on all
my projects". I believe that's the firehose setting, so I've gone
ahead and changed it to "Only for things I watch or I'm involved in".
I'm unaware of any reason that would have been changed on the back
end, though there were some upgrades recently. It's also possible you
got assigned to a new group or somehow joined some of the projects
(I'm not well-versed in all the terminology there).
-Greg

On Sun, Jun 30, 2024 at 10:35 PM Frank Schilder  wrote:
>
> Hi all, hopefully someone on this list can help me out. I recently started to 
> receive unsolicited e-mail from the ceph tracker and also certain merge/pull 
> requests. The latest one is:
>
> [CephFS - Bug #66763] (New) qa: revert commit to unblock snap-schedule 
> testing
>
> I have nothing to do with that and I have not subscribed to this tracker item 
> (https://tracker.ceph.com/issues/66763) eithrt. Yet, I receive unrequested 
> updates.
>
> Could someone please take a look and try to find out what the problem is?
>
> Thanks a lot!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Repurposing some Dell R750s for Ceph

2024-07-11 Thread Frank Schilder

Hi Drew,

as far as I know Dell's drive bays for RAID controllers are not the same as the 
drive bays for CPU attached disks. In particular, I don't think they have that 
config for 3.5" drive bays and your description sounds a lot like that's what 
you have. Are you trying to go from 16x2.5" HDD to something like 24xNVMe?

Maybe you could provide a bit more information here, like (links to) the wiring 
diagrams you mentioned? From the description I cannot entirely deduce what 
exactly you have and where you want to go to.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Drew Weaver 
Sent: Thursday, July 11, 2024 3:16 PM
To: 'ceph-users@ceph.io'
Subject: [ceph-users] Repurposing some Dell R750s for Ceph

Hello,

We would like to repurpose some Dell PowerEdge R750s for a Ceph cluster.

Currently the servers have one H755N RAID controller for each 8 drives. (2 
total)

I have been asking their technical support what needs to happen in order for us 
to just rip out those raid controllers and cable the backplane directly to the 
motherboard/PCIe lanes and they haven't been super enthusiastic about helping 
me. I get it just buy another 50 servers, right? No big deal.

I have the diagrams that show how each configuration should be connected, I 
think I just need the right cable(s), my question is has anyone done this work 
before and was it worth it?

Also bonus if anyone has an R750 that has the drives directly connected to the 
backplane and can find the part number of the cable that connects the backplane 
to the motherboard I would greatly appreciate that part number. My sales guys 
are "having a hard time locating it".

Thanks,
-Drew

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to specify id on newly created OSD with Ceph Orchestrator

2024-07-23 Thread Frank Schilder

> Why would you want to do that?

You would want to do that to have minimal data movement, that is, limit the 
wear on disks to the absolutely necessary minimum. If you replace a disk and 
re-deploy the OSD with the same ID on the same host with the same device class, 
only the PGs that land on this OSD will move. If you assign the free IDs 
randomly, a lot more data movement will occur as the entire crush map will be 
affected and many more PGs become remapped.

Try it on a test cluster. Down an OSD and destroy it. See how many PGs are 
remapped. When recovery finished, purge the OSD ID. Another huge set of data 
movement will happen. This second part will happen a second time when adding 
the disk back and is unnecessary if you plan to replace the disk with an (from 
ceph point of view) identical disk model.

That's why the "destroyed" state for OSDs is there, it keeps the OSD IDs place 
in the crush map, prevents large data movement on OSD ID removal, can be 
re-used and then again prevents large data movement on adding (two times the 
second movement I mentioned above).

There was also a bug if OSD IDs are not sequential (there are holed in the ID 
list). This is fixed, but better safe than sorry.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Iztok Gregori 
Sent: Tuesday, July 23, 2024 9:10 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to specify id on newly created OSD with Ceph 
Orchestrator

On 23/07/24 08:41, Robert Sander wrote:
> On 7/23/24 08:24, Iztok Gregori wrote:
>
>> Am I missing something obvious or with Ceph orchestrator there are non
>> way to specify an id during the OSD creation?
>
> Why would you want to do that?

For me there wasn't a "real need", I could imagine a scenario in which
you want to have a specific osd id range allocated to a specific host,
but in my case it is just a curiosity.

With "ceph-volume" is possible to specify an id during the OSD creation,
but with ceph-orch I didn't find the equivalent option and I'm curious
if I've missed something.

Cheers
Iztok

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] snaptrim not making progress

2024-07-29 Thread Frank Schilder

Hi all,

our cluster is octopus latest. We seem to have a problem with snaptrim. On a 
pool for HDD RBD images I observed today that all PGs are either in state 
snaptrim or snaptrim_wait. It looks like the snaptrim process does not actually 
make any progress. There is no CPU activity by these OSDs indicating they would 
do snaptrimming (usuallu they would at least use 50% CPU as shown in top). I 
also don't see anything in the OSD logs.

For our VMs we run daily snapshot rotation and snaptrim usually finishes within 
a few minutes. We had a VM with disks on that pool cause an error due to a 
hanging virsh domfsfreeze command. This, however, is routine, we see this 
happening every now and then without any follow-up issues. I'm wondering now if 
we might have hit a race for the first time. Is there anything on an RBD image 
or pool that could block snaptrim from starting or progressing?

Thanks for any pointers!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 0 slow ops message stuck for down+out OSD

2024-07-29 Thread Frank Schilder

Hi all,

we had a failing disk (with slow ops) and I shut down the OSD. Its status 
down+out. However, I still see this message stuck in the output of ceph status 
and ceph health detail:

0 slow ops, oldest one blocked for 70 sec, osd.183 has slow ops

I believe there was a case about that some time ago, but I can't find it. How 
can I get rid of this stuck warning? Our cluster is octopus latest.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 0 slow ops message stuck for down+out OSD

2024-07-29 Thread Frank Schilder

Very funny, it was actually me who made this case some time ago: 
https://www.mail-archive.com/ceph-users@ceph.io/msg10095.html

I will look into what we did last time.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Monday, July 29, 2024 10:30 AM
To: ceph-users@ceph.io
Subject: [ceph-users] 0 slow ops message stuck for down+out OSD

Hi all,

we had a failing disk (with slow ops) and I shut down the OSD. Its status 
down+out. However, I still see this message stuck in the output of ceph status 
and ceph health detail:

0 slow ops, oldest one blocked for 70 sec, osd.183 has slow ops

I believe there was a case about that some time ago, but I can't find it. How 
can I get rid of this stuck warning? Our cluster is octopus latest.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: snaptrim not making progress

2024-07-29 Thread Frank Schilder

Some additional info: my best bet is that the stuck snaptrim has to do with 
image one-427. Not sure if this is a useful clue, the VM has 2 images and one 
of these has an exclusive lock while the other doesn't. Both images are in use 
though and having a lock is the standard situation. Here some output of rbd 
commands, both images reported are attached to the same VM:

# rbd ls -l -p sr-rbd-meta-one | grep -e NAME -e "-42[67]" | sed -e "s/  */  /g"
NAME  SIZE  PARENT  FMT  PROT  LOCK
one-426  200  GiB  2  excl
one-426@714  200  GiB  2  yes
one-426@721  200  GiB  2  yes
one-426@727  200  GiB  2  yes
one-426@734  200  GiB  2  yes
one-426@739  200  GiB  2  yes
one-426@740  200  GiB  2  yes
one-426@741  200  GiB  2  yes
one-426@742  200  GiB  2  yes
one-426@743  200  GiB  2  yes
one-426@744  200  GiB  2  yes
one-426@745  200  GiB  2  yes
one-426@746  200  GiB  2  yes
one-426@747  200  GiB  2  yes
one-427  40  GiB  2
one-427@1177  40  GiB  2  yes
one-427@1184  40  GiB  2  yes
one-427@1190  40  GiB  2  yes
one-427@1197  40  GiB  2  yes
one-427@1202  40  GiB  2  yes
one-427@1203  40  GiB  2  yes
one-427@1204  40  GiB  2  yes
one-427@1205  40  GiB  2  yes
one-427@1206  40  GiB  2  yes
one-427@1207  40  GiB  2  yes
one-427@1208  40  GiB  2  yes
one-427@1209  40  GiB  2  yes
one-427@1210  40  GiB  2  yes

# rbd lock ls sr-rbd-meta-one/one-426
There is 1 exclusive lock on this image.
LockerIDAddress
client.370497673  auto 140579044354336  192.168.48.11:0/652417924
# rbd lock ls sr-rbd-meta-one/one-427
# no output

# rbd status sr-rbd-meta-one/one-426
Watchers:
watcher=192.168.48.11:0/652417924 client.370497673 
cookie=140579044354336
# rbd status sr-rbd-meta-one/one-427
Watchers:
watcher=192.168.48.11:0/2422413806 client.370420944 
cookie=140578306156832

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: Monday, July 29, 2024 10:24 AM
To: ceph-users@ceph.io
Subject: [ceph-users] snaptrim not making progress

Hi all,

our cluster is octopus latest. We seem to have a problem with snaptrim. On a 
pool for HDD RBD images I observed today that all PGs are either in state 
snaptrim or snaptrim_wait. It looks like the snaptrim process does not actually 
make any progress. There is no CPU activity by these OSDs indicating they would 
do snaptrimming (usuallu they would at least use 50% CPU as shown in top). I 
also don't see anything in the OSD logs.

For our VMs we run daily snapshot rotation and snaptrim usually finishes within 
a few minutes. We had a VM with disks on that pool cause an error due to a 
hanging virsh domfsfreeze command. This, however, is routine, we see this 
happening every now and then without any follow-up issues. I'm wondering now if 
we might have hit a race for the first time. Is there anything on an RBD image 
or pool that could block snaptrim from starting or progressing?

Thanks for any pointers!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: snaptrim not making progress

2024-07-29 Thread Frank Schilder

Update: snaptrim has started doing something. I see now the count of PGs that 
are in active+clean (without snaptrim[-wait]) increasing.

I wonder if this started after taking an OSD out of the cluster; see also the 
thread "0 slow ops message stuck for down+out OSD" 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GXJDYQA5KYWVWF334URUZS3ARXEQ5ROJ/).
 The OSD was seemingly fine until shut-down, that is, there were no warnings 
and user IO seemed to progress without problems.

I did shut it down at some point due to slow ops warnings and I had a smart 
warning about it as well. However, this was at 9:30am but the snaptrim was 
hanging since 3am. Is there any event with an OSD/disk that can cause snaptrim 
to stall yet there is no health issue detected/reported?

Thanks for any pointers!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: Monday, July 29, 2024 11:04 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: snaptrim not making progress

Some additional info: my best bet is that the stuck snaptrim has to do with 
image one-427. Not sure if this is a useful clue, the VM has 2 images and one 
of these has an exclusive lock while the other doesn't. Both images are in use 
though and having a lock is the standard situation. Here some output of rbd 
commands, both images reported are attached to the same VM:

# rbd ls -l -p sr-rbd-meta-one | grep -e NAME -e "-42[67]" | sed -e "s/  */  /g"
NAME  SIZE  PARENT  FMT  PROT  LOCK
one-426  200  GiB  2  excl
one-426@714  200  GiB  2  yes
one-426@721  200  GiB  2  yes
one-426@727  200  GiB  2  yes
one-426@734  200  GiB  2  yes
one-426@739  200  GiB  2  yes
one-426@740  200  GiB  2  yes
one-426@741  200  GiB  2  yes
one-426@742  200  GiB  2  yes
one-426@743  200  GiB  2  yes
one-426@744  200  GiB  2  yes
one-426@745  200  GiB  2  yes
one-426@746  200  GiB  2  yes
one-426@747  200  GiB  2  yes
one-427  40  GiB  2
one-427@1177  40  GiB  2  yes
one-427@1184  40  GiB  2  yes
one-427@1190  40  GiB  2  yes
one-427@1197  40  GiB  2  yes
one-427@1202  40  GiB  2  yes
one-427@1203  40  GiB  2  yes
one-427@1204  40  GiB  2  yes
one-427@1205  40  GiB  2  yes
one-427@1206  40  GiB  2  yes
one-427@1207  40  GiB  2  yes
one-427@1208  40  GiB  2  yes
one-427@1209  40  GiB  2  yes
one-427@1210  40  GiB  2  yes

# rbd lock ls sr-rbd-meta-one/one-426
There is 1 exclusive lock on this image.
LockerIDAddress
client.370497673  auto 140579044354336  192.168.48.11:0/652417924
# rbd lock ls sr-rbd-meta-one/one-427
# no output

# rbd status sr-rbd-meta-one/one-426
Watchers:
watcher=192.168.48.11:0/652417924 client.370497673 
cookie=140579044354336
# rbd status sr-rbd-meta-one/one-427
Watchers:
watcher=192.168.48.11:0/2422413806 client.370420944 
cookie=140578306156832

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: Monday, July 29, 2024 10:24 AM
To: ceph-users@ceph.io
Subject: [ceph-users] snaptrim not making progress

Hi all,

our cluster is octopus latest. We seem to have a problem with snaptrim. On a 
pool for HDD RBD images I observed today that all PGs are either in state 
snaptrim or snaptrim_wait. It looks like the snaptrim process does not actually 
make any progress. There is no CPU activity by these OSDs indicating they would 
do snaptrimming (usuallu they would at least use 50% CPU as shown in top). I 
also don't see anything in the OSD logs.

For our VMs we run daily snapshot rotation and snaptrim usually finishes within 
a few minutes. We had a VM with disks on that pool cause an error due to a 
hanging virsh domfsfreeze command. This, however, is routine, we see this 
happening every now and then without any follow-up issues. I'm wondering now if 
we might have hit a race for the first time. Is there anything on an RBD image 
or pool that could block snaptrim from starting or progressing?

Thanks for any pointers!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 0 slow ops message stuck for down+out OSD

2024-07-29 Thread Frank Schilder

> Hi, would a mgr restart fix that?

It did! The one thing we didn't try last time. We thought the message was stuck 
in the MONs.

Thanks!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Monday, July 29, 2024 4:15 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: 0 slow ops message stuck for down+out OSD

Hi, would a mgr restart fix that?

Zitat von Frank Schilder :

> Very funny, it was actually me who made this case some time ago:
> https://www.mail-archive.com/ceph-users@ceph.io/msg10095.html
>
> I will look into what we did last time.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Frank Schilder 
> Sent: Monday, July 29, 2024 10:30 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] 0 slow ops message stuck for down+out OSD
>
> Hi all,
>
> we had a failing disk (with slow ops) and I shut down the OSD. Its
> status down+out. However, I still see this message stuck in the
> output of ceph status and ceph health detail:
>
> 0 slow ops, oldest one blocked for 70 sec, osd.183 has slow ops
>
> I believe there was a case about that some time ago, but I can't
> find it. How can I get rid of this stuck warning? Our cluster is
> octopus latest.
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Bluestore issue using 18.2.2

2024-08-14 Thread Frank Schilder

Hi Eugen,

isn't every shard/replica on every OSD read and written with a checksum? Even 
if only the primary holds a checksum, it should be possible to identify the 
damaged shard/replica during deep-scrub (even for replication 1).

Apart from that, it is unusual to see a virtual disk have read-errors. If its 
some kind of pass-through mapping, there is probably something incorrectly 
configured with a write cache. Still, this would only be a problem if the VM 
dies unexpectedly. There is something off with the setup (unless the underlying 
hardware device for the VDs does actually have damage).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, August 14, 2024 9:05 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Bluestore issue using 18.2.2

Hi,

it looks like you're using size 2 pool(s), I strongly advise to
increase that to 3 (and min_size = 2). Although it's unclear why the
PGs get damaged, the repair of a PG with only two replicas is
difficult, which is the correct one? So to avoid that, avoid pools
with size 2, except for tests and if you don't care about the data.
If you want to use the current situation to learn, you could try to
inspect the PGs with the ceph-objectstore-tool and find out which
replica is the correct one, export it and then inject it into the OSD.
But this can be tricky, of course.

Regards,
Eugen

Zitat von Marianne Spiller :

> Hi,
>
> I am trying to gather experience on a Ceph STAGE cluster; it
> consists of virtual machines - which is not perfect, I know. The VMs
> are running Debian 12 and podman-4.3.1. There is practically no load
> on this Ceph - there is just one client using the storage, and it
> makes no noise. So this is what happened:
>
> * "During data consistency checks (scrub), at least one PG has been
> flagged as being damaged or inconsistent."
> * so I listed them (["2.3","2.58"])
> * and tried to repair ("ceph pg repair 2.3", "ceph pg repair 2.58")
> * they both went well (resulting in "pgs: 129 active+clean"), but
> the cluster keeped its "HEALTH_WARN" state ("Too many repaired reads
> on 1 OSDs")
> * so I googled for this message; and the only thing I found was to
> restart the OSD to get rid of this message and - more important -
> the cluster WARN state ("ceph orch daemon restart osd.3")
> * after the restart, my cluster was still in WARN state - and
> complained about "2 PGs has been flagged as being damaged or
> inconsistent" - but other PGs on other OSDs
> * I "ceph pg repair"ed them, too, and the cluster's state was WARN
> afterwards, again ("Too many repaired reads on 1 OSDs")
> * when I restarted the OSD ("ceph orch daemon restart osd.2"), the
> crash occured; Ceph marked this OSD "down" and "out" and suspected a
> hardware issue, while the OSD HDDs in fact are QEMU "harddisks"
> * I can't judge whether it's a serious bug or just due to my
> non-optimal STAGE setup, so I'll attach the gzipped log of osd.2
>
> I need help to understand what happened and how to prevent it in the
> future. What ist this "Too many repaired reads" and how to deal with
> it?
>
> Thanks a lot for reading,
>   Marianne


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Identify laggy PGs

2024-08-15 Thread Frank Schilder

The current ceph recommendation is to use between 100-200 PGs/OSD. Therefore, a 
large PG is a PG that has more data than 0.5-1% of the disk capacity and you 
should split PGs for the relevant pool.

A huge PG is a PG for which deep-scrub takes much longer than 20min on HDD and 
4-5min on SSD.

Average deep-scrub times (time it takes to deep-scrub) are actually a very good 
way of judging if PGs are too large. These times roughly correlate with the 
time it takes to copy a PG.

On SSDs we aim for 200+PGs/OSD and for HDDs for 150PGs/OSD. For very large HDD 
disks (>=16TB) we consider raising this to 300PGs/OSD due to excessively long 
deep-scrub times per PG.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: Wednesday, August 14, 2024 12:00 PM
To: Eugen Block; ceph-users@ceph.io
Subject: [ceph-users] Re: Identify laggy PGs

Just curiously I've checked my pg size which is like 150GB, when are we talking 
about big pgs?

From: Eugen Block 
Sent: Wednesday, August 14, 2024 2:23 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Identify laggy PGs

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

how big are those PGs? If they're huge and are deep-scrubbed, for
example, that can cause significant delays. I usually look at 'ceph pg
ls-by-pool {pool}' and the "BYTES" column.

Zitat von Boris :

> Hi,
>
> currently we encouter laggy PGs and I would like to find out what is
> causing it.
> I suspect it might be one or more failing OSDs. We had flapping OSDs and I
> synced one out, which helped with the flapping, but it doesn't help with
> the laggy ones.
>
> Any tooling to identify or count PG performance and map that to OSDs?
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Is there a way to make Cephfs kernel client to write data to ceph osd smoothly with buffer io

2020-11-12 Thread Frank Schilder

Yes, that's right. It would be nice if there was a mount option to have such 
parameters adjusted on a per-file system basis. I should mention that I 
observed a significant performance improvement for HDD throughput of the local 
disk as well when adjusting these parameters for ceph.

This is largely due to the "too much memory problem" on big servers. The kernel 
defaults are suitable for machines with 4-8G of RAM. Any enterprise server will 
beat that with the consequence of insanely large amounts of dirty buffers, 
leading to buffer flush panic overloading in particular, network file systems 
(there is a nice article by SUSE 
https://www.suse.com/support/kb/doc/?id=17857). Adjusting these parameters 
to play nice with ceph might actually improve overall performance as a side 
effect. I would give it a go.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sage Meng 
Sent: 12 November 2020 16:00:08
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Is there a way to make Cephfs kernel client to write 
data to ceph osd smoothly with buffer io

vm.dirty_bytes and vm.dirty_background_bytes  are all system-wide control 
parameters, it will influence all the system jobs by adjusting them. Better to 
have a Ceph Special way to  make the transfer more smoothly.

Frank Schilder mailto:fr...@dtu.dk>> 于2020年11月11日周三 下午3:28写道：
These kernel parameters influence the flushing of data, and also performance:

vm.dirty_bytes
vm.dirty_background_bytes

Smaller vm.dirty_background_bytes will make the transfer more smooth and the 
ceph cluster will like that. However, it reduces the chances of merge 
operations in cache and the ceph cluster will not like that. The tuning is 
heavily workload dependent. Test with realistic workloads and a reasonably 
large spectrum of values. I got good results by tuning down 
vm.dirty_background_bytes just to the point when it reduced client performance 
of copying large files.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sage Meng mailto:lkke...@gmail.com>>
Sent: 06 November 2020 13:45:53
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] Is there a way to make Cephfs kernel client to write data 
to ceph osd smoothly with buffer io

Hi All,

  Cephfs kernel client is influenced by kernel page cache when we write
data to it,  outgoing data will be huge when os starts flush page cache.
So Is there a way to make Cephfs kernel client to write data to ceph osd
smoothly when buffer io is used ?
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-13 Thread Frank Schilder

I think this depends on the type of backing disk. We use the following CPUs:

Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz

My experience is, that a HDD OSD hardly gets to 100% of 1 hyper thread load 
even under heavy recovery/rebalance operations on 8+2 and 6+2 EC pools with 
compression set to aggressive. The CPU is mostly doing wait-IO, that is, the 
disk is the real bottle neck, not the processor power. With SSDs I have seen 
2HT at 100% and 2 more at 50% each. I guess NVMe might be more demanding.

A server with 12 HDD and 1 SSD should be fine with a modern CPU with 8 cores. 
16 threads sounds like an 8 core CPU. The 2nd generation Intel® Xeon® Silver 
4209T with 8 cores should easily handle that (single socket system). We have 
the 16-core Intel silver in a dual socket system currently connected to 5HDD 
and 7SSD and I did a rebalance operation yesterday. The CPU user load did not 
exceed 2%, it can handle OSD processes easily. The server is dimensioned to run 
up to 12HDD and 14SSD OSDs (Dell R740xd2). As far as I can tell, the CPU 
configuration is overpowered for that.

Just for info, we use ganglia to record node utilisation. I use 1-year records 
and pick peak loads I observed for dimensioning the CPUs. These records include 
some very heavy recovery periods.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Tony Liu 
Sent: 13 November 2020 04:57:53
To: Nathan Fish
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: which of cpu frequency and number of threads servers 
osd better?

Thanks Nathan!
Tony
> -Original Message-
> From: Nathan Fish 
> Sent: Thursday, November 12, 2020 7:43 PM
> To: Tony Liu 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] which of cpu frequency and number of threads
> servers osd better?
>
> From what I've seen, OSD daemons tend to bottleneck on the first 2
> threads, while getting some use out of another 2. So 32 threads at 3.0
> would be a lot better. Note that you may get better performance
> splitting off some of that SSD for block.db partitions or at least
> block.wal for the HDDs.
>
> On Thu, Nov 12, 2020 at 9:57 PM Tony Liu  wrote:
> >
> > Hi,
> >
> > For example, 16 threads with 3.2GHz and 32 threads with 3.0GHz, which
> > makes 11 OSDs (10x12TB HDD and 1x960GB SSD) with better performance?
> >
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-13 Thread Frank Schilder

> If each OSD requires 4T

Nobody said that. What was said is HDD=1T,  SSD=3T. It depends on the drive 
type!

The %-utilisation information is just from top observed during heavy load. It 
does not show how the kernel schedules things on physical Ts. So, 2x50% 
utilisation could run on the same HT. I don't know how the OSDs are organised 
into threads, I'm just stating observations from real life (mimic cluster). So, 
for an SSD OSD I have seen a maximum of 4 threads in R state, two with 100% and 
two with 50% CPU, a load that fits on 3HT.

So, real life says 1HT per HDD and 3HT per SSD plus a bit for kernel and 
networking and you are set - based on worst-case performance monitoring I have 
seen in 2 years. Note that this is worst-case load. The average load is much 
lower.

A 16 core machine is totally overpowered. Assuming 1C=2HT, I count 
(2*3+8*1)/2=7 or (1*3+10*1)/2=6.5. So an 8 core CPU should do in either case. A 
10 core CPU might be better, but 16C is a waste of money.

I should mention that these estimates apply to Intel CPUs (x86_64 
architectures). Other architectures might not provide the same cycle efficiency.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Tony Liu 
Sent: 13 November 2020 08:32:55
To: Frank Schilder; Nathan Fish
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Re: which of cpu frequency and number of threads 
servers osd better?

You all mentioned first 2T and another 2T. Could you give more
details how OSD works with multi-thread, or share the link if
it's already documented somewhere?

Is it always 4T, or start with 1T and grow up to 4T? Is it max 4T?
Does each T run different job or just multiple instances of the
same job? Does disk type affect how T works, like 1T is good enough
for HDD while 4T is required for SSD?

If I change my plan to 2 SSD OSDs and 8 HDD OSDs (with 1 SSD for
WAL and DB). If each OSD requires 4T, then 16C/32T 3.0GHz could
be a better choice, because it provides sufficient Ts?
If SSD OSD requires 4T and HDD OSD only requires 1T, then 8C/16T
3.2GHz would be better, because it provides sufficient Ts as well
as stronger computing?

Thanks!
Tony
> -Original Message-
> From: Frank Schilder 
> Sent: Thursday, November 12, 2020 10:59 PM
> To: Tony Liu ; Nathan Fish 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: which of cpu frequency and number of
> threads servers osd better?
>
> I think this depends on the type of backing disk. We use the following
> CPUs:
>
> Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
> Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
> Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
>
> My experience is, that a HDD OSD hardly gets to 100% of 1 hyper thread
> load even under heavy recovery/rebalance operations on 8+2 and 6+2 EC
> pools with compression set to aggressive. The CPU is mostly doing wait-
> IO, that is, the disk is the real bottle neck, not the processor power.
> With SSDs I have seen 2HT at 100% and 2 more at 50% each. I guess NVMe
> might be more demanding.
>
> A server with 12 HDD and 1 SSD should be fine with a modern CPU with 8
> cores. 16 threads sounds like an 8 core CPU. The 2nd generation Intel®
> Xeon® Silver 4209T with 8 cores should easily handle that (single socket
> system). We have the 16-core Intel silver in a dual socket system
> currently connected to 5HDD and 7SSD and I did a rebalance operation
> yesterday. The CPU user load did not exceed 2%, it can handle OSD
> processes easily. The server is dimensioned to run up to 12HDD and 14SSD
> OSDs (Dell R740xd2). As far as I can tell, the CPU configuration is
> overpowered for that.
>
> Just for info, we use ganglia to record node utilisation. I use 1-year
> records and pick peak loads I observed for dimensioning the CPUs. These
> records include some very heavy recovery periods.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Tony Liu 
> Sent: 13 November 2020 04:57:53
> To: Nathan Fish
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: which of cpu frequency and number of threads
> servers osd better?
>
> Thanks Nathan!
> Tony
> > -Original Message-
> > From: Nathan Fish 
> > Sent: Thursday, November 12, 2020 7:43 PM
> > To: Tony Liu 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] which of cpu frequency and number of threads
> > servers osd better?
> >
> > From what I've seen, OSD daemons tend to bottleneck on the first 2
> > threads, while getting some use out of another 2. So 32 threads at 3.0
> > would be a lot better. Note that you may get better performance
> > splitting off some of that SSD for bl

[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-14 Thread Frank Schilder

Yeah, I forgot to mention this. Our HDD OSDs are simplest set-up, WAB, DB, 
BLOCK all collocated on the HDD. My plan for the future is to use dm-cache for 
LVM OSDs instead of WAL/DB device. Then I might also see some more CPU 
utilisation with small-file I/O. From the question and the suggested per-server 
disk configuration I assumed that the target config would be everything 
collocated on HDD.

By the way, is there any documentation on "osd op num threads per shard" 
somewhere? Can you post a link?

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Nathan Fish 
Sent: 13 November 2020 21:07:20
To: Frank Schilder
Cc: Tony Liu; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads 
servers osd better?

We have 12TB HDD OSDs with 32GiB of (Optane) NVMe for block.db, used
for cephfs_data pools, and NVMe-only OSDs used for cephfs_data pools.
The NVMe DB about doubled our random IO performance - a great
investment - doubling max CPU load as a result. We had to turn up "osd
op num threads per shard hdd" from 1 to 2. (2 is the default for
SSDs). This didn't noticeably improve performance, but without it,
OSDs under max load would sometimes fail to respond to heartbeats. So
with the load that we have - millions of mostly small files on CephFS
- I wouldn't go below 2 real cores per OSD. But this may be a fringe
workload.

On Fri, Nov 13, 2020 at 3:36 AM Frank Schilder  wrote:
>
> > If each OSD requires 4T
>
> Nobody said that. What was said is HDD=1T,  SSD=3T. It depends on the drive 
> type!
>
> The %-utilisation information is just from top observed during heavy load. It 
> does not show how the kernel schedules things on physical Ts. So, 2x50% 
> utilisation could run on the same HT. I don't know how the OSDs are organised 
> into threads, I'm just stating observations from real life (mimic cluster). 
> So, for an SSD OSD I have seen a maximum of 4 threads in R state, two with 
> 100% and two with 50% CPU, a load that fits on 3HT.
>
> So, real life says 1HT per HDD and 3HT per SSD plus a bit for kernel and 
> networking and you are set - based on worst-case performance monitoring I 
> have seen in 2 years. Note that this is worst-case load. The average load is 
> much lower.
>
> A 16 core machine is totally overpowered. Assuming 1C=2HT, I count 
> (2*3+8*1)/2=7 or (1*3+10*1)/2=6.5. So an 8 core CPU should do in either case. 
> A 10 core CPU might be better, but 16C is a waste of money.
>
> I should mention that these estimates apply to Intel CPUs (x86_64 
> architectures). Other architectures might not provide the same cycle 
> efficiency.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Tony Liu 
> Sent: 13 November 2020 08:32:55
> To: Frank Schilder; Nathan Fish
> Cc: ceph-users@ceph.io
> Subject: RE: [ceph-users] Re: which of cpu frequency and number of threads 
> servers osd better?
>
> You all mentioned first 2T and another 2T. Could you give more
> details how OSD works with multi-thread, or share the link if
> it's already documented somewhere?
>
> Is it always 4T, or start with 1T and grow up to 4T? Is it max 4T?
> Does each T run different job or just multiple instances of the
> same job? Does disk type affect how T works, like 1T is good enough
> for HDD while 4T is required for SSD?
>
> If I change my plan to 2 SSD OSDs and 8 HDD OSDs (with 1 SSD for
> WAL and DB). If each OSD requires 4T, then 16C/32T 3.0GHz could
> be a better choice, because it provides sufficient Ts?
> If SSD OSD requires 4T and HDD OSD only requires 1T, then 8C/16T
> 3.2GHz would be better, because it provides sufficient Ts as well
> as stronger computing?
>
> Thanks!
> Tony
> > -Original Message-
> > From: Frank Schilder 
> > Sent: Thursday, November 12, 2020 10:59 PM
> > To: Tony Liu ; Nathan Fish 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Re: which of cpu frequency and number of
> > threads servers osd better?
> >
> > I think this depends on the type of backing disk. We use the following
> > CPUs:
> >
> > Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
> > Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
> > Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
> >
> > My experience is, that a HDD OSD hardly gets to 100% of 1 hyper thread
> > load even under heavy recovery/rebalance operations on 8+2 and 6+2 EC
> > pools with compression set to aggressive. The CPU is mostly doing wait-
> > IO, that is, the disk is the real bottle neck, not the processor power.
> > With

[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-14 Thread Frank Schilder

I found this here: 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/configuration_guide/osd_configuration_reference#operations
 . Nothing in the ceph docs. It would be interesting to know what a shard is 
and what it does. Can anyone shed a bit of light on this?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 14 November 2020 10:48:57
To: Nathan Fish
Cc: Tony Liu; ceph-users@ceph.io
Subject: [ceph-users] Re: which of cpu frequency and number of threads servers 
osd better?

Yeah, I forgot to mention this. Our HDD OSDs are simplest set-up, WAB, DB, 
BLOCK all collocated on the HDD. My plan for the future is to use dm-cache for 
LVM OSDs instead of WAL/DB device. Then I might also see some more CPU 
utilisation with small-file I/O. From the question and the suggested per-server 
disk configuration I assumed that the target config would be everything 
collocated on HDD.

By the way, is there any documentation on "osd op num threads per shard" 
somewhere? Can you post a link?

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Nathan Fish 
Sent: 13 November 2020 21:07:20
To: Frank Schilder
Cc: Tony Liu; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads 
servers osd better?

We have 12TB HDD OSDs with 32GiB of (Optane) NVMe for block.db, used
for cephfs_data pools, and NVMe-only OSDs used for cephfs_data pools.
The NVMe DB about doubled our random IO performance - a great
investment - doubling max CPU load as a result. We had to turn up "osd
op num threads per shard hdd" from 1 to 2. (2 is the default for
SSDs). This didn't noticeably improve performance, but without it,
OSDs under max load would sometimes fail to respond to heartbeats. So
with the load that we have - millions of mostly small files on CephFS
- I wouldn't go below 2 real cores per OSD. But this may be a fringe
workload.

On Fri, Nov 13, 2020 at 3:36 AM Frank Schilder  wrote:
>
> > If each OSD requires 4T
>
> Nobody said that. What was said is HDD=1T,  SSD=3T. It depends on the drive 
> type!
>
> The %-utilisation information is just from top observed during heavy load. It 
> does not show how the kernel schedules things on physical Ts. So, 2x50% 
> utilisation could run on the same HT. I don't know how the OSDs are organised 
> into threads, I'm just stating observations from real life (mimic cluster). 
> So, for an SSD OSD I have seen a maximum of 4 threads in R state, two with 
> 100% and two with 50% CPU, a load that fits on 3HT.
>
> So, real life says 1HT per HDD and 3HT per SSD plus a bit for kernel and 
> networking and you are set - based on worst-case performance monitoring I 
> have seen in 2 years. Note that this is worst-case load. The average load is 
> much lower.
>
> A 16 core machine is totally overpowered. Assuming 1C=2HT, I count 
> (2*3+8*1)/2=7 or (1*3+10*1)/2=6.5. So an 8 core CPU should do in either case. 
> A 10 core CPU might be better, but 16C is a waste of money.
>
> I should mention that these estimates apply to Intel CPUs (x86_64 
> architectures). Other architectures might not provide the same cycle 
> efficiency.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Tony Liu 
> Sent: 13 November 2020 08:32:55
> To: Frank Schilder; Nathan Fish
> Cc: ceph-users@ceph.io
> Subject: RE: [ceph-users] Re: which of cpu frequency and number of threads 
> servers osd better?
>
> You all mentioned first 2T and another 2T. Could you give more
> details how OSD works with multi-thread, or share the link if
> it's already documented somewhere?
>
> Is it always 4T, or start with 1T and grow up to 4T? Is it max 4T?
> Does each T run different job or just multiple instances of the
> same job? Does disk type affect how T works, like 1T is good enough
> for HDD while 4T is required for SSD?
>
> If I change my plan to 2 SSD OSDs and 8 HDD OSDs (with 1 SSD for
> WAL and DB). If each OSD requires 4T, then 16C/32T 3.0GHz could
> be a better choice, because it provides sufficient Ts?
> If SSD OSD requires 4T and HDD OSD only requires 1T, then 8C/16T
> 3.2GHz would be better, because it provides sufficient Ts as well
> as stronger computing?
>
> Thanks!
> Tony
> > -Original Message-
> > From: Frank Schilder 
> > Sent: Thursday, November 12, 2020 10:59 PM
> > To: Tony Liu ; Nathan Fish 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Re: which of cpu frequency and number of
> > threads servers osd better?
> >
> >

[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-14 Thread Frank Schilder

My plan is to use at least 500GB NVMe per HDD OSD. I have not started that yet, 
but there are threads of other people sharing their experience. If you go 
beyond 300GB per OSD, apparently the WAL/DB options cannot really use the extra 
capacity. With dm-cache or the like you would additionally start holding hot 
data in cache.

Ideally, I can split a 4TB or even a 8TB NVMe over 6 OSDs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: 14 November 2020 10:57:57
To: Frank Schilder
Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads 
servers osd better?

Guten Tag.

> My plan for the future is to use dm-cache for LVM OSDs instead of WAL/DB 
> device.

Do you have any insights into the benefits of that approach instead of WAL/DB, 
and of dm-cache vs bcache vs dm-writecache vs … ?  And any for sizing the cache 
device and handling failures?  Presumably the DB will be active enough that it 
will persist in the cache, so sizing should be at a minimum that to hold 2 
copies of the DB to accomodate compaction?

I have an existing RGW cluster on HDDs that utilizes a cache tier; the high 
water mark is set fairly low so that it doesn’t fill up, something that 
apparently happened last Christmas.  I’ve been wanting to get a feel for OSD 
cache as an alternative to deprecated and fussy cache tiering, as well as 
something like a Varnish cache on RGW load balancers to short-circult small 
requests.

— Anthony


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: which of cpu frequency and number of threads servers osd better?

2020-11-16 Thread Frank Schilder

We are starting to use 18TB spindles, have loads of cold data and only a thin 
layer of hot data. One 4/8TB NVMe drive as a cache in front of 6x18TB will 
provide close to or even matching SSD performance for the hot data at a 
reasonable extra cost per TB storage. My plan is to wait for 1-2 more years for 
prices for PCI-NVMe to drop and then start using this method. The second 
advantage is, that one can continue to deploy collocated HDD OSDs as WAL/DB 
will certainly land and stay in cache. The cache can be added to existing OSDs 
without redeployment. In addition, dm-cache uses a hit count method for 
computing promotion to cache, which works very different from promotion to ceph 
cache pools. Dm-cache can afford that due to its local nature. In particular, 
it doesn't promote on just 1 access, which means that a weekly or monthly 
backup will not flush the entire cache every time.

All SSD pools for this data (ceph-fs in EC pool on HDD) will be unaffordable to 
us for a long time. Not to mention that these large SSDs are almost certainly 
QLC, which have much less sustained throughput compared with the 18TB He-drives 
(they have higher IOP/s though, which is not so relevant for our FS use 
workloads). The cache method will provide at least the additional IOP/s that 
WAB/DB devices would, but due to its size also data caching. We need to go 
NVMe, because the servers we plan to use (R740xd2) provide the largest capacity 
configuration with 24xHDD+4xPCI NVMe. You can either choose 2 extra drives or 4 
PCI NVMe, but not both. So, NVMe cannot be exchanged by fast SSDs as they would 
eat drive slots.

There were a few threads over the past 1-2 years where people dropped in some 
of these observations and I just took note of it. It is used in production 
already and from what I got people are happy with it. Much easier than WAL/DB 
partitions plus all the sizing problems for L0/L1/... are sorted trivially. 
With the size of NVMe growing rapidly beyond what WAL/DB devices can utilize 
and since LVM is the new OSD device, using LVM dm-cache seems to be the way 
forward for me.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Anthony D'Atri 
Sent: 16 November 2020 03:00:38
To: Frank Schilder
Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads 
servers osd better?

Thanks.  I’m curious how the economics for that compare with just using all 
SSDs:

* HDDs are cheaper
* But colo SSDs are operationally simpler
* And depending on configuration you can provision a cheaper HBA

> On Nov 14, 2020, at 2:04 AM, Frank Schilder  wrote:
>
> My plan is to use at least 500GB NVMe per HDD OSD. I have not started that 
> yet, but there are threads of other people sharing their experience. If you 
> go beyond 300GB per OSD, apparently the WAL/DB options cannot really use the 
> extra capacity. With dm-cache or the like you would additionally start 
> holding hot data in cache.
>
> Ideally, I can split a 4TB or even a 8TB NVMe over 6 OSDs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Anthony D'Atri 
> Sent: 14 November 2020 10:57:57
> To: Frank Schilder
> Subject: Re: [ceph-users] Re: which of cpu frequency and number of threads 
> servers osd better?
>
> Guten Tag.
>
>> My plan for the future is to use dm-cache for LVM OSDs instead of WAL/DB 
>> device.
>
> Do you have any insights into the benefits of that approach instead of 
> WAL/DB, and of dm-cache vs bcache vs dm-writecache vs … ?  And any for sizing 
> the cache device and handling failures?  Presumably the DB will be active 
> enough that it will persist in the cache, so sizing should be at a minimum 
> that to hold 2 copies of the DB to accomodate compaction?
>
> I have an existing RGW cluster on HDDs that utilizes a cache tier; the high 
> water mark is set fairly low so that it doesn’t fill up, something that 
> apparently happened last Christmas.  I’ve been wanting to get a feel for OSD 
> cache as an alternative to deprecated and fussy cache tiering, as well as 
> something like a Varnish cache on RGW load balancers to short-circult small 
> requests.
>
> — Anthony
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

2020-11-16 Thread Frank Schilder

To throw in my 5 cents. Choosing m in k+m EC replication is not random and the 
argument that anyone with larger m could always say lower m is wrong is also 
not working.

Why are people recommending m>=2 for production (or R>=3 replicas)?

Its very simple. What is forgotten below is maintenance. Whenever you do 
maintenance on ceph, there will be longer episodes of degraded redundancy as 
OSDs are down. However, on production storage systems, writes *always* need to 
go to redundant storage. Hence, minimum redundancy under maintenance is the 
keyword here.

With m=1 (R=2) one could never do any maintenance without down time as shutting 
down just 1 OSD would imply writes to non-redundant storage, which in turn 
would mean data loss in case a disk dies during maintenance.

Basically, with m parity shards you can do maintenance on m-1 failure domains 
at the same time without downtime or non-redundant writes. With R copies you 
can do maintenance on R-2 failure domains without downtime.

If your SLAs require higher minimum redundancy at all times, m (R) need to be 
large enough to allow maintenance unless you do downtime. However, the latter 
would be odd, because one of the key features of ceph is its ability to 
provides infinite uptime while hardware gets renewed all the time.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Hans van den Bogert 
Sent: 16 November 2020 12:59:31
Cc: ceph-users
Subject: [ceph-users] Re: (Ceph Octopus) Repairing a neglected Ceph cluster - 
Degraded Data Reduncancy, all PGs degraded, undersized, not scrubbed in time

I think we're deviating from the original thread quite a bit and I would
never argue that in a production environment with plenty OSDs you should
go for R=2 or K+1, so my example cluster which happens to be 2+1 is a
bit unlucky.

However I'm interested in the following

On 11/16/20 11:31 AM, Janne Johansson wrote:
 > So while one could always say "one more drive is better than your
 > amount", there are people losing data with repl=2 or K+1 because some
 > more normal operation was in flight and _then_ a single surprise
 > happens.  So you can have a weird reboot, causing those PGs needing
 > backfill later, and if one of the uptodate hosts have any single
 > surprise during the recovery, the cluster will lack some of the current
 > data even if two disks were never down at the same time.

I'm not sure I follow, from a logical perspective they *are* down at the
same time right? In your scenario 1 up-to-date  replica was left, but
even that had a surprise. Okay well that's the risk you take with R=2,
but it's not intrinsically different than R=3.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD memory leak?

2020-11-16 Thread Frank Schilder

Dear all,

I collected memory allocation data over a period of 2 months; see the graphs 
here: <https://imgur.com/a/R0q6nzP>. I need to revise my statement about 
accelerated growth. The new graphs indicate that we are looking at linear 
growth, that is, probably a small memory leak in a regularly called function. I 
think the snippets of the heap stats and memory profiling output below should 
give a clue about where to look.

Osd 195 is using about 2.1GB more than it should, the memory limit is 2GB:

osd.195 tcmalloc heap stats:
MALLOC: 4555926984 ( 4344.9 MiB) Bytes in use by application
MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
MALLOC: +288132120 (  274.8 MiB) Bytes in central cache freelist
MALLOC: + 12879104 (   12.3 MiB) Bytes in transfer cache freelist
MALLOC: + 20619552 (   19.7 MiB) Bytes in thread cache freelists
MALLOC: + 33292288 (   31.8 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =   4910850048 ( 4683.4 MiB) Actual memory used (physical + swap)
MALLOC: +865198080 (  825.1 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =   5776048128 ( 5508.5 MiB) Virtual address space used
MALLOC:
MALLOC: 470779  Spans in use
MALLOC: 35  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
{
"error": "(0) Success",
"success": true
}

It looks like the vast majority of this leak occurs in "ceph::decode"; see the 
top of the heap profiler allocation stats:

Total: 4011.6 MB
  1567.9  39.1%  39.1%   1815.3  45.3% ceph::decode
   457.7  11.4%  50.5%457.7  11.4% rocksdb::BlockFetcher::ReadBlockContents
   269.7   6.7%  57.2%269.7   6.7% std::vector::_M_default_append
   256.0   6.4%  63.6%256.0   6.4% rocksdb::Arena::AllocateNewBlock
   243.9   6.1%  69.7%243.9   6.1% std::_Rb_tree::_M_emplace_hint_unique
   184.1   4.6%  74.3%184.1   4.6% CrushWrapper::get_leaves
   174.6   4.4%  78.6%174.6   4.4% ceph::buffer::create_aligned_in_mempool
   170.3   4.2%  82.9%170.3   4.2% ceph::buffer::malformed_input::what
   125.2   3.1%  86.0%191.4   4.8% PGLog::IndexedLog::add
   101.1   2.5%  88.5%101.1   2.5% CrushWrapper::decode_crush_bucket

Does this already help? If not, I collected 126GB of data from the heap 
profiler.

It would be great if this leak could be closed. It would be enough to extend 
the uptime of an OSD to cover usual maintenance windows.

By the way, increasing the cache_min value helped a lot. The OSD kept a healthy 
amount of ONODE items in cache despite the leak. Users noticed the improvement.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 31 August 2020 19:50:57
To: Mark Nelson; Dan van der Ster; ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Looks like the image attachment got removed. Please find it here: 
https://imgur.com/a/3tabzCN

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 31 August 2020 14:42
To: Mark Nelson; Dan van der Ster; ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Hi Dan and Mark,

sorry, took a bit longer. I uploaded a new archive containing files with the 
following format 
(https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - 
valid 60 days):

- osd.195.profile.*.heap - raw heap dump file
- osd.195.profile.*.heap.txt - output of conversion with --text
- osd.195.profile.*.heap-base0001.txt - output of conversion with --text 
against first dump as base
- osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour
- osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour
- osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, 
counters are reset

Only for the last couple of days are converted files included, post-conversion 
of everything simply takes too long.

Please find also attached a recording of memory usage on one of the relevant 
OSD nodes. I marked restarts of all OSDs/the host with vertical red lines. What 
is worrying is the self-amplifying nature of the leak. ts not a linear process, 
it looks at least quadratic if not exponential. What we are looking for is, 
given the comparably short uptime, probably still in the lower percentages with 
increasing rate. The OSDs just started to overrun their limit:

top - 14:38:49 up 155 days, 19:17,  1 user,  load average: 5.99, 4.59, 4.59
Tasks: 684 total,   1 running, 293 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.9 sy,  0.0 ni, 89.6

[ceph-users] MGR restart loop

2020-11-17 Thread Frank Schilder

Dear cephers,

I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do 
maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON 
and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop 
the deamons on the server and do the maintenance. Now I'd stuck at the last 
server, because MGR fail-over does not work. The remaining MGR instances go 
into a restart loop.

In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done 
with maintenance. Everything fine. However, as soon as I stop the last MON I 
need to do maintenance on, the last remaining MGR goes into a restart loop all 
by itself. As far as I can see, the MGR does actually not restart, it just gets 
thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
 
  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3208 active+clean
 7active+clean+scrubbing+deep

As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is 
collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and 
mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
 
  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
no active mgr
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
 
  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: no daemons active
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
 
  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active, starting)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
 
  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

It is cycling through these 3 states and I couldn't find a reason why. The node 
ceph-01 is not special in any way.

Any hint would be greatly appreciated.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MGR restart loop

2020-11-17 Thread Frank Schilder

Addition: This happens only when I stop mon.ceph-01, I can stop any other MON 
daemon without problems. I checked network connectivity and all hosts can see 
all other hosts.

I already increased mon_mgr_beacon_grace to a huge value due to another bug a 
long time ago:

global advanced mon_mgr_beacon_grace 86400

This restart cycle seems to have another reason. The log contains this line 
just before the MGR goes out:

Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.179 7f7c544ea700  1 mgr 
send_beacon active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.193 7f7c544ea700  0 
log_channel(cluster) log [DBG] : pgmap v4: 3215 pgs: 3208 active+clean, 7 
active+clean+scrubbing+deep; 689 TiB data, 877 TiB used, 1.1 PiB / 1.9 PiB avail
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [INF] : Manager daemon ceph-03 is unresponsive.  No 
standby daemons available.
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.313 7f756700  0 
log_channel(cluster) log [DBG] : mgrmap e1330: no daemons active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700 -1 mgr 
handle_mgr_map I was active but no longer am
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700  1 mgr 
respawn  e: '/usr/bin/ceph-mgr'

The beacon has been sent. Why does it not arrive at the MONs? There is only 
little load right now.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 17 November 2020 16:25:36
To: ceph-users@ceph.io
Subject: [ceph-users] MGR restart loop

Dear cephers,

I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do 
maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON 
and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop 
the deamons on the server and do the maintenance. Now I'd stuck at the last 
server, because MGR fail-over does not work. The remaining MGR instances go 
into a restart loop.

In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done 
with maintenance. Everything fine. However, as soon as I stop the last MON I 
need to do maintenance on, the last remaining MGR goes into a restart loop all 
by itself. As far as I can see, the MGR does actually not restart, it just gets 
thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3208 active+clean
 7active+clean+scrubbing+deep

As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is 
collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and 
mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
no active mgr
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: no daemons active
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active, starting)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools,

[ceph-users] Re: Ceph EC PG calculation

2020-11-18 Thread Frank Schilder

Roughly speaking, if you have N OSDs, a replication factor of R and aim for P 
PGs/OSD on average, you can assign (N*P)/R PGs to the pool.

Example: 4+2 EC has replication 6. There are 36 OSDs. If you want to place, 
say,  50 PGs per OSD, you can assign

(36*50)/6=300 PGs

to the EC pool. You may pick a close power of 2 if you wish and then calculate 
how many PGs will be placed on each OSD on average. For example, we choose 256 
PGs, then

256*6/36 = 42.7 PGs per OSD will be added.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 18 November 2020 04:58:38
To: ceph-users@ceph.io
Subject: [ceph-users] Ceph EC PG calculation

Hi,

I have this error:
I have 36 osd and get this:
Error ERANGE:  pg_num 4096 size 6 would mean 25011 total pgs, which exceeds max 
10500 (mon_max_pg_per_osd 250 * num_in_osds 42)

If I want to calculate the max pg in my server, how it works if I have EC pool?

I have 4:2 data EC pool, and the others are replicated.

These are the pools:
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 597 
flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 598 flags 
hashpspool stripe_width 0 application rgw
pool 6 'sin.rgw.log' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 599 flags 
hashpspool stripe_width 0 application rgw
pool 7 'sin.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 600 flags 
hashpspool stripe_width 0 application rgw
pool 8 'sin.rgw.meta' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 601 lfor 0/393/391 
flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw
pool 10 'sin.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 1 
object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 602 
lfor 0/529/527 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 
application rgw
pool 11 'sin.rgw.buckets.data.old' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 603 
flags hashpspool stripe_width 0 application rgw
pool 12 'sin.rgw.buckets.data' erasure profile data-ec size 6 min_size 5 
crush_rule 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 604 flags hashpspool,ec_overwrites stripe_width 16384 application 
rgw

So how I can calculate the pgs?

This is my osd tree:
ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
-1 534.38354  root default
-5  89.06392  host cephosd-6s01
36   nvme1.74660  osd.36up   1.0  1.0
  0ssd   14.55289  osd.0 up   1.0  1.0
  8ssd   14.55289  osd.8 up   1.0  1.0
15ssd   14.55289  osd.15up   1.0  1.0
18ssd   14.55289  osd.18up   1.0  1.0
24ssd   14.55289  osd.24up   1.0  1.0
30ssd   14.55289  osd.30up   1.0  1.0
-3  89.06392  host cephosd-6s02
37   nvme1.74660  osd.37up   1.0  1.0
  1ssd   14.55289  osd.1 up   1.0  1.0
11ssd   14.55289  osd.11up   1.0  1.0
17ssd   14.55289  osd.17up   1.0  1.0
23ssd   14.55289  osd.23up   1.0  1.0
28ssd   14.55289  osd.28up   1.0  1.0
35ssd   14.55289  osd.35up   1.0  1.0
-11  89.06392  host cephosd-6s03
41   nvme1.74660  osd.41up   1.0  1.0
  2ssd   14.55289  osd.2 up   1.0  1.0
  6ssd   14.55289  osd.6 up   1.0  1.0
13ssd   14.55289  osd.13up   1.0  1.0
19ssd   14.55289  osd.19up   1.0  1.0
26ssd   14.55289  osd.26up   1.0  1.0
32ssd   14.55289  osd.32up   1.0  1.0
-13  89.06392  host cephosd-6s04
38   nvme1.74660  osd.38up   1.0  1.0
  5ssd   14.55289  osd.5 up   1.0  1.0
  7ssd   14.55289  osd.7

[ceph-users] Re: Ceph EC PG calculation

2020-11-18 Thread Frank Schilder

Its the same formula. An k-times replicated pool has replication factor R. With 
the formula I stated below, you can compute the entire PG budget depending on 
what your PG target per OSD is. I'm afraid you will have to do that yourself.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 18 November 2020 09:21:50
To: Frank Schilder; ceph-users@ceph.io
Subject: RE: Ceph EC PG calculation

Hi,

Thank you Frank.

And after how this affect the non EC pools? Because they will use the same 
device classes, which is SSD.
So I'd calculate with 100PG/osd, because this will grow.
If I calculate with EC it will be 512. But still have many replicated pools 😊

Or just let the autoscaler in warn and do when it instruct.

To be honest I just want to be sure my setup is correct or I miss something or 
did something wrong.


-Original Message-----
From: Frank Schilder 
Sent: Wednesday, November 18, 2020 3:11 PM
To: Szabo, Istvan (Agoda) ; ceph-users@ceph.io
Subject: Re: Ceph EC PG calculation

Email received from outside the company. If in doubt don't click links nor open 
attachments!


Roughly speaking, if you have N OSDs, a replication factor of R and aim for P 
PGs/OSD on average, you can assign (N*P)/R PGs to the pool.

Example: 4+2 EC has replication 6. There are 36 OSDs. If you want to place, 
say,  50 PGs per OSD, you can assign

(36*50)/6=300 PGs

to the EC pool. You may pick a close power of 2 if you wish and then calculate 
how many PGs will be placed on each OSD on average. For example, we choose 256 
PGs, then

256*6/36 = 42.7 PGs per OSD will be added.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 18 November 2020 04:58:38
To: ceph-users@ceph.io
Subject: [ceph-users] Ceph EC PG calculation

Hi,

I have this error:
I have 36 osd and get this:
Error ERANGE:  pg_num 4096 size 6 would mean 25011 total pgs, which exceeds max 
10500 (mon_max_pg_per_osd 250 * num_in_osds 42)

If I want to calculate the max pg in my server, how it works if I have EC pool?

I have 4:2 data EC pool, and the others are replicated.

These are the pools:
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 597 
flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth pool 
2 '.rgw.root' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode warn last_change 598 flags hashpspool 
stripe_width 0 application rgw pool 6 'sin.rgw.log' replicated size 3 min_size 
2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 599 flags hashpspool stripe_width 0 application rgw pool 7 
'sin.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 600 flags 
hashpspool stripe_width 0 application rgw pool 8 'sin.rgw.meta' replicated size 
3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 
autoscale_mode warn last_change 601 lfor 0/393/391 flags hashpspool 
stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw pool 10 
'sin.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 602 lfor 0/529/527 
flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application 
rgw pool 11 'sin.rgw.buckets.data.old' replicated size 3 min_size 2 crush_rule 
0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 603 
flags hashpspool stripe_width 0 application rgw pool 12 'sin.rgw.buckets.data' 
erasure profile data-ec size 6 min_size 5 crush_rule 3 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode warn last_change 604 flags 
hashpspool,ec_overwrites stripe_width 16384 application rgw

So how I can calculate the pgs?

This is my osd tree:
ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
-1 534.38354  root default
-5  89.06392  host cephosd-6s01
36   nvme1.74660  osd.36up   1.0  1.0
  0ssd   14.55289  osd.0 up   1.0  1.0
  8ssd   14.55289  osd.8 up   1.0  1.0
15ssd   14.55289  osd.15up   1.0  1.0
18ssd   14.55289  osd.18up   1.0  1.0
24ssd   14.55289  osd.24up   1.0  1.0
30ssd   14.55289  osd.30up   1.0  1.0
-3  89.06392  host cephosd-6s02
37   nvme1.74660  osd.37up   1.0  1.0

[ceph-users] Re: Ceph EC PG calculation

2020-11-18 Thread Frank Schilder

> Its the same formula. An k-times replicated pool has replication factor R.
> With the formula I stated below, you can compute the entire PG budget 
> depending
> on what your PG target per OSD is. I'm afraid you will have to do that 
> yourself.

Sorry, I meant a k-times replicated pool has replication factor R=k.
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Frank Schilder 
Sent: 18 November 2020 09:25:46
To: Szabo, Istvan (Agoda); ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph EC PG calculation

Its the same formula. An k-times replicated pool has replication factor R. With 
the formula I stated below, you can compute the entire PG budget depending on 
what your PG target per OSD is. I'm afraid you will have to do that yourself.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 18 November 2020 09:21:50
To: Frank Schilder; ceph-users@ceph.io
Subject: RE: Ceph EC PG calculation

Hi,

Thank you Frank.

And after how this affect the non EC pools? Because they will use the same 
device classes, which is SSD.
So I'd calculate with 100PG/osd, because this will grow.
If I calculate with EC it will be 512. But still have many replicated pools 😊

Or just let the autoscaler in warn and do when it instruct.

To be honest I just want to be sure my setup is correct or I miss something or 
did something wrong.


-----Original Message-
From: Frank Schilder 
Sent: Wednesday, November 18, 2020 3:11 PM
To: Szabo, Istvan (Agoda) ; ceph-users@ceph.io
Subject: Re: Ceph EC PG calculation

Email received from outside the company. If in doubt don't click links nor open 
attachments!


Roughly speaking, if you have N OSDs, a replication factor of R and aim for P 
PGs/OSD on average, you can assign (N*P)/R PGs to the pool.

Example: 4+2 EC has replication 6. There are 36 OSDs. If you want to place, 
say,  50 PGs per OSD, you can assign

(36*50)/6=300 PGs

to the EC pool. You may pick a close power of 2 if you wish and then calculate 
how many PGs will be placed on each OSD on average. For example, we choose 256 
PGs, then

256*6/36 = 42.7 PGs per OSD will be added.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 18 November 2020 04:58:38
To: ceph-users@ceph.io
Subject: [ceph-users] Ceph EC PG calculation

Hi,

I have this error:
I have 36 osd and get this:
Error ERANGE:  pg_num 4096 size 6 would mean 25011 total pgs, which exceeds max 
10500 (mon_max_pg_per_osd 250 * num_in_osds 42)

If I want to calculate the max pg in my server, how it works if I have EC pool?

I have 4:2 data EC pool, and the others are replicated.

These are the pools:
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 2 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 597 
flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth pool 
2 '.rgw.root' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode warn last_change 598 flags hashpspool 
stripe_width 0 application rgw pool 6 'sin.rgw.log' replicated size 3 min_size 
2 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn 
last_change 599 flags hashpspool stripe_width 0 application rgw pool 7 
'sin.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 600 flags 
hashpspool stripe_width 0 application rgw pool 8 'sin.rgw.meta' replicated size 
3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 
autoscale_mode warn last_change 601 lfor 0/393/391 flags hashpspool 
stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application rgw pool 10 
'sin.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 1 object_hash 
rjenkins pg_num 8 pgp_num 8 autoscale_mode warn last_change 602 lfor 0/529/527 
flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 8 application 
rgw pool 11 'sin.rgw.buckets.data.old' replicated size 3 min_size 2 crush_rule 
0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 603 
flags hashpspool stripe_width 0 application rgw pool 12 'sin.rgw.buckets.data' 
erasure profile data-ec size 6 min_size 5 crush_rule 3 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode warn last_change 604 flags 
hashpspool,ec_overwrites stripe_width 16384 application rgw

So how I can calculate the pgs?

This is my osd tree:
ID   CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
-1 534.38354  root default
-5  89.06392  host cephosd-6s01
36   nvme1.74660  osd.36

[ceph-users] MONs unresponsive for excessive amount of time

2020-11-18 Thread Frank Schilder

Hi all,

one of our MONs was down for maintenance for ca. 45 minutes. After this time I 
started it up again and it joined the cluster.

Unfortunately, things did not go as expected. The MON sub-cluster became 
unresponsive for a bit more than 10 minutes. Admin commands would hang, even if 
issued directly to a specific monitor via "ceph tell mon.xxx". In addition, our 
MDS lost connection to the MONs and reported a laggy connection. Consequently, 
all ceph fs access was frozen for a bit more than 10 minutes as well.

>From the little I could get out with "ceph daemon mon.xxx mon_status" I could 
>see that the restarted MON was in state "synchronizing" (or similar, its from 
>memory) while the other mons were in quorum.

Our cluster is mimic-12.2.8. Somehow, this observation does not fit together 
with the intended HA of the MON cluster, there should not be any stall at all.

My questions: Why do the MONs become unresponsive for such a long time? What 
are the MONs doing during this time frame? Are there any config options I 
should look at? Are there any log messages I should hunt for?

Any hint is appreciated.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MGR restart loop

2020-11-19 Thread Frank Schilder

Hi all,

there seems to be a bug in how beacon time-outs are computed. After waiting for 
a full time-out period of 86400s=24h, the problem disappeared. It looks like 
received beacons are only counted properly after a MON was up for the grace 
period. I have no other explanation.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 17 November 2020 17:05:21
To: ceph-users@ceph.io
Subject: [ceph-users] Re: MGR restart loop

Addition: This happens only when I stop mon.ceph-01, I can stop any other MON 
daemon without problems. I checked network connectivity and all hosts can see 
all other hosts.

I already increased mon_mgr_beacon_grace to a huge value due to another bug a 
long time ago:

global advanced mon_mgr_beacon_grace 86400

This restart cycle seems to have another reason. The log contains this line 
just before the MGR goes out:

Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.179 7f7c544ea700  1 mgr 
send_beacon active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.193 7f7c544ea700  0 
log_channel(cluster) log [DBG] : pgmap v4: 3215 pgs: 3208 active+clean, 7 
active+clean+scrubbing+deep; 689 TiB data, 877 TiB used, 1.1 PiB / 1.9 PiB avail
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [INF] : Manager daemon ceph-03 is unresponsive.  No 
standby daemons available.
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.270 7f7bc2363700  0 
log_channel(cluster) log [WRN] : Health check failed: no active mgr (MGR_DOWN)
Nov 17 16:10:10 ceph-02 journal: debug 2020-11-17 16:10:10.313 7f756700  0 
log_channel(cluster) log [DBG] : mgrmap e1330: no daemons active
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700 -1 mgr 
handle_mgr_map I was active but no longer am
Nov 17 16:10:10 ceph-03 journal: 2020-11-17 16:10:10.340 7f7c57cf1700  1 mgr 
respawn  e: '/usr/bin/ceph-mgr'

The beacon has been sent. Why does it not arrive at the MONs? There is only 
little load right now.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: 17 November 2020 16:25:36
To: ceph-users@ceph.io
Subject: [ceph-users] MGR restart loop

Dear cephers,

I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do 
maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON 
and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop 
the deamons on the server and do the maintenance. Now I'd stuck at the last 
server, because MGR fail-over does not work. The remaining MGR instances go 
into a restart loop.

In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done 
with maintenance. Everything fine. However, as soon as I stop the last MON I 
need to do maintenance on, the last remaining MGR goes into a restart loop all 
by itself. As far as I can see, the MGR does actually not restart, it just gets 
thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull

  services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3208 active+clean
 7active+clean+scrubbing+deep

As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is 
collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and 
mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:   877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
 8active+clean+scrubbing+deep

[root@ceph-01 ~]# ceph status
  cluster:
id: xxx
health: HEALTH_WARN
no active mgr
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03

  services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: no daemons active
mds: con-fs2-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in

  data:
pools:   11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage:

[ceph-users] Re: newbie Cephfs auth permissions issues

2020-11-19 Thread Frank Schilder

That's a known issue. You probably did "enable application cephfs" on the 
pools. This prevents a meta data tag to be applied correctly. If you google for 
your problem, you will find threads on this with fixes. There was at least one 
this year.

Also, you could just start from scratch one more time and follow the 
instructions but ignore the enable application part.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Jonathan D. Proulx 
Sent: 19 November 2020 15:33:06
To: ceph-users
Subject: [ceph-users] newbie Cephfs auth permissions issues

Hi All,

I've been using ceph block and object storage for years but just
wandering into cephfs now (Nautilus all servers on 14.2.9 ).

I created small data and metadata pools, a new filesystem and used:

ceph fs authorize  client. / rw

creating two new users to mount it, both can one using fuse (14.2.9)
and one using kernel client (Ubuntu 20.04 kernel 5.4.0-53).

so far so good, but then it gets "weird" I can perform metadata
operations like "mkdir" and "touch" but not actually write any data:

testy-mctestface% touch /mnt/cephfs/boo
testy-mctestface% echo foo > /mnt/cephfs/boo
echo: write error: operation not permitted

auth caps look good to me, but seem most likely to be worng:

root@ceph-mon0:/ # ceph auth get client.client0
exported keyring for client.client0
[client.client0]
key = 
caps mds = "allow rw"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data="

is "data" hear supposed to be  or ? Presumably
it's fsname since that what the "fa authorize" put there and it should
know...

can anyone see what I'm doing wrong here?

Thanks,
-Jon
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The serious side-effect of rbd cache setting

2020-11-20 Thread Frank Schilder

Do you have test results for the same test without caching?

I have seen periodic stalls in any RBD IOP/s benchmark on ceph. The benchmarks 
create IO requests much faster than OSDs can handle them. At some point all 
queues run full and you start seeing slow ops on OSDs.

I would also prefer if IO activity was more steady and not so bursty, but for 
some reason IO client throttling is pushed to the clients instead of the 
internal OPS queueing system (ceph is collaborative, meaning a rogue 
un-collaborative client can screw it up for everyone).

If you know what your IO stack can handle without stalls, you can use libvirt 
QOS settings to limit clients with reasonable peak-load and steady-load 
settings.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: norman 
Sent: 20 November 2020 13:40:18
To: ceph-users
Subject: [ceph-users] The serious side-effect of rbd cache setting

Hi All,

We're testing the rbd cache setting for openstack(Ceph 14.2.5 Bluestore
3-replica), and an odd problem found:

1. Setting librbd cache

[client]

rbd cache = true

rbd cache size = 16777216

rbd cache max dirty = 12582912

rbd cache target dirty = 8388608

rbd cache max dirty age = 1

rbd cache writethrough until flush = true

2. Running rbd bench

rbd -c /etc/ceph/ceph.conf \
 -k /etc/ceph/keyring2 \
 -n client.rbd-openstack-002 bench \
 --io-size 4K \
 --io-threads 1 \
 --io-pattern seq \
 --io-type read \
 --io-total 100G \
 openstack-volumes/image-you-can-drop-me
3. Start another test

rbd -c /etc/ceph/ceph.conf \

 -k /etc/ceph/keyring2 \

 -n client.rbd-openstack-002 bench \

 --io-size 4K \

 --io-threads 1 \

 --io-pattern rand \

 --io-type write \

 --io-total 100G \

 openstack-volumes/image-you-can-drop-me2

Running for minutes, I found the read test almost hung for a while:

69152069   2375.21  9728858.72
70153627   2104.63  8620569.93
71155748   1956.04  8011953.10
72157665   1945.84  7970177.24
73159661   1947.64  7977549.44
74161522   1890.45  7743277.01
75163583   1991.04  8155301.58
76165791   2008.44  8226566.26
77168433   2153.43  8820438.66
78170269   2121.43  8689377.16
79172511   2197.62  9001467.33
80174845   2252.22  9225091.00
81177089   2259.42  9254579.83
82179675   2248.22  9208708.30
83182053   2356.61  9652679.11
84185087   2515.00  10301433.50
99185345550.16  2253434.96
   101185346407.76  1670187.73
   102185348282.44  1156878.38
   103185350162.34  664931.53
   104185353 12.86  52681.27
   105185357  1.93   7916.89
   106185361  2.74  11235.38
   107185367  3.27  13379.95
   108185375  5.08  20794.43
   109185384  6.93  28365.91
   110185403  9.19  37650.06
   111185438 17.47  71544.17
   128185467  4.94  20243.53
   129185468  4.45  18210.82
   131185469  3.89  15928.44
   132185493  4.09  16764.16
   133185529  4.16  17037.21
   134185578 18.64  76329.67
   135185631 27.78  113768.65

Why this happened? It's a unacceptable performance for read.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The serious side-effect of rbd cache setting

2020-11-20 Thread Frank Schilder

Hmm, so maybe your hardware is good enough that cache is actually not helping? 
This is not unheard of. I don't really see any improvement from caching to 
begin with. On the other hand, a synthetic benchmark is not really a test that 
utilises the good sides of cache (in particular, write merges will probably not 
occur). It would probably make more sense to run real VMs with real workload 
for a while and monitor latencies etc. over a longer period of time.

Other than that I only see that the cache size is quite small. You do 100G 
random operations on a 16M cache, the default is 32M. I would not expect 
anything interesting from this ratio. Cache only makes sense if you have a lot 
of cache hits.

In addition, the max_dirty and target_dirty values are really high 
percentage-wise. This could lead to a lot of deferred  operations for too long 
and result in a cache flush blocking IO.

Larger cache size, smaller targets for dirty and a benchmark that simulates a 
realistic workload might be worth to investigate.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: norman 
Sent: 20 November 2020 13:58:27
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] The serious side-effect of rbd cache setting

If the rbd cache = false,  and run the same two tests, the read iops is
stable(this is a new cluster without stress):

   109274471   2319.41  9500308.72
   110276846   2380.81  9751782.65
   111278969   2431.40  9959023.39
   112280924   2287.21  9368428.23
   113282886   2227.82  9125145.62
   114286130   2331.61  9550275.83
   115289693   2569.19  10523406.25
   116293161   2838.17  11625140.61
   117296484   3111.75  12745715.04
   118300068   3436.12  14074349.33
   119302424   3258.53  13346958.90
   120304442   2949.56  12081397.86
   121306988   2765.18  11326156.91
   122309867   2676.38  10962461.69
   123312475   2481.20  10162987.53
   124314957   2506.40  10266198.33
   125317124   2536.19  10388249.19
   126320239   2649.98  10854336.06
   127323243   2674.98  10956727.73
   128326688   2842.37  11642342.34
   129328855   2779.37  11384315.33
   130331414   2857.77  11705415.59
   131333811   2714.18  7277.84
   132336164   2583.99  10584022.02
   133338664   2395.01  9809941.00
   134341417   2512.20  10289953.14
   135344409   2598.79  10644637.88
   136347112   2659.98  10895292.68
   137349486   2664.18  10912494.47
   138351921   2651.18  10859250.80
   139354592   2634.79  10792081.86
   140357559   2629.79  10771603.52

On 20/11/2020 下午8:50, Frank Schilder wrote:
> Do you have test results for the same test without caching?
>
> I have seen periodic stalls in any RBD IOP/s benchmark on ceph. The 
> benchmarks create IO requests much faster than OSDs can handle them. At some 
> point all queues run full and you start seeing slow ops on OSDs.
>
> I would also prefer if IO activity was more steady and not so bursty, but for 
> some reason IO client throttling is pushed to the clients instead of the 
> internal OPS queueing system (ceph is collaborative, meaning a rogue 
> un-collaborative client can screw it up for everyone).
>
> If you know what your IO stack can handle without stalls, you can use libvirt 
> QOS settings to limit clients with reasonable peak-load and steady-load 
> settings.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: norman 
> Sent: 20 November 2020 13:40:18
> To: ceph-users
> Subject: [ceph-users] The serious side-effect of rbd cache setting
>
> Hi All,
>
> We're testing the rbd cache setting for openstack(Ceph 14.2.5 Bluestore
> 3-replica), and an odd problem found:
>
> 1. Setting librbd cache
>
> [client]
>
> rbd cache = true
>
> rbd cache size = 16777216
>
> rbd cache max dirty = 12582912
>
> rbd cache target dirty = 8388608
>
> rbd cache max dirty age = 1
>
> rbd cache writethrough until flush = true
>
> 2. Running rbd bench
>
> rbd -c /etc/ceph/ceph.conf \
>   -k /etc/ceph/keyring2 \
>   -n client.rbd-openstack-002 bench \
>   --io-size 4K \
>   --io-threads 1 \
>   --io-pattern seq \
>   --io-type read \
>   --io-total 100G \
>   openstack-volumes/image-you-can-drop-me
> 3. Start another test
>
> rbd -c /etc/ceph/ceph.conf \
>
>   -k /etc/ceph/keyring2 \
>
>   -n client.rbd-openstack-002 bench \
>
>   --io-size 4K \
>
>   --io-threads 1 \
>
>   --io-pattern rand \
>
>   --io-type write \
>
>   --io-total 100G \
>
>

[ceph-users] Re: multiple OSD crash, unfound objects

2020-11-22 Thread Frank Schilder

Dear Michael,

I was also wondering whether deleting the broken pool could clean up 
everything. The difficulty is, that while migrating a pool to new devices is 
easy via a crush rule change, migrating data between pools is not so easy. In 
particular, if you can't afford downtime.

In case you can afford some downtime, it might be possible to migrate fast by 
creating a new pool and use the pool copy command to migrate the data (rados 
cppool ...). Its important that the FS is shutdown (no MDS active) during this 
copy process. After copy, one could either rename the pools to have the copy 
match the fs data pool name, or change the data pool at the top level 
directory. You might need to set some pool meta data by hand, notably, the fs 
tag.

Having said that, I have no idea how a ceph fs reacts if presented with a 
replacement data pool. Although I don't believe that meta data contains the 
pool IDs, I cannot exclude that complication. The copy pool variant should be 
tested with an isolated FS first.

The other option is what you describe, create a new data pool, make the fs root 
placed on this pool and copy every file onto itself. This should also do the 
trick. However, with this method you will not be able to get rid of the broken 
pool. After the copy, you could, however, reduce the number of PGs to below the 
unhealthy one and the broken PG(s) might get deleted cleanly. Then you still 
have a surplus pool, but at least all PGs are clean.

I hope one of these will work. Please post your experience here.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Michael Thomas 
Sent: 22 November 2020 18:29:16
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/23/20 3:07 AM, Frank Schilder wrote:
> Hi Michael.
>
>> I still don't see any traffic to the pool, though I'm also unsure how much 
>> traffic is to be expected.
>
> Probably not much. If ceph df shows that the pool contains some objects, I 
> guess that's sorted.
>
> That osdmaptool crashes indicates that your cluster runs with corrupted 
> internal data. I tested your crush map and you should get complete PGs for 
> the fs data pool. That you don't and that osdmaptool crashes points at a 
> corruption of internal data. I'm afraid this is the point where you need 
> support from ceph developers and should file a tracker report 
> (https://tracker.ceph.com/projects/ceph/issues). A short description of the 
> origin of the situation with the osdmaptool output and a reference to this 
> thread linked in should be sufficient. Please post a link to the ticket here.

https://tracker.ceph.com/issues/48059

> In parallel, you should probably open a new thread focussed on the osd map 
> corruption. Maybe there are low-level commands to repair it.

Will do.

> You should wait with trying to clean up the unfound objects until this is 
> resolved. Not sure about adding further storage either. To me, this sounds 
> quite serious.

Another approach that I'm considering is to create a new pool using the
same set of OSDs, adding it to the set of cephfs data pools, and
migrating the data from the "broken" pool to the new pool.

I have some additional unused storage that I could add to this new pool,
if I can figure out the right crush rules to make sure they don't get
used for the "broken" pool too.

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: multiple OSD crash, unfound objects

2020-11-22 Thread Frank Schilder

Dear Michael,

yes, your plan will work if the temporary space requirement can be addressed. 
Good luck!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Michael Thomas 
Sent: 22 November 2020 20:14:09
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

Hi Frank,

 From my understanding, with my current filesystem layout, I should be
able to remove the "broken" pool once the data has been moved off of it.
  This is because the "broken" pool is not the default data pool.
According to the documentation[1]:

   fs rm_data_pool  

"This command removes the specified pool from the list of data pools for
the file system. If any files have layouts for the removed data pool,
the file data will become unavailable. The default data pool (when
creating the file system) cannot be removed."

My default data pool (triply replicated on SSD) is still healthy.  The
"broken" pool is EC on HDD, and while it holds a majority of the
filesystem data (~400TB), it is not the root of the filesystem.

My plan would be:

* Create a new data pool matching the "broken" pool
* Create a parallel directory tree matching the directories that are
mapped to the "broken" pool.  eg Broken: /ceph/frames/..., New:
/ceph/frames.new/...
* Use 'setfattr -n ceph.dir.layout.pool' on this parallel directory tree
to map the content to the new data pool
* Use parallel+rsync to copy data from the broken pool to the new pool.
* After each directory gets filled in the new pool, mv/rename the old
and new directories so that users start accessing the data from the new
pool.
* Delete data from the renamed old pool directories as they are
replaced, to keep the OSDs from filling up
* After all data is moved off of the old pool (verified by checking
ceph.dir.layout.pool and ceph.file.layout.pool on all files in the fs,
as well as rados ls, ceph df), remove the pool from the fs.

This is effectively the same strategy I did when moving frequently
accessed directories from the EC pool to a replicated SSD pool, except
that in the previous situation I didn't need to remove any pools at the
end.  It's time consuming, because every file on the "broken" pool needs
to be copied, but it minimizes downtime.  Being able to add some
temporary new OSDs to the new pool (but not the "broken" pool) would
reduce some pressure of filling up the OSDs.  If the old and new pools
use the same crush rule, would disabling backfilling+rebalancing keep
the OSDs from being used in the old pool until the old pool is deleted
(with the exception of the occasional new file)?

--Mike
[1]https://docs.ceph.com/en/latest/cephfs/administration/#file-systems

On 11/22/20 12:19 PM, Frank Schilder wrote:
> Dear Michael,
>
> I was also wondering whether deleting the broken pool could clean up 
> everything. The difficulty is, that while migrating a pool to new devices is 
> easy via a crush rule change, migrating data between pools is not so easy. In 
> particular, if you can't afford downtime.
>
> In case you can afford some downtime, it might be possible to migrate fast by 
> creating a new pool and use the pool copy command to migrate the data (rados 
> cppool ...). Its important that the FS is shutdown (no MDS active) during 
> this copy process. After copy, one could either rename the pools to have the 
> copy match the fs data pool name, or change the data pool at the top level 
> directory. You might need to set some pool meta data by hand, notably, the fs 
> tag.
>
> Having said that, I have no idea how a ceph fs reacts if presented with a 
> replacement data pool. Although I don't believe that meta data contains the 
> pool IDs, I cannot exclude that complication. The copy pool variant should be 
> tested with an isolated FS first.
>
> The other option is what you describe, create a new data pool, make the fs 
> root placed on this pool and copy every file onto itself. This should also do 
> the trick. However, with this method you will not be able to get rid of the 
> broken pool. After the copy, you could, however, reduce the number of PGs to 
> below the unhealthy one and the broken PG(s) might get deleted cleanly. Then 
> you still have a surplus pool, but at least all PGs are clean.
>
> I hope one of these will work. Please post your experience here.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________
> From: Michael Thomas 
> Sent: 22 November 2020 18:29:16
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
>
> On 10/23/20 3:07 AM, Frank Schilder wrote:
>>

[ceph-users] PGs undersized for no reason?

2020-11-23 Thread Frank Schilder

Hi all,

I'm upgrading ceph mimic 13.2.8 to 13.2.10 and make a strange observation. When 
restarting OSDs on the new version, the PGs come back as undersized. They are 
missing 1 OSD and I get a lot of objects degraded/misplaced.

I have only the noout flag set.

Can anyone help me out why the PGs don't peer until they are all complete?
Is there a flag I can set to get complete PGs before starting backfill/recovery?

Ceph is currently rebuilding objects even though all data should still be 
there. Hence, the update takes an unreasonable amount of time now and I 
remember that with the update from 13.2.2 to 13.2.8 PGs came back complete 
really fast. There was no such extended period with incomplete PGs and degraded 
redundancy.

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PGs undersized for no reason?

2020-11-23 Thread Frank Schilder

Found it. OSDs came up in the wrong root.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: 23 November 2020 12:46:32
To: ceph-users@ceph.io
Subject: [ceph-users] PGs undersized for no reason?

Hi all,

I'm upgrading ceph mimic 13.2.8 to 13.2.10 and make a strange observation. When 
restarting OSDs on the new version, the PGs come back as undersized. They are 
missing 1 OSD and I get a lot of objects degraded/misplaced.

I have only the noout flag set.

Can anyone help me out why the PGs don't peer until they are all complete?
Is there a flag I can set to get complete PGs before starting backfill/recovery?

Ceph is currently rebuilding objects even though all data should still be 
there. Hence, the update takes an unreasonable amount of time now and I 
remember that with the update from 13.2.2 to 13.2.8 PGs came back complete 
really fast. There was no such extended period with incomplete PGs and degraded 
redundancy.

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cephfs snapshots and previous version

2020-11-24 Thread Frank Schilder

We made the same observation and found out that for CentOS8 there are extra 
modules for samba that provide vfs modules for certain storage systems (search 
for all available package names containing samba and they show up in the list). 
One is available and supports gluster fs. The corresponding package for ceph is 
missing. My best bet is, that this is part of RedHat enterprise storage and 
left out deliberately. We compiled SAMBA from source without problems and 
everything is then present.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Oliver Weinmann 
Sent: 23 November 2020 23:19:55
To: ceph-users
Subject: [ceph-users] Cephfs snapshots and previous version

Today I played with a samba gateway and cephfs. I couldn’t get previous 
versions displayed on a windows client and found very little info on the net 
how to accomplish this. It seems that I need a vfs module called 
ceph_snapshots. It’s not included in the latest samba version on Centos 8. by 
this I also noticed that there is no vfs ceph module. Are these modules not 
stable and therefore not included in centos8? I can compile them but I would 
like to know why they are not included. And one more question. Are there any 
plans to add samba gateway support to cephadm?

Best regards,
Oliver

Von meinem iPhone gesendet
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Documentation of older Ceph version not accessible anymore on docs.ceph.com

2020-11-24 Thread Frank Schilder

Older versions are available here:

https://web.archive.org/web/20191226012841/https://docs.ceph.com/docs/mimic/

I'm actually also a bit unhappy about older versions missing. Mimic is not end 
of life and a lot of people still use luminous. Since there are such dramatic 
differences between interfaces, the old docs should not just disappear.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan Mick 
Sent: 24 November 2020 01:53:29
To: Martin Palma
Cc: ceph-users
Subject: [ceph-users] Re: Documentation of older Ceph version not accessible 
anymore on docs.ceph.com

I don't know the answer to that.

On 11/23/2020 6:59 AM, Martin Palma wrote:
> Hi Dan,
>
> yes I noticed but now only "latest", "octopus" and "nautilus" are
> offered to be viewed. For older versions I had to go directly to
> github.
>
> Also simply switching the URL from
> "https://docs.ceph.com/en/nautilus/"; to
> "https://docs.ceph.com/en/luminous/"; will not work any more.
>
> Is it planned to make the documentation of the older version available
> again through doc.ceph.com?
>
> Best,
> Martin
>
> On Sat, Nov 21, 2020 at 2:11 AM Dan Mick  wrote:
>>
>> On 11/14/2020 10:56 AM, Martin Palma wrote:
>>> Hello,
>>>
>>> maybe I missed the announcement but why is the documentation of the
>>> older ceph version not accessible anymore on docs.ceph.com
>>
>> It's changed UI because we're hosting them on readthedocs.com now.  See
>> the dropdown in the lower right corner.
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: replace osd with Octopus

2020-11-28 Thread Frank Schilder

Hi all,

maybe a further alternative.

With our support contract I get exact replacements. I found out that doing an 
off-line copy of a still readable OSD with ddrescue speeds things up 
dramatically and avoids extended periods of degraded PGs.

Situation and what I did:

I had a disk with repeated deep scrub errors and checking with smartctl I could 
see that it started remapping sectors. This showed up as PG scrub error. I 
initiated a full deep scrub of the disk and run PG repair on every PG that was 
marked as having errors. This way, ceph rewrites the broken object and the disk 
writes it to a remapped, that is, healthy sector. Doing this a couple of times 
will leave you with a disk that is 100% readable.

I then shut the OSD down. This lead to recovery IO as expected and after less 
than 2 hours everything was rebuilt to full redundancy (it was probably faster, 
I only checked after 2 hours). Recovery from single disk fail is very fast due 
to all-to-all rebuild.

In the mean time, I did a full disk copy with ddrescue to a large file system 
space I have on a copy station. Took 16h for a 12TB drive. Right after this, 
the replacement arrived and I copied the image back. Another 16h.

After this, I simply inserted the new disk with the 5 days old OSD copy and 
brought it up (there was a weekend in between). Almost all objects on the drive 
were still up-to-date and after just 30 minutes all PGs were active and clean. 
Nothing remapped or misplaced any more.

For comparison, I once added a single drive and it took 2 weeks for the 
affected PGs to be active+clean again. The off-line copy can use much more 
aggressive and effective IO to a single drive than ceph rebalancing ever would.

For single-disk exchange on our service contract I will probably continue with 
the ddrescue method even though it requires manual action.

For the future I plan to adapt a different strategy to utilize the all-to-all 
copy capability of ceph. Exchanging single disks seems not to be a good way to 
run ceph. I will rather have a larger amount of disks act as hot spares. For 
example, having enough capacity that one can tolerate loosing 10% of all disks 
before replacing anything. Adding a large number of disks is overall more 
effective as it will basically take the same time to get back to health OK as 
exchanging a single disk.

With my timings, this "replace many disks not single ones" will amortise if at 
least 5-6 drives failed and are down+out. It will also limit writes to degraded 
PGs to the shortest interval possible.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: 28 November 2020 05:55:06
To: Tony Liu
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: replace osd with Octopus

>>
>
> Here is the context.
> https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd
>
> When disk is broken,
> 1) orch osd rm  --replace [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i 
>
> Step #1 marks OSD "destroyed". I assume it has the same effect as
> "ceph osd destroy". And that keeps OSD "in", no PG remapping and
> cluster is in "degrade" state.
>
> After step #3, OSD will be "up" and "in", data will be recovered
> back to new disk. Is that right?

Yes.

> Is cluster "degrade" or "healthy" during such recovery?

It will be degraded, because there are fewer copies of some data available than 
during normal operation.  Clients will continue to access all data.

> For another option, the difference is no "--replace" in step #1.
> 1) orch osd rm  [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i 
>
> Step #1 evacuates PGs from OSD and removes it from cluster.
> If disk is broken or OSD daemon is down, is this evacuation still
> going to work?

Yes, of course — broken drives are the typical reason for removing OSDs.

> Is it going to take a while if there is lots data on this disk?

Yes, depending on what “a while” means to you, the size of the cluster, whether 
the pool is replicated or EC, and whether these are HDDs or SSDs.

> After step #3, PGs will be rebalanced/remapped again when new OSD
> joins the cluster.
>
> I think, to replace with the same disk model, option #1 is preferred,
> to replace with different disk model, it needs to be option #2.

I haven’t tried it under Octopus, but I don’t think this is strictly true.  If 
you replace it with a different model that is approximately the same size, 
everything will be fine.  Through Luminous and I think Nautilus at least, if 
you `destroy` and replace with a larger drive, the CRUSH weight of the OSD will 
still reflect that of the old drive.  You could then run `ceph osd crush 
reweight` after deploying to adjust the size.  You could record the

[ceph-users] Re: replace osd with Octopus

2020-12-02 Thread Frank Schilder

> A dummy question, what's this all-to-all rebuild/copy?
> Is that PG remapping when the broken disk is taken out?

- all-to-all: every OSD sends/receives objects to/from every other OSD
- one-to-all: one OSD sends objects to all other OSDs
- all-to-one: all other OSDs send objects to one OSD

All-to-all happens if one disk fails and all other OSDs rebuild the missing 
data. This is very fast.

One-to-all happens when you evacuate a single disk, for example, by setting its 
weight to 0. This is very slow. It is faster to just fail the disk and let the 
data rebuild, however, with the drawback of temporarily reduced redundancy.

All-to-one happens when you add a single disk and all other OSDs send it its 
data. This is also very slow and there is no short-cut.

Conclusion: design work flows that utilize the all-to-all capability of ceph as 
much as possible. For example, plan the cluster such that single-disk 
operations can be avoided.

> In your case, does "shut the OSD down" mark OSD "out"?
> "rebuilt to full redundancy" took 2 hours (I assume there was
> PG remapping.)? What's the disk size?

If you stop an OSD, it will be down and 5 minutes later marked out (auto-out). 
These time-outs can be configured. Size was 12TB (10.7TiB). Its NL-SAS drives.

> Regarding to your future plan relying on all-to-all copy,
> "with large amount of hot spares", I assume you mean large
> amount of spare spaces? What do you do when a disk fails?
> Just take it out and let the cluster heal itself by remapping
> PGs from failed disk to spare spaces?

Hot spares means that you deploy 5-10% more disks than you need to provide the 
requested capacity (hot means they are already part of the cluster, otherwise 
they would be called cold spares). Then, if a single disk fails, you do 
nothing, because you still have excess capacity. Only after all the 5-10% extra 
disks have failed will 5-10% disks be added again as new. In fact, I would plan 
it such that this replacement falls together with the next capacity extension. 
Then, you simply do nothing when a disk fails - except maybe taking it out and 
requesting a replacement if your contract provides that (put it on a shelf 
until next cluster extension).

Doubling the number of OSDs in a storage extension operation will practically 
result in all-to-all data movement. Its theoretically half-to-half, but more 
than 50% of objects are usually misplaced and there will be movement between 
the original set of OSDs as well. In any case, getting such a large number of 
disks involved that only need to be filled up to 50% of the previous capacity 
will be much more efficient (in administrator workload/salary) than doing 
single-disk replacements or tiny extensions.

Ceph is fun if its big enough :)

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Tony Liu 
Sent: 02 December 2020 05:48:10
To: Frank Schilder; Anthony D'Atri
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Re: replace osd with Octopus

Hi Frank,

A dummy question, what's this all-to-all rebuild/copy?
Is that PG remapping when the broken disk is taken out?

In your case, does "shut the OSD down" mark OSD "out"?
"rebuilt to full redundancy" took 2 hours (I assume there was
PG remapping.)? What's the disk size?

Regarding to your future plan relying on all-to-all copy,
"with large amount of hot spares", I assume you mean large
amount of spare spaces? What do you do when a disk fails?
Just take it out and let the cluster heal itself by remapping
PGs from failed disk to spare spaces?


Thanks!
Tony
> -Original Message-
> From: Frank Schilder 
> Sent: Saturday, November 28, 2020 12:42 AM
> To: Anthony D'Atri ; Tony Liu
> 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: replace osd with Octopus
>
> Hi all,
>
> maybe a further alternative.
>
> With our support contract I get exact replacements. I found out that
> doing an off-line copy of a still readable OSD with ddrescue speeds
> things up dramatically and avoids extended periods of degraded PGs.
>
> Situation and what I did:
>
> I had a disk with repeated deep scrub errors and checking with smartctl
> I could see that it started remapping sectors. This showed up as PG
> scrub error. I initiated a full deep scrub of the disk and run PG repair
> on every PG that was marked as having errors. This way, ceph rewrites
> the broken object and the disk writes it to a remapped, that is, healthy
> sector. Doing this a couple of times will leave you with a disk that is
> 100% readable.
>
> I then shut the OSD down. This lead to recovery IO as expected and after
> less than 2 hours everything was rebuilt to full redundancy (it was
> probably faster, I only check

[ceph-users] Re: replace osd with Octopus

2020-12-02 Thread Frank Schilder

> I must be missing something seriously:)

Yes. And I think its time that you actually try it out instead of writing ever 
longer e-mails.

If you re-read the e-mail correspondence carefully, you should notice that your 
follow-up questions have been answered already.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Tony Liu 
Sent: 02 December 2020 19:00:18
To: Frank Schilder; Anthony D'Atri
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Re: replace osd with Octopus

> > A dummy question, what's this all-to-all rebuild/copy?
> > Is that PG remapping when the broken disk is taken out?
>
> - all-to-all: every OSD sends/receives objects to/from every other OSD
> - one-to-all: one OSD sends objects to all other OSDs
> - all-to-one: all other OSDs send objects to one OSD
>
> All-to-all happens if one disk fails and all other OSDs rebuild the
> missing data. This is very fast.

No matter "up" or "down", PG mapping remains when OSD is "in",
PGs will be remapped when OSD is "out". Is that correct?

Since "other OSDs rebuild the missing data", there must be PG
remapping and the failed disk is "out" by either manual or automatic.
Right?

> One-to-all happens when you evacuate a single disk, for example, by
> setting its weight to 0. This is very slow. It is faster to just fail
> the disk and let the data rebuild, however, with the drawback of
> temporarily reduced redundancy.

>From what I see, the difference between "evacuate a single disk" and
"a disk fails" is cluster state. When "evacuate a single disk",
cluster is healthy because all replicas are available.
When "a disk fails", cluster is degraded, because one replica is
missing. In terms of PG remapping, it happens either way. I see the
same copy happening in background, PGs on failed/evacuated disk are
copied to other disks. If that's true, why there is dramatic timing
difference for those two cases?

Give my above understanding, all-to-all is no difference from
one-to-all. In either case, PGs of one disk are remapped to others.

I must be missing something seriously:)

> All-to-one happens when you add a single disk and all other OSDs send it
> its data. This is also very slow and there is no short-cut.

Add a new disk will cause PGs to be rebalanced. It will take times.
But for replacing disk (OSD keeps being "in".), since PG mapping
remains, no rebalance/remapping, just copy data back.

> Conclusion: design work flows that utilize the all-to-all capability of
> ceph as much as possible. For example, plan the cluster such that
> single-disk operations can be avoided.
>
> > In your case, does "shut the OSD down" mark OSD "out"?
> > "rebuilt to full redundancy" took 2 hours (I assume there was PG
> > remapping.)? What's the disk size?
>
> If you stop an OSD, it will be down and 5 minutes later marked out
> (auto-out). These time-outs can be configured. Size was 12TB (10.7TiB).
> Its NL-SAS drives.
>
> > Regarding to your future plan relying on all-to-all copy, "with large
> > amount of hot spares", I assume you mean large amount of spare spaces?
> > What do you do when a disk fails?
> > Just take it out and let the cluster heal itself by remapping PGs from
> > failed disk to spare spaces?
>
> Hot spares means that you deploy 5-10% more disks than you need to
> provide the requested capacity (hot means they are already part of the
> cluster, otherwise they would be called cold spares). Then, if a single
> disk fails, you do nothing, because you still have excess capacity. Only
> after all the 5-10% extra disks have failed will 5-10% disks be added
> again as new. In fact, I would plan it such that this replacement falls
> together with the next capacity extension. Then, you simply do nothing
> when a disk fails - except maybe taking it out and requesting a
> replacement if your contract provides that (put it on a shelf until next
> cluster extension).

Is hot spare disk "in" the cluster and allocated with PGs?
If yes, what's the difference between hot spare disk and normal disks?

My understanding is that, just keep cluster capacity under a reasonable
threshold, to accommodate one or couple disks failure. Since the
cluster will heal itself, no rush to replace the disk when failure
happens. And when replacing the disk, it will be the same as adding
a new disk. This is my original option #2. I was just not sure about
how much time the cluster will take to heal itself. Based on your
experiences, it's pretty fast, couple hours to rebuild 10T data.

> Doubling the number of OSDs in a storage extension operation will
> pra

[ceph-users] Increase number of objects in flight during recovery

2020-12-03 Thread Frank Schilder

Hi all,

I have the opposite problem as discussed in "slow down keys/s in recovery". I 
need to increase the number of objects in flight during rebalance. It is 
already all remapped PGs in state backfilling, but it looks like no more than 8 
objects/sec are transferred per PG at a time. The pools sits on 
high-performance SSDs and could easily handle a transfer of 100 or more 
objects/sec simultaneously. Is there any way to increase the number of 
transfers/sec or simultaneous transfers? Increasing the options 
osd_max_backfills and osd_recovery_max_active has no effect.

Background: The pool in question (con-fs2-meta2) is the default data pool of a 
ceph fs, which stores exclusively the kind of meta data that goes into this 
pool. Storage consumption is reported as 0, but the number of objects is huge:

NAME ID USED%USED MAX AVAIL OBJECTS 
  
con-fs2-meta112 216 MiB  0.02   933 GiB  
1335 
con-fs2-meta213 0 B 0   933 GiB 
118389897 
con-fs2-data 14 698 TiB 72.15   270 TiB 
286826739 

Unfortunately, there were no recommendations on dimensioning PG numbers for 
this pool, so I used the same for con-fs2-meta1, and con-fs2-meta2. In 
hindsight, this was potentially a bad idea, the meta2 pool should have a much 
higher PG count or a much more aggressive recovery policy.

I now need to rebalance PGs on meta2 and it is going way too slow compared with 
the performance of the SSDs it is located on. In a way, I would like to keep 
the PG count where it is, but increase the recovery rate for this pool by a 
factor of 10. Please let me know what options I have.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

< 1 2 3 4 5 6 7 8 >

201 - 300 of 772 matches

Mail list logo