[ceph-users] Re: Pacific bluestore_volume_selection_policy

2024-01-08 Thread Eugen Block

Hi,

I just did the same in my lab environment and the config got applied  
to the daemon after a restart:


pacific:~ # ceph tell osd.0 config show | grep  
bluestore_volume_selection_policy

"bluestore_volume_selection_policy": "rocksdb_original",

This is also a (tiny single-node) cluster running 16.2.14. Maybe you  
have some typo or something while doing the loop? Have you tried to  
set it for one OSD only and see if it starts with the config set?



Zitat von Reed Dier :

After ~3 uneventful weeks after upgrading from 15.2.17 to 16.2.14  
I’ve started seeing OSD crashes with "cur >= fnode.size” and "cur >=  
p.length”, which seems to be resolved in the next point release for  
pacific later this month, but until then, I’d love to keep the OSDs  
from flapping.


$ for crash in $(ceph crash ls | grep osd | awk '{print $1}') ; do  
ceph crash info $crash | egrep "(assert_condition|crash_id)" ; done

"assert_condition": "cur >= fnode.size",
"crash_id":  
"2024-01-03T09:07:55.698213Z_348af2d3-d4a7-4c27-9f71-70e6dc7c1af7",

"assert_condition": "cur >= p.length",
"crash_id":  
"2024-01-03T14:21:55.794692Z_4557c416-ffca-4165-aa91-d63698d41454",

"assert_condition": "cur >= fnode.size",
"crash_id":  
"2024-01-03T22:53:43.010010Z_15dc2b2a-30fb-4355-84b9-2f9560f08ea7",

"assert_condition": "cur >= p.length",
"crash_id":  
"2024-01-04T02:34:34.408976Z_2954a2c2-25d2-478e-92ad-d79c42d3ba43",

"assert_condition": "cur2 >= p.length",
"crash_id":  
"2024-01-04T21:57:07.100877Z_12f89c2c-4209-4f5a-b243-f0445ba629d2",

"assert_condition": "cur >= p.length",
"crash_id":  
"2024-01-05T00:35:08.561753Z_a189d967-ab02-4c61-bf68-1229222fd259",

"assert_condition": "cur >= fnode.size",
"crash_id":  
"2024-01-05T04:11:48.625086Z_a598cbaf-2c4f-4824-9939-1271eeba13ea",

"assert_condition": "cur >= p.length",
"crash_id":  
"2024-01-05T13:49:34.911210Z_953e38b9-8ae4-4cfe-8f22-d4b7cdf65cea",

"assert_condition": "cur >= p.length",
"crash_id":  
"2024-01-05T13:54:25.732770Z_4924b1c0-309c-4471-8c5d-c3aaea49166c",

"assert_condition": "cur >= p.length",
"crash_id":  
"2024-01-05T16:35:16.485416Z_0bca3d2a-2451-4275-a049-a65c58c1aff1”,


As noted in  
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/  




You can apparently work around the issue by setting
'bluestore_volume_selection_policy' config parameter to rocksdb_original.


However, after trying to set that parameter with `ceph config set  
osd.$osd bluestore_volume_selection_policy rocksdb_original` it  
doesn’t appear to set?


$ ceph config show-with-defaults osd.0  | grep  
bluestore_volume_selection_policy

bluestore_volume_selection_policy   use_some_extra



$ ceph config set osd.0 bluestore_volume_selection_policy rocksdb_original
$ ceph config show osd.0  | grep bluestore_volume_selection_policy
bluestore_volume_selection_policy   use_some_extra   
  default mom


This, I assume, should reflect the new setting, however it still  
shows the default “use_some_extra” value.


But then this seems to imply that the config is set?

$ ceph config dump | grep bluestore_volume_selection_policy
osd.0dev
bluestore_volume_selection_policy   rocksdb_original 
  *

[snip]
osd.9dev
bluestore_volume_selection_policy   rocksdb_original 
  *


Does this need to be set in ceph.conf or is there another setting  
that also needs to be set?
Even after bouncing the OSD daemon, `ceph config show` still reports  
“use_some_extra"


Appreciate any help they can offer to point me towards to bridge the  
gap between now and the next point release.


Thanks,
Reed
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Network Flapping Causing Slow Ops and Freezing VMs

2024-01-08 Thread Eugen Block

Hi,

just to get a better understanding, when you write

Although the OSDs were correctly marked as down in the monitor, slow  
ops persisted until we resolved the network issue.


do you mean that the MONs marked the OSDs as down (temporarily) or did  
you do that? Because if the OSDs "flap" they would also mark  
themselves "up" all the time, this should be reflected in the OSD  
logs, something like "wrongly marked me down". Can you confirm that  
the daemons were still up and logged the "wrongly marked me down"  
messages?
In some cases the "nodown" flag can prevent flapping OSDs, but since  
you actually had a network issue it wouldn't really help here. I would  
probably have set the noout flag and stop the OSD daemons on the  
affected node until the issue was resolved.


Regards,
Eugen

Zitat von mahnoosh shahidi :


Hi all,

I hope this message finds you well. We recently encountered an issue on one
of our OSD servers, leading to network flapping and subsequently causing
significant performance degradation across our entire cluster. Although the
OSDs were correctly marked as down in the monitor, slow ops persisted until
we resolved the network issue. This incident resulted in a major
disruption, especially affecting VMs with mapped RBD images, leading to
their freezing.

In light of this, I have two key questions for the community:

1. Why did slow ops persist even after marking the affected server as down
in the monitor?

2.Are there any recommended configurations for OSD suicide or OSD down
reports that could help us better handle similar network-related issues in
the future?

Best Regards,
Mahnoosh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph -s: wrong host count

2024-01-08 Thread Jan Kasprzak
Hello, Ceph users!

I have recently noticed that when I reboot a single ceph node,
ceph -s reports "5 hosts down" instead of one. The following
is captured during reboot of a node with two OSDs:

health: HEALTH_WARN
noout flag(s) set
2 osds down
5 hosts (2 osds) down
[...]
mon: 3 daemons, quorum mon1,mon3,mon2 (age 8h)
mgr: mon2(active, since 2d), standbys: mon3, mon1
osd: 34 osds: 32 up (since 2m), 34 in (since 4M)
 flags noout
rgw: 1 daemon active (1 hosts, 1 zones)

After the node successfully reboots, ceph -s reports "HEALTH OK"
and of course no OSDs and no hosts are reported as being down.

Does anybody else see this as well? This is Ceph 18.2.1, but I think
I have seen this on Ceph 17 as well.

Thanks,

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph -s: wrong host count

2024-01-08 Thread Eugen Block

Hi,

you probably have empty OSD nodes in your crush tree. Can you send the  
output of 'ceph osd tree'?


Thanks,
Eugen

Zitat von Jan Kasprzak :


Hello, Ceph users!

I have recently noticed that when I reboot a single ceph node,
ceph -s reports "5 hosts down" instead of one. The following
is captured during reboot of a node with two OSDs:

health: HEALTH_WARN
noout flag(s) set
2 osds down
5 hosts (2 osds) down
[...]
mon: 3 daemons, quorum mon1,mon3,mon2 (age 8h)
mgr: mon2(active, since 2d), standbys: mon3, mon1
osd: 34 osds: 32 up (since 2m), 34 in (since 4M)
 flags noout
rgw: 1 daemon active (1 hosts, 1 zones)

After the node successfully reboots, ceph -s reports "HEALTH OK"
and of course no OSDs and no hosts are reported as being down.

Does anybody else see this as well? This is Ceph 18.2.1, but I think
I have seen this on Ceph 17 as well.

Thanks,

-Yenya

--
| Jan "Yenya" Kasprzak  |
| https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] osd_mclock_max_capacity_iops_hdd in Reef

2024-01-08 Thread Luis Domingues
Hi all,

We are testing migrations from a cluster running Pacific to Reef. In pacific we 
needed to tweak osd_mclock_max_capacity_iops_hdd to have decent performances of 
ou cluster.

But in reef it looks like changing the value of 
osd_mclock_max_capacity_iops_hdd does not impact cluster performances. Did 
osd_mclock_max_capacity_iops_hdd became useless?

I did not found anything regarding it on the changelogs, but I could have miss 
something.

Luis Domingues
Proton AG
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph -s: wrong host count

2024-01-08 Thread Jan Kasprzak
Hi Eugen,

Eugen Block wrote:
: you probably have empty OSD nodes in your crush tree. Can you send
: the output of 'ceph osd tree'?

You are right, there were 4 hosts in the crush tree, which I removed
from the cluster and repurposed a while ago. I have edited the CRUSH
map to remove the hosts and other empty buckets, and now my cluster
is rebalancing.

Thanks for the hint!

-Yenya


-- 
| Jan "Yenya" Kasprzak  |
| https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stuck in upgrade process to reef

2024-01-08 Thread Igor Fedotov

Hi Jan,

indeed fsck logs for the OSDs other than osd.0 look good so it would be 
interesting to see OSD startup logs for them. Preferably to have that 
for multiple (e.g. 3-4) OSDs to get the pattern.


Original upgrade log(s) would be nice to see as well.

You might want to use Google Drive or any other publicly available file 
sharing site for that.



Thanks,

Igor

On 05/01/2024 10:25, Jan Marek wrote:

Hi Igor,

I've tried to start only osd.1, which seems to be fsck'd OK, but
it crashed :-(

I search logs and I've found, that I have logs from 22.12.2023,
when I've did a upgrade (I have set logging to journald).

Would you be interested in those logs? This file have 30MB in
bzip2 format, how I can share it with you?

It contains crash log from start osd.1 too, but I can cut out
from it and send it to list...

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:

Hi Igor,

I've ran this oneliner:

for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 
5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} 
--command fsck ; done;

On osd.0 it crashed very quickly, on osd.1 it is still working.

I've send those logs in one e-mail.

But!

I've tried to list disk devices in monitor view, and I've got
very interesting screenshot - some part I've emphasized by red
rectangulars.

I've got a json from syslog, which was as a part cephadm call,
where it seems to be correct (for my eyes).

Can be this coincidence for this problem?

Sincerely
Jan Marek

Dne Čt, led 04, 2024 at 12:32:47 CET napsal(a) Igor Fedotov:

Hi Jan,

may I see the fsck logs from all the failing OSDs to see the pattern. IIUC
the full node is suffering from the issue, right?


Thanks,

Igor

On 1/2/2024 10:53 AM, Jan Marek wrote:

Hello once again,

I've tried this:

export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
ceph-bluestore-tool --path 
/var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck

And I've sending /tmp/osd.0.log file attached.

Sincerely
Jan Marek

Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:

Hi Jan,

this doesn't look like RocksDB corruption but rather like some BlueStore
metadata inconsistency. Also assertion backtrace in the new log looks
completely different from the original one. So in an attempt to find any
systematic pattern I'd suggest to run fsck with verbose logging for every
failing OSD. Relevant command line:

CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
bin/ceph-bluestore-tool --path  --command fsck

Unlikely this will fix anything it's rather a way to collect logs to get
better insight.


Additionally you might want to run similar fsck for a couple of healthy OSDs
- curious if it succeeds as I have a feeling that the problem with crashing
OSDs had been hidden before the upgrade and revealed rather than caused by
it.


Thanks,

Igor

On 12/29/2023 3:28 PM, Jan Marek wrote:

Hello Igor,

I'm attaching a part of syslog creating while starting OSD.0.

Many thanks for help.

Sincerely
Jan Marek

Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:

Hi Jan,

IIUC the attached log is for ceph-kvstore-tool, right?

Can you please share full OSD startup log as well?


Thanks,

Igor

On 12/27/2023 4:30 PM, Jan Marek wrote:

Hello,

I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
osd node have 12 rotational disk and one NVMe device for
bluestore DB). CEPH is installed by ceph orchestrator and have
bluefs storage on osd.

I've started process upgrade from version 17.2.6 to 18.2.1 by
invocating:

ceph orch upgrade start --ceph-version 18.2.1

After upgrade of mon and mgr processes orchestrator tried to
upgrade the first OSD node, but they are falling down.

I've stop the process of upgrade, but I have 1 osd node
completely down.

After upgrade I've got some error messages and I've found
/var/lib/ceph/crash directories, I attach to this message
files, which I've found here.

Please, can you advice, what now I can do? It seems, that rocksdb
is even non-compatible or corrupted :-(

Thanks in advance.

Sincerely
Jan Marek

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx


--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
___
ceph-users mailing list -- ceph-users@ceph.io

[ceph-users] Re: osd_mclock_max_capacity_iops_hdd in Reef

2024-01-08 Thread Sridhar Seshasayee
Hi Luis,


> We are testing migrations from a cluster running Pacific to Reef. In
> pacific we needed to tweak osd_mclock_max_capacity_iops_hdd to have decent
> performances of ou cluster.
>

It would be helpful to know the procedure you are employing for the
migration.


>
> But in reef it looks like changing the value of
> osd_mclock_max_capacity_iops_hdd does not impact cluster performances. Did
> osd_mclock_max_capacity_iops_hdd became useless?
>

"osd_mclock_max_capacity_iops_hdd" is still valid in Reef as long as it
accurately represents the capability of the underlying OSD device for the
intended workload.

Between Pacific and Reef many improvements to the mClock feature have been
made. An important change relates to the automatic determination of cost
per I/O which is now tied to the sequential and random IOPS capability of
the underlying device of an OSD. As long as
"osd_mclock_max_capacity_iops_hdd" and
"osd_mclock_max_sequential_bandwidth_hdd" represent a fairly accurate
capability of the backing OSD device, the performance should be along
expected lines. Changing the "osd_mclock_max_capacity_iops_hdd" to a value
that is beyond the capability of the device will obviously not yield any
improvement.

If the above parameters are representative of the capability of the backing
OSD device and you still see lower than expected performance, then it could
be some other issue that needs looking into.
-Sridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_mclock_max_capacity_iops_hdd in Reef

2024-01-08 Thread Luis Domingues
Hi Sridhar. Thanks for your reply:


> > We are testing migrations from a cluster running Pacific to Reef. In
> > pacific we needed to tweak osd_mclock_max_capacity_iops_hdd to have decent
> > performances of ou cluster.
>
> It would be helpful to know the procedure you are employing for the
> migration.

For now we run some benchmarks on a fairly small dev/test cluster. It has been 
deployed using cephadm and updated with cephadm from Pacific to Reef.

What we observed, is that with Pacific, tweaking 
osd_mclock_max_capacity_iops_hdd, we can go from arround 200MB/s of writes up 
to 600MB/s of writes, on balanced profile.
But with Reef, changing osd_mclock_max_capacity_iops_hdd does not change a lot 
the performances of the cluster. (Or if it does, they are small enough so I did 
not see them).

That been said, the performances of Reef "out of the box" are what we expect of 
our cluster (arround 600MB/s), while with Pacific we needed to tweak manually 
osd_mclock_max_capacity_iops_hdd to get the expected performances. So there is 
definitely a big improvement there.

What made me think that this option was maybe not used anymore, during the 
deploy of Pacific, each OSD pushes its own osd_mclock_max_capacity_iops_hdd, 
but deploying Reef not. We did not see any values for the OSDs in the ceph 
config db.

In conclusion, we could say, at least on our pre-update tests, that mClock 
seems to behave a lot better in Reef than in Pacific.

Luis Domingues
Proton AG


On Monday, 8 January 2024 at 12:29, Sridhar Seshasayee  
wrote:


> Hi Luis,
> 
> > We are testing migrations from a cluster running Pacific to Reef. In
> > pacific we needed to tweak osd_mclock_max_capacity_iops_hdd to have decent
> > performances of ou cluster.
> 
> 
> It would be helpful to know the procedure you are employing for the
> migration.
> 
> > But in reef it looks like changing the value of
> > osd_mclock_max_capacity_iops_hdd does not impact cluster performances. Did
> > osd_mclock_max_capacity_iops_hdd became useless?
> 
> 
> "osd_mclock_max_capacity_iops_hdd" is still valid in Reef as long as it
> accurately represents the capability of the underlying OSD device for the
> intended workload.
> 
> Between Pacific and Reef many improvements to the mClock feature have been
> made. An important change relates to the automatic determination of cost
> per I/O which is now tied to the sequential and random IOPS capability of
> the underlying device of an OSD. As long as
> "osd_mclock_max_capacity_iops_hdd" and
> "osd_mclock_max_sequential_bandwidth_hdd" represent a fairly accurate
> capability of the backing OSD device, the performance should be along
> expected lines. Changing the "osd_mclock_max_capacity_iops_hdd" to a value
> that is beyond the capability of the device will obviously not yield any
> improvement.
> 
> If the above parameters are representative of the capability of the backing
> OSD device and you still see lower than expected performance, then it could
> be some other issue that needs looking into.
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_mclock_max_capacity_iops_hdd in Reef

2024-01-08 Thread Sridhar Seshasayee
Hi Luis,

What we observed, is that with Pacific, tweaking
> osd_mclock_max_capacity_iops_hdd, we can go from arround 200MB/s of writes
> up to 600MB/s of writes, on balanced profile.
> But with Reef, changing osd_mclock_max_capacity_iops_hdd does not change a
> lot the performances of the cluster. (Or if it does, they are small enough
> so I did not see them).
>

The above probably indicates that the default values for
osd_mclock_max_capacity_iops_hdd are close enough to the actual capability
of the backing device.


> That been said, the performances of Reef "out of the box" are what we
> expect of our cluster (arround 600MB/s), while with Pacific we needed to
> tweak manually osd_mclock_max_capacity_iops_hdd to get the expected
> performances. So there is definitely a big improvement there.
>

This is good feedback. One of our goals was to achieve a hands-free
configuration of mClock and fine tune only when necessary.


>
> What made me think that this option was maybe not used anymore, during the
> deploy of Pacific, each OSD pushes its own
> osd_mclock_max_capacity_iops_hdd, but deploying Reef not. We did not see
> any values for the OSDs in the ceph config db.
>

The fact that you don't see any values in the config db indicates that the
default values are in effect. We added a fallback mechanism to use the
default values in case the benchmark test during OSD boot-up returned
unrealistic values. Please see
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/#mitigation-of-unrealistic-osd-capacity-from-automated-test
for more details and awareness around this. In your case, the configuration
may be left as is since the defaults are giving you the expected
performance.
-Sridhar
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2024-01-08 Thread Fulvio Galeazzi

Hallo Frank,
	just found this post, thank you! I have also been puzzled/struggling 
with scrub/deep-scrub and found your post very useful: will give this a 
try, soon.


One thing, first: I am using Octopus, too, but I cannot find any 
documentation about osd_deep_scrub_randomize_ratio. I do see that in 
past releases, but not on Octopus: is it still a valid parameter?


Let me check whether I understood your procedure: you optimize scrub 
time distribution essentially by playing with osd_scrub_min_interval, 
thus "forcing" the automated algorithm to preferentially select 
older-scrubbed PGs, am I correct?


Another small question: you opt for osd_max_scrubs=1 just to make sure 
your I/O is not adversely affected by scrubbing, or is there a more 
profound reason for that?


  Thanks!

Fulvio

On 12/13/23 13:36, Frank Schilder wrote:

Hi all,

since there seems to be some interest, here some additional notes.

1) The script is tested on octopus. It seems that there was a change in the 
output of ceph commands used and it might need some tweaking to get it to work 
on other versions.

2) If you want to give my findings a shot, you can do so in a gradual way. The 
most important change is setting osd_deep_scrub_randomize_ratio=0 (with 
osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the 
requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp 
younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the 
one change to test, all other settings have less impact. The script will not 
report some numbers at the end, but the histogram will be correct. Let it run a 
few deep-scrub-interval rounds until the histogram is evened out.

If you start your test after using osd_max_scrubs>1 for a while -as I did - you 
will need a lot of patience and might need to mute some scrub warnings for a while.

3) The changes are mostly relevant for large HDDs that take a long time to 
deep-scrub (many small objects). The overall load reduction, however, is useful 
in general.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Fulvio Galeazzi
GARR-Net Department
tel.: +39-334-6533-250
skype: fgaleazzi70


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific bluestore_volume_selection_policy

2024-01-08 Thread Reed Dier
I ended up setting it in ceph.conf which appears to have worked (as far as I 
can tell).

> [osd]
> bluestore_volume_selection_policy = rocksdb_original

> $ ceph config show osd.0  | grep bluestore_volume_selection_policy
> bluestore_volume_selection_policy   rocksdb_original  file
>   (mon[rocksdb_original])

So far so good…

Reed

> On Jan 8, 2024, at 2:04 AM, Eugen Block  wrote:
> 
> Hi,
> 
> I just did the same in my lab environment and the config got applied to the 
> daemon after a restart:
> 
> pacific:~ # ceph tell osd.0 config show | grep 
> bluestore_volume_selection_policy
>"bluestore_volume_selection_policy": "rocksdb_original",
> 
> This is also a (tiny single-node) cluster running 16.2.14. Maybe you have 
> some typo or something while doing the loop? Have you tried to set it for one 
> OSD only and see if it starts with the config set?
> 
> 
> Zitat von Reed Dier mailto:reed.d...@focusvq.com>>:
> 
>> After ~3 uneventful weeks after upgrading from 15.2.17 to 16.2.14 I’ve 
>> started seeing OSD crashes with "cur >= fnode.size” and "cur >= p.length”, 
>> which seems to be resolved in the next point release for pacific later this 
>> month, but until then, I’d love to keep the OSDs from flapping.
>> 
>>> $ for crash in $(ceph crash ls | grep osd | awk '{print $1}') ; do ceph 
>>> crash info $crash | egrep "(assert_condition|crash_id)" ; done
>>>"assert_condition": "cur >= fnode.size",
>>>"crash_id": 
>>> "2024-01-03T09:07:55.698213Z_348af2d3-d4a7-4c27-9f71-70e6dc7c1af7",
>>>"assert_condition": "cur >= p.length",
>>>"crash_id": 
>>> "2024-01-03T14:21:55.794692Z_4557c416-ffca-4165-aa91-d63698d41454",
>>>"assert_condition": "cur >= fnode.size",
>>>"crash_id": 
>>> "2024-01-03T22:53:43.010010Z_15dc2b2a-30fb-4355-84b9-2f9560f08ea7",
>>>"assert_condition": "cur >= p.length",
>>>"crash_id": 
>>> "2024-01-04T02:34:34.408976Z_2954a2c2-25d2-478e-92ad-d79c42d3ba43",
>>>"assert_condition": "cur2 >= p.length",
>>>"crash_id": 
>>> "2024-01-04T21:57:07.100877Z_12f89c2c-4209-4f5a-b243-f0445ba629d2",
>>>"assert_condition": "cur >= p.length",
>>>"crash_id": 
>>> "2024-01-05T00:35:08.561753Z_a189d967-ab02-4c61-bf68-1229222fd259",
>>>"assert_condition": "cur >= fnode.size",
>>>"crash_id": 
>>> "2024-01-05T04:11:48.625086Z_a598cbaf-2c4f-4824-9939-1271eeba13ea",
>>>"assert_condition": "cur >= p.length",
>>>"crash_id": 
>>> "2024-01-05T13:49:34.911210Z_953e38b9-8ae4-4cfe-8f22-d4b7cdf65cea",
>>>"assert_condition": "cur >= p.length",
>>>"crash_id": 
>>> "2024-01-05T13:54:25.732770Z_4924b1c0-309c-4471-8c5d-c3aaea49166c",
>>>"assert_condition": "cur >= p.length",
>>>"crash_id": 
>>> "2024-01-05T16:35:16.485416Z_0bca3d2a-2451-4275-a049-a65c58c1aff1”,
>> 
>> As noted in 
>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/YNJ35HXN4HXF4XWB6IOZ2RKXX7EQCEIY/
>>  
>> 
>>  
>> >  
>> >
>> 
>>> You can apparently work around the issue by setting
>>> 'bluestore_volume_selection_policy' config parameter to rocksdb_original.
>> 
>> However, after trying to set that parameter with `ceph config set osd.$osd 
>> bluestore_volume_selection_policy rocksdb_original` it doesn’t appear to set?
>> 
>>> $ ceph config show-with-defaults osd.0  | grep 
>>> bluestore_volume_selection_policy
>>> bluestore_volume_selection_policy   use_some_extra
>> 
>>> $ ceph config set osd.0 bluestore_volume_selection_policy rocksdb_original
>>> $ ceph config show osd.0  | grep bluestore_volume_selection_policy
>>> bluestore_volume_selection_policy   use_some_extra
>>> default mom
>> 
>> This, I assume, should reflect the new setting, however it still shows the 
>> default “use_some_extra” value.
>> 
>> But then this seems to imply that the config is set?
>>> $ ceph config dump | grep bluestore_volume_selection_policy
>>>osd.0dev   bluestore_volume_selection_policy   
>>> rocksdb_original  *
>>> [snip]
>>>osd.9dev   bluestore_volume_selection_policy   
>>> rocksdb_original  *
>> 
>> Does this need to be set in ceph.conf or is there another setting that also 
>> needs to be set?
>> Even after bouncing the OSD daemon, `ceph config show` still reports 
>> “use_some_extra"
>> 
>> Appreciate any help they can offer to point me towards to bridge the gap 
>> between now and the next point release.
>> 
>> Thanks,
>> Reed
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io 
>> To un

[ceph-users] Re: cephadm bootstrap on 3 network clusters

2024-01-08 Thread Patrick Begou

Hi Sebastian

as you says "more than 3 public networks", did you manage Ceph daemons 
listening on multiple public interface ?
I'm looking for such a possibility as daemons seams binded to one 
interface only but do not find any how-to.


Thanks

Patrick

Le 03/01/2024 à 21:31, Sebastian a écrit :

Hi,
check routing table and default gateway and eventually fix it.
use IP instead of dns name.

I have more complicated situation :D
I have more than 3 public networks and cluster networks…

BR,
Sebastian


On Jan 3, 2024, at 16:40, Luis Domingues  wrote:



Why? The public network should not have any restrictions between the
Ceph nodes. Same with the cluster network.

Internal policies and network rules.

Luis Domingues
Proton AG


On Wednesday, 3 January 2024 at 16:15, Robert Sander 
 wrote:



Hi Luis,

On 1/3/24 16:12, Luis Domingues wrote:


My issue is that mon1 cannot connect via SSH to itself using pub network, and 
bootstrap fail at the end when cephadm tries to add mon1 to the list of hosts.


Why? The public network should not have any restrictions between the
Ceph nodes. Same with the cluster network.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Network Flapping Causing Slow Ops and Freezing VMs

2024-01-08 Thread mahnoosh shahidi
Hi Eugen

Yes osds were marked as down by mons and there was "wrongly marked as down"
in the logs but the osds were down all the time. Actually I was looking for
a fast fail procedure for these kind of situation cause any manual action
would take time and can causes major incidents.

Best Regards,
Mahnoosh

On Mon, 8 Jan 2024, 11:47 Eugen Block,  wrote:

> Hi,
>
> just to get a better understanding, when you write
>
> > Although the OSDs were correctly marked as down in the monitor, slow
> > ops persisted until we resolved the network issue.
>
> do you mean that the MONs marked the OSDs as down (temporarily) or did
> you do that? Because if the OSDs "flap" they would also mark
> themselves "up" all the time, this should be reflected in the OSD
> logs, something like "wrongly marked me down". Can you confirm that
> the daemons were still up and logged the "wrongly marked me down"
> messages?
> In some cases the "nodown" flag can prevent flapping OSDs, but since
> you actually had a network issue it wouldn't really help here. I would
> probably have set the noout flag and stop the OSD daemons on the
> affected node until the issue was resolved.
>
> Regards,
> Eugen
>
> Zitat von mahnoosh shahidi :
>
> > Hi all,
> >
> > I hope this message finds you well. We recently encountered an issue on
> one
> > of our OSD servers, leading to network flapping and subsequently causing
> > significant performance degradation across our entire cluster. Although
> the
> > OSDs were correctly marked as down in the monitor, slow ops persisted
> until
> > we resolved the network issue. This incident resulted in a major
> > disruption, especially affecting VMs with mapped RBD images, leading to
> > their freezing.
> >
> > In light of this, I have two key questions for the community:
> >
> > 1. Why did slow ops persist even after marking the affected server as
> down
> > in the monitor?
> >
> > 2.Are there any recommended configurations for OSD suicide or OSD down
> > reports that could help us better handle similar network-related issues
> in
> > the future?
> >
> > Best Regards,
> > Mahnoosh
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Network Flapping Causing Slow Ops and Freezing VMs

2024-01-08 Thread Eugen Block
You didn't mention which ceph version you're running, assuming that  
it's managed by cephadm you could put the host in maintenance mode [1]  
which stops all services and then adds the no-out flag for that host  
to prevent unnecessary recovery.
Once the maintenance is done, exit the maintenance mode and the  
services should start again. Note that all ceph services would be  
stopped, so MONs too.


[1] https://docs.ceph.com/en/latest/cephadm/host-management/

Zitat von mahnoosh shahidi :


Hi Eugen

Yes osds were marked as down by mons and there was "wrongly marked as down"
in the logs but the osds were down all the time. Actually I was looking for
a fast fail procedure for these kind of situation cause any manual action
would take time and can causes major incidents.

Best Regards,
Mahnoosh

On Mon, 8 Jan 2024, 11:47 Eugen Block,  wrote:


Hi,

just to get a better understanding, when you write

> Although the OSDs were correctly marked as down in the monitor, slow
> ops persisted until we resolved the network issue.

do you mean that the MONs marked the OSDs as down (temporarily) or did
you do that? Because if the OSDs "flap" they would also mark
themselves "up" all the time, this should be reflected in the OSD
logs, something like "wrongly marked me down". Can you confirm that
the daemons were still up and logged the "wrongly marked me down"
messages?
In some cases the "nodown" flag can prevent flapping OSDs, but since
you actually had a network issue it wouldn't really help here. I would
probably have set the noout flag and stop the OSD daemons on the
affected node until the issue was resolved.

Regards,
Eugen

Zitat von mahnoosh shahidi :

> Hi all,
>
> I hope this message finds you well. We recently encountered an issue on
one
> of our OSD servers, leading to network flapping and subsequently causing
> significant performance degradation across our entire cluster. Although
the
> OSDs were correctly marked as down in the monitor, slow ops persisted
until
> we resolved the network issue. This incident resulted in a major
> disruption, especially affecting VMs with mapped RBD images, leading to
> their freezing.
>
> In light of this, I have two key questions for the community:
>
> 1. Why did slow ops persist even after marking the affected server as
down
> in the monitor?
>
> 2.Are there any recommended configurations for OSD suicide or OSD down
> reports that could help us better handle similar network-related issues
in
> the future?
>
> Best Regards,
> Mahnoosh
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Radosgw not syncing files/folders with slashes in object name

2024-01-08 Thread Matt Dunavant
Hi all,

I need some help troubleshooting a strange issue where my relatively newly 
setup 2 ceph clusters (17.2.6) are setup for replication and I can get files 
with just regular names (example.txt for example) to sync but anything with a 
slash or a folder type (folder1/folder2/example.txt for example) won't sync 
over. Not sure exactly why this would be the case as I'm pretty sure slashes 
are allowed in object names, https://docs.ceph.com/en/latest/radosgw/layout/. 
Any ideas or something obvious I'm missing? Sync status looks normal and I have 
tested this with a variety of new and old buckets and the behavior always stays 
the same, nothing with a slash syncs but everything without does.

Thanks in advance,
-Matt Dunavant
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd persistent cache configuration

2024-01-08 Thread Peter
 rbd --version
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd persistent cache configuration

2024-01-08 Thread Ilya Dryomov
On Mon, Jan 8, 2024 at 10:43 PM Peter  wrote:
>
>  rbd --version
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
> (stable)

Hi Peter,

The PWL cache was introduced in Pacific (16.2.z).

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: REST API Endpoint Failure - Request For Where To Look To Resolve

2024-01-08 Thread Nizamudeen A
Hi,

Niz is totally fine ;) and good to hear the issue is resolved.

Regards,

On Sun, Jan 7, 2024 at 9:05 AM duluxoz  wrote:

> Hi Niz (may I call you "Niz"?)
>
> So, with the info you provided I was able to find what the issue was in
> the logs (now I know where the darn things are!) and so we have resolved
> our problem - a mis-configured port number - obvious when you think about
> it - and so I'd like to thank you once again for all of you patience and
> help
>
> Cheers
>
> Dulux-oz
> On 05/01/2024 20:39, Nizamudeen A wrote:
>
> ah sorry for that. Outside the cephadm shell, if you do cephadm ls | grep
> "mgr.", that should give you the mgr container name. It should look
> something like this
> [root@ceph-node-00 ~]# cephadm ls | grep "mgr."
> "name": "mgr.ceph-node-00.aoxbdg",
> "systemd_unit":
> "ceph-e877a630-abaa-11ee-b7ce-52540097c...@mgr.ceph-node-00.aoxbdg"
> ,
> "service_name": "mgr",
>
> and you can use that name to see the logs.
>
> On Fri, Jan 5, 2024 at 3:04 PM duluxoz  wrote:
>
>> Yeah, that's what I meant when I said I'm new to podman and containers -
>> so, stupid Q: What is the "typical" name for a given container eg if the
>> server is "node1" is the management container "mgr.node1" of something
>> similar?
>>
>> And thanks for the help - I really *do* appreciate it.  :-)
>> On 05/01/2024 20:30, Nizamudeen A wrote:
>>
>> ah yeah, its usually inside the container so you'll need to check the mgr
>> container for the logs.
>> cephadm logs -n 
>>
>> also cephadm has
>> its own log channel which can be used to get the logs.
>>
>> https://docs.ceph.com/en/quincy/cephadm/operations/#watching-cephadm-log-messages
>>
>> On Fri, Jan 5, 2024 at 2:54 PM duluxoz  wrote:
>>
>>> Yeap, can do - are the relevant logs in the "usual" place or buried
>>> somewhere inside some sort of container (typically)?  :-)
>>> On 05/01/2024 20:14, Nizamudeen A wrote:
>>>
>>> no, the error message is not clear enough to deduce an error. could you
>>> perhaps share the mgr logs at the time of the error? It could have some
>>> tracebacks
>>> which can give more info to debug it further.
>>>
>>> Regards,
>>>
>>> On Fri, Jan 5, 2024 at 2:00 PM duluxoz  wrote:
>>>
 Hi Nizam,

 Yeap, done all that - we're now at the point of creating the iSCSI
 Target(s) for the gateway (via the Dashboard and/or the CLI: see the error
 message in the OP) - any ideas?  :-)

 Cheers

 Dulux-Oz
 On 05/01/2024 19:10, Nizamudeen A wrote:

 Hi,

 You can find the APIs associated with the iscsi here:
 https://docs.ceph.com/en/reef/mgr/ceph_api/#iscsi

 and if you create iscsi service through dashboard or cephadm, it should
 add the iscsi gateways to the dashboard.
 you can view them by issuing *ceph dashboard iscsi-gateway-list* and
 you can add or remove gateways manually by

 ceph dashboard iscsi-gateway-add -i 
 []
 ceph dashboard iscsi-gateway-rm 

 which you can find the documentation here:
 https://docs.ceph.com/en/quincy/mgr/dashboard/#enabling-iscsi-management

 Regards,
 Nizam




 On Fri, Jan 5, 2024 at 12:53 PM duluxoz  wrote:

> Hi All,
>
> A little help please.
>
> TL/DR: Please help with error message:
> ~~~
> REST API failure, code : 500
> Unable to access the configuration object
> Unable to contact the local API endpoint (https://localhost:5000/api)
> ~~~
>
> The Issue
> 
>
> I've been through the documentation and can't find what I'm looking
> for
> - possibly because I'm not really sure what it is I *am* looking for,
> so
> if someone can point me in the right direction I would really
> appreciate it.
>
> I get the above error message when I run the `gwcli` command from
> inside
> a cephadm shell.
>
> What I'm trying to do is set up a set of iSCSI Gateways in our
> Ceph-Reef
> 18.2.1 Cluster (yes, I know its being depreciated as of Nov 22 - or
> whatever). We recently migrated 7 upgraded from a manual install of
> Quincy to a CephAdm install of Reef - everything went AOK *except* for
> the iSCSI Gateways. So we tore them down and then rebuilt them as per
> the latest documentation. So now we've got 3 gateways as per the
> Service
> page of the Dashboard and I'm trying to create the targets.
>
> I tried via the Dashboard but had errors, so instead I went in to do
> it
> via gwcli and hit the above error (which I now bevel to be the cause
> of
> the GUI creation I encountered.
>
> I have absolutely no experience with podman or containers in general,
> and can't work out how to fix the issue. So I'm requesting some help -
> not to solve the problem for me, but to point me in the right
> direction
> to solve it myself.  :-)
>
> So, anyone?
>
> Cheers
> Dul