[ceph-users] Re: [DOC] Openstack with RBD DOC update?
Hi, so there currently is a section how to configure nova [0], but it refers to the client side ceph.conf, not the rbd details in nova.conf as Ilya already pointed out. I'll just add what I have in one of my test clusters in the [livbirt] section of the nova.conf (we use it identically in our production clusters): [libvirt] virt_type = kvm live_migration_uri = "qemu+ssh://%s/system" live_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE cpu_mode = host-passthrough disk_cachemodes = network=writeback images_type = rbd images_rbd_pool = vms images_rbd_ceph_conf = /etc/ceph/ceph.conf rbd_user = cinder rbd_secret_uuid = Maybe leave out the non-rbd config options to only have a minimum conf in the docs? It is common to have the cinder user configured for nova as well because it requires access to both ephemeral disks and persistent volumes (just mentioning that in case it's not commonly known). And this permission topic brings me to a thread [1] Christian Rohmann brought up in the openstack-discuss mailing list. If it's not the right place to bring this up, please ignore this section. There have been changes regarding glance permissions and the (openstack) docs are not consistent anymore, maybe someone from the ceph team could assist and get them consistent again? CC'ed Christian here as well. The ceph docs don't mention any other permissions than for the images pool, so the question is: e) Instead of try and error on the "rados_*"-prefixed object required, maybe it makes sense to have someone from Ceph look into this to define which caps are actually required to allow for list_children on RBD images with children in other pools? @Christian: regarding auth caps this was the main question, right? Thanks, Eugen [0] https://docs.ceph.com/en/latest/rbd/rbd-openstack/#configuring-nova [1] https://lists.openstack.org/archives/list/openstack-disc...@lists.openstack.org/message/JVZHT4O45ZBMDEMLE7W6JFH5KXD3SL7F/ [2] https://docs.ceph.com/en/latest/rbd/rbd-openstack/#setup-ceph-client-authentication Zitat von Zac Dover : You guys can just respond here and I’ll add your responses to the docs. Zac Sent from [Proton Mail](https://proton.me/mail/home) for iOS On Thu, Jan 25, 2024 at 05:52, Ilya Dryomov <[idryo...@gmail.com](mailto:On Thu, Jan 25, 2024 at 05:52, Ilya Dryomov < wrote: On Wed, Jan 24, 2024 at 7:31 PM Eugen Block wrote: We do like the separation of nova pools as well, and we also heavily use ephemeral disks instead of boot-from-volume instances. One of the reasons being that you can't detach a root volume from an instances. It helps in specific maintenance cases, so +1 for keeping it in the docs. So it seems like instead of dropping mentions of vms pool, we should expand "Configuring Nova" section where it says In order to boot virtual machines directly from Ceph volumes, you must configure the ephemeral backend for Nova. with appropriate steps and /etc/nova/nova.conf snippet. I'm guessing images_type = rbd images_rbd_pool = vms images_rbd_ceph_conf = /etc/ceph/ceph.conf at a minimum? Zitat or Eugen, do you want to suggest a precise edit based on your working configuration for Zac to incorporate or perhaps even open a PR directly? Thanks, Ilya Zitat von Erik McCormick : > On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais > wrote: > >> Good afternoon everybody! >> >> I have a question regarding the documentation... I was reviewing it and >> realized that the "vms" pool is not being used anywhere in the configs. >> >> The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the >> configuration section of nova.conf was removed, but the pool configuration >> remained there. >> >> Would it be correct to ignore all mentions of this pool (I don't see any >> use for it)? If so, it would be interesting to update the documentation. >> >> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool > > > The use of that "vms" pool is for Nova to directly store "ephemeral" disks > in ceph instead of on local disk. It used to be described in the Ceph doc, > but seems to no longer be there. It's still in the Redhat version [1] > however. Wouldn't it be better to put that back instead of removing the > creation of the vms pool from the docs? Maybe there's a good reason we only > want to boot instances into volumes now, but I'm not aware of it. > > [1] - Section 3.4.3 of > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index > > -Erik > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing
[ceph-users] Re: Questions about the CRUSH details
Den tors 25 jan. 2024 kl 03:05 skrev Henry lol : > > Do you mean object location (osds) is initially calculated only using its > name and crushmap, > and then the result is reprocessed with the map of the PGs? > > and I'm still skeptical about computation on the client-side. > is it possible to obtain object location without computation on the client > because ceph-mon already updates that information to PG map? The client should not need to contact the mon for each object access and every client can't have a complete list of millions of objects in the cluster, so it does client-side computations. The mon connection will more or less only require new updates if/when OSDs change weight or goes in/out. This way, clients can run on "autopilot" even if all mons are down, as long as OSD states don't change. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed
I have to say that not including a fix for a serious issue into the last minor release of Pacific is a rather odd decision. /Z On Thu, 25 Jan 2024 at 09:00, Konstantin Shalygin wrote: > Hi, > > The backport to pacific was rejected [1], you may switch to reef, when [2] > merged and released > > > [1] https://github.com/ceph/ceph/pull/55109 > [2] https://github.com/ceph/ceph/pull/55110 > > k > Sent from my iPhone > > > On Jan 25, 2024, at 04:12, changzhi tan <544463...@qq.com> wrote: > > > > Is there any way to solve this problem?thanks > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed
Hi, The backport to pacific was rejected [1], you may switch to reef, when [2] merged and released [1] https://github.com/ceph/ceph/pull/55109 [2] https://github.com/ceph/ceph/pull/55110 k Sent from my iPhone > On Jan 25, 2024, at 04:12, changzhi tan <544463...@qq.com> wrote: > > Is there any way to solve this problem?thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed
I found that quickly restarting the affected mgr every 2 days is an okay kludge. It takes less than a second to restart, and never grows to dangerous sizes which is when it randomly starts ballooning. /Z On Thu, 25 Jan 2024, 03:12 changzhi tan, <544463...@qq.com> wrote: > Is there any way to solve this problem?thanks > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Questions about the CRUSH details
Do you mean object location (osds) is initially calculated only using its name and crushmap, and then the result is reprocessed with the map of the PGs? and I'm still skeptical about computation on the client-side. is it possible to obtain object location without computation on the client because ceph-mon already updates that information to PG map? 2024년 1월 25일 (목) 오전 3:08, David C. 님이 작성: > Hi, > > The client calculates the location (PG) of an object from its name and the > crushmap. > This is what makes it possible to parallelize the flows directly from the > client. > > The client also has the map of the PGs which are relocated to other OSDs > (upmap, temp, etc.) > > > Cordialement, > > *David CASIER* > > > > > Le mer. 24 janv. 2024 à 17:49, Henry lol a > écrit : > >> Hello, I'm new to ceph and sorry in advance for the naive questions. >> >> 1. >> As far as I know, CRUSH utilizes the cluster map consisting of the PG >> map and others. >> I don't understand why CRUSH computation is required on client-side, >> even though PG-to-OSDs mapping can be acquired from the PG map. >> >> 2. >> how does the client get a valid(old) OSD set when the PG is being >> remapped to a new ODS set which CRUSH returns? >> >> thanks. >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed
Is there any way to solve this problem?thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [DOC] Openstack with RBD DOC update?
You guys can just respond here and I’ll add your responses to the docs. Zac Sent from [Proton Mail](https://proton.me/mail/home) for iOS On Thu, Jan 25, 2024 at 05:52, Ilya Dryomov <[idryo...@gmail.com](mailto:On Thu, Jan 25, 2024 at 05:52, Ilya Dryomov < wrote: > On Wed, Jan 24, 2024 at 7:31 PM Eugen Block wrote: >> >> We do like the separation of nova pools as well, and we also heavily >> use ephemeral disks instead of boot-from-volume instances. One of the >> reasons being that you can't detach a root volume from an instances. >> It helps in specific maintenance cases, so +1 for keeping it in the >> docs. > > So it seems like instead of dropping mentions of vms pool, we should > expand "Configuring Nova" section where it says > > In order to boot virtual machines directly from Ceph volumes, you > must configure the ephemeral backend for Nova. > > with appropriate steps and /etc/nova/nova.conf snippet. I'm guessing > > images_type = rbd > images_rbd_pool = vms > images_rbd_ceph_conf = /etc/ceph/ceph.conf > > at a minimum? > > Zitat or Eugen, do you want to suggest a precise edit based on your > working configuration for Zac to incorporate or perhaps even open a PR > directly? > > Thanks, > > Ilya > >> >> Zitat von Erik McCormick : >> >> > On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais >> > wrote: >> > >> >> Good afternoon everybody! >> >> >> >> I have a question regarding the documentation... I was reviewing it and >> >> realized that the "vms" pool is not being used anywhere in the configs. >> >> >> >> The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the >> >> configuration section of nova.conf was removed, but the pool configuration >> >> remained there. >> >> >> >> Would it be correct to ignore all mentions of this pool (I don't see any >> >> use for it)? If so, it would be interesting to update the documentation. >> >> >> >> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool >> > >> > >> > The use of that "vms" pool is for Nova to directly store "ephemeral" disks >> > in ceph instead of on local disk. It used to be described in the Ceph doc, >> > but seems to no longer be there. It's still in the Redhat version [1] >> > however. Wouldn't it be better to put that back instead of removing the >> > creation of the vms pool from the docs? Maybe there's a good reason we only >> > want to boot instances into volumes now, but I'm not aware of it. >> > >> > [1] - Section 3.4.3 of >> > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index >> > >> > -Erik >> > ___ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [DOC] Openstack with RBD DOC update?
On Wed, Jan 24, 2024 at 8:52 PM Ilya Dryomov wrote: > > On Wed, Jan 24, 2024 at 7:31 PM Eugen Block wrote: > > > > We do like the separation of nova pools as well, and we also heavily > > use ephemeral disks instead of boot-from-volume instances. One of the > > reasons being that you can't detach a root volume from an instances. > > It helps in specific maintenance cases, so +1 for keeping it in the > > docs. > > So it seems like instead of dropping mentions of vms pool, we should > expand "Configuring Nova" section where it says > > In order to boot virtual machines directly from Ceph volumes, you > must configure the ephemeral backend for Nova. > > with appropriate steps and /etc/nova/nova.conf snippet. I'm guessing > > images_type = rbd > images_rbd_pool = vms > images_rbd_ceph_conf = /etc/ceph/ceph.conf > > at a minimum? > > Zitat or Eugen, do you want to suggest a precise edit based on your Apologies, autocomplete fail... I meant Erik or Eugen of course. Ilya ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [DOC] Openstack with RBD DOC update?
On Wed, Jan 24, 2024 at 7:31 PM Eugen Block wrote: > > We do like the separation of nova pools as well, and we also heavily > use ephemeral disks instead of boot-from-volume instances. One of the > reasons being that you can't detach a root volume from an instances. > It helps in specific maintenance cases, so +1 for keeping it in the > docs. So it seems like instead of dropping mentions of vms pool, we should expand "Configuring Nova" section where it says In order to boot virtual machines directly from Ceph volumes, you must configure the ephemeral backend for Nova. with appropriate steps and /etc/nova/nova.conf snippet. I'm guessing images_type = rbd images_rbd_pool = vms images_rbd_ceph_conf = /etc/ceph/ceph.conf at a minimum? Zitat or Eugen, do you want to suggest a precise edit based on your working configuration for Zac to incorporate or perhaps even open a PR directly? Thanks, Ilya > > Zitat von Erik McCormick : > > > On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais > > wrote: > > > >> Good afternoon everybody! > >> > >> I have a question regarding the documentation... I was reviewing it and > >> realized that the "vms" pool is not being used anywhere in the configs. > >> > >> The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the > >> configuration section of nova.conf was removed, but the pool configuration > >> remained there. > >> > >> Would it be correct to ignore all mentions of this pool (I don't see any > >> use for it)? If so, it would be interesting to update the documentation. > >> > >> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool > > > > > > The use of that "vms" pool is for Nova to directly store "ephemeral" disks > > in ceph instead of on local disk. It used to be described in the Ceph doc, > > but seems to no longer be there. It's still in the Redhat version [1] > > however. Wouldn't it be better to put that back instead of removing the > > creation of the vms pool from the docs? Maybe there's a good reason we only > > want to boot instances into volumes now, but I'm not aware of it. > > > > [1] - Section 3.4.3 of > > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index > > > > -Erik > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [DOC] Openstack with RBD DOC update?
We do like the separation of nova pools as well, and we also heavily use ephemeral disks instead of boot-from-volume instances. One of the reasons being that you can't detach a root volume from an instances. It helps in specific maintenance cases, so +1 for keeping it in the docs. Zitat von Erik McCormick : On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais wrote: Good afternoon everybody! I have a question regarding the documentation... I was reviewing it and realized that the "vms" pool is not being used anywhere in the configs. The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the configuration section of nova.conf was removed, but the pool configuration remained there. Would it be correct to ignore all mentions of this pool (I don't see any use for it)? If so, it would be interesting to update the documentation. https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool The use of that "vms" pool is for Nova to directly store "ephemeral" disks in ceph instead of on local disk. It used to be described in the Ceph doc, but seems to no longer be there. It's still in the Redhat version [1] however. Wouldn't it be better to put that back instead of removing the creation of the vms pool from the docs? Maybe there's a good reason we only want to boot instances into volumes now, but I'm not aware of it. [1] - Section 3.4.3 of https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index -Erik ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Stupid question about ceph fs volume
Hi everyone, Stupid question about ceph fs volume create how can I specify the metadata pool and the data pool ? I was able to create a cephfs «manually» with something like ceph fs new vo cephfs_metadata cephfs_data but as I understand the documentation, with this method I need to deploy the mds, and the «new» way to do it is to use ceph fs volume. But with ceph fs volume I didn't find any documentation of how to set the metadata/data pool I also didn't find any way to change after the creation of the volume the pool. Thanks -- Albert SHIH 嶺 France Heure locale/Local time: mer. 24 janv. 2024 19:24:23 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Questions about the CRUSH details
Hi, The client calculates the location (PG) of an object from its name and the crushmap. This is what makes it possible to parallelize the flows directly from the client. The client also has the map of the PGs which are relocated to other OSDs (upmap, temp, etc.) Cordialement, *David CASIER* Le mer. 24 janv. 2024 à 17:49, Henry lol a écrit : > Hello, I'm new to ceph and sorry in advance for the naive questions. > > 1. > As far as I know, CRUSH utilizes the cluster map consisting of the PG > map and others. > I don't understand why CRUSH computation is required on client-side, > even though PG-to-OSDs mapping can be acquired from the PG map. > > 2. > how does the client get a valid(old) OSD set when the PG is being > remapped to a new ODS set which CRUSH returns? > > thanks. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [DOC] Openstack with RBD DOC update?
On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais wrote: > Good afternoon everybody! > > I have a question regarding the documentation... I was reviewing it and > realized that the "vms" pool is not being used anywhere in the configs. > > The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the > configuration section of nova.conf was removed, but the pool configuration > remained there. > > Would it be correct to ignore all mentions of this pool (I don't see any > use for it)? If so, it would be interesting to update the documentation. > > https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool The use of that "vms" pool is for Nova to directly store "ephemeral" disks in ceph instead of on local disk. It used to be described in the Ceph doc, but seems to no longer be there. It's still in the Redhat version [1] however. Wouldn't it be better to put that back instead of removing the creation of the vms pool from the docs? Maybe there's a good reason we only want to boot instances into volumes now, but I'm not aware of it. [1] - Section 3.4.3 of https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index -Erik ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Questions about the CRUSH details
Hello, I'm new to ceph and sorry in advance for the naive questions. 1. As far as I know, CRUSH utilizes the cluster map consisting of the PG map and others. I don't understand why CRUSH computation is required on client-side, even though PG-to-OSDs mapping can be acquired from the PG map. 2. how does the client get a valid(old) OSD set when the PG is being remapped to a new ODS set which CRUSH returns? thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CLT meeting notes January 24th 2024
- Build/package PRs- who to best review these? - Example: https://github.com/ceph/ceph/pull/55218 - Idea: create a GitHub team specifically for these types of PRs https://github.com/orgs/ceph/teams - Laura will try to organize people for the group - Pacific 16.2.15 status - Handful of PRs left in 16.2.15 tag https://github.com/ceph/ceph/pulls?q=is%3Apr+is%3Aopen+milestone%3Av16.2.15 that still need to be tested and merged - Yuri will begin testing RC after that ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [DOC] Openstack with RBD DOC update?
Murilo, I'm looking into it. Zac Dover Upstream Documentation Ceph Foundation On Thursday, January 25th, 2024 at 1:01 AM, Murilo Morais wrote: > > > Good afternoon everybody! > > I have a question regarding the documentation... I was reviewing it and > realized that the "vms" pool is not being used anywhere in the configs. > > The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the > configuration section of nova.conf was removed, but the pool configuration > remained there. > > Would it be correct to ignore all mentions of this pool (I don't see any > use for it)? If so, it would be interesting to update the documentation. > > https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] [DOC] Openstack with RBD DOC update?
Good afternoon everybody! I have a question regarding the documentation... I was reviewing it and realized that the "vms" pool is not being used anywhere in the configs. The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the configuration section of nova.conf was removed, but the pool configuration remained there. Would it be correct to ignore all mentions of this pool (I don't see any use for it)? If so, it would be interesting to update the documentation. https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out
Hi, Hector also claims that he observed an incomplete acting set after *adding* an OSD. Assuming that the cluster was health OK before that, that should not happen in theory. In practice this was observed with certain definitions of crush maps. There is, for example, the issue with "choose" and "chooseleaf" not doing the same thing in situations they should. Another one was that spurious (temporary) allocations of PGs could exceed hard limits without being obvious or reported at all. Without seeing the crush maps its hard to tell what is going on. With just 3 hosts and 4 OSDs per hosts the cluster might be hitting corner cases with such a wide EC profile. Having the osdmap of the cluster in normal conditions would allow to simulate OSD downs and ups off-line and one might gain inside why crush fails to compute a complete acting set (yes, I'm not talking about the up set, I was always talking about the acting set). There might also be an issue with the PG-/OSD-map logs tracking the full history of the PGs in question. A possible way to test is to issue a re-peer command after all peering finished on a PG with incomplete acting set to see if this resolves the PG. If so, there is a temporary condition that prevents the PGs from becoming clean when going through the standard peering procedure. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Wednesday, January 24, 2024 9:45 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Degraded PGs on EC pool when marking an OSD out Hi, this topic pops up every now and then, and although I don't have definitive proof for my assumptions I still stand with them. ;-) As the docs [2] already state, it's expected that PGs become degraded after some sort of failure (setting an OSD "out" falls into that category IMO): > It is normal for placement groups to enter “degraded” or “peering” > states after a component failure. Normally, these states reflect the > expected progression through the failure recovery process. However, > a placement group that stays in one of these states for a long time > might be an indication of a larger problem. And you report that your PGs do not stay in that state but eventually recover. My understanding is as follows: PGs have to be recreated on different hosts/OSDs after setting an OSD "out". During this transition (peering) the PGs are degraded until the newly assigned OSD have noticed their new responsibility (I'm not familiar with the actual data flow). The degraded state then clears as long as the out OSD is up (its PGs are active). If you stop that OSD ("down") the PGs become and stay degraded until they have been fully recreated on different hosts/OSDs. Not sure what impacts the duration until the degraded state clears, but in my small test cluster (similar osd tree as yours) the degraded state clears after a few seconds only, but I only have a few (almost empty) PGs in the EC test pool. I guess a comment from the devs couldn't hurt to clear this up. [2] https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups Zitat von Hector Martin : > On 2024/01/22 19:06, Frank Schilder wrote: >> You seem to have a problem with your crush rule(s): >> >> 14.3d ... [18,17,16,3,1,0,NONE,NONE,12] >> >> If you really just took out 1 OSD, having 2xNONE in the acting set >> indicates that your crush rule can't find valid mappings. You might >> need to tune crush tunables: >> https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs > > Look closely: that's the *acting* (second column) OSD set, not the *up* > (first column) OSD set. It's supposed to be the *previous* set of OSDs > assigned to that PG, but inexplicably some OSDs just "fall off" when the > PGs get remapped around. > > Simply waiting lets the data recover. At no point are any of my PGs > actually missing OSDs according to the current cluster state, and CRUSH > always finds a valid mapping. Rather the problem is that the *previous* > set of OSDs just loses some entries some for some reason. > > The same problem happens when I *add* an OSD to the cluster. For > example, right now, osd.15 is out. This is the state of one pg: > > 14.3d 1044 0 0 00 > 157307567310 0 1630 0 1630 > active+clean 2024-01-22T20:15:46.684066+0900 15550'1630 > 15550:16184 [18,17,16,3,1,0,11,14,12] 18 > [18,17,16,3,1,0,11,14,12] 18 15550'1629 > 2024-01-22T20:15:46.683491+0900 0'0 > 2024-01-08T15:18:21.654679+0900 02 > periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900 > 10430 > > Note the OSD list ([18,17,16,3,1,0,11,14,12]) > > Then I bring osd.15 in and: > > 14.3d 1044 0 1077 0
[ceph-users] cephx client key rotation
Hi, this question has come up once in the past[0] afaict, but it was kind of inconclusive so I'm taking the liberty of bringing it up again. I'm looking into implementing a key rotation scheme for Ceph client keys. As it potentially takes some non-zero amount of time to update key material there might be a situation where keys have changed on the MON side but, still one of N clients might not have updated key material and try to auth with an obsolete key which naturally would fail. It would be great if we could have two keys active for an entity at the same time, but aiui that's not really possible, is that right? I'm wondering about ceph auth get-or-create-pending. Per the docs a pending key would become active on first use, so that if one of N clients uses it, this still leaves room for another client to race. What do people do to deal with this situation? [0] https://ceph-users.ceph.narkive.com/ObSMdmxX/rotating-cephx-keys ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephadm orchestrator and special label _admin in 17.2.7
On 23.01.2024 18:19, Albert Shih wrote: Just like to known if it's a very bad idea to do a rsync of /etc/ceph from the «_admin» server to the other ceph cluster server. I in fact add something like for host in `cat /usr/local/etc/ceph_list_noeuds.txt` do /usr/bin/rsync -av /etc/ceph/ceph* $host:/etc/ceph/ done in a cronjob Why not just add the _admin label to the host and let Ceph do the job? You can also run this to get the ceph.conf copied to all host ceph config set mgr/cephadm/manage_etc_ceph_ceph_conf true Anyway, I don't se any problem with rsync it, it's just ceph.conf and the admin key. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef
Hi, Confirmed that this happens to me as well. After upgrading from 18.2.0 to 18.2.1 OSD metrics like: ceph_osd_op_* are missing from ceph-mgr. The Grafana dashboard also doesn't display all graphs correctly. ceph-dashboard/Ceph - Cluster : Capacity used, Cluster I/O, OSD Capacity Utilization, PGs per OSD curl http://localhost:9283/metrics | grep -i ceph_osd_op % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 38317 100 38317 0 0 9.8M 0 --:--:-- --:--:-- --:--:-- 12.1M Before the upgrading to reef 18.2.1 I could get all the metrics. Martin On 18/01/2024 12:32, Jose Vicente wrote: Hi, After upgrading from Quincy to Reef the ceph-mgr daemon is not throwing some throughput OSD metrics like: ceph_osd_op_* curl http://localhost:9283/metrics | grep -i ceph_osd_op % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 295k 100 295k 0 0 144M 0 --:--:-- --:--:-- --:--:-- 144M However I can get other metrics like: # curl http://localhost:9283/metrics | grep -i ceph_osd_apply # HELP ceph_osd_apply_latency_ms OSD stat apply_latency_ms # TYPE ceph_osd_apply_latency_ms gauge ceph_osd_apply_latency_ms{ceph_daemon="osd.275"} 152.0 ceph_osd_apply_latency_ms{ceph_daemon="osd.274"} 102.0 ... Before the upgrading to reef (from quincy) I I could get all the metrics. MGR module prometheus is enabled. Rocky Linux release 8.8 (Green Obsidian) ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable) # netstat -nap | grep 9283 tcp 0 0 127.0.0.1:53834 127.0.0.1:9283 ESTABLISHED 3561/prometheus tcp6 0 0 :::9283 :::* LISTEN 804985/ceph-mgr Thanks, Jose C. ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an email toceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] List contents of stray buckets with octopus
Hi all, I need to list the contents of the stray buckets on one of our MDSes. The MDS reports 772674 stray entries. However, if I dump its cache and grep for stray I get only 216 hits. How can I get to the contents of the stray buckets? Please note that Octopus is still hit by https://tracker.ceph.com/issues/57059 so a "dump tree" will not work. In addition, I clearly don't just need the entries in cache, I need a listing of everything. How can I get that? I'm willing to run rados commands and pipe through ceph-encoder if necessary. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Scrubbing?
> [...] After a few days, I have on our OSD nodes around 90MB/s > read and 70MB/s write while 'ceph -s' have client io as > 2,5MB/s read and 50MB/s write. [...] This is one of my pet-peeves: that a storage system must have capacity (principally IOPS) to handle both a maintenance workload and a user workload, and since the former often involves whole-storage or whole-metadata operations it can be quite heavy, especially in the case of Ceph where rebalancing and scrubbing and checking should be fairly frequent to detect and correct inconsistencies. > Is this activity OK? [...] Indeed. Some "clever" people "save money" by "rightsizing" their storage so it cannot run at the same time the maintenance and the user workload, and so turn off the maintenance workload, because they "feel lucky" I guess, but I do not recommend that. :-). I have seen more than one Ceph cluster that did not have the capacity even to run *just* the maintenance workload. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How many pool for cephfs
Le 24/01/2024 à 10:33:45+0100, Robert Sander a écrit Hi, > > On 1/24/24 10:08, Albert Shih wrote: > > > 99.99% because I'm newbie with ceph and don't understand clearly how > > the autorisation work with cephfs ;-) > > I strongly recommend you to ask for a expierenced Ceph consultant that helps > you design and setup your storage cluster. I known I'm working on (meaning I'm waiting my administration to do «what need to be done)... > > It looks like you try to make design decisions that will heavily influence > performance of the system. I'm well aware > > > If I say 20-30 it's because I currently have on my classic ZFS/NFS server > > around 25 «datasets» exported to various server. > > The next question is how would the "consumers" access the filesystem: Via > NFS or mounted directly. Even with the second option you can separate client > access via CephX keys as David already wrote. The separate client key would be more than enough for us. > > > Ok. I got for my ceph cluster two set of servers, first set are for > > services (mgr,mon,etc.) with ssd and don't currently run any osd (but still > > have 2 ssd not used), I also got a second set of server with HDD and 2 SSD. > > The data pool will be on > > the second set (with HDD). Where should I run the MDS and on which osd ? > > Do you intend to use the Ceph cluster only for archival storage? Mostly yes. > Hwo large is your second set of Ceph nodes, how many HDDs in each? Do you Huge ;-) I got 6 ceph server with ... 60 HDD. (I know, I know it's not ideal) > intend to use the SSDs for the OSDs' RocksDB? RocksDB ? no... > Where do you plan to store the metadata pools for CephFS? They should be That's exactly the question... My cluster are : 5 server with «small» ssd for service (each got 2 ssd no currently used) 6 server with «huge» HDD for data (each got 2 ssd no currently used) so for my cephfs metadata I can put them on my 5 servers for services (but that's mean the mds running on those 5 servers) or should I use the ssd on the 6 server who hold the OSD for data Thanks. Regards -- Albert SHIH 嶺 France Heure locale/Local time: mer. 24 janv. 2024 10:48:11 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How many pool for cephfs
Hi, On 1/24/24 10:08, Albert Shih wrote: 99.99% because I'm newbie with ceph and don't understand clearly how the autorisation work with cephfs ;-) I strongly recommend you to ask for a expierenced Ceph consultant that helps you design and setup your storage cluster. It looks like you try to make design decisions that will heavily influence performance of the system. If I say 20-30 it's because I currently have on my classic ZFS/NFS server around 25 «datasets» exported to various server. The next question is how would the "consumers" access the filesystem: Via NFS or mounted directly. Even with the second option you can separate client access via CephX keys as David already wrote. Ok. I got for my ceph cluster two set of servers, first set are for services (mgr,mon,etc.) with ssd and don't currently run any osd (but still have 2 ssd not used), I also got a second set of server with HDD and 2 SSD. The data pool will be on the second set (with HDD). Where should I run the MDS and on which osd ? Do you intend to use the Ceph cluster only for archival storage? Hwo large is your second set of Ceph nodes, how many HDDs in each? Do you intend to use the SSDs for the OSDs' RocksDB? Where do you plan to store the metadata pools for CephFS? They should be stored on fats media. Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How many pool for cephfs
Le 24/01/2024 à 10:23:20+0100, David C. a écrit Hi, > > In this scenario, it is more consistent to work with subvolumes. Ok. I will do that. > > Regarding security, you can use namespaces to isolate access at the OSD level. HumI'm currently have no idea what you just say but that's OK ;-) > > What Robert emphasizes is that creating pools dynamically is not without > effect > on the number of PGs and (therefore) on the architecture (PG per OSD, > balancer, > pg autoscaling, etc.) Ok.no worriesI didn't know it was possible;-) Regards. JAS -- Albert SHIH 嶺 France Heure locale/Local time: mer. 24 janv. 2024 10:31:44 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How many pool for cephfs
Hi Albert, In this scenario, it is more consistent to work with subvolumes. Regarding security, you can use namespaces to isolate access at the OSD level. What Robert emphasizes is that creating pools dynamically is not without effect on the number of PGs and (therefore) on the architecture (PG per OSD, balancer, pg autoscaling, etc.) Cordialement, *David CASIER* Le mer. 24 janv. 2024 à 10:10, Albert Shih a écrit : > Le 24/01/2024 à 09:45:56+0100, Robert Sander a écrit > Hi > > > > > On 1/24/24 09:40, Albert Shih wrote: > > > > > Knowing I got two class of osd (hdd and ssd), and I have a need of ~ > 20/30 > > > cephfs (currently and that number will increase with time). > > > > Why do you need 20 - 30 separate CephFS instances? > > 99.99% because I'm newbie with ceph and don't understand clearly how > the autorisation work with cephfs ;-) > > If I say 20-30 it's because I currently have on my classic ZFS/NFS server > around 25 «datasets» exported to various server. > > But because you question I understand I can put many export «inside» one > cephfs. > > > > and put all my cephfs inside two of them. Or should I create for each > > > cephfs a couple of pool metadata/data ? > > > > Each CephFS instance needs their own pools, at least two (data + > metadata) > > per instance. And each CephFS needs at least one MDS running, better > with an > > additional cold or even hot standby MDS. > > Ok. I got for my ceph cluster two set of servers, first set are for > services (mgr,mon,etc.) with ssd and don't currently run any osd (but still > have 2 ssd not used), I also got a second set of server with HDD and 2 > SSD. The data pool will be on > the second set (with HDD). Where should I run the MDS and on which osd ? > > > > > > Il will also need to have ceph S3 storage, same question, should I > have a > > > designated pool for S3 storage or can/should I use the same > > > cephfs_data_replicated/erasure pool ? > > > > No, S3 needs its own pools. It cannot re-use CephFS pools. > > Ok thanks. > > Regards > -- > Albert SHIH 嶺 > France > Heure locale/Local time: > mer. 24 janv. 2024 09:55:26 CET > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How many pool for cephfs
Le 24/01/2024 à 09:45:56+0100, Robert Sander a écrit Hi > > On 1/24/24 09:40, Albert Shih wrote: > > > Knowing I got two class of osd (hdd and ssd), and I have a need of ~ 20/30 > > cephfs (currently and that number will increase with time). > > Why do you need 20 - 30 separate CephFS instances? 99.99% because I'm newbie with ceph and don't understand clearly how the autorisation work with cephfs ;-) If I say 20-30 it's because I currently have on my classic ZFS/NFS server around 25 «datasets» exported to various server. But because you question I understand I can put many export «inside» one cephfs. > > and put all my cephfs inside two of them. Or should I create for each > > cephfs a couple of pool metadata/data ? > > Each CephFS instance needs their own pools, at least two (data + metadata) > per instance. And each CephFS needs at least one MDS running, better with an > additional cold or even hot standby MDS. Ok. I got for my ceph cluster two set of servers, first set are for services (mgr,mon,etc.) with ssd and don't currently run any osd (but still have 2 ssd not used), I also got a second set of server with HDD and 2 SSD. The data pool will be on the second set (with HDD). Where should I run the MDS and on which osd ? > > > Il will also need to have ceph S3 storage, same question, should I have a > > designated pool for S3 storage or can/should I use the same > > cephfs_data_replicated/erasure pool ? > > No, S3 needs its own pools. It cannot re-use CephFS pools. Ok thanks. Regards -- Albert SHIH 嶺 France Heure locale/Local time: mer. 24 janv. 2024 09:55:26 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How many pool for cephfs
Hi, On 1/24/24 09:40, Albert Shih wrote: Knowing I got two class of osd (hdd and ssd), and I have a need of ~ 20/30 cephfs (currently and that number will increase with time). Why do you need 20 - 30 separate CephFS instances? and put all my cephfs inside two of them. Or should I create for each cephfs a couple of pool metadata/data ? Each CephFS instance needs their own pools, at least two (data + metadata) per instance. And each CephFS needs at least one MDS running, better with an additional cold or even hot standby MDS. Il will also need to have ceph S3 storage, same question, should I have a designated pool for S3 storage or can/should I use the same cephfs_data_replicated/erasure pool ? No, S3 needs its own pools. It cannot re-use CephFS pools. Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out
Hi, this topic pops up every now and then, and although I don't have definitive proof for my assumptions I still stand with them. ;-) As the docs [2] already state, it's expected that PGs become degraded after some sort of failure (setting an OSD "out" falls into that category IMO): It is normal for placement groups to enter “degraded” or “peering” states after a component failure. Normally, these states reflect the expected progression through the failure recovery process. However, a placement group that stays in one of these states for a long time might be an indication of a larger problem. And you report that your PGs do not stay in that state but eventually recover. My understanding is as follows: PGs have to be recreated on different hosts/OSDs after setting an OSD "out". During this transition (peering) the PGs are degraded until the newly assigned OSD have noticed their new responsibility (I'm not familiar with the actual data flow). The degraded state then clears as long as the out OSD is up (its PGs are active). If you stop that OSD ("down") the PGs become and stay degraded until they have been fully recreated on different hosts/OSDs. Not sure what impacts the duration until the degraded state clears, but in my small test cluster (similar osd tree as yours) the degraded state clears after a few seconds only, but I only have a few (almost empty) PGs in the EC test pool. I guess a comment from the devs couldn't hurt to clear this up. [2] https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups Zitat von Hector Martin : On 2024/01/22 19:06, Frank Schilder wrote: You seem to have a problem with your crush rule(s): 14.3d ... [18,17,16,3,1,0,NONE,NONE,12] If you really just took out 1 OSD, having 2xNONE in the acting set indicates that your crush rule can't find valid mappings. You might need to tune crush tunables: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs Look closely: that's the *acting* (second column) OSD set, not the *up* (first column) OSD set. It's supposed to be the *previous* set of OSDs assigned to that PG, but inexplicably some OSDs just "fall off" when the PGs get remapped around. Simply waiting lets the data recover. At no point are any of my PGs actually missing OSDs according to the current cluster state, and CRUSH always finds a valid mapping. Rather the problem is that the *previous* set of OSDs just loses some entries some for some reason. The same problem happens when I *add* an OSD to the cluster. For example, right now, osd.15 is out. This is the state of one pg: 14.3d 1044 0 0 00 157307567310 0 1630 0 1630 active+clean 2024-01-22T20:15:46.684066+0900 15550'1630 15550:16184 [18,17,16,3,1,0,11,14,12] 18 [18,17,16,3,1,0,11,14,12] 18 15550'1629 2024-01-22T20:15:46.683491+0900 0'0 2024-01-08T15:18:21.654679+0900 02 periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900 10430 Note the OSD list ([18,17,16,3,1,0,11,14,12]) Then I bring osd.15 in and: 14.3d 1044 0 1077 00 157307567310 0 1630 0 1630 active+recovery_wait+undersized+degraded+remapped 2024-01-22T22:52:22.700096+0900 15550'1630 15554:16163 [15,17,16,3,1,0,11,14,12] 15[NONE,17,16,3,1,0,11,14,12] 17 15550'1629 2024-01-22T20:15:46.683491+0900 0'0 2024-01-08T15:18:21.654679+0900 02 periodic scrub scheduled @ 2024-01-31T02:31:53.342289+0900 10430 So somehow osd.18 "vanished" from the acting list ([NONE,17,16,3,1,0,11,14,12]) as it is being replaced by 15 in the new up list ([15,17,16,3,1,0,11,14,12]). The data is in osd.18, but somehow Ceph forgot. It is possible that your low OSD count causes the "crush gives up too soon" issue. You might also consider to use a crush rule that places exactly 3 shards per host (examples were in posts just last week). Otherwise, it is not guaranteed that "... data remains available if a whole host goes down ..." because you might have 4 chunks on one of the hosts and fall below min_size (the failure domain of your crush rule for the EC profiles is OSD). That should be what my CRUSH rule does. It picks 3 hosts then picks 3 OSDs per host (IIUC). And oddly enough everything works for the other EC pool even though it shares the same CRUSH rule (just ignoring one OSD from it). To test if your crush rules can generate valid mappings, you can pull the osdmap of your cluster and use osdmaptool to experiment with it without risk of destroying anything. It allows you to try different crush rules and failure scenarios on off-line but real cluster
[ceph-users] How many pool for cephfs
Hi everyone, I like to know how many pool should I create for multiple cephfs ? Knowing I got two class of osd (hdd and ssd), and I have a need of ~ 20/30 cephfs (currently and that number will increase with time). Should I create one cephfs_metadata_replicated one cephfs_data_replicated few cephfs_data_erasure_coding (depending of k/m) and put all my cephfs inside two of them. Or should I create for each cephfs a couple of pool metadata/data ? Il will also need to have ceph S3 storage, same question, should I have a designated pool for S3 storage or can/should I use the same cephfs_data_replicated/erasure pool ? Regards -- Albert SHIH 嶺 France Heure locale/Local time: mer. 24 janv. 2024 09:33:09 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io