[ceph-users] Re: [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Eugen Block

Hi,

so there currently is a section how to configure nova [0], but it  
refers to the client side ceph.conf, not the rbd details in nova.conf  
as Ilya already pointed out. I'll just add what I have in one of my  
test clusters in the [livbirt] section of the nova.conf (we use it  
identically in our production clusters):


[libvirt]
virt_type = kvm
live_migration_uri = "qemu+ssh://%s/system"
live_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE,  
VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE

cpu_mode = host-passthrough
disk_cachemodes = network=writeback
images_type = rbd
images_rbd_pool = vms
images_rbd_ceph_conf = /etc/ceph/ceph.conf
rbd_user = cinder
rbd_secret_uuid = 

Maybe leave out the non-rbd config options to only have a minimum conf  
in the docs? It is common to have the cinder user configured for nova  
as well because it requires access to both ephemeral disks and  
persistent volumes (just mentioning that in case it's not commonly  
known).


And this permission topic brings me to a thread [1] Christian Rohmann  
brought up in the openstack-discuss mailing list. If it's not the  
right place to bring this up, please ignore this section.
There have been changes regarding glance permissions and the  
(openstack) docs are not consistent anymore, maybe someone from the  
ceph team could assist and get them consistent again? CC'ed Christian  
here as well.
The ceph docs don't mention any other permissions than for the images  
pool, so the question is:


e) Instead of try and error on the "rados_*"-prefixed object  
required, maybe it makes sense to have someone from Ceph look into  
this to define which caps are actually required to allow for  
list_children on RBD images with children in other pools?


@Christian: regarding auth caps this was the main question, right?

Thanks,
Eugen

[0] https://docs.ceph.com/en/latest/rbd/rbd-openstack/#configuring-nova
[1]  
https://lists.openstack.org/archives/list/openstack-disc...@lists.openstack.org/message/JVZHT4O45ZBMDEMLE7W6JFH5KXD3SL7F/
[2]  
https://docs.ceph.com/en/latest/rbd/rbd-openstack/#setup-ceph-client-authentication


Zitat von Zac Dover :


You guys can just respond here and I’ll add your responses to the docs.

Zac

Sent from [Proton Mail](https://proton.me/mail/home) for iOS

On Thu, Jan 25, 2024 at 05:52, Ilya Dryomov  
<[idryo...@gmail.com](mailto:On Thu, Jan 25, 2024 at 05:52, Ilya  
Dryomov < wrote:



On Wed, Jan 24, 2024 at 7:31 PM Eugen Block  wrote:


We do like the separation of nova pools as well, and we also heavily
use ephemeral disks instead of boot-from-volume instances. One of the
reasons being that you can't detach a root volume from an instances.
It helps in specific maintenance cases, so +1 for keeping it in the
docs.


So it seems like instead of dropping mentions of vms pool, we should
expand "Configuring Nova" section where it says

In order to boot virtual machines directly from Ceph volumes, you
must configure the ephemeral backend for Nova.

with appropriate steps and /etc/nova/nova.conf snippet. I'm guessing

images_type = rbd
images_rbd_pool = vms
images_rbd_ceph_conf = /etc/ceph/ceph.conf

at a minimum?

Zitat or Eugen, do you want to suggest a precise edit based on your
working configuration for Zac to incorporate or perhaps even open a PR
directly?

Thanks,

Ilya



Zitat von Erik McCormick :

> On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais 
> wrote:
>
>> Good afternoon everybody!
>>
>> I have a question regarding the documentation... I was reviewing it and
>> realized that the "vms" pool is not being used anywhere in the configs.
>>
>> The first mention of this pool was in commit 2eab1c1 and, in  
e9b13fa, the
>> configuration section of nova.conf was removed, but the pool  
configuration

>> remained there.
>>
>> Would it be correct to ignore all mentions of this pool (I don't see any
>> use for it)? If so, it would be interesting to update the documentation.
>>
>> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool
>
>
> The use of that "vms" pool is for Nova to directly store  
"ephemeral" disks
> in ceph instead of on local disk. It used to be described in the  
Ceph doc,

> but seems to no longer be there. It's still in the Redhat version [1]
> however. Wouldn't it be better to put that back instead of removing the
> creation of the vms pool from the docs? Maybe there's a good  
reason we only

> want to boot instances into volumes now, but I'm not aware of it.
>
> [1] - Section 3.4.3 of
>  
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index

>
> -Erik
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing 

[ceph-users] Re: Questions about the CRUSH details

2024-01-24 Thread Janne Johansson
Den tors 25 jan. 2024 kl 03:05 skrev Henry lol :
>
> Do you mean object location (osds) is initially calculated only using its
> name and crushmap,
> and then the result is reprocessed with the map of the PGs?
>
> and I'm still skeptical about computation on the client-side.
> is it possible to obtain object location without computation on the client
> because ceph-mon already updates that information to PG map?

The client should not need to contact the mon for each object access
and every client can't have a complete list of millions of objects in
the cluster, so it does client-side computations.

The mon connection will more or less only require new updates if/when
OSDs change weight or goes in/out. This way, clients can run on
"autopilot" even if all mons are down, as long as OSD states don't
change.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Zakhar Kirpichenko
I have to say that not including a fix for a serious issue into the last
minor release of Pacific is a rather odd decision.

/Z

On Thu, 25 Jan 2024 at 09:00, Konstantin Shalygin  wrote:

> Hi,
>
> The backport to pacific was rejected [1], you may switch to reef, when [2]
> merged and released
>
>
> [1] https://github.com/ceph/ceph/pull/55109
> [2] https://github.com/ceph/ceph/pull/55110
>
> k
> Sent from my iPhone
>
> > On Jan 25, 2024, at 04:12, changzhi tan <544463...@qq.com> wrote:
> >
> > Is there any way to solve this problem?thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Konstantin Shalygin
Hi,

The backport to pacific was rejected [1], you may switch to reef, when [2] 
merged and released


[1] https://github.com/ceph/ceph/pull/55109
[2] https://github.com/ceph/ceph/pull/55110

k
Sent from my iPhone

> On Jan 25, 2024, at 04:12, changzhi tan <544463...@qq.com> wrote:
> 
> Is there any way to solve this problem?thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Zakhar Kirpichenko
I found that quickly restarting the affected mgr every 2 days is an okay
kludge. It takes less than a second to restart, and never grows to
dangerous sizes which is when it randomly starts ballooning.

/Z

On Thu, 25 Jan 2024, 03:12 changzhi tan, <544463...@qq.com> wrote:

> Is there any way to solve this problem?thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Questions about the CRUSH details

2024-01-24 Thread Henry lol
Do you mean object location (osds) is initially calculated only using its
name and crushmap,
and then the result is reprocessed with the map of the PGs?

and I'm still skeptical about computation on the client-side.
is it possible to obtain object location without computation on the client
because ceph-mon already updates that information to PG map?

2024년 1월 25일 (목) 오전 3:08, David C. 님이 작성:

> Hi,
>
> The client calculates the location (PG) of an object from its name and the
> crushmap.
> This is what makes it possible to parallelize the flows directly from the
> client.
>
> The client also has the map of the PGs which are relocated to other OSDs
> (upmap, temp, etc.)
> 
>
> Cordialement,
>
> *David CASIER*
> 
>
>
>
> Le mer. 24 janv. 2024 à 17:49, Henry lol  a
> écrit :
>
>> Hello, I'm new to ceph and sorry in advance for the naive questions.
>>
>> 1.
>> As far as I know, CRUSH utilizes the cluster map consisting of the PG
>> map and others.
>> I don't understand why CRUSH computation is required on client-side,
>> even though PG-to-OSDs mapping can be acquired from the PG map.
>>
>> 2.
>> how does the client get a valid(old) OSD set when the PG is being
>> remapped to a new ODS set which CRUSH returns?
>>
>> thanks.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread changzhi tan
Is there any way to solve this problem?thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Zac Dover
You guys can just respond here and I’ll add your responses to the docs.

Zac

Sent from [Proton Mail](https://proton.me/mail/home) for iOS

On Thu, Jan 25, 2024 at 05:52, Ilya Dryomov <[idryo...@gmail.com](mailto:On 
Thu, Jan 25, 2024 at 05:52, Ilya Dryomov < wrote:

> On Wed, Jan 24, 2024 at 7:31 PM Eugen Block  wrote:
>>
>> We do like the separation of nova pools as well, and we also heavily
>> use ephemeral disks instead of boot-from-volume instances. One of the
>> reasons being that you can't detach a root volume from an instances.
>> It helps in specific maintenance cases, so +1 for keeping it in the
>> docs.
>
> So it seems like instead of dropping mentions of vms pool, we should
> expand "Configuring Nova" section where it says
>
> In order to boot virtual machines directly from Ceph volumes, you
> must configure the ephemeral backend for Nova.
>
> with appropriate steps and /etc/nova/nova.conf snippet. I'm guessing
>
> images_type = rbd
> images_rbd_pool = vms
> images_rbd_ceph_conf = /etc/ceph/ceph.conf
>
> at a minimum?
>
> Zitat or Eugen, do you want to suggest a precise edit based on your
> working configuration for Zac to incorporate or perhaps even open a PR
> directly?
>
> Thanks,
>
> Ilya
>
>>
>> Zitat von Erik McCormick :
>>
>> > On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais 
>> > wrote:
>> >
>> >> Good afternoon everybody!
>> >>
>> >> I have a question regarding the documentation... I was reviewing it and
>> >> realized that the "vms" pool is not being used anywhere in the configs.
>> >>
>> >> The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the
>> >> configuration section of nova.conf was removed, but the pool configuration
>> >> remained there.
>> >>
>> >> Would it be correct to ignore all mentions of this pool (I don't see any
>> >> use for it)? If so, it would be interesting to update the documentation.
>> >>
>> >> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool
>> >
>> >
>> > The use of that "vms" pool is for Nova to directly store "ephemeral" disks
>> > in ceph instead of on local disk. It used to be described in the Ceph doc,
>> > but seems to no longer be there. It's still in the Redhat version [1]
>> > however. Wouldn't it be better to put that back instead of removing the
>> > creation of the vms pool from the docs? Maybe there's a good reason we only
>> > want to boot instances into volumes now, but I'm not aware of it.
>> >
>> > [1] - Section 3.4.3 of
>> > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index
>> >
>> > -Erik
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Ilya Dryomov
On Wed, Jan 24, 2024 at 8:52 PM Ilya Dryomov  wrote:
>
> On Wed, Jan 24, 2024 at 7:31 PM Eugen Block  wrote:
> >
> > We do like the separation of nova pools as well, and we also heavily
> > use ephemeral disks instead of boot-from-volume instances. One of the
> > reasons being that you can't detach a root volume from an instances.
> > It helps in specific maintenance cases, so +1 for keeping it in the
> > docs.
>
> So it seems like instead of dropping mentions of vms pool, we should
> expand "Configuring Nova" section where it says
>
> In order to boot virtual machines directly from Ceph volumes, you
> must configure the ephemeral backend for Nova.
>
> with appropriate steps and /etc/nova/nova.conf snippet.  I'm guessing
>
> images_type = rbd
> images_rbd_pool = vms
> images_rbd_ceph_conf = /etc/ceph/ceph.conf
>
> at a minimum?
>
> Zitat or Eugen, do you want to suggest a precise edit based on your

Apologies, autocomplete fail...  I meant Erik or Eugen of course.

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Ilya Dryomov
On Wed, Jan 24, 2024 at 7:31 PM Eugen Block  wrote:
>
> We do like the separation of nova pools as well, and we also heavily
> use ephemeral disks instead of boot-from-volume instances. One of the
> reasons being that you can't detach a root volume from an instances.
> It helps in specific maintenance cases, so +1 for keeping it in the
> docs.

So it seems like instead of dropping mentions of vms pool, we should
expand "Configuring Nova" section where it says

In order to boot virtual machines directly from Ceph volumes, you
must configure the ephemeral backend for Nova.

with appropriate steps and /etc/nova/nova.conf snippet.  I'm guessing

images_type = rbd
images_rbd_pool = vms
images_rbd_ceph_conf = /etc/ceph/ceph.conf

at a minimum?

Zitat or Eugen, do you want to suggest a precise edit based on your
working configuration for Zac to incorporate or perhaps even open a PR
directly?

Thanks,

Ilya

>
> Zitat von Erik McCormick :
>
> > On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais 
> > wrote:
> >
> >>  Good afternoon everybody!
> >>
> >> I have a question regarding the documentation... I was reviewing it and
> >> realized that the "vms" pool is not being used anywhere in the configs.
> >>
> >> The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the
> >> configuration section of nova.conf was removed, but the pool configuration
> >> remained there.
> >>
> >> Would it be correct to ignore all mentions of this pool (I don't see any
> >> use for it)? If so, it would be interesting to update the documentation.
> >>
> >> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool
> >
> >
> > The use of that "vms" pool is for Nova to directly store "ephemeral" disks
> > in ceph instead of on local disk. It used to be described in the Ceph doc,
> > but seems to no longer be there. It's still in the Redhat version [1]
> > however. Wouldn't it be better to put that back instead of removing the
> > creation of the vms pool from the docs? Maybe there's a good reason we only
> > want to boot instances into volumes now, but I'm not aware of it.
> >
> > [1] - Section 3.4.3 of
> > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index
> >
> > -Erik
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Eugen Block
We do like the separation of nova pools as well, and we also heavily  
use ephemeral disks instead of boot-from-volume instances. One of the  
reasons being that you can't detach a root volume from an instances.  
It helps in specific maintenance cases, so +1 for keeping it in the  
docs.


Zitat von Erik McCormick :


On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais 
wrote:


 Good afternoon everybody!

I have a question regarding the documentation... I was reviewing it and
realized that the "vms" pool is not being used anywhere in the configs.

The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the
configuration section of nova.conf was removed, but the pool configuration
remained there.

Would it be correct to ignore all mentions of this pool (I don't see any
use for it)? If so, it would be interesting to update the documentation.

https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool



The use of that "vms" pool is for Nova to directly store "ephemeral" disks
in ceph instead of on local disk. It used to be described in the Ceph doc,
but seems to no longer be there. It's still in the Redhat version [1]
however. Wouldn't it be better to put that back instead of removing the
creation of the vms pool from the docs? Maybe there's a good reason we only
want to boot instances into volumes now, but I'm not aware of it.

[1] - Section 3.4.3 of
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index

-Erik
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Stupid question about ceph fs volume

2024-01-24 Thread Albert Shih
Hi everyone,

Stupid question about 

  ceph fs volume create

how can I specify the metadata pool and the data pool ? 

I was able to create a cephfs «manually» with something like 

  ceph fs new vo cephfs_metadata cephfs_data

but as I understand the documentation, with this method I need to deploy
the mds, and the «new» way to do it is to use ceph fs volume. 

But with ceph fs volume I didn't find any documentation of how to set the
metadata/data pool

I also didn't find any way to change after the creation of the volume the
pool. 

Thanks

-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
mer. 24 janv. 2024 19:24:23 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Questions about the CRUSH details

2024-01-24 Thread David C.
Hi,

The client calculates the location (PG) of an object from its name and the
crushmap.
This is what makes it possible to parallelize the flows directly from the
client.

The client also has the map of the PGs which are relocated to other OSDs
(upmap, temp, etc.)


Cordialement,

*David CASIER*




Le mer. 24 janv. 2024 à 17:49, Henry lol  a
écrit :

> Hello, I'm new to ceph and sorry in advance for the naive questions.
>
> 1.
> As far as I know, CRUSH utilizes the cluster map consisting of the PG
> map and others.
> I don't understand why CRUSH computation is required on client-side,
> even though PG-to-OSDs mapping can be acquired from the PG map.
>
> 2.
> how does the client get a valid(old) OSD set when the PG is being
> remapped to a new ODS set which CRUSH returns?
>
> thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Erik McCormick
On Wed, Jan 24, 2024 at 10:02 AM Murilo Morais 
wrote:

>  Good afternoon everybody!
>
> I have a question regarding the documentation... I was reviewing it and
> realized that the "vms" pool is not being used anywhere in the configs.
>
> The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the
> configuration section of nova.conf was removed, but the pool configuration
> remained there.
>
> Would it be correct to ignore all mentions of this pool (I don't see any
> use for it)? If so, it would be interesting to update the documentation.
>
> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool


The use of that "vms" pool is for Nova to directly store "ephemeral" disks
in ceph instead of on local disk. It used to be described in the Ceph doc,
but seems to no longer be there. It's still in the Redhat version [1]
however. Wouldn't it be better to put that back instead of removing the
creation of the vms pool from the docs? Maybe there's a good reason we only
want to boot instances into volumes now, but I'm not aware of it.

[1] - Section 3.4.3 of
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/ceph_block_device_to_openstack_guide/index

-Erik
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Questions about the CRUSH details

2024-01-24 Thread Henry lol
Hello, I'm new to ceph and sorry in advance for the naive questions.

1.
As far as I know, CRUSH utilizes the cluster map consisting of the PG
map and others.
I don't understand why CRUSH computation is required on client-side,
even though PG-to-OSDs mapping can be acquired from the PG map.

2.
how does the client get a valid(old) OSD set when the PG is being
remapped to a new ODS set which CRUSH returns?

thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT meeting notes January 24th 2024

2024-01-24 Thread Adam King
   - Build/package PRs- who to best review these?


   - Example: https://github.com/ceph/ceph/pull/55218


   - Idea: create a GitHub team specifically for these types of PRs
   https://github.com/orgs/ceph/teams


   - Laura will try to organize people for the group


   - Pacific 16.2.15 status


   - Handful of PRs left in 16.2.15 tag
   https://github.com/ceph/ceph/pulls?q=is%3Apr+is%3Aopen+milestone%3Av16.2.15
that still need to be tested and merged


   - Yuri will begin testing RC after that
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Zac Dover
Murilo,

I'm looking into it.

Zac Dover
Upstream Documentation
Ceph Foundation




On Thursday, January 25th, 2024 at 1:01 AM, Murilo Morais 
 wrote:

> 
> 
> Good afternoon everybody!
> 
> I have a question regarding the documentation... I was reviewing it and
> realized that the "vms" pool is not being used anywhere in the configs.
> 
> The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the
> configuration section of nova.conf was removed, but the pool configuration
> remained there.
> 
> Would it be correct to ignore all mentions of this pool (I don't see any
> use for it)? If so, it would be interesting to update the documentation.
> 
> https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [DOC] Openstack with RBD DOC update?

2024-01-24 Thread Murilo Morais
 Good afternoon everybody!

I have a question regarding the documentation... I was reviewing it and
realized that the "vms" pool is not being used anywhere in the configs.

The first mention of this pool was in commit 2eab1c1 and, in e9b13fa, the
configuration section of nova.conf was removed, but the pool configuration
remained there.

Would it be correct to ignore all mentions of this pool (I don't see any
use for it)? If so, it would be interesting to update the documentation.

https://docs.ceph.com/en/latest/rbd/rbd-openstack/#create-a-pool
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

2024-01-24 Thread Frank Schilder
Hi,

Hector also claims that he observed an incomplete acting set after *adding* an 
OSD. Assuming that the cluster was health OK before that, that should not 
happen in theory. In practice this was observed with certain definitions of 
crush maps. There is, for example, the issue with "choose" and "chooseleaf" not 
doing the same thing in situations they should. Another one was that spurious 
(temporary) allocations of PGs could exceed hard limits without being obvious 
or reported at all. Without seeing the crush maps its hard to tell what is 
going on. With just 3 hosts and 4 OSDs per hosts the cluster might be hitting 
corner cases with such a wide EC profile.

Having the osdmap of the cluster in normal conditions would allow to simulate 
OSD downs and ups off-line and one might gain inside why crush fails to compute 
a complete acting set (yes, I'm not talking about the up set, I was always 
talking about the acting set). There might also be an issue with the 
PG-/OSD-map logs tracking the full history of the PGs in question.

A possible way to test is to issue a re-peer command after all peering finished 
on a PG with incomplete acting set to see if this resolves the PG. If so, there 
is a temporary condition that prevents the PGs from becoming clean when going 
through the standard peering procedure.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, January 24, 2024 9:45 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

Hi,

this topic pops up every now and then, and although I don't have
definitive proof for my assumptions I still stand with them. ;-)
As the docs [2] already state, it's expected that PGs become degraded
after some sort of failure (setting an OSD "out" falls into that
category IMO):

> It is normal for placement groups to enter “degraded” or “peering”
> states after a component failure. Normally, these states reflect the
> expected progression through the failure recovery process. However,
> a placement group that stays in one of these states for a long time
> might be an indication of a larger problem.

And you report that your PGs do not stay in that state but eventually
recover. My understanding is as follows:
PGs have to be recreated on different hosts/OSDs after setting an OSD
"out". During this transition (peering) the PGs are degraded until the
newly assigned OSD have noticed their new responsibility (I'm not
familiar with the actual data flow). The degraded state then clears as
long as the out OSD is up (its PGs are active). If you stop that OSD
("down") the PGs become and stay degraded until they have been fully
recreated on different hosts/OSDs. Not sure what impacts the duration
until the degraded state clears, but in my small test cluster (similar
osd tree as yours) the degraded state clears after a few seconds only,
but I only have a few (almost empty) PGs in the EC test pool.

I guess a comment from the devs couldn't hurt to clear this up.

[2]
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups

Zitat von Hector Martin :

> On 2024/01/22 19:06, Frank Schilder wrote:
>> You seem to have a problem with your crush rule(s):
>>
>> 14.3d ... [18,17,16,3,1,0,NONE,NONE,12]
>>
>> If you really just took out 1 OSD, having 2xNONE in the acting set
>> indicates that your crush rule can't find valid mappings. You might
>> need to tune crush tunables:
>> https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs
>
> Look closely: that's the *acting* (second column) OSD set, not the *up*
> (first column) OSD set. It's supposed to be the *previous* set of OSDs
> assigned to that PG, but inexplicably some OSDs just "fall off" when the
> PGs get remapped around.
>
> Simply waiting lets the data recover. At no point are any of my PGs
> actually missing OSDs according to the current cluster state, and CRUSH
> always finds a valid mapping. Rather the problem is that the *previous*
> set of OSDs just loses some entries some for some reason.
>
> The same problem happens when I *add* an OSD to the cluster. For
> example, right now, osd.15 is out. This is the state of one pg:
>
> 14.3d   1044   0 0  00
> 157307567310   0  1630 0  1630
> active+clean  2024-01-22T20:15:46.684066+0900 15550'1630
> 15550:16184  [18,17,16,3,1,0,11,14,12]  18
> [18,17,16,3,1,0,11,14,12]  18 15550'1629
> 2024-01-22T20:15:46.683491+0900  0'0
> 2024-01-08T15:18:21.654679+0900  02
> periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900
> 10430
>
> Note the OSD list ([18,17,16,3,1,0,11,14,12])
>
> Then I bring osd.15 in and:
>
> 14.3d   1044   0  1077  0 

[ceph-users] cephx client key rotation

2024-01-24 Thread Peter Sabaini
Hi,

this question has come up once in the past[0] afaict, but it was kind of 
inconclusive so I'm taking the liberty of bringing it up again.

I'm looking into implementing a key rotation scheme for Ceph client keys. As it 
potentially takes some non-zero amount of time to update key material there 
might be a situation where keys have changed on the MON side but, still one of 
N clients might not have updated key material and try to auth with an obsolete 
key which naturally would fail. 

It would be great if we could have two keys active for an entity at the same 
time, but aiui that's not really possible, is that right?

I'm wondering about ceph auth get-or-create-pending. Per the docs a pending key 
would become active on first use, so that if one of N clients uses it, this 
still leaves room for another client to race.

What do people do to deal with this situation?


[0] https://ceph-users.ceph.narkive.com/ObSMdmxX/rotating-cephx-keys
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm orchestrator and special label _admin in 17.2.7

2024-01-24 Thread Kai Stian Olstad

On 23.01.2024 18:19, Albert Shih wrote:
Just like to known if it's a very bad idea to do a rsync of /etc/ceph 
from

the «_admin» server to the other ceph cluster server.

I in fact add something like

for host in `cat /usr/local/etc/ceph_list_noeuds.txt`
do
  /usr/bin/rsync -av /etc/ceph/ceph* $host:/etc/ceph/
done

in a cronjob


Why not just add the _admin label to the host and let Ceph do the job?

You can also run this to get the ceph.conf copied to all host
ceph config set mgr/cephadm/manage_etc_ceph_ceph_conf true

Anyway, I don't se any problem with rsync it, it's just ceph.conf and 
the admin key.



--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Throughput metrics missing iwhen updating Ceph Quincy to Reef

2024-01-24 Thread Martin

Hi,

Confirmed that this happens to me as well.
After upgrading from 18.2.0 to 18.2.1 OSD metrics like: ceph_osd_op_* 
are missing from ceph-mgr.


The Grafana dashboard also doesn't display all graphs correctly.

ceph-dashboard/Ceph - Cluster : Capacity used, Cluster I/O, OSD Capacity 
Utilization, PGs per OSD


curl http://localhost:9283/metrics | grep -i ceph_osd_op
  % Total    % Received % Xferd  Average Speed   Time Time Time  
Current

 Dload  Upload   Total Spent    Left  Speed
100 38317  100 38317    0 0   9.8M  0 --:--:-- --:--:-- --:--:-- 
12.1M


Before the upgrading to reef 18.2.1 I could get all the metrics.

Martin

On 18/01/2024 12:32, Jose Vicente wrote:

Hi,
After upgrading from Quincy to Reef the ceph-mgr daemon is not 
throwing some throughput OSD metrics like: ceph_osd_op_*

curl http://localhost:9283/metrics | grep -i ceph_osd_op
  % Total    % Received % Xferd  Average Speed   Time  Time     Time 
 Current
                                 Dload  Upload   Total Spent    Left 
 Speed
100  295k  100  295k    0     0   144M      0 --:--:-- --:--:-- 
--:--:--  144M

However I can get other metrics like:
# curl http://localhost:9283/metrics | grep -i ceph_osd_apply
# HELP ceph_osd_apply_latency_ms OSD stat apply_latency_ms
# TYPE ceph_osd_apply_latency_ms gauge
ceph_osd_apply_latency_ms{ceph_daemon="osd.275"} 152.0
ceph_osd_apply_latency_ms{ceph_daemon="osd.274"} 102.0
...
Before the upgrading to reef (from quincy) I I could get all the 
metrics. MGR module prometheus is enabled.

Rocky Linux release 8.8 (Green Obsidian)
ceph version 18.2.1 (7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef 
(stable)

# netstat -nap | grep 9283
tcp        0      0 127.0.0.1:53834         127.0.0.1:9283     
 ESTABLISHED 3561/prometheus
tcp6       0      0 :::9283                 :::*      LISTEN     
 804985/ceph-mgr

Thanks,
Jose C.

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] List contents of stray buckets with octopus

2024-01-24 Thread Frank Schilder
Hi all,

I need to list the contents of the stray buckets on one of our MDSes. The MDS 
reports 772674 stray entries. However, if I dump its cache and grep for stray I 
get only 216 hits.

How can I get to the contents of the stray buckets?

Please note that Octopus is still hit by https://tracker.ceph.com/issues/57059 
so a "dump tree" will not work. In addition, I clearly don't just need the 
entries in cache, I need a listing of everything. How can I get that? I'm 
willing to run rados commands and pipe through ceph-encoder if necessary.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Scrubbing?

2024-01-24 Thread Peter Grandi
> [...] After a few days, I have on our OSD nodes around 90MB/s
> read and 70MB/s write while 'ceph -s' have client io as
> 2,5MB/s read and 50MB/s write. [...]

This is one of my pet-peeves: that a storage system must have
capacity (principally IOPS) to handle both a maintenance
workload and a user workload, and since the former often
involves whole-storage or whole-metadata operations it can be
quite heavy, especially in the case of Ceph where rebalancing
and scrubbing and checking should be fairly frequent to detect
and correct inconsistencies.

> Is this activity OK? [...]

Indeed. Some "clever" people "save money" by "rightsizing" their
storage so it cannot run at the same time the maintenance and
the user workload, and so turn off the maintenance workload,
because they "feel lucky" I guess, but I do not recommend that.
:-). I have seen more than one Ceph cluster that did not have
the capacity even to run *just* the maintenance workload.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How many pool for cephfs

2024-01-24 Thread Albert Shih
Le 24/01/2024 à 10:33:45+0100, Robert Sander a écrit
Hi, 

> 
> On 1/24/24 10:08, Albert Shih wrote:
> 
> > 99.99% because I'm newbie with ceph and don't understand clearly how
> > the autorisation work with cephfs ;-)
> 
> I strongly recommend you to ask for a expierenced Ceph consultant that helps
> you design and setup your storage cluster.

I known I'm working on (meaning I'm waiting my administration to do «what
need to be done)...

> 
> It looks like you try to make design decisions that will heavily influence
> performance of the system.

I'm well aware

> 
> > If I say 20-30 it's because I currently have on my classic ZFS/NFS server
> > around 25 «datasets» exported to various server.
> 
> The next question is how would the "consumers" access the filesystem: Via
> NFS or mounted directly. Even with the second option you can separate client
> access via CephX keys as David already wrote.

The separate client key would be more than enough for us. 

> 
> > Ok. I got for my ceph cluster two set of servers, first set are for
> > services (mgr,mon,etc.) with ssd and don't currently run any osd (but still
> > have 2 ssd not used), I also got a second set of server with HDD and 2 SSD. 
> > The data pool will be on
> > the second set (with HDD). Where should I run the MDS and on which osd ?
> 
> Do you intend to use the Ceph cluster only for archival storage?

Mostly yes. 

> Hwo large is your second set of Ceph nodes, how many HDDs in each? Do you

Huge ;-) 

I got 6 ceph server with ... 60 HDD. (I know, I know it's not ideal)

> intend to use the SSDs for the OSDs' RocksDB?

RocksDB ? no...

> Where do you plan to store the metadata pools for CephFS? They should be

That's exactly the question...

My cluster are :

  5 server with «small» ssd for service (each got 2 ssd no currently used)
  6 server with «huge» HDD for data (each got 2 ssd no currently used)

so for my cephfs metadata I can put them on my 5 servers for services (but
that's mean the mds running on those 5 servers) or should I use the ssd on
the 6 server who hold the OSD for data 

Thanks.

Regards
-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
mer. 24 janv. 2024 10:48:11 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How many pool for cephfs

2024-01-24 Thread Robert Sander

Hi,

On 1/24/24 10:08, Albert Shih wrote:


99.99% because I'm newbie with ceph and don't understand clearly how
the autorisation work with cephfs ;-)


I strongly recommend you to ask for a expierenced Ceph consultant that 
helps you design and setup your storage cluster.


It looks like you try to make design decisions that will heavily 
influence performance of the system.



If I say 20-30 it's because I currently have on my classic ZFS/NFS server
around 25 «datasets» exported to various server.


The next question is how would the "consumers" access the filesystem: 
Via NFS or mounted directly. Even with the second option you can 
separate client access via CephX keys as David already wrote.



Ok. I got for my ceph cluster two set of servers, first set are for
services (mgr,mon,etc.) with ssd and don't currently run any osd (but still
have 2 ssd not used), I also got a second set of server with HDD and 2 SSD. The 
data pool will be on
the second set (with HDD). Where should I run the MDS and on which osd ?


Do you intend to use the Ceph cluster only for archival storage?
Hwo large is your second set of Ceph nodes, how many HDDs in each? Do 
you intend to use the SSDs for the OSDs' RocksDB?
Where do you plan to store the metadata pools for CephFS? They should be 
stored on fats media.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How many pool for cephfs

2024-01-24 Thread Albert Shih
Le 24/01/2024 à 10:23:20+0100, David C. a écrit
Hi, 

> 
> In this scenario, it is more consistent to work with subvolumes.

Ok. I will do that. 

> 
> Regarding security, you can use namespaces to isolate access at the OSD level.

HumI'm currently have no idea what you just say but that's OK ;-)

> 
> What Robert emphasizes is that creating pools dynamically is not without 
> effect
> on the number of PGs and (therefore) on the architecture (PG per OSD, 
> balancer,
> pg autoscaling, etc.)

Ok.no worriesI didn't know it was possible;-)

Regards.

JAS
 
-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
mer. 24 janv. 2024 10:31:44 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How many pool for cephfs

2024-01-24 Thread David C.
Hi Albert,

In this scenario, it is more consistent to work with subvolumes.

Regarding security, you can use namespaces to isolate access at the OSD
level.

What Robert emphasizes is that creating pools dynamically is not without
effect on the number of PGs and (therefore) on the architecture (PG per
OSD, balancer, pg autoscaling, etc.)


Cordialement,

*David CASIER*




Le mer. 24 janv. 2024 à 10:10, Albert Shih  a écrit :

> Le 24/01/2024 à 09:45:56+0100, Robert Sander a écrit
> Hi
>
> >
> > On 1/24/24 09:40, Albert Shih wrote:
> >
> > > Knowing I got two class of osd (hdd and ssd), and I have a need of ~
> 20/30
> > > cephfs (currently and that number will increase with time).
> >
> > Why do you need 20 - 30 separate CephFS instances?
>
> 99.99% because I'm newbie with ceph and don't understand clearly how
> the autorisation work with cephfs ;-)
>
> If I say 20-30 it's because I currently have on my classic ZFS/NFS server
> around 25 «datasets» exported to various server.
>
> But because you question I understand I can put many export «inside» one
> cephfs.
>
> > > and put all my cephfs inside two of them. Or should I create for each
> > > cephfs a couple of pool metadata/data ?
> >
> > Each CephFS instance needs their own pools, at least two (data +
> metadata)
> > per instance. And each CephFS needs at least one MDS running, better
> with an
> > additional cold or even hot standby MDS.
>
> Ok. I got for my ceph cluster two set of servers, first set are for
> services (mgr,mon,etc.) with ssd and don't currently run any osd (but still
> have 2 ssd not used), I also got a second set of server with HDD and 2
> SSD. The data pool will be on
> the second set (with HDD). Where should I run the MDS and on which osd ?
>
> >
> > > Il will also need to have ceph S3 storage, same question, should I
> have a
> > > designated pool for S3 storage or can/should I use the same
> > > cephfs_data_replicated/erasure pool ?
> >
> > No, S3 needs its own pools. It cannot re-use CephFS pools.
>
> Ok thanks.
>
> Regards
> --
> Albert SHIH 嶺 
> France
> Heure locale/Local time:
> mer. 24 janv. 2024 09:55:26 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How many pool for cephfs

2024-01-24 Thread Albert Shih
Le 24/01/2024 à 09:45:56+0100, Robert Sander a écrit
Hi

> 
> On 1/24/24 09:40, Albert Shih wrote:
> 
> > Knowing I got two class of osd (hdd and ssd), and I have a need of ~ 20/30
> > cephfs (currently and that number will increase with time).
> 
> Why do you need 20 - 30 separate CephFS instances?

99.99% because I'm newbie with ceph and don't understand clearly how 
the autorisation work with cephfs ;-)

If I say 20-30 it's because I currently have on my classic ZFS/NFS server
around 25 «datasets» exported to various server. 

But because you question I understand I can put many export «inside» one
cephfs. 

> > and put all my cephfs inside two of them. Or should I create for each
> > cephfs a couple of pool metadata/data ?
> 
> Each CephFS instance needs their own pools, at least two (data + metadata)
> per instance. And each CephFS needs at least one MDS running, better with an
> additional cold or even hot standby MDS.

Ok. I got for my ceph cluster two set of servers, first set are for
services (mgr,mon,etc.) with ssd and don't currently run any osd (but still
have 2 ssd not used), I also got a second set of server with HDD and 2 SSD. The 
data pool will be on
the second set (with HDD). Where should I run the MDS and on which osd ? 

> 
> > Il will also need to have ceph S3 storage, same question, should I have a
> > designated pool for S3 storage or can/should I use the same
> > cephfs_data_replicated/erasure pool ?
> 
> No, S3 needs its own pools. It cannot re-use CephFS pools.

Ok thanks. 

Regards
-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
mer. 24 janv. 2024 09:55:26 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How many pool for cephfs

2024-01-24 Thread Robert Sander

Hi,

On 1/24/24 09:40, Albert Shih wrote:


Knowing I got two class of osd (hdd and ssd), and I have a need of ~ 20/30
cephfs (currently and that number will increase with time).


Why do you need 20 - 30 separate CephFS instances?


and put all my cephfs inside two of them. Or should I create for each
cephfs a couple of pool metadata/data ?


Each CephFS instance needs their own pools, at least two (data + 
metadata) per instance. And each CephFS needs at least one MDS running, 
better with an additional cold or even hot standby MDS.



Il will also need to have ceph S3 storage, same question, should I have a
designated pool for S3 storage or can/should I use the same
cephfs_data_replicated/erasure pool ?


No, S3 needs its own pools. It cannot re-use CephFS pools.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

2024-01-24 Thread Eugen Block

Hi,

this topic pops up every now and then, and although I don't have  
definitive proof for my assumptions I still stand with them. ;-)
As the docs [2] already state, it's expected that PGs become degraded  
after some sort of failure (setting an OSD "out" falls into that  
category IMO):


It is normal for placement groups to enter “degraded” or “peering”  
states after a component failure. Normally, these states reflect the  
expected progression through the failure recovery process. However,  
a placement group that stays in one of these states for a long time  
might be an indication of a larger problem.


And you report that your PGs do not stay in that state but eventually  
recover. My understanding is as follows:
PGs have to be recreated on different hosts/OSDs after setting an OSD  
"out". During this transition (peering) the PGs are degraded until the  
newly assigned OSD have noticed their new responsibility (I'm not  
familiar with the actual data flow). The degraded state then clears as  
long as the out OSD is up (its PGs are active). If you stop that OSD  
("down") the PGs become and stay degraded until they have been fully  
recreated on different hosts/OSDs. Not sure what impacts the duration  
until the degraded state clears, but in my small test cluster (similar  
osd tree as yours) the degraded state clears after a few seconds only,  
but I only have a few (almost empty) PGs in the EC test pool.


I guess a comment from the devs couldn't hurt to clear this up.

[2]  
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups


Zitat von Hector Martin :


On 2024/01/22 19:06, Frank Schilder wrote:

You seem to have a problem with your crush rule(s):

14.3d ... [18,17,16,3,1,0,NONE,NONE,12]

If you really just took out 1 OSD, having 2xNONE in the acting set  
indicates that your crush rule can't find valid mappings. You might  
need to tune crush tunables:  
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs


Look closely: that's the *acting* (second column) OSD set, not the *up*
(first column) OSD set. It's supposed to be the *previous* set of OSDs
assigned to that PG, but inexplicably some OSDs just "fall off" when the
PGs get remapped around.

Simply waiting lets the data recover. At no point are any of my PGs
actually missing OSDs according to the current cluster state, and CRUSH
always finds a valid mapping. Rather the problem is that the *previous*
set of OSDs just loses some entries some for some reason.

The same problem happens when I *add* an OSD to the cluster. For
example, right now, osd.15 is out. This is the state of one pg:

14.3d   1044   0 0  00
157307567310   0  1630 0  1630
active+clean  2024-01-22T20:15:46.684066+0900 15550'1630
15550:16184  [18,17,16,3,1,0,11,14,12]  18
[18,17,16,3,1,0,11,14,12]  18 15550'1629
2024-01-22T20:15:46.683491+0900  0'0
2024-01-08T15:18:21.654679+0900  02
periodic scrub scheduled @ 2024-01-31T07:34:27.297723+0900
10430

Note the OSD list ([18,17,16,3,1,0,11,14,12])

Then I bring osd.15 in and:

14.3d   1044   0  1077  00
157307567310   0  1630 0  1630
active+recovery_wait+undersized+degraded+remapped
2024-01-22T22:52:22.700096+0900 15550'1630 15554:16163
[15,17,16,3,1,0,11,14,12]  15[NONE,17,16,3,1,0,11,14,12]
 17 15550'1629  2024-01-22T20:15:46.683491+0900
0'0  2024-01-08T15:18:21.654679+0900  02
 periodic scrub scheduled @ 2024-01-31T02:31:53.342289+0900
 10430

So somehow osd.18 "vanished" from the acting list
([NONE,17,16,3,1,0,11,14,12]) as it is being replaced by 15 in the new
up list ([15,17,16,3,1,0,11,14,12]). The data is in osd.18, but somehow
Ceph forgot.



It is possible that your low OSD count causes the "crush gives up  
too soon" issue. You might also consider to use a crush rule that  
places exactly 3 shards per host (examples were in posts just last  
week). Otherwise, it is not guaranteed that "... data remains  
available if a whole host goes down ..." because you might have 4  
chunks on one of the hosts and fall below min_size (the failure  
domain of your crush rule for the EC profiles is OSD).


That should be what my CRUSH rule does. It picks 3 hosts then picks 3
OSDs per host (IIUC). And oddly enough everything works for the other EC
pool even though it shares the same CRUSH rule (just ignoring one OSD
from it).

To test if your crush rules can generate valid mappings, you can  
pull the osdmap of your cluster and use osdmaptool to experiment  
with it without risk of destroying anything. It allows you to try  
different crush rules and failure scenarios on off-line but real  
cluster 

[ceph-users] How many pool for cephfs

2024-01-24 Thread Albert Shih
Hi everyone, 

I like to know how many pool should I create for multiple cephfs ?

Knowing I got two class of osd (hdd and ssd), and I have a need of ~ 20/30
cephfs (currently and that number will increase with time). 

Should I create 

  one cephfs_metadata_replicated
  one cephfs_data_replicated
  few cephfs_data_erasure_coding (depending of k/m) 

and put all my cephfs inside two of them. Or should I create for each
cephfs a couple of pool metadata/data ? 

Il will also need to have ceph S3 storage, same question, should I have a
designated pool for S3 storage or can/should I use the same
cephfs_data_replicated/erasure pool ? 

Regards

-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
mer. 24 janv. 2024 09:33:09 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io