[ceph-users] Re: cephadm custom crush location hooks

2024-05-03 Thread Wyll Ingersoll


Yeah, now that you mention it, I recall figuring that out also at some point. I 
think I did it originally when I was debugging the problem without the 
container.


From: Eugen Block 
Sent: Friday, May 3, 2024 8:37 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] cephadm custom crush location hooks

Hm, I wonder why the symlink is required, the OSDs map / to /rootfs
anyway (excerpt of unit.run file):

-v /:/rootfs

So I removed the symlink and just added /rootfs to the crush location hook:

ceph config set osd.0 crush_location_hook
/rootfs/usr/local/bin/custom_crush_location

After OSD restart the OSD finds its correct location. So I actually
only need to update the location path, nothing else, it seems.

Zitat von Eugen Block :

> I found your (open) tracker issue:
>
> https://tracker.ceph.com/issues/53562
>
> Your workaround works great, I tried it in a test cluster
> successfully. I will adopt it to our production cluster as well.
>
> Thanks!
> Eugen
>
> Zitat von Eugen Block :
>
>> Thank you very much for the quick response! I will take a look
>> first thing tomorrow and try that in a test cluster. But I agree,
>> it would be helpful to have a way with cephadm to apply these hooks
>> without these workarounds. I'll check if there's a tracker issue
>> for that, and create one if necessary.
>>
>> Thanks!
>> Eugen
>>
>> Zitat von Wyll Ingersoll :
>>
>>> I've found the crush location hook script code to be problematic
>>> in the containerized/cephadm world.
>>>
>>> Our workaround is to place the script in a common place on each
>>> OSD node, such as /etc/crush/crushhook.sh, and then make a link
>>> from /rootfs -> /, and set the configuration value so that the
>>> path to the hook script starts with /rootfs.  The container that
>>> the OSDs run in has access to /rootfs and this hack allows them to
>>> all view the crush script without having to manually modify unit
>>> files.
>>>
>>> For example:
>>>
>>> 1.
>>> put crushhook script on the host OS in /etc/crush/crushhook.sh
>>> 2.
>>> make a link on the host os:   $ cd /; sudo ln -s / /rootfs
>>> 3.
>>> ceph config set osd crush_location_hook /rootfs/etc/crush/crushhook.sh
>>>
>>>
>>> The containers see "/rootfs" and will then be able to access your
>>> script.  Be aware though that if your script requires any sort of
>>> elevated access, it may fail because the hook runs as ceph:ceph in
>>> a minimal container so not all functions are available.  I had to
>>> add lots of debug output and logging in mine (it's rather
>>> complicated) to figure out what was going on when it was running.
>>>
>>> I would love to see the "crush_location_hook" script be something
>>> that can be stored in the config entirely instead of as a link,
>>> similar to how the ssl certificates for RGW or the dashboard are
>>> stored (ceph config-key set ...).   The current situation is not
>>> ideal.
>>>
>>>
>>>
>>>
>>> 
>>> From: Eugen Block 
>>> Sent: Thursday, May 2, 2024 10:23 AM
>>> To: ceph-users@ceph.io 
>>> Subject: [ceph-users] cephadm custom crush location hooks
>>>
>>> Hi,
>>>
>>> we've been using custom crush location hooks for some OSDs [1] for
>>> years. Since we moved to cephadm, we always have to manually edit the
>>> unit.run file of those OSDs because the path to the script is not
>>> mapped into the containers. I don't want to define custom location
>>> hooks for all OSDs globally in the OSD spec, even if those are limited
>>> to two hosts only in our case. But I'm not aware of a method to target
>>> only specific OSDs to have some files mapped into the container [2].
>>> Is my assumption correct that we'll have to live with the manual
>>> intervention until we reorganize our osd tree? Or did I miss something?
>>>
>>> Thanks!
>>> Eugen
>>>
>>> [1]
>>> https://docs.ceph.com/en/latest/rados/operations/crush-map/#custom-location-hooks
>>> [2]
>>> https://docs.ceph.com/en/latest/cephadm/services/#mounting-files-with-extra-container-arguments
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm custom crush location hooks

2024-05-03 Thread Wyll Ingersoll
Thank you!

From: Eugen Block 
Sent: Friday, May 3, 2024 6:46 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] cephadm custom crush location hooks

I found your (open) tracker issue:

https://tracker.ceph.com/issues/53562

Your workaround works great, I tried it in a test cluster
successfully. I will adopt it to our production cluster as well.

Thanks!
Eugen

Zitat von Eugen Block :

> Thank you very much for the quick response! I will take a look first
> thing tomorrow and try that in a test cluster. But I agree, it would
> be helpful to have a way with cephadm to apply these hooks without
> these workarounds. I'll check if there's a tracker issue for that,
> and create one if necessary.
>
> Thanks!
> Eugen
>
> Zitat von Wyll Ingersoll :
>
>> I've found the crush location hook script code to be problematic in
>> the containerized/cephadm world.
>>
>> Our workaround is to place the script in a common place on each OSD
>> node, such as /etc/crush/crushhook.sh, and then make a link from
>> /rootfs -> /, and set the configuration value so that the path to
>> the hook script starts with /rootfs.  The container that the OSDs
>> run in has access to /rootfs and this hack allows them to all view
>> the crush script without having to manually modify unit files.
>>
>> For example:
>>
>>  1.
>> put crushhook script on the host OS in /etc/crush/crushhook.sh
>>  2.
>> make a link on the host os:   $ cd /; sudo ln -s / /rootfs
>>  3.
>> ceph config set osd crush_location_hook /rootfs/etc/crush/crushhook.sh
>>
>>
>> The containers see "/rootfs" and will then be able to access your
>> script.  Be aware though that if your script requires any sort of
>> elevated access, it may fail because the hook runs as ceph:ceph in
>> a minimal container so not all functions are available.  I had to
>> add lots of debug output and logging in mine (it's rather
>> complicated) to figure out what was going on when it was running.
>>
>> I would love to see the "crush_location_hook" script be something
>> that can be stored in the config entirely instead of as a link,
>> similar to how the ssl certificates for RGW or the dashboard are
>> stored (ceph config-key set ...).   The current situation is not
>> ideal.
>>
>>
>>
>>
>> 
>> From: Eugen Block 
>> Sent: Thursday, May 2, 2024 10:23 AM
>> To: ceph-users@ceph.io 
>> Subject: [ceph-users] cephadm custom crush location hooks
>>
>> Hi,
>>
>> we've been using custom crush location hooks for some OSDs [1] for
>> years. Since we moved to cephadm, we always have to manually edit the
>> unit.run file of those OSDs because the path to the script is not
>> mapped into the containers. I don't want to define custom location
>> hooks for all OSDs globally in the OSD spec, even if those are limited
>> to two hosts only in our case. But I'm not aware of a method to target
>> only specific OSDs to have some files mapped into the container [2].
>> Is my assumption correct that we'll have to live with the manual
>> intervention until we reorganize our osd tree? Or did I miss something?
>>
>> Thanks!
>> Eugen
>>
>> [1]
>> https://docs.ceph.com/en/latest/rados/operations/crush-map/#custom-location-hooks
>> [2]
>> https://docs.ceph.com/en/latest/cephadm/services/#mounting-files-with-extra-container-arguments
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm custom crush location hooks

2024-05-02 Thread Wyll Ingersoll



I've found the crush location hook script code to be problematic in the 
containerized/cephadm world.

Our workaround is to place the script in a common place on each OSD node, such 
as /etc/crush/crushhook.sh, and then make a link from /rootfs -> /, and set the 
configuration value so that the path to the hook script starts with /rootfs.  
The container that the OSDs run in has access to /rootfs and this hack allows 
them to all view the crush script without having to manually modify unit files.

For example:

  1.
put crushhook script on the host OS in /etc/crush/crushhook.sh
  2.
make a link on the host os:   $ cd /; sudo ln -s / /rootfs
  3.
ceph config set osd crush_location_hook /rootfs/etc/crush/crushhook.sh


The containers see "/rootfs" and will then be able to access your script.  Be 
aware though that if your script requires any sort of elevated access, it may 
fail because the hook runs as ceph:ceph in a minimal container so not all 
functions are available.  I had to add lots of debug output and logging in mine 
(it's rather complicated) to figure out what was going on when it was running.

I would love to see the "crush_location_hook" script be something that can be 
stored in the config entirely instead of as a link, similar to how the ssl 
certificates for RGW or the dashboard are stored (ceph config-key set ...).   
The current situation is not ideal.





From: Eugen Block 
Sent: Thursday, May 2, 2024 10:23 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] cephadm custom crush location hooks

Hi,

we've been using custom crush location hooks for some OSDs [1] for
years. Since we moved to cephadm, we always have to manually edit the
unit.run file of those OSDs because the path to the script is not
mapped into the containers. I don't want to define custom location
hooks for all OSDs globally in the OSD spec, even if those are limited
to two hosts only in our case. But I'm not aware of a method to target
only specific OSDs to have some files mapped into the container [2].
Is my assumption correct that we'll have to live with the manual
intervention until we reorganize our osd tree? Or did I miss something?

Thanks!
Eugen

[1]
https://docs.ceph.com/en/latest/rados/operations/crush-map/#custom-location-hooks
[2]
https://docs.ceph.com/en/latest/cephadm/services/#mounting-files-with-extra-container-arguments
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] replacing storage server host (not drives)

2023-09-26 Thread Wyll Ingersoll


We have a storage node that is failing, but the disks themselves are not.  What 
is the recommended procedure for replacing the host itself without destroying 
the OSDs or losing data?

This cluster is running ceph 16.2.11 using ceph orchestrator with docker 
containers on Ubuntu 20.04 (focal).

thank you,
  Wyllys Ingersoll
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-mgr ssh connections left open

2023-07-20 Thread Wyll Ingersoll


Yes, it is ceph pacific 16.2.11.

Is this a known issue that is fixed in a more recent pacific update?  We're not 
ready to move to quincy yet.

thanks,
   Wyllys


From: John Mulligan 
Sent: Thursday, July 20, 2023 10:30 AM
To: ceph-users@ceph.io 
Cc: Wyll Ingersoll 
Subject: Re: [ceph-users] ceph-mgr ssh connections left open

On Tuesday, July 18, 2023 10:56:12 AM EDT Wyll Ingersoll wrote:
> Every night at midnight, our ceph-mgr daemons open up ssh connections to the
> other nodes and then leaves them open. Eventually they become zombies. I
> cannot figure out what module is causing this or how to turn it off.  If
> left unchecked over days/weeks, the zombie ssh connections just keep
> growing, the only way to clear them is to restart ceph-mgr services.
>
> Any idea what is causing this or how it can be disabled?
>
> Example:
>
>
> ceph 1350387 1350373  7 Jul17 ?01:19:39 /usr/bin/ceph-mgr -n
> mgr.mon03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
> --default-log-to-stderr=true --default-log-stderr-prefix
>
> ceph 1350548 1350387  0 Jul17 ?00:00:01 ssh -C -F
> /tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o
> ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.11 sudo python
>
> [...snip...]

Is this cluster on pacific?  The module in question is likely to be `cephadm`
but the cephadm ssh backend has been changed and the team assumes problems
like this no longer occur.

Hope that helps!


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-mgr ssh connections left open

2023-07-18 Thread Wyll Ingersoll
Every night at midnight, our ceph-mgr daemons open up ssh connections to the 
other nodes and then leaves them open. Eventually they become zombies.
I cannot figure out what module is causing this or how to turn it off.  If left 
unchecked over days/weeks, the zombie ssh connections just keep growing, the 
only way to clear them is to restart ceph-mgr services.

Any idea what is causing this or how it can be disabled?

Example:


ceph 1350387 1350373  7 Jul17 ?01:19:39 /usr/bin/ceph-mgr -n 
mgr.mon03 -f --setuser ceph --setgroup ceph --default-log-to-file=false 
--default-log-to-stderr=true --default-log-stderr-prefix

ceph 1350548 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.11 sudo python

ceph 1350549 1350387  0 Jul17 ?00:00:02 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.41 sudo python

ceph 1350550 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.42 sudo python

ceph 1350551 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.22 sudo python

ceph 1350552 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.23 sudo python

root 1350553 902  0 Jul17 ?00:00:00 sshd: xxx [priv]

ceph 1350554 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.105 sudo pytho

ceph 1350556 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.21 sudo python

ceph 1350557 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.101 sudo pytho

ceph 1350559 1350387  0 Jul17 ?00:00:01 ssh -C -F 
/tmp/cephadm-conf-d0khggdz -i /tmp/cephadm-identity-onf2msju -o 
ServerAliveInterval=7 -o ServerAliveCountMax=3 xxx@10.4.1.102 sudo pytho




Our current list of ceph-mgr modules enabled and default is:


"always_on_modules": [

"balancer",

"crash",

"devicehealth",

"orchestrator",

"pg_autoscaler",

"progress",

"rbd_support",

"status",

"telemetry",

"volumes"

],

"enabled_modules": [

"cephadm",

"dashboard",

"diskprediction_local",

"nfs",

"prometheus",

"restful"

],


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific dashboard: unable to get RGW information

2023-04-11 Thread Wyll Ingersoll
I have a similar issue with how the dashboard tries to access an SSL protected 
RGW service.  It doesn't use the correct name and doesn't allow for any way to 
override the RGW name that the dashboard uses.

https://tracker.ceph.com/issues/59111
Bug #59111: dashboard should use rgw_dns_name when talking to rgw api - 
Dashboard - Ceph 
Redmine
tracker.ceph.com




From: Michel Jouvin 
Sent: Tuesday, April 11, 2023 4:19 PM
To: Ceph Users 
Subject: [ceph-users] Pacific dashboard: unable to get RGW information

Hi,

Our cluster is running Pacific 16.2.10. We have a problem using the
dashboard to display information about RGWs configured in the cluster.
When clicking on "Object Gateway", we get an error 500. Looking in the
mgr logs, I found that the problem is that the RGW is accessed by its IP
address rather than its name. As the RGW has SSL enabled, the
certificate cannot be matched against the IP address.

I digged into the configuration but I was not able to identify where an
IP address rather than a name was used (I checked in particular the
zonegroup parameters and names are used to define endpoints). Did I make
something wrong in the configuration or is it a know issue when using
SSL-enabled RGW?

Best regards,

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph NFS data - cannot read files, getattr returns NFS4ERR_PERM

2023-03-16 Thread Wyll Ingersoll

I agree it should be in release notes or documentation, it took me 3 days to 
track it down and I was searching for all kinds of combinations of "cephfs nfs" 
and "ceph nfs permissions".
Perhaps just having this thread archived will make it easier for the next 
person to find the answer, though.


From: Eugen Block 
Sent: Thursday, March 16, 2023 10:30 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Ceph NFS data - cannot read files, getattr 
returns NFS4ERR_PERM

You found the right keywords yourself (application metadata), but I'm
glad it worked for you. I only found this tracker issue [2] which
fixes the behavior when issuing a "fs new" command, and it contains
the same workaround (set the application metadata). Maybe this should
be part of the (upgrade) release notes and the documentation so that
users with older clusters are aware that their cephfs pools might be
missing the correct metadata. Maybe it is already, I didn't find
anything yet.

[2] https://tracker.ceph.com/issues/43761

Zitat von Wyll Ingersoll :

> YES!!   That fixed it.
>
> I issued the following commands to update the application_metadata
> on the cephfs pools and now its working. THANK YOU!
>
> ceph osd pool application set cephfs_data cephfs data cephfs
> ceph osd pool application set cephfs_metadata cephfs data cephfs
>
> Now the application_metadata looks correct on both pools and I can
> read/write the data as expected.
>
> Is this an official ceph bug or only recorded in the SUSE bug db?
> It should be in ceph, IMO, since it is not SUSE specific.
>
>
> -Wyllys Ingersoll
>
> ____
> From: Eugen Block 
> Sent: Thursday, March 16, 2023 10:04 AM
> To: Wyll Ingersoll 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] Re: Ceph NFS data - cannot read files,
> getattr returns NFS4ERR_PERM
>
> It sounds a bit like this [1], doesn't it? Setting the application
> metadata is just:
>
> ceph osd pool application set  cephfs
>  cephfs
>
> [1] https://www.suse.com/support/kb/doc/?id=20812
>
> Zitat von Wyll Ingersoll :
>
>> Yes, with this last upgrade (pacific) we migrated to the
>> orchestrated model where everything is in containers. Previously, we
>> managed nfs-ganesha ourselves and exported shares using FSAL VFS
>> over /cephfs mounted on the NFS server.
>>
>> With orchestrated ceph managed NFS, ganesha runs in a container and
>> uses FSAL CEPH instead, which accesses cephfs data via libcephfs
>> instead of reading from the mounted local FS. I suspect that is part
>> of the problem and is related to getting the right permissions.
>>
>> Here is something I noticed.  On a separate cluster running same
>> release, the cephfs_data pool has the following application metadata:
>> "application_metadata": {
>> "cephfs": {
>> "metadata": "cephfs"
>> }
>> }
>>
>> But on the upgraded cluster the application_metadata on
>> cephfs_data/metadata pools looks like:
>> "application_metadata": {
>> "cephfs": {}
>> }
>>
>>
>> I'm wondering if that has something to do with the permission issues
>> because the caps use "tag data=cephfs" to grant the RW permission
>> for the OSDs.  How do I update the application_metadata on the
>> pools?  I don't see a subcommand in the rados utility, I'm hoping I
>> dont have to write something to do it myself.  THough I do see there
>> are APIs for updating in in librados, so I could write a short C
>> utility to make the change if necessary.
>>
>> thanks!
>>
>> 
>> From: Eugen Block 
>> Sent: Thursday, March 16, 2023 9:48 AM
>> To: Wyll Ingersoll 
>> Cc: ceph-users@ceph.io 
>> Subject: Re: [ceph-users] Re: Ceph NFS data - cannot read files,
>> getattr returns NFS4ERR_PERM
>>
>> That would have been my next question, if it had worked before. So the
>> only difference is the nfs-ganesha deployment and different (newer?)
>> clients than before? Unfortunately, I don't have any ganesha instance
>> running in any of my (test) clusters. Maybe someone else can chime in.
>>
>> Zitat von Wyll Ingersoll :
>>
>>> Nope, that didn't work.  I updated the caps to add "allow r path=/"
>>> to the mds, but it made no difference.  I restarted the nfs
>>> container and unmounted/mounted the share on the client.
>>>
>>> The caps now look like:
>>>    key = xxx
>>>

[ceph-users] Re: Ceph NFS data - cannot read files, getattr returns NFS4ERR_PERM

2023-03-16 Thread Wyll Ingersoll
YES!!   That fixed it.

I issued the following commands to update the application_metadata on the 
cephfs pools and now its working. THANK YOU!

ceph osd pool application set cephfs_data cephfs data cephfs
ceph osd pool application set cephfs_metadata cephfs data cephfs

Now the application_metadata looks correct on both pools and I can read/write 
the data as expected.

Is this an official ceph bug or only recorded in the SUSE bug db?  It should be 
in ceph, IMO, since it is not SUSE specific.


-Wyllys Ingersoll


From: Eugen Block 
Sent: Thursday, March 16, 2023 10:04 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Ceph NFS data - cannot read files, getattr 
returns NFS4ERR_PERM

It sounds a bit like this [1], doesn't it? Setting the application
metadata is just:

ceph osd pool application set  cephfs
 cephfs

[1] https://www.suse.com/support/kb/doc/?id=20812

Zitat von Wyll Ingersoll :

> Yes, with this last upgrade (pacific) we migrated to the
> orchestrated model where everything is in containers. Previously, we
> managed nfs-ganesha ourselves and exported shares using FSAL VFS
> over /cephfs mounted on the NFS server.
>
> With orchestrated ceph managed NFS, ganesha runs in a container and
> uses FSAL CEPH instead, which accesses cephfs data via libcephfs
> instead of reading from the mounted local FS. I suspect that is part
> of the problem and is related to getting the right permissions.
>
> Here is something I noticed.  On a separate cluster running same
> release, the cephfs_data pool has the following application metadata:
> "application_metadata": {
> "cephfs": {
> "metadata": "cephfs"
> }
> }
>
> But on the upgraded cluster the application_metadata on
> cephfs_data/metadata pools looks like:
> "application_metadata": {
> "cephfs": {}
> }
>
>
> I'm wondering if that has something to do with the permission issues
> because the caps use "tag data=cephfs" to grant the RW permission
> for the OSDs.  How do I update the application_metadata on the
> pools?  I don't see a subcommand in the rados utility, I'm hoping I
> dont have to write something to do it myself.  THough I do see there
> are APIs for updating in in librados, so I could write a short C
> utility to make the change if necessary.
>
> thanks!
>
> 
> From: Eugen Block 
> Sent: Thursday, March 16, 2023 9:48 AM
> To: Wyll Ingersoll 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] Re: Ceph NFS data - cannot read files,
> getattr returns NFS4ERR_PERM
>
> That would have been my next question, if it had worked before. So the
> only difference is the nfs-ganesha deployment and different (newer?)
> clients than before? Unfortunately, I don't have any ganesha instance
> running in any of my (test) clusters. Maybe someone else can chime in.
>
> Zitat von Wyll Ingersoll :
>
>> Nope, that didn't work.  I updated the caps to add "allow r path=/"
>> to the mds, but it made no difference.  I restarted the nfs
>> container and unmounted/mounted the share on the client.
>>
>> The caps now look like:
>>    key = xxx
>>    caps mds = "allow rw path=/exports/nfs/foobar, allow r path=/"
>>    caps mon = "allow r"
>>    caps osd = "allow rw pool=.nfs namespace=cephfs, allow rw tag
>> cephfs data=cephfs"
>>
>>
>> This is really frustrating. We can mount the shares and get
>> directory listings, and even create directories and files (empty),
>> but cannot read or write any actual data.  This would seem to
>> indicate a permission problem writing to the cephfs data pool, but
>> we haven't tinkered with any of the caps or permissions.
>>
>> tcpdump shows lots of errors when trying to read a file from the share:
>> NFS reply xid 771352420 reply ok 96 getattr ERROR: Operation not permitted
>>
>> One thing to note - this is a system that has been around for years
>> and years and has been upgraded through many iterations of ceph.
>> The cephfs data/metadata pools were probably created on Hammer or
>> Luminous, though I'm not sure if that would matter or not.
>> Everything else operates correctly AFAIK.
>>
>> I may take this over to the Ganesha mailing list to see if they have
>> any ideas.
>>
>> thanks!
>>
>>
>>
>>
>> 
>> From: Eugen Block 
>> Sent: Thursday, March 16, 2023 3:33 AM
>> To: ceph-users@ceph.io 
>> Subject: [ceph-users] Re: Cep

[ceph-users] Re: Ceph NFS data - cannot read files, getattr returns NFS4ERR_PERM

2023-03-16 Thread Wyll Ingersoll
Yes, with this last upgrade (pacific) we migrated to the orchestrated model 
where everything is in containers. Previously, we managed nfs-ganesha ourselves 
and exported shares using FSAL VFS over /cephfs mounted on the NFS server.

With orchestrated ceph managed NFS, ganesha runs in a container and uses FSAL 
CEPH instead, which accesses cephfs data via libcephfs instead of reading from 
the mounted local FS. I suspect that is part of the problem and is related to 
getting the right permissions.

Here is something I noticed.  On a separate cluster running same release, the 
cephfs_data pool has the following application metadata:
"application_metadata": {
"cephfs": {
"metadata": "cephfs"
}
}

But on the upgraded cluster the application_metadata on cephfs_data/metadata 
pools looks like:
"application_metadata": {
"cephfs": {}
}


I'm wondering if that has something to do with the permission issues because 
the caps use "tag data=cephfs" to grant the RW permission for the OSDs.  How do 
I update the application_metadata on the pools?  I don't see a subcommand in 
the rados utility, I'm hoping I dont have to write something to do it myself.  
THough I do see there are APIs for updating in in librados, so I could write a 
short C utility to make the change if necessary.

thanks!

____
From: Eugen Block 
Sent: Thursday, March 16, 2023 9:48 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Ceph NFS data - cannot read files, getattr 
returns NFS4ERR_PERM

That would have been my next question, if it had worked before. So the
only difference is the nfs-ganesha deployment and different (newer?)
clients than before? Unfortunately, I don't have any ganesha instance
running in any of my (test) clusters. Maybe someone else can chime in.

Zitat von Wyll Ingersoll :

> Nope, that didn't work.  I updated the caps to add "allow r path=/"
> to the mds, but it made no difference.  I restarted the nfs
> container and unmounted/mounted the share on the client.
>
> The caps now look like:
>    key = xxx
>    caps mds = "allow rw path=/exports/nfs/foobar, allow r path=/"
>    caps mon = "allow r"
>    caps osd = "allow rw pool=.nfs namespace=cephfs, allow rw tag
> cephfs data=cephfs"
>
>
> This is really frustrating. We can mount the shares and get
> directory listings, and even create directories and files (empty),
> but cannot read or write any actual data.  This would seem to
> indicate a permission problem writing to the cephfs data pool, but
> we haven't tinkered with any of the caps or permissions.
>
> tcpdump shows lots of errors when trying to read a file from the share:
> NFS reply xid 771352420 reply ok 96 getattr ERROR: Operation not permitted
>
> One thing to note - this is a system that has been around for years
> and years and has been upgraded through many iterations of ceph.
> The cephfs data/metadata pools were probably created on Hammer or
> Luminous, though I'm not sure if that would matter or not.
> Everything else operates correctly AFAIK.
>
> I may take this over to the Ganesha mailing list to see if they have
> any ideas.
>
> thanks!
>
>
>
>
> 
> From: Eugen Block 
> Sent: Thursday, March 16, 2023 3:33 AM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] Re: Ceph NFS data - cannot read files, getattr
> returns NFS4ERR_PERM
>
> Hi,
>
> we saw this on a Nautilus cluster when Clients were updated so we had
> to modify the client caps to allow read access for the "/" directory.
> There's an excerpt in the SUSE docs [1] for that:
>
>> If clients with path restriction are used, the MDS capabilities need
>> to include read access to the root directory.
>> The allow r path=/ part means that path-restricted clients are able
>> to see the root volume, but cannot write to it. This may be an issue
>> for use cases where complete isolation is a requirement.
>
> Can you update the caps and test again?
>
> Regards,
> Eugen
>
> [1] https://documentation.suse.com/ses/7.1/html/ses-all/cha-ceph-cephfs.html
>
> Zitat von Wyll Ingersoll :
>
>> ceph pacific 16.2.11 (cephadm managed)
>>
>> I have configured some NFS mounts from the ceph GUI from cephfs.  We
>> can mount the filesystems and view file/directory listings, but
>> cannot read any file data.
>> The permissions on the shares are RW.  We mount from the client
>> using "vers=4.1".
>>
>> Looking at debug logs from the container running nfs-ganesha, I see
>> the following errors when trying to 

[ceph-users] Re: Ceph NFS data - cannot read files, getattr returns NFS4ERR_PERM

2023-03-16 Thread Wyll Ingersoll

Nope, that didn't work.  I updated the caps to add "allow r path=/" to the mds, 
but it made no difference.  I restarted the nfs container and unmounted/mounted 
the share on the client.

The caps now look like:
   key = xxx
   caps mds = "allow rw path=/exports/nfs/foobar, allow r path=/"
   caps mon = "allow r"
   caps osd = "allow rw pool=.nfs namespace=cephfs, allow rw tag cephfs 
data=cephfs"


This is really frustrating. We can mount the shares and get directory listings, 
and even create directories and files (empty), but cannot read or write any 
actual data.  This would seem to indicate a permission problem writing to the 
cephfs data pool, but we haven't tinkered with any of the caps or permissions.

tcpdump shows lots of errors when trying to read a file from the share:
NFS reply xid 771352420 reply ok 96 getattr ERROR: Operation not permitted

One thing to note - this is a system that has been around for years and years 
and has been upgraded through many iterations of ceph.  The cephfs 
data/metadata pools were probably created on Hammer or Luminous, though I'm not 
sure if that would matter or not.  Everything else operates correctly AFAIK.

I may take this over to the Ganesha mailing list to see if they have any ideas.

thanks!





From: Eugen Block 
Sent: Thursday, March 16, 2023 3:33 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Ceph NFS data - cannot read files, getattr returns 
NFS4ERR_PERM

Hi,

we saw this on a Nautilus cluster when Clients were updated so we had
to modify the client caps to allow read access for the "/" directory.
There's an excerpt in the SUSE docs [1] for that:

> If clients with path restriction are used, the MDS capabilities need
> to include read access to the root directory.
> The allow r path=/ part means that path-restricted clients are able
> to see the root volume, but cannot write to it. This may be an issue
> for use cases where complete isolation is a requirement.

Can you update the caps and test again?

Regards,
Eugen

[1] https://documentation.suse.com/ses/7.1/html/ses-all/cha-ceph-cephfs.html

Zitat von Wyll Ingersoll :

> ceph pacific 16.2.11 (cephadm managed)
>
> I have configured some NFS mounts from the ceph GUI from cephfs.  We
> can mount the filesystems and view file/directory listings, but
> cannot read any file data.
> The permissions on the shares are RW.  We mount from the client
> using "vers=4.1".
>
> Looking at debug logs from the container running nfs-ganesha, I see
> the following errors when trying to read a file's content:
> 15/03/2023 15:27:13 : epoch 6411e209 : gw01 : ganesha.nfsd-7[svc_8]
> complete_op :NFS4 :DEBUG :Status of OP_READ in position 2 =
> NFS4ERR_PERM, op response size is 7480 total response size is 7568
> 15/03/2023 15:27:13 : epoch 6411e209 : gw01 : ganesha.nfsd-7[svc_8]
> complete_nfs4_compound :NFS4 :DEBUG :End status = NFS4ERR_PERM
> lastindex = 3
>
>
> Also, watching the TCP traffic, I see errors in the NFS protocol
> corresponding to these messages:
> 11:44:43.745570 IP xxx.747 > gw01.nfs: Flags [P.], seq
> 24184536:24184748, ack 11409577, win 602, options [nop,nop,TS val
> 342245425 ecr 2683489461], length 212: NFS request xid 156024373 208
> getattr fh 0,1/53
> 11:44:43.745683 IP gw01.nfs > xxx.747: Flags [P.], seq
> 11409577:11409677, ack 24184748, win 3081, options [nop,nop,TS val
> 2683489461 ecr 342245425], length 100: NFS reply xid 156024373 reply
> ok 96 getattr ERROR: Operation not permitted
>
> So there appears to be a permissions problem where nfs-ganesha is
> not able to "getattr" on cephfs data.
>
> The export looks like this (read from rados):
> EXPORT {
> FSAL {
> name = "CEPH";
> user_id = "nfs.cephfs.7";
> filesystem = "cephfs";
> secret_access_key = "xxx";
> }
> export_id = 7;
> path = "/exports/nfs/foobar";
> pseudo = "/foobar";
> access_type = "RW";
> squash = "no_root_squash";
> attr_expiration_time = 0;
> security_label = false;
> protocols = 4;
> transports = "TCP";
> }
>
> ceph auth permissions for the nfs.cephfs.7 client:
> [client.nfs.cephfs.7]
>   key = xxx
>   caps mds = "allow rw path=/exports/nfs/foobar"
>   caps mon = "allow r"
>   caps osd = "allow rw pool=.nfs namespace=cephfs, allow rw tag
> cephfs data=cephfs"
>
>
> Any suggestions?
>
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph NFS data - cannot read files, getattr returns NFS4ERR_PERM

2023-03-15 Thread Wyll Ingersoll

ceph pacific 16.2.11 (cephadm managed)

I have configured some NFS mounts from the ceph GUI from cephfs.  We can mount 
the filesystems and view file/directory listings, but cannot read any file data.
The permissions on the shares are RW.  We mount from the client using 
"vers=4.1".

Looking at debug logs from the container running nfs-ganesha, I see the 
following errors when trying to read a file's content:
15/03/2023 15:27:13 : epoch 6411e209 : gw01 : ganesha.nfsd-7[svc_8] complete_op 
:NFS4 :DEBUG :Status of OP_READ in position 2 = NFS4ERR_PERM, op response size 
is 7480 total response size is 7568
15/03/2023 15:27:13 : epoch 6411e209 : gw01 : ganesha.nfsd-7[svc_8] 
complete_nfs4_compound :NFS4 :DEBUG :End status = NFS4ERR_PERM lastindex = 3


Also, watching the TCP traffic, I see errors in the NFS protocol corresponding 
to these messages:
11:44:43.745570 IP xxx.747 > gw01.nfs: Flags [P.], seq 24184536:24184748, ack 
11409577, win 602, options [nop,nop,TS val 342245425 ecr 2683489461], length 
212: NFS request xid 156024373 208 getattr fh 0,1/53
11:44:43.745683 IP gw01.nfs > xxx.747: Flags [P.], seq 11409577:11409677, ack 
24184748, win 3081, options [nop,nop,TS val 2683489461 ecr 342245425], length 
100: NFS reply xid 156024373 reply ok 96 getattr ERROR: Operation not permitted

So there appears to be a permissions problem where nfs-ganesha is not able to 
"getattr" on cephfs data.

The export looks like this (read from rados):
EXPORT {
FSAL {
name = "CEPH";
user_id = "nfs.cephfs.7";
filesystem = "cephfs";
secret_access_key = "xxx";
}
export_id = 7;
path = "/exports/nfs/foobar";
pseudo = "/foobar";
access_type = "RW";
squash = "no_root_squash";
attr_expiration_time = 0;
security_label = false;
protocols = 4;
transports = "TCP";
}

ceph auth permissions for the nfs.cephfs.7 client:
[client.nfs.cephfs.7]
  key = xxx
  caps mds = "allow rw path=/exports/nfs/foobar"
  caps mon = "allow r"
  caps osd = "allow rw pool=.nfs namespace=cephfs, allow rw tag cephfs 
data=cephfs"


Any suggestions?





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Dashboard for Object Servers using wrong hostname

2023-03-08 Thread Wyll Ingersoll


I have an ochestrated (cephadm) ceph cluster (16.2.11) with 2 radosgw services 
on 2 separate hosts without HA (i.e. no ingress/haproxy in front).  Both of the 
rgw servers use SSL and have a properly signed certificate. We can access them 
with standard S3 tools like s3cmd, cyberduck, etc.

The problem seems to be that the the Ceph mgr dashboard fails to access the RGW 
API because it uses the shortname "gw01" instead of the FQDN "gw01.domain.com" 
when forming the S3 signature which makes the S3 signature check fail and we 
get the following error:

Error connecting to Object Gateway: RGW REST API failed request with status 
code 403 
(b'{"Code":"SignatureDoesNotMatch","RequestId":"tx0521ceca28974e94b-006408e'
 b'f93-454bbb4e-default","HostId":"454bbb4e-default-default"}')

It seems that the ceph mgr (which we have restarted several times) uses just 
the short hostname from the inventory and I don't see how to tell it to use the 
FQDN.  Neither is it possible to configure the RGW to listen on an alternate 
non-SSL port on the cluster private network since the service spec for RGW only 
allows to set the rgw_frontend_port and rgw_frontend_type, but not the full 
frontend spec (which would allow for multiple listeners).

When we did have HA (haproxy) ingress configured, we ran into issues with the 
user clients getting lots of 503 errors due to some interaction between the RGW 
and the haproxy so we gave up on that config and now talk directly to the RGW 
over SSL which is working well.

Any suggestions?

thanks,
   Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph orch osd spec questions

2023-01-18 Thread Wyll Ingersoll


In case anyone was wondering, I figured out the problem...

This nasty bug in Pacific 16.2.10   https://tracker.ceph.com/issues/56031  - I 
think it is fixed in the upcoming .11 release and in Quincy.

This bug causes the computation of the bluestore DB partition to be much 
smaller than it should be, so if you request a reasonable size which is smaller 
than the incorrectly computed maximum size, the DB creation will fail.

Our problem was that we added 3 new SSDs that were considered "unused" by the 
system, giving us a total of 8 (5 used, 3 unused).   When the orchestrator 
issues a "ceph-volume lvm batch" command, it passes 40 data devices and 8 db 
devices.  Normally, you would expect it to divide them into 5 slots per DB 
device (40/8).   But when it computes the size of the slots, that is where the 
problem occurs.

ceph-volume first sees the 3 unused devices in a group and incorrectly decides 
that the slots needed is 3 * 5 = 15 slots, then divides the size of a single DB 
device by 15, thus making a max DB size 3x smaller than it should be.  If the 
code had also used the size of all of the devices in the group, then computed 
the max size, it would have been fine, but it only accounts for the size of the 
1st DB device in the group resulting in a size 3x smaller than it should be.

The workaround is to trick ceph into grouping all of the DB devices into unique 
groups of 1 by putting a minimal VG with a unique name on each of the unused 
SSDs so that when ceph-volume computes the sizing, it sees groups of 1 and thus 
doesn't multiply the number of slots incorrectly.   I used "vgcreate bug1 -s 1M 
/dev/xyz" to create a bogus VG on each of the unused SSDs, now I have properly 
sized DB devices on the new SSDs (the "bugX" VGs can then be removed once there 
are legitimate DB VGs on the device).

Question - Because our cluster was initially layed out using the buggy 
ceph-volume (16.2.10), we now have hundreds of DB devices that are far smaller 
than they should be (far less than the recommended 1-4% of the data devices 
size).  Is it possible to resize the DB devices without destroying and 
recreating the OSD itself?

What are the implications of having bluestore DB devices that are far smaller 
than they should be?


thanks,
  Wyllys Ingersoll


________
From: Wyll Ingersoll 
Sent: Friday, January 13, 2023 4:35 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] ceph orch osd spec questions


Ceph Pacific 16.2.9

We have a storage server with multiple 1.7TB SSDs dedicated to the bluestore DB 
usage.  The osd spec originally was misconfigured slightly and had set the 
"limit" parameter on the db_devices to 5 (there are 8 SSDs available) and did 
not specify a block_db_size.  ceph layed out the original 40 OSDs and put 8 DBs 
across 5 of the SSDs (because of limit param).  Ceph seems to have auto-sized 
the bluestore DB partitions to be about 45GB, which is far less than the 
recommended 1-4% (using 10TB drives).  How does ceph-volume determine the size 
of the bluestore DB/WAL partitions when it is not specified in the spec?

We updated the spec and specified a block_db_size of 300G and removed the 
"limit" value.  Now we can see in the cephadm.log that the ceph-volume command 
being issued is using the correct list of SSD devices (all 8) as options to the 
lvm batch (--db-devices ...), but it keeps failing to create the new OSD 
because we are asking for 300G and it thinks there is only 44G available even 
though the last 3 SSDs in the list are empty (1.7T).  So, it appears that 
somehow the orchestrator is ignoring the last 3 SSDs.  I have verified that 
these SSDs are wiped clean, have no partitions or LVM, and no label (sgdisk -Z, 
wipefs -a). They appear as available in the inventory and not locked or 
otherwise in use.

Also, the "db_slots" spec parameter is ignored in pacific due to a bug so there 
is no way to tell the orchestrator to use "block_db_slots". Adding it to the 
spec like "block_db_size" fails since it is not recognized.

Any help figuring out why these SSDs are being ignored would be much 
appreciated.

Our spec for this host looks like this:
---

spec:

  data_devices:

rotational: 1

size: '3TB:'

  db_devices:

rotational: 0

size: ':2T'

vendor: 'SEAGATE'

  block_db_size: 300G

---

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph orch osd spec questions

2023-01-13 Thread Wyll Ingersoll


Ceph Pacific 16.2.9

We have a storage server with multiple 1.7TB SSDs dedicated to the bluestore DB 
usage.  The osd spec originally was misconfigured slightly and had set the 
"limit" parameter on the db_devices to 5 (there are 8 SSDs available) and did 
not specify a block_db_size.  ceph layed out the original 40 OSDs and put 8 DBs 
across 5 of the SSDs (because of limit param).  Ceph seems to have auto-sized 
the bluestore DB partitions to be about 45GB, which is far less than the 
recommended 1-4% (using 10TB drives).  How does ceph-volume determine the size 
of the bluestore DB/WAL partitions when it is not specified in the spec?

We updated the spec and specified a block_db_size of 300G and removed the 
"limit" value.  Now we can see in the cephadm.log that the ceph-volume command 
being issued is using the correct list of SSD devices (all 8) as options to the 
lvm batch (--db-devices ...), but it keeps failing to create the new OSD 
because we are asking for 300G and it thinks there is only 44G available even 
though the last 3 SSDs in the list are empty (1.7T).  So, it appears that 
somehow the orchestrator is ignoring the last 3 SSDs.  I have verified that 
these SSDs are wiped clean, have no partitions or LVM, and no label (sgdisk -Z, 
wipefs -a). They appear as available in the inventory and not locked or 
otherwise in use.

Also, the "db_slots" spec parameter is ignored in pacific due to a bug so there 
is no way to tell the orchestrator to use "block_db_slots". Adding it to the 
spec like "block_db_size" fails since it is not recognized.

Any help figuring out why these SSDs are being ignored would be much 
appreciated.

Our spec for this host looks like this:
---

spec:

  data_devices:

rotational: 1

size: '3TB:'

  db_devices:

rotational: 0

size: ':2T'

vendor: 'SEAGATE'

  block_db_size: 300G

---

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: adding OSD to orchestrated system, ignoring osd service spec.

2023-01-11 Thread Wyll Ingersoll


Not really, its on an airgapped/secure network and I cannot copy-and-paste from 
it.  What are you looking for?  This cluster has 720 OSDs across 18 storage 
nodes.
I think we have identified the problem and it may not be a ceph issue, but need 
to investigate further.  It has something to do with the SSD devices that are 
being ignored - they are slightly different from the other ones.

From: Eugen Block 
Sent: Wednesday, January 11, 2023 3:27 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: adding OSD to orchestrated system, ignoring osd 
service spec.

Hi,

can you share the output of

storage01:~ # ceph orch ls osd

Thanks,
Eugen

Zitat von Wyll Ingersoll :

> When adding a new OSD to a ceph orchestrated system (16.2.9) on a
> storage node that has a specification profile that dictates which
> devices to use as the db_devices (SSDs), the newly added OSDs seem
> to be ignoring the db_devices (there are several available) and
> putting the data and db/wal on the same device.
>
> We installed the new disk (HDD) and then ran "ceph orch device zap
> /dev/xyz --force" to initialize the addition process.
> The OSDs that were added originally on that node were layed out
> correctly, but the new ones seem to be ignoring the OSD service spec.
>
> How can we make sure the new devices added are layed out correctly?
>
> thanks,
> Wyllys
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] adding OSD to orchestrated system, ignoring osd service spec.

2023-01-10 Thread Wyll Ingersoll


When adding a new OSD to a ceph orchestrated system (16.2.9) on a storage node 
that has a specification profile that dictates which devices to use as the 
db_devices (SSDs), the newly added OSDs seem to be ignoring the db_devices 
(there are several available) and putting the data and db/wal on the same 
device.

We installed the new disk (HDD) and then ran "ceph orch device zap /dev/xyz 
--force" to initialize the addition process.
The OSDs that were added originally on that node were layed out correctly, but 
the new ones seem to be ignoring the OSD service spec.

How can we make sure the new devices added are layed out correctly?

thanks,
Wyllys


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Removing OSDs - draining but never completes.

2023-01-10 Thread Wyll Ingersoll
Running ceph-pacific 16.2.9 using ceph orchestrator.

We made a mistake adding a disk to the cluster and immediately issued a command 
to remove it using "ceph orch osd rm ### --replace --force".

This OSD had no data on it at the time and was removed after just a few 
minutes.  "ceph orch osd rm status" shows that it is still "draining".
ceph osd df shows that the osd being removed has -1 PGs.

So - why is the simple act of removal taking so long and can we abort it and 
manually remove that osd somehow?

Note: the cluster is also doing a rebalance while this is going on, but the osd 
being removed never had any data and should not be affected by the rebalance.

thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph orch osd rm - draining forever, shows -1 pgs

2023-01-09 Thread Wyll Ingersoll


Running ceph-pacific 16.2.9 using ceph orchestrator.

We made a mistake adding a disk to the cluster and immediately issued a command 
to remove it using "ceph orch osd rm ### --replace --force".

This OSD no data on it at the time and was removed after just a few minutes.  
"ceph orch osd rm status" shows that it is still "draining".
ceph osd df shows that the osd being removed has -1 PGs.

So - why is the simple act of removal taking so long and can we abort it and 
manually remove that osd somehow?

Note: the cluster is also doing a rebalance while this is going on, but the osd 
being removed never had any data and should not be affected by the rebalance.

thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OMAP data growth

2022-12-05 Thread Wyll Ingersoll


But why is OMAP data usage growing at a rate 10x the amount of the actual data 
being written to RGW?

From: Robert Sander 
Sent: Monday, December 5, 2022 3:06 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: OMAP data growth

Am 02.12.22 um 21:09 schrieb Wyll Ingersoll:

>*   What is causing the OMAP data consumption to grow so fast and can it 
> be trimmed/throttled?

S3 is a heavy user of OMAP data. RBD and CephFS not so much.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OMAP data growth

2022-12-02 Thread Wyll Ingersoll


We have a large cluster (10PB) which is about 30% full at this point.  We 
recently fixed a configuration issue that then triggered the pg autoscaler to 
start moving around massive amounts of data (85% misplaced objects - about 7.5B 
objects).  The misplaced % is dropping slowly (about 10% each day), but the 
overall data usage is growing by about 300T/day even though the data being 
written by clients is well under 30T/day.

The issue was that we have both 3x replicated pools and a very large 
erasure-coded (8+4) data pool for RGW.  autoscaler doesnt work if it sees what 
it thinks are overlapping roots ("default" vs "default~hdd" in the crush tree, 
even if both refer to the same OSDs, they have different ids: -1 vs -2).  We 
cleared that by setting the same root for both crush rules and then PG 
autoscaler kicked in and started doing its thing.

The "ceph osd df" output shows the OMAP jumping significantly and our data 
availability is shrinking MUCH faster than we would expect based on the client 
usage.

Questions:

  *   What is causing the OMAP data consumption to grow so fast and can it be 
trimmed/throttled?
  *   Will the overhead data be cleaned up once the misplaced object counts 
drop to a much lower value?
  *   Would it do any good to disable the autoscaler at this point since the 
PGs have already started being moved?
  *   Any other recommendations to make this go smoother?

thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm node-exporter extra_container_args for textfile_collector

2022-10-28 Thread Wyll Ingersoll
Thanks!

I created:  https://tracker.ceph.com/issues/57944
Feature #57944: add option to allow for setting extra daemon args for 
containerized services - Orchestrator - Ceph 
<https://tracker.ceph.com/issues/57944>
Redmine
tracker.ceph.com




From: Adam King 
Sent: Friday, October 28, 2022 2:25 PM
To: Lee Carney 
Cc: Wyll Ingersoll ; ceph-users@ceph.io 

Subject: [ceph-users] Re: cephadm node-exporter extra_container_args for 
textfile_collector

We had actually considered adding an `extra_daemon_args` to be the
equivalent to `extra_container_args` but for the daemon itself rather than
a flag for the podman/docker run command. IIRC we thought it was a good
idea but nobody actually pushed to add it in then since (at the time) we
weren't aware of anyone asking for it. If you want to request the feature
you can create a new tracker issue under
https://tracker.ceph.com/projects/orchestrator/issues?set_filter=1_id=2
but
I think it's something that we'll be adding regardless. The tracker
existing might just raise priority a bit.

On Fri, Oct 28, 2022 at 12:42 PM Lee Carney 
wrote:

> Thanks for confirmation - I've ended up just adding the arguments onto the
> end https://github.com/prometheus/node_exporter/blob/master/Dockerfile#L12
> and creating a custom image.
>
>
> It would be nice to know how to get this request submitted as a feature
> request or bounty etc
>
> ____
> From: Wyll Ingersoll 
> Sent: 28 October 2022 15:19:17
> To: Lee Carney; ceph-users@ceph.io
> Subject: Re: cephadm node-exporter extra_container_args for
> textfile_collector
>
> I ran into the same issue - wanted to add the textfile.directory to the
> node_exporter using "extra_container_args" - and it failed just as you
> describe.  It appears that those args get applied to the container command
> (podman or docker) and not to the actual service in the container.  Not
> sure if that is intentional or not, but it would be nice to be able to add
> args to the process IN the container, especially for the node_exporter so
> that additional data can be collected.
>
>
> 
> From: Lee Carney 
> Sent: Thursday, October 27, 2022 10:27 AM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] cephadm node-exporter extra_container_args for
> textfile_collector
>
> Has anyone had success in using cephadm to add extra_container_args onto
> the node-exporter config? For example changing the collector config.
>
>
> I am trying and failing using the following:
>
>
> 1. Create ne.yml
>
> service_type: node-exporter
> service_name: node-exporter
> placement:
> host_pattern: '*'
> extra_container_args:
> -
> --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
>
>
> 2. cephadm shell --mount ne.yml:/var/lib/ceph/node-exporter/ne.yml
>
> 3. ceph orch apply -i '/var/lib/ceph/node-exporter/ne.yml'
>
> 4. Service will fail to start as the args have been applied at the
> beginning of the service config and not at the end..e.g.  cat
> /var/lib/ceph/a898358c-eeac-11ec-b707-0800279b70f1/node-exporter.node1-cceph1-vagrant-local1/unit.run
>
> /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --init
> --name
> ceph-a898358c-eeac-11ec-b707-0800279b70f1-node-exporter-node1-cceph1-vagrant-local1
> --user 65534 --security-opt label=disable -d --log-driver journald
> --conmon-pidfile
> /run/ceph-a898358c-eeac-11ec-b707-0800279b70f1@node-exporter.node1-cceph1-vagrant-local1.service-pid
> --cidfile
> /run/ceph-a898358c-eeac-11ec-b707-0800279b70f1@node-exporter.node1-cceph1-vagrant-local1.service-cid
> --cgroups=split
> --collector.textfile.directory=/var/lib/node_exporter/textfile_collector -e
> CONTAINER_IMAGE=quay.io/prometheus/node-exporter:v1.3.1 -e NODE_NAME=
> node1-cceph1-vagrant-local1.dev-globalrelay.net -e
> CEPH_USE_RANDOM_NONCE=1 -v /proc:/host/proc:ro -v /sys:/host/sys:ro -v
> /:/rootfs:ro quay.io/prometheus/node-exporter:v1.3.1 --no-collector.timex
> --web.listen-address=:9100 --path.procfs=/host/proc --path.sysfs=/host/sys
> --path.rootfs=/rootfs
>
>
> Looking at the cephadm deployer I had expected the extra args to be added
> at the end:
> https://github.com/ceph/ceph/blob/c37bd103033bc0a9f05ec0e78cef7cbca5649eeb/src/cephadm/cephadm#L575
> and
> https://github.com/ceph/ceph/blob/c37bd103033bc0a9f05ec0e78cef7cbca5649eeb/src/cephadm/cephadm#L5761
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.

[ceph-users] Re: cephadm node-exporter extra_container_args for textfile_collector

2022-10-28 Thread Wyll Ingersoll
I ran into the same issue - wanted to add the textfile.directory to the 
node_exporter using "extra_container_args" - and it failed just as you 
describe.  It appears that those args get applied to the container command 
(podman or docker) and not to the actual service in the container.  Not sure if 
that is intentional or not, but it would be nice to be able to add args to the 
process IN the container, especially for the node_exporter so that additional 
data can be collected.



From: Lee Carney 
Sent: Thursday, October 27, 2022 10:27 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] cephadm node-exporter extra_container_args for 
textfile_collector

Has anyone had success in using cephadm to add extra_container_args onto the 
node-exporter config? For example changing the collector config.


I am trying and failing using the following:


1. Create ne.yml

service_type: node-exporter
service_name: node-exporter
placement:
host_pattern: '*'
extra_container_args:
- --collector.textfile.directory=/var/lib/node_exporter/textfile_collector


2. cephadm shell --mount ne.yml:/var/lib/ceph/node-exporter/ne.yml

3. ceph orch apply -i '/var/lib/ceph/node-exporter/ne.yml'

4. Service will fail to start as the args have been applied at the beginning of 
the service config and not at the end..e.g.  cat 
/var/lib/ceph/a898358c-eeac-11ec-b707-0800279b70f1/node-exporter.node1-cceph1-vagrant-local1/unit.run

/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --init --name 
ceph-a898358c-eeac-11ec-b707-0800279b70f1-node-exporter-node1-cceph1-vagrant-local1
 --user 65534 --security-opt label=disable -d --log-driver journald 
--conmon-pidfile 
/run/ceph-a898358c-eeac-11ec-b707-0800279b70f1@node-exporter.node1-cceph1-vagrant-local1.service-pid
 --cidfile 
/run/ceph-a898358c-eeac-11ec-b707-0800279b70f1@node-exporter.node1-cceph1-vagrant-local1.service-cid
 --cgroups=split 
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector -e 
CONTAINER_IMAGE=quay.io/prometheus/node-exporter:v1.3.1 -e 
NODE_NAME=node1-cceph1-vagrant-local1.dev-globalrelay.net -e 
CEPH_USE_RANDOM_NONCE=1 -v /proc:/host/proc:ro -v /sys:/host/sys:ro -v 
/:/rootfs:ro quay.io/prometheus/node-exporter:v1.3.1 --no-collector.timex 
--web.listen-address=:9100 --path.procfs=/host/proc --path.sysfs=/host/sys 
--path.rootfs=/rootfs


Looking at the cephadm deployer I had expected the extra args to be added at 
the end: 
https://github.com/ceph/ceph/blob/c37bd103033bc0a9f05ec0e78cef7cbca5649eeb/src/cephadm/cephadm#L575
 and 
https://github.com/ceph/ceph/blob/c37bd103033bc0a9f05ec0e78cef7cbca5649eeb/src/cephadm/cephadm#L5761


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SMB and ceph question

2022-10-27 Thread Wyll Ingersoll


No - the recommendation is just to mount /cephfs using the kernel module and 
then share it via standard VFS module from Samba. Pretty simple.

From: Christophe BAILLON 
Sent: Thursday, October 27, 2022 4:08 PM
To: Wyll Ingersoll 
Cc: Eugen Block ; ceph-users 
Subject: Re: [ceph-users] Re: SMB and ceph question

Re

Ok, I thought there was a module like ganesha for the nfs to install directly 
on the cluster...

- Mail original -
> De: "Wyll Ingersoll" 
> À: "Eugen Block" , "ceph-users" 
> Envoyé: Jeudi 27 Octobre 2022 15:25:36
> Objet: [ceph-users] Re: SMB and ceph question

> I don't think there is anything particularly special about exposing /cephfs 
> (or
> subdirs thereof) over SMB with SAMBA.  We've done it for years over various
> releases of both Ceph and Samba.
> Basically, you create a NAS server host that mounts /cephfs and run Samba on
> that host.  You share whatever subdirectories you need to in the usual way.
> SMB clients mount from the Samba service and have no knowledge of the
> underlying storage.
>
>
> 
> From: Eugen Block 
> Sent: Thursday, October 27, 2022 5:40 AM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] Re: SMB and ceph question
>
> Hi,
>
> the SUSE docs [1] are not that old, they apply for Ceph Pacific. Have
> you tried it yet?
> Maybe the upstream docs could adapt the SUSE docs, just an idea if
> there aren't any guides yet on docs.ceph.com.
>
> Regards,
> Eugen
>
> [1] https://documentation.suse.com/ses/7.1/single-html/ses-admin/#cha-ses-cifs
>
> Zitat von Christophe BAILLON :
>
>> Hello,
>>
>> For a side project, we need to expose cephfs datas to legacy users
>> via SMB, I don't find the official way in ceph doc to do that.
>> In old suze doc I found ref to ceph-samba, but I can't find any
>> informations on ceph official doc.
>> We have a small cephadm dedicated cluster to do that, can you help
>> me to find the best way to deploy samba on top ?
>>
>> Regards
>>
>> --
>> Christophe BAILLON
>> Mobile :: +336 16 400 522
>> Work :: https://eyona.com
>> Twitter :: https://twitter.com/ctof
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Christophe BAILLON
Mobile :: +336 16 400 522
Work :: https://eyona.com
Twitter :: https://twitter.com/ctof
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: SMB and ceph question

2022-10-27 Thread Wyll Ingersoll


I don't think there is anything particularly special about exposing /cephfs (or 
subdirs thereof) over SMB with SAMBA.  We've done it for years over various 
releases of both Ceph and Samba.
Basically, you create a NAS server host that mounts /cephfs and run Samba on 
that host.  You share whatever subdirectories you need to in the usual way.  
SMB clients mount from the Samba service and have no knowledge of the 
underlying storage.



From: Eugen Block 
Sent: Thursday, October 27, 2022 5:40 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: SMB and ceph question

Hi,

the SUSE docs [1] are not that old, they apply for Ceph Pacific. Have
you tried it yet?
Maybe the upstream docs could adapt the SUSE docs, just an idea if
there aren't any guides yet on docs.ceph.com.

Regards,
Eugen

[1] https://documentation.suse.com/ses/7.1/single-html/ses-admin/#cha-ses-cifs

Zitat von Christophe BAILLON :

> Hello,
>
> For a side project, we need to expose cephfs datas to legacy users
> via SMB, I don't find the official way in ceph doc to do that.
> In old suze doc I found ref to ceph-samba, but I can't find any
> informations on ceph official doc.
> We have a small cephadm dedicated cluster to do that, can you help
> me to find the best way to deploy samba on top ?
>
> Regards
>
> --
> Christophe BAILLON
> Mobile :: +336 16 400 522
> Work :: https://eyona.com
> Twitter :: https://twitter.com/ctof
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Dashboard device health info missing

2022-10-24 Thread Wyll Ingersoll


Looking at the device health info for the OSDs in our cluster sometimes shows 
"No SMART data available".  This appears to only occur for SCSI type disks in 
our cluster. ATA disks have their full health SMART data displayed, but the 
non-ATA do not.

The actual SMART data (JSON formatted) is returned by the mgr, though because 
it is formatted differently I suspect the dashboard UI doesn't know how to 
interpret it.  This is misleading, at the least it could display the "output" 
section for each device so that the viewer can interpret it even if the 
dashboard doesn't know how to.

Is this a known bug?   We would like to have SMART data for all of our devices 
if possible.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw networking

2022-10-20 Thread Wyll Ingersoll


What network does radosgw use when it reads/writes the objects to the cluster?

We have a high-speed cluster_network and want the radosgw to write data over 
that instead of the slower public_network if possible, is it configurable?

thanks!
  Wyllys Ingersoll



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-14 Thread Wyll Ingersoll
This looks very useful.  Has anyone created a grafana dashboard that will 
display the collected data ?



From: Konstantin Shalygin 
Sent: Friday, October 14, 2022 12:12 PM
To: John Petrini 
Cc: Marc ; Paul Mezzanini ; ceph-users 

Subject: [ceph-users] Re: monitoring drives

Hi,

You can get this metrics, even wear level, from official smartctl_exporter [1]

[1] https://github.com/prometheus-community/smartctl_exporter

k
Sent from my iPhone

> On 14 Oct 2022, at 17:12, John Petrini  wrote:
>
> We run a mix of Samsung and Intel SSD's, our solution was to write a
> script that parses the output of the Samsung SSD Toolkit and Intel
> ISDCT CLI tools respectively. In our case, we expose those metrics
> using node_exporter's textfile collector for ingestion by prometheus.
> It's mostly the same smart data but it helps identify some vendor
> specific smart metrics, namely SSD wear level, that we were unable to
> decipher from the raw smart data.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osds not bootstrapping: monclient: wait_auth_rotating timed out

2022-09-26 Thread Wyll Ingersoll


Yes, we restarted the primary mon and mgr services.  Still no luck.


From: Dhairya Parmar 
Sent: Monday, September 26, 2022 3:44 PM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] osds not bootstrapping: monclient: wait_auth_rotating 
timed out

Looking at the shared tracker, I can see people talking about restarting 
primary mon/mgr
and getting this fixed at note-4<https://tracker.ceph.com/issues/17170#note-4> 
and note-8<https://tracker.ceph.com/issues/17170#note-8>. Did you try that out?

On Tue, Sep 27, 2022 at 12:44 AM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:
Ceph Pacific (16.2.9) on a large cluster.  Approximately 60 (out of 700) osds 
fail to start and show an error:

monclient: wait_auth_rotating timed out after 300

We modified the "rotating_keys_bootstrap_timeout" from 30 to 300, but they 
still fail.  All nodes are time-synced with NTP and the skew has been verified 
to be < 1.0 seconds.
It looks a lot like this bug: https://tracker.ceph.com/issues/17170  which does 
not appear to be resolved yet.

Any other suggestions on how to get these OSDs to sync up with the cluster?


thanks!

___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>



--
Dhairya Parmar

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc.<https://www.redhat.com/>

dpar...@redhat.com<mailto:dpar...@redhat.com>

[https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]<https://www.redhat.com/>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] osds not bootstrapping: monclient: wait_auth_rotating timed out

2022-09-26 Thread Wyll Ingersoll
Ceph Pacific (16.2.9) on a large cluster.  Approximately 60 (out of 700) osds 
fail to start and show an error:

monclient: wait_auth_rotating timed out after 300

We modified the "rotating_keys_bootstrap_timeout" from 30 to 300, but they 
still fail.  All nodes are time-synced with NTP and the skew has been verified 
to be < 1.0 seconds.
It looks a lot like this bug: https://tracker.ceph.com/issues/17170  which does 
not appear to be resolved yet.

Any other suggestions on how to get these OSDs to sync up with the cluster?


thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Wyll Ingersoll
Understood, that was a typo on my part.

Definitely dont cancel-backfill after generating the moves from 
placementoptimizer.


From: Josh Baergen 
Sent: Friday, September 23, 2022 11:31 AM
To: Wyll Ingersoll 
Cc: Eugen Block ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Balancer Distribution Help

Hey Wyll,

> $ pgremapper cancel-backfill --yes   # to stop all pending operations
> $ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
> $ bash upmap-moves
>
> Repeat the above 3 steps until balance is achieved, then re-enable the 
> balancer and unset the "no" flags set earlier?

You don't want to run cancel-backfill after placementoptimizer,
otherwise it will undo the balancing backfill.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Wyll Ingersoll


When doing manual remapping/rebalancing with tools like pgremapper and 
placementoptimizer, what are the recommended settings for norebalance, 
norecover, nobackfill?
Should the balancer module be disabled if we are manually issuing the pg remap 
commands generated by those scripts so it doesn't interfere?

Something like this:

$ ceph osd set norebalance
$ ceph osd set norecover
$ ceph osd set nobackfill
$ ceph balancer off

$ pgremapper cancel-backfill --yes   # to stop all pending operations
$ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
$ bash upmap-moves

Repeat the above 3 steps until balance is achieved, then re-enable the balancer 
and unset the "no" flags set earlier?



From: Eugen Block 
Sent: Friday, September 23, 2022 2:21 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Balancer Distribution Help

+1 for increasing PG numbers, those are quite low.

Zitat von Bailey Allison :

> Hi Reed,
>
> Just taking a quick glance at the Pastebin provided I have to say
> your cluster balance is already pretty damn good all things
> considered.
>
> We've seen the upmap balancer at it's best in practice provides a
> deviation of about 10-20% percent across OSDs which seems to be
> matching up on your cluster. It's something that as the more nodes
> and OSDs you add that are equal in size to the cluster, and as the
> PGs increase on the cluster it can do a better and better job of,
> but in practice about a 10% difference in OSDs is  very normal.
>
> Something to note in the video provided is that they were using a
> cluster with 28PB of storage available, so who knows how many
> OSDs/nodes/PGs per pool/etc., that their cluster has the luxury and
> ability to balance across.
>
> The only thing I can think to suggest is just increasing the PG
> count as you've already mentioned. The ideal setting is about 100
> PGs per OSD, and looking at your cluster both the SSDs and the
> smaller HDDs have only about 50 PGs per OSD.
>
> If you're able to get both of those devices to a closer to 100 PG
> per OSD ratio it should help a lot more with the balancing. More PGs
> means more places to distribute data.
>
> It will be tricky in that I am just noticing for the HDDs you have
> some hosts/chassis with 24 OSDs per and others with 6 HDDs per so
> getting the PG distribution more even for those will be challenging,
> but for the SSDs it should be quite simple to get those to be 100
> PGs per OSD.
>
> Just taking a further look it does appear on some OSDs although I
> will say across the entire cluster the actual data stored is
> balanced good, there are a couple of OSDs where the OMAP/metadata is
> not balanced as well as the others.
>
> Where you are using EC pools for CephFS, any OMAP data cannot be
> stored within EC so it will store all of that within a replication
> data cephfs pool, most likely your hdd_cephfs pool.
>
> Just something to keep in mind as not only is it important to make
> sure the data is balanced, but the OMAP data and metadata are
> balanced as well.
>
> Otherwise though I would recommended just trying to get your cluster
> to a point where each of the OSDs have roughly 100 PGs per OSD, or
> at least as close to this as you are able to given your clusters
> crush rulesets.
>
> This should then help the balancer spread the data across the
> cluster, but again unless I overlooked something your cluster
> already appears to be extremely well balanced.
>
> There is a PG calculator you can use online at:
>
> https://old.ceph.com/pgcalc/
>
> There is also a PG calc on the Redhat website but it requires a subscription.
>
> Both calculators are essentially the same but I have noticed the
> free one will round down the PGs and the Redhat one will round up
> the PGs.
>
> Regards,
>
> Bailey
>
> -Original Message-
> From: Reed Dier 
> Sent: September 22, 2022 4:48 PM
> To: ceph-users 
> Subject: [ceph-users] Balancer Distribution Help
>
> Hoping someone can point me to possible tunables that could
> hopefully better tighten my OSD distribution.
>
> Cluster is currently
>> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974)
>> octopus (stable)": 307
> With plans to begin moving to pacific before end of year, with a
> possible interim stop at octopus.17 on the way.
>
> Cluster was born on jewel, and is fully bluestore/straw2.
> The upmap balancer works/is working, but not to the degree that I
> believe it could/should work, which seems should be much closer to
> near perfect than what I’m seeing.
>
> https://imgur.com/a/lhtZswo  <-
> Histograms of my OSD distribution
>
> https://pastebin.com/raw/dk3fd4GH
>  <- pastebin of
> cluster/pool/crush relevant bits
>
> To put it succinctly, I’m hoping to get much tighter OSD
> distribution, but I’m not sure what knobs to try turning next, as
> the upmap balancer has gone as far as it can, and I end up playing
> “reweight the 

[ceph-users] Re: data usage growing despite data being written

2022-09-07 Thread Wyll Ingersoll
OK we did get the "pgremapper" installed and are starting to use it.
We got some warnings when trying to rebalance an entire host bucket, saying 
some of our pgs have mismatched lengths and then "nothing to do".
Switching to individual "drain" operations now to see if that helps.



From: Stefan Kooman 
Sent: Wednesday, September 7, 2022 11:34 AM
To: Wyll Ingersoll ; Gregory Farnum 

Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: data usage growing despite data being written

On 9/7/22 16:38, Wyll Ingersoll wrote:
> I'm sure we probably have but I'm not sure what else to do.  We are desperate 
> to get data off of these 99%+ OSDs and the cluster by itself isn't doing it.
>
> The crushmap appears ok.  we have replicated pools and a large EC pool, all 
> are using host-based failure domains.  The new osds on the newly added hosts 
> are slowly filling, just not as much as we expected.
>
> We have far too many osds at 99%+ and they continue to fill up.  How do we 
> remove the excess OSDMap data, is it even possible?
>
> If we shouldn't be migrating PGs and we cannot remove data, what are our 
> options to get it to balance again and stop filling up with OSDMaps and other 
> internal ceph data?

Have you tried with the tools I mentioned before in the other thread? To
drain PGs from one OSD to another, you can use the pgremmaper tool for
example [1].

Good to know: The *fullest* OSD determines how full the cluster is, and
how much space you have available. If OSDs get fuller and fuller ...
than it might appear that data is written, but in reality it's not. So
you can "fill" up a cluster without writing data to it.

Gr. Stefan

[1]: https://github.com/digitalocean/pgremapper/#drain

P.s. I guess you meant to say  "despite *no* data being written", right?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: data usage growing despite data being written

2022-09-07 Thread Wyll Ingersoll


We've been working with the "upmap_remapper.py" script from CERN, which is OK 
but not working well enough.
We cannot bring in the pgremapper currently, as it has some dependencies and 
would require special approval to bring into the environment.
We want to get it stable enough to use the "placementoptimizer" utillity, but 
the epochs are changing too fast and it won't work right now.

From: Stefan Kooman 
Sent: Wednesday, September 7, 2022 11:34 AM
To: Wyll Ingersoll ; Gregory Farnum 

Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: data usage growing despite data being written

On 9/7/22 16:38, Wyll Ingersoll wrote:
> I'm sure we probably have but I'm not sure what else to do.  We are desperate 
> to get data off of these 99%+ OSDs and the cluster by itself isn't doing it.
>
> The crushmap appears ok.  we have replicated pools and a large EC pool, all 
> are using host-based failure domains.  The new osds on the newly added hosts 
> are slowly filling, just not as much as we expected.
>
> We have far too many osds at 99%+ and they continue to fill up.  How do we 
> remove the excess OSDMap data, is it even possible?
>
> If we shouldn't be migrating PGs and we cannot remove data, what are our 
> options to get it to balance again and stop filling up with OSDMaps and other 
> internal ceph data?

Have you tried with the tools I mentioned before in the other thread? To
drain PGs from one OSD to another, you can use the pgremmaper tool for
example [1].

Good to know: The *fullest* OSD determines how full the cluster is, and
how much space you have available. If OSDs get fuller and fuller ...
than it might appear that data is written, but in reality it's not. So
you can "fill" up a cluster without writing data to it.

Gr. Stefan

[1]: https://github.com/digitalocean/pgremapper/#drain

P.s. I guess you meant to say  "despite *no* data being written", right?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: data usage growing despite data being written

2022-09-07 Thread Wyll Ingersoll


Can we tweak the osdmap pruning parameters to be more aggressive about trimming 
those osdmaps?   Would that reduce data on the OSDs or only on the MON DB?
Looking at mon_min_osdmpa_epochs (500) and mon_osdmap_full_prune_min (1).

Is there a way to find out how many osdmaps are currently being kept?

From: Gregory Farnum 
Sent: Wednesday, September 7, 2022 10:58 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] data usage growing despite data being written

On Wed, Sep 7, 2022 at 7:38 AM Wyll Ingersoll
 wrote:
>
> I'm sure we probably have but I'm not sure what else to do.  We are desperate 
> to get data off of these 99%+ OSDs and the cluster by itself isn't doing it.
>
> The crushmap appears ok.  we have replicated pools and a large EC pool, all 
> are using host-based failure domains.  The new osds on the newly added hosts 
> are slowly filling, just not as much as we expected.
>
> We have far too many osds at 99%+ and they continue to fill up.  How do we 
> remove the excess OSDMap data, is it even possible?
>
> If we shouldn't be migrating PGs and we cannot remove data, what are our 
> options to get it to balance again and stop filling up with OSDMaps and other 
> internal ceph data?

Well, you can turn things off, figure out the proper mapping, and use
the ceph-objectstore-tool to migrate PGs to their proper destinations
(letting the cluster clean up the excess copies if you can afford to —
deleting things is always scary).
But I haven't had to help recover a death-looping cluster in around a
decade, so that's about all the options I can offer up.
-Greg


>
> thanks!
>
>
>
> 
> From: Gregory Farnum 
> Sent: Wednesday, September 7, 2022 10:01 AM
> To: Wyll Ingersoll 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] data usage growing despite data being written
>
> On Tue, Sep 6, 2022 at 2:08 PM Wyll Ingersoll
>  wrote:
> >
> >
> > Our cluster has not had any data written to it externally in several weeks, 
> > but yet the overall data usage has been growing.
> > Is this due to heavy recovery activity?  If so, what can be done (if 
> > anything) to reduce the data generated during recovery.
> >
> > We've been trying to move PGs away from high-usage OSDS (many over 99%), 
> > but it's like playing whack-a-mole, the cluster keeps sending new data to 
> > already overly full osds making further recovery nearly impossible.
>
> I may be missing something, but I think you've really slowed things
> down by continually migrating PGs around while the cluster is already
> unhealthy. It forces a lot of new OSDMap generation and general churn
> (which itself slows down data movement.)
>
> I'd also examine your crush map carefully, since it sounded like you'd
> added some new hosts and they weren't getting the data you expected
> them to. Perhaps there's some kind of imbalance (eg, they aren't in
> racks, and selecting those is part of your crush rule?).
> -Greg
>
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: data usage growing despite data being written

2022-09-07 Thread Wyll Ingersoll
I'm sure we probably have but I'm not sure what else to do.  We are desperate 
to get data off of these 99%+ OSDs and the cluster by itself isn't doing it.

The crushmap appears ok.  we have replicated pools and a large EC pool, all are 
using host-based failure domains.  The new osds on the newly added hosts are 
slowly filling, just not as much as we expected.

We have far too many osds at 99%+ and they continue to fill up.  How do we 
remove the excess OSDMap data, is it even possible?

If we shouldn't be migrating PGs and we cannot remove data, what are our 
options to get it to balance again and stop filling up with OSDMaps and other 
internal ceph data?

thanks!




From: Gregory Farnum 
Sent: Wednesday, September 7, 2022 10:01 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] data usage growing despite data being written

On Tue, Sep 6, 2022 at 2:08 PM Wyll Ingersoll
 wrote:
>
>
> Our cluster has not had any data written to it externally in several weeks, 
> but yet the overall data usage has been growing.
> Is this due to heavy recovery activity?  If so, what can be done (if 
> anything) to reduce the data generated during recovery.
>
> We've been trying to move PGs away from high-usage OSDS (many over 99%), but 
> it's like playing whack-a-mole, the cluster keeps sending new data to already 
> overly full osds making further recovery nearly impossible.

I may be missing something, but I think you've really slowed things
down by continually migrating PGs around while the cluster is already
unhealthy. It forces a lot of new OSDMap generation and general churn
(which itself slows down data movement.)

I'd also examine your crush map carefully, since it sounded like you'd
added some new hosts and they weren't getting the data you expected
them to. Perhaps there's some kind of imbalance (eg, they aren't in
racks, and selecting those is part of your crush rule?).
-Greg

>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] data usage growing despite data being written

2022-09-06 Thread Wyll Ingersoll


Our cluster has not had any data written to it externally in several weeks, but 
yet the overall data usage has been growing.
Is this due to heavy recovery activity?  If so, what can be done (if anything) 
to reduce the data generated during recovery.

We've been trying to move PGs away from high-usage OSDS (many over 99%), but 
it's like playing whack-a-mole, the cluster keeps sending new data to already 
overly full osds making further recovery nearly impossible.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] More recovery pain

2022-09-01 Thread Wyll Ingersoll


We are in the middle of a massive recovery event and our monitor DBs keep 
exploding to the point that they fill their disk partition (800GB disk).   We 
cannot compact it because there is no room on the device for compaction to 
happen.  We cannot add another disk at this time either.  We destroyed rebuilt 
one of them and now it won't come back into quorum.

What other parameters can we set to limit the growth of the monitor DB during 
recovery?
Are there tools to manually purge the LOGM messages from the monitor DB?

We disabled clog_to_monitor already.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs growing beyond full ratio

2022-08-30 Thread Wyll Ingersoll
One of our OSDs eventually reached 100% capacity (in spite of the full ratio 
being 95%).  Now it is down and we cannot restart the osd process on it because 
there is not enough space on the device.

Is there a way to find PGs on that disk that can be safely removed without 
destroying data so we can bring it back online?  This is a bluestore OSD.

I don't understand how this overfilling issue is not already a bug that is 
getting attention, it seems very broken that an OSD can blow way past its 
full_ratio.



From: Wyll Ingersoll 
Sent: Monday, August 29, 2022 9:24 AM
To: Jarett ; ceph-users@ceph.io 
Subject: [ceph-users] Re: OSDs growing beyond full ratio


I would think so, but it isn't happening nearly fast enough.

It's literally been over 10 days with 40 new drives across 2 new servers and 
they barely have any PGs yet. A few, but not nearly enough to help with the 
imbalance.

From: Jarett 
Sent: Sunday, August 28, 2022 8:19 PM
To: Wyll Ingersoll ; ceph-users@ceph.io 

Subject: RE: [ceph-users] OSDs growing beyond full ratio


Isn’t rebalancing onto the empty OSDs default behavior?



From: Wyll Ingersoll<mailto:wyllys.ingers...@keepertech.com>
Sent: Sunday, August 28, 2022 10:31 AM
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] OSDs growing beyond full ratio



We have a pacific cluster that is overly filled and is having major trouble 
recovering.  We are desperate for help in improving recovery speed.  We have 
modified all of the various recovery throttling parameters.



The full_ratio is 0.95 but we have several osds that continue to grow and are 
approaching 100% utilization.  They are reweighted to almost 0, but yet 
continue to grow.

Why is this happening?  I thought the cluster would stop writing to the osd 
when it was at above the full ratio.





We have added additional capacity to the cluster but the new OSDs are being 
used very very slowly.  The primary pool in the cluster is the RGW data pool 
which is a 12+4 EC pool using "host" placement rules across 18 hosts, 2 new 
hosts with 20x10TB osds each were recently added but they are only very very 
slowly being filled up.  I don't see how to force recovery on that particular 
pool.   From what I understand, we cannot modify the EC parameters without 
destroying the pool and we cannot offload that pool to any others because there 
is no other place to store the amount of data.





We have been running "ceph osd reweight-by-utilization"  periodically and it 
works for a while (a few hours) but then recovery and backfill IO numbers drop 
to negligible values.



The balancer module will not run because the current misplaced % is about 97%.



Would it be more effective to use the osmaptool and generate a bunch of upmap 
commands to manually move data around or keep trying to get 
reweight-by-utlilization to work?



Any suggestions (other than deleting data which we cannot do at this point, the 
pools are not accessible) or adding more storage (we already did and it is not 
being utilized very heavily yet for some reason).









___

ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs growing beyond full ratio

2022-08-30 Thread Wyll Ingersoll


Thanks, we may resort to that if we can't make progress in rebalancing things.

From: Dave Schulz 
Sent: Tuesday, August 30, 2022 11:18 AM
To: Wyll Ingersoll ; Josh Baergen 

Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: OSDs growing beyond full ratio


Hi Wyll,

The only way I could get my OSDs to start dropping their utilization because of 
a similar "unable to access the fs" problem was to run "ceph osd crush reweight 
 0" on the full OSDs then wait while they start to empty and get below the 
full ratio.  Not this is different from ceph osd reweight (missing the word 
crush).  I know this goes against the documented best practices and I'm just 
relaying what worked for me recently.  I'm running 14.2.22 and I think you're 
Pacific which is 2 major versions newer.

In case it's important: We also had HDD with SSD for DB/WAL.


@Ceph gurus: Is a file in ceph assigned to a specific PG?  In my case it seems 
like a file that's close to the size of a single OSD gets moved from one OSD to 
the next filling it up and domino-ing around the cluster filling up OSDs.

Sincerely

-Dave


On 2022-08-30 8:04 a.m., Wyll Ingersoll wrote:
[△EXTERNAL]



OSDs are bluestore on HDD with SSD for DB/WAL.  We already tuned the sleep_hdd 
to 0 and cranked up the max_backfills and recovery parameters to much higher 
values.



From: Josh Baergen <mailto:jbaer...@digitalocean.com>
Sent: Tuesday, August 30, 2022 9:46 AM
To: Wyll Ingersoll 
<mailto:wyllys.ingers...@keepertech.com>
Cc: Dave Schulz <mailto:dsch...@ucalgary.ca>; 
ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
<mailto:ceph-users@ceph.io>
Subject: Re: [ceph-users] Re: OSDs growing beyond full ratio

Hey Wyll,

I haven't been following this thread very closely so my apologies if
this has already been covered: Are the OSDs on HDDs or SSDs (or
hybrid)? If HDDs, you may want to look at decreasing
osd_recovery_sleep_hdd and increasing osd_max_backfills. YMMV, but
I've seen osd_recovery_sleep_hdd=0.01 and osd_max_backfills=6 work OK
on Bluestore HDDs. This would help speed up the data movements.

If it's a hybrid setup, I'm sure you could apply similar tweaks. Sleep
is already 0 for SSDs but you may be able to increase max_backfills
for some gains.

Josh

On Tue, Aug 30, 2022 at 7:31 AM Wyll Ingersoll
<mailto:wyllys.ingers...@keepertech.com> wrote:
>
>
> Yes, this cluster has both - a large cephfs FS (60TB) that is replicated 
> (2-copy) and a really large RGW data pool that is EC (12+4).  We cannot 
> currently delete any data from either of them because commands to access them 
> are not responsive.  The cephfs will not mount and radosgw-admin just hangs.
>
> We have several OSDs that are >99% full and keep approaching 100, even after 
> reweighting them to 0. There is no client activity in this cluster at this 
> point (its dead), but lots of rebalance and repairing going on) so data is 
> moving around.
>
> We are currently trying to use upmap commands to relocate PGs in to attempt 
> to balance things better and get it moving again, but progress is glacially 
> slow.
>
> ____
> From: Dave Schulz <mailto:dsch...@ucalgary.ca>
> Sent: Monday, August 29, 2022 10:42 PM
> To: Wyll Ingersoll 
> <mailto:wyllys.ingers...@keepertech.com>; 
> ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
> <mailto:ceph-users@ceph.io>
> Subject: Re: [ceph-users] Re: OSDs growing beyond full ratio
>
> Hi Wyll,
>
> Any chance you're using CephFS and have some really large files in the
> CephFS filesystem?  Erasure coding? I recently encountered a similar
> problem and as soon as the end-user deleted the really large files our
> problem became much more managable.
>
> I had issues reweighting OSDs too and in the end I changed the crush
> weights and had to chase them around every couple of days reweighting
> the OSDs >70% to zero and then setting them back to 12 when they were
> mostly empty (12TB spinning rust buckets).  Note that I'm really not
> recommending this course of action it's just the only option that seemed
> to have any effect.
>
> -Dave
>
> On 2022-08-29 3:00 p.m., Wyll Ingersoll wrote:
> > [△EXTERNAL]
> >
> >
> >
> > Can anyone explain why OSDs (ceph pacific, bluestore osds) continue to grow 
> > well after they have exceeded the "full" level (95%) and is there any way 
> > to stop this?
> >
> > "The full_ratio is 0.95 but we have several osds that continue to grow and 
> > are approaching 100% utilization.  They are reweighted to almost 0, but yet 
> > continue to grow.
> > Why is this happening?  I thought the cluster would stop writing to the osd 
> > when it was at above the

[ceph-users] Re: OSDs growing beyond full ratio

2022-08-30 Thread Wyll Ingersoll


OSDs are bluestore on HDD with SSD for DB/WAL.  We already tuned the sleep_hdd 
to 0 and cranked up the max_backfills and recovery parameters to much higher 
values.



From: Josh Baergen 
Sent: Tuesday, August 30, 2022 9:46 AM
To: Wyll Ingersoll 
Cc: Dave Schulz ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: OSDs growing beyond full ratio

Hey Wyll,

I haven't been following this thread very closely so my apologies if
this has already been covered: Are the OSDs on HDDs or SSDs (or
hybrid)? If HDDs, you may want to look at decreasing
osd_recovery_sleep_hdd and increasing osd_max_backfills. YMMV, but
I've seen osd_recovery_sleep_hdd=0.01 and osd_max_backfills=6 work OK
on Bluestore HDDs. This would help speed up the data movements.

If it's a hybrid setup, I'm sure you could apply similar tweaks. Sleep
is already 0 for SSDs but you may be able to increase max_backfills
for some gains.

Josh

On Tue, Aug 30, 2022 at 7:31 AM Wyll Ingersoll
 wrote:
>
>
> Yes, this cluster has both - a large cephfs FS (60TB) that is replicated 
> (2-copy) and a really large RGW data pool that is EC (12+4).  We cannot 
> currently delete any data from either of them because commands to access them 
> are not responsive.  The cephfs will not mount and radosgw-admin just hangs.
>
> We have several OSDs that are >99% full and keep approaching 100, even after 
> reweighting them to 0. There is no client activity in this cluster at this 
> point (its dead), but lots of rebalance and repairing going on) so data is 
> moving around.
>
> We are currently trying to use upmap commands to relocate PGs in to attempt 
> to balance things better and get it moving again, but progress is glacially 
> slow.
>
> 
> From: Dave Schulz 
> Sent: Monday, August 29, 2022 10:42 PM
> To: Wyll Ingersoll ; ceph-users@ceph.io 
> 
> Subject: Re: [ceph-users] Re: OSDs growing beyond full ratio
>
> Hi Wyll,
>
> Any chance you're using CephFS and have some really large files in the
> CephFS filesystem?  Erasure coding? I recently encountered a similar
> problem and as soon as the end-user deleted the really large files our
> problem became much more managable.
>
> I had issues reweighting OSDs too and in the end I changed the crush
> weights and had to chase them around every couple of days reweighting
> the OSDs >70% to zero and then setting them back to 12 when they were
> mostly empty (12TB spinning rust buckets).  Note that I'm really not
> recommending this course of action it's just the only option that seemed
> to have any effect.
>
> -Dave
>
> On 2022-08-29 3:00 p.m., Wyll Ingersoll wrote:
> > [△EXTERNAL]
> >
> >
> >
> > Can anyone explain why OSDs (ceph pacific, bluestore osds) continue to grow 
> > well after they have exceeded the "full" level (95%) and is there any way 
> > to stop this?
> >
> > "The full_ratio is 0.95 but we have several osds that continue to grow and 
> > are approaching 100% utilization.  They are reweighted to almost 0, but yet 
> > continue to grow.
> > Why is this happening?  I thought the cluster would stop writing to the osd 
> > when it was at above the full ratio."
> >
> > thanks...
> >
> > 
> > From: Wyll Ingersoll 
> > Sent: Monday, August 29, 2022 9:24 AM
> > To: Jarett ; ceph-users@ceph.io 
> > Subject: [ceph-users] Re: OSDs growing beyond full ratio
> >
> >
> > I would think so, but it isn't happening nearly fast enough.
> >
> > It's literally been over 10 days with 40 new drives across 2 new servers 
> > and they barely have any PGs yet. A few, but not nearly enough to help with 
> > the imbalance.
> > 
> > From: Jarett 
> > Sent: Sunday, August 28, 2022 8:19 PM
> > To: Wyll Ingersoll ; ceph-users@ceph.io 
> > 
> > Subject: RE: [ceph-users] OSDs growing beyond full ratio
> >
> >
> > Isn’t rebalancing onto the empty OSDs default behavior?
> >
> >
> >
> > From: Wyll Ingersoll<mailto:wyllys.ingers...@keepertech.com>
> > Sent: Sunday, August 28, 2022 10:31 AM
> > To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> > Subject: [ceph-users] OSDs growing beyond full ratio
> >
> >
> >
> > We have a pacific cluster that is overly filled and is having major trouble 
> > recovering.  We are desperate for help in improving recovery speed.  We 
> > have modified all of the various recovery throttling parameters.
> >
> >
> >
> > The full_ratio is 0.95 but we have several osds that continue to grow and 
> > are approa

[ceph-users] Re: OSDs growing beyond full ratio

2022-08-30 Thread Wyll Ingersoll

Yes, this cluster has both - a large cephfs FS (60TB) that is replicated 
(2-copy) and a really​ large RGW data pool that is EC (12+4).  We cannot 
currently delete any data from either of them because commands to access them 
are not responsive.  The cephfs will not mount and radosgw-admin just hangs.

We have several OSDs that are >99% full and keep approaching 100, even after 
reweighting them to 0. There is no client activity in this cluster at this 
point (its dead), but lots of rebalance and repairing going on) so data is 
moving around.

We are currently trying to use upmap commands to relocate PGs in to attempt to 
balance things better and get it moving again, but progress is glacially slow.


From: Dave Schulz 
Sent: Monday, August 29, 2022 10:42 PM
To: Wyll Ingersoll ; ceph-users@ceph.io 

Subject: Re: [ceph-users] Re: OSDs growing beyond full ratio

Hi Wyll,

Any chance you're using CephFS and have some really large files in the
CephFS filesystem?  Erasure coding? I recently encountered a similar
problem and as soon as the end-user deleted the really large files our
problem became much more managable.

I had issues reweighting OSDs too and in the end I changed the crush
weights and had to chase them around every couple of days reweighting
the OSDs >70% to zero and then setting them back to 12 when they were
mostly empty (12TB spinning rust buckets).  Note that I'm really not
recommending this course of action it's just the only option that seemed
to have any effect.

-Dave

On 2022-08-29 3:00 p.m., Wyll Ingersoll wrote:
> [△EXTERNAL]
>
>
>
> Can anyone explain why OSDs (ceph pacific, bluestore osds) continue to grow 
> well after they have exceeded the "full" level (95%) and is there any way to 
> stop this?
>
> "The full_ratio is 0.95 but we have several osds that continue to grow and 
> are approaching 100% utilization.  They are reweighted to almost 0, but yet 
> continue to grow.
> Why is this happening?  I thought the cluster would stop writing to the osd 
> when it was at above the full ratio."
>
> thanks...
>
> 
> From: Wyll Ingersoll 
> Sent: Monday, August 29, 2022 9:24 AM
> To: Jarett ; ceph-users@ceph.io 
> Subject: [ceph-users] Re: OSDs growing beyond full ratio
>
>
> I would think so, but it isn't happening nearly fast enough.
>
> It's literally been over 10 days with 40 new drives across 2 new servers and 
> they barely have any PGs yet. A few, but not nearly enough to help with the 
> imbalance.
> 
> From: Jarett 
> Sent: Sunday, August 28, 2022 8:19 PM
> To: Wyll Ingersoll ; ceph-users@ceph.io 
> 
> Subject: RE: [ceph-users] OSDs growing beyond full ratio
>
>
> Isn’t rebalancing onto the empty OSDs default behavior?
>
>
>
> From: Wyll Ingersoll<mailto:wyllys.ingers...@keepertech.com>
> Sent: Sunday, August 28, 2022 10:31 AM
> To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] OSDs growing beyond full ratio
>
>
>
> We have a pacific cluster that is overly filled and is having major trouble 
> recovering.  We are desperate for help in improving recovery speed.  We have 
> modified all of the various recovery throttling parameters.
>
>
>
> The full_ratio is 0.95 but we have several osds that continue to grow and are 
> approaching 100% utilization.  They are reweighted to almost 0, but yet 
> continue to grow.
>
> Why is this happening?  I thought the cluster would stop writing to the osd 
> when it was at above the full ratio.
>
>
>
>
>
> We have added additional capacity to the cluster but the new OSDs are being 
> used very very slowly.  The primary pool in the cluster is the RGW data pool 
> which is a 12+4 EC pool using "host" placement rules across 18 hosts, 2 new 
> hosts with 20x10TB osds each were recently added but they are only very very 
> slowly being filled up.  I don't see how to force recovery on that particular 
> pool.   From what I understand, we cannot modify the EC parameters without 
> destroying the pool and we cannot offload that pool to any others because 
> there is no other place to store the amount of data.
>
>
>
>
>
> We have been running "ceph osd reweight-by-utilization"  periodically and it 
> works for a while (a few hours) but then recovery and backfill IO numbers 
> drop to negligible values.
>
>
>
> The balancer module will not run because the current misplaced % is about 97%.
>
>
>
> Would it be more effective to use the osmaptool and generate a bunch of upmap 
> commands to manually move data around or keep trying to get 
> reweight-by-utlilization to work?
>
>
>
> Any suggestions (other th

[ceph-users] Re: OSDs growing beyond full ratio

2022-08-29 Thread Wyll Ingersoll


Can anyone explain why OSDs (ceph pacific, bluestore osds) continue to grow 
well after they have exceeded the "full" level (95%) and is there any way to 
stop this?

"The full_ratio is 0.95 but we have several osds that continue to grow and are 
approaching 100% utilization.  They are reweighted to almost 0, but yet 
continue to grow.
Why is this happening?  I thought the cluster would stop writing to the osd 
when it was at above the full ratio."

thanks...

____
From: Wyll Ingersoll 
Sent: Monday, August 29, 2022 9:24 AM
To: Jarett ; ceph-users@ceph.io 
Subject: [ceph-users] Re: OSDs growing beyond full ratio


I would think so, but it isn't happening nearly fast enough.

It's literally been over 10 days with 40 new drives across 2 new servers and 
they barely have any PGs yet. A few, but not nearly enough to help with the 
imbalance.

From: Jarett 
Sent: Sunday, August 28, 2022 8:19 PM
To: Wyll Ingersoll ; ceph-users@ceph.io 

Subject: RE: [ceph-users] OSDs growing beyond full ratio


Isn’t rebalancing onto the empty OSDs default behavior?



From: Wyll Ingersoll<mailto:wyllys.ingers...@keepertech.com>
Sent: Sunday, August 28, 2022 10:31 AM
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] OSDs growing beyond full ratio



We have a pacific cluster that is overly filled and is having major trouble 
recovering.  We are desperate for help in improving recovery speed.  We have 
modified all of the various recovery throttling parameters.



The full_ratio is 0.95 but we have several osds that continue to grow and are 
approaching 100% utilization.  They are reweighted to almost 0, but yet 
continue to grow.

Why is this happening?  I thought the cluster would stop writing to the osd 
when it was at above the full ratio.





We have added additional capacity to the cluster but the new OSDs are being 
used very very slowly.  The primary pool in the cluster is the RGW data pool 
which is a 12+4 EC pool using "host" placement rules across 18 hosts, 2 new 
hosts with 20x10TB osds each were recently added but they are only very very 
slowly being filled up.  I don't see how to force recovery on that particular 
pool.   From what I understand, we cannot modify the EC parameters without 
destroying the pool and we cannot offload that pool to any others because there 
is no other place to store the amount of data.





We have been running "ceph osd reweight-by-utilization"  periodically and it 
works for a while (a few hours) but then recovery and backfill IO numbers drop 
to negligible values.



The balancer module will not run because the current misplaced % is about 97%.



Would it be more effective to use the osmaptool and generate a bunch of upmap 
commands to manually move data around or keep trying to get 
reweight-by-utlilization to work?



Any suggestions (other than deleting data which we cannot do at this point, the 
pools are not accessible) or adding more storage (we already did and it is not 
being utilized very heavily yet for some reason).









___

ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs growing beyond full ratio

2022-08-29 Thread Wyll Ingersoll
Thank You!

I will see about trying these out, probably using your suggestion of several 
iterations with #1 and then #3.



From: Stefan Kooman 
Sent: Monday, August 29, 2022 1:38 AM
To: Wyll Ingersoll ; ceph-users@ceph.io 

Subject: Re: [ceph-users] OSDs growing beyond full ratio

On 8/28/22 17:30, Wyll Ingersoll wrote:
> We have a pacific cluster that is overly filled and is having major trouble 
> recovering.  We are desperate for help in improving recovery speed.  We have 
> modified all of the various recovery throttling parameters.
>
> The full_ratio is 0.95 but we have several osds that continue to grow and are 
> approaching 100% utilization.  They are reweighted to almost 0, but yet 
> continue to grow.
> Why is this happening?  I thought the cluster would stop writing to the osd 
> when it was at above the full ratio.
>
>
> We have added additional capacity to the cluster but the new OSDs are being 
> used very very slowly.  The primary pool in the cluster is the RGW data pool 
> which is a 12+4 EC pool using "host" placement rules across 18 hosts, 2 new 
> hosts with 20x10TB osds each were recently added but they are only very very 
> slowly being filled up.  I don't see how to force recovery on that particular 
> pool.   From what I understand, we cannot modify the EC parameters without 
> destroying the pool and we cannot offload that pool to any others because 
> there is no other place to store the amount of data.
>
>
> We have been running "ceph osd reweight-by-utilization"  periodically and it 
> works for a while (a few hours) but then recovery and backfill IO numbers 
> drop to negligible values.
>
> The balancer module will not run because the current misplaced % is about 97%.
>
> Would it be more effective to use the osmaptool and generate a bunch of upmap 
> commands to manually move data around or keep trying to get 
> reweight-by-utlilization to work?

I would use the script: upmap-remapped.py [1] to get your cluster
healthy again, and after that pgremapper [2] to drain PGs from the full
OSDs. At a certain point (usage) you might want to let the Ceph balancer
do it's thing. But from experience I can tell that Jonas Jelten
ceph-balancer script is currently doing a way better job [3]. Search the
list for the use / usage of the scripts (or use a search engine). With
upmaps you have more control on where PGs should go. You might want to
skip step [2] and directly try ceph-balancer [3].

Gr. Stefan

[1]:
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
[2]: https://github.com/digitalocean/pgremapper/
[3]: https://github.com/TheJJ/ceph-balancer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs growing beyond full ratio

2022-08-29 Thread Wyll Ingersoll


I would think so, but it isn't happening nearly fast enough.

It's literally been over 10 days with 40 new drives across 2 new servers and 
they barely have any PGs yet. A few, but not nearly enough to help with the 
imbalance.

From: Jarett 
Sent: Sunday, August 28, 2022 8:19 PM
To: Wyll Ingersoll ; ceph-users@ceph.io 

Subject: RE: [ceph-users] OSDs growing beyond full ratio


Isn’t rebalancing onto the empty OSDs default behavior?



From: Wyll Ingersoll<mailto:wyllys.ingers...@keepertech.com>
Sent: Sunday, August 28, 2022 10:31 AM
To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] OSDs growing beyond full ratio



We have a pacific cluster that is overly filled and is having major trouble 
recovering.  We are desperate for help in improving recovery speed.  We have 
modified all of the various recovery throttling parameters.



The full_ratio is 0.95 but we have several osds that continue to grow and are 
approaching 100% utilization.  They are reweighted to almost 0, but yet 
continue to grow.

Why is this happening?  I thought the cluster would stop writing to the osd 
when it was at above the full ratio.





We have added additional capacity to the cluster but the new OSDs are being 
used very very slowly.  The primary pool in the cluster is the RGW data pool 
which is a 12+4 EC pool using "host" placement rules across 18 hosts, 2 new 
hosts with 20x10TB osds each were recently added but they are only very very 
slowly being filled up.  I don't see how to force recovery on that particular 
pool.   From what I understand, we cannot modify the EC parameters without 
destroying the pool and we cannot offload that pool to any others because there 
is no other place to store the amount of data.





We have been running "ceph osd reweight-by-utilization"  periodically and it 
works for a while (a few hours) but then recovery and backfill IO numbers drop 
to negligible values.



The balancer module will not run because the current misplaced % is about 97%.



Would it be more effective to use the osmaptool and generate a bunch of upmap 
commands to manually move data around or keep trying to get 
reweight-by-utlilization to work?



Any suggestions (other than deleting data which we cannot do at this point, the 
pools are not accessible) or adding more storage (we already did and it is not 
being utilized very heavily yet for some reason).









___

ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSDs growing beyond full ratio

2022-08-28 Thread Wyll Ingersoll
We have a pacific cluster that is overly filled and is having major trouble 
recovering.  We are desperate for help in improving recovery speed.  We have 
modified all of the various recovery throttling parameters.

The full_ratio is 0.95 but we have several osds that continue to grow and are 
approaching 100% utilization.  They are reweighted to almost 0, but yet 
continue to grow.
Why is this happening?  I thought the cluster would stop writing to the osd 
when it was at above the full ratio.


We have added additional capacity to the cluster but the new OSDs are being 
used very very slowly.  The primary pool in the cluster is the RGW data pool 
which is a 12+4 EC pool using "host" placement rules across 18 hosts, 2 new 
hosts with 20x10TB osds each were recently added but they are only very very 
slowly being filled up.  I don't see how to force recovery on that particular 
pool.   From what I understand, we cannot modify the EC parameters without 
destroying the pool and we cannot offload that pool to any others because there 
is no other place to store the amount of data.


We have been running "ceph osd reweight-by-utilization"  periodically and it 
works for a while (a few hours) but then recovery and backfill IO numbers drop 
to negligible values.

The balancer module will not run because the current misplaced % is about 97%.

Would it be more effective to use the osmaptool and generate a bunch of upmap 
commands to manually move data around or keep trying to get 
reweight-by-utlilization to work?

Any suggestions (other than deleting data which we cannot do at this point, the 
pools are not accessible) or adding more storage (we already did and it is not 
being utilized very heavily yet for some reason).




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: backfillfull osd - but it is only at 68% capacity

2022-08-25 Thread Wyll Ingersoll


This was seen today in Pacific 16.2.9.


From: Stefan Kooman 
Sent: Thursday, August 25, 2022 3:17 PM
To: Eugen Block ; Wyll Ingersoll 

Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: backfillfull osd - but it is only at 68% capacity

On 8/25/22 20:56, Eugen Block wrote:
> Hi,
>
> I’ve seen this many times in older clusters, mostly Nautilus (can’t say
> much about Octopus or later). Apparently the root cause hasn’t been
> fixed yet, but it should resolve after the recovery has finished.

Wat version of Ceph? I have seen this before a couple of times. Might be
this bug: https://tracker.ceph.com/issues/39555 that has been fixed. I
have not seen it after that.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: backfillfull osd - but it is only at 68% capacity

2022-08-25 Thread Wyll Ingersoll


That problem seems to have cleared up.  We are in the middle of a massive 
rebalancing effort for a 700 OSD, 10PB cluster that is wildly out of whack 
(because it got too full) and see lots of strange numbers reported occasionally.



From: Eugen Block 
Sent: Thursday, August 25, 2022 2:56 PM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] backfillfull osd - but it is only at 68% capacity

Hi,

I’ve seen this many times in older clusters, mostly Nautilus (can’t
say much about Octopus or later). Apparently the root cause hasn’t
been fixed yet, but it should resolve after the recovery has finished.

Zitat von Wyll Ingersoll :

> My cluster (ceph pacific) is complaining about one of the OSD being
> backfillfull:
>
> [WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s)
>
> osd.31 is backfill full
>
> backfillfull ratios:
>
> full_ratio 0.95
>
> backfillfull_ratio 0.9
>
> nearfull_ratio 0.85
>
> ceph osd df shows:
>
>  31hdd  5.55899   1.0  5.6 TiB  3.8 TiB  3.7 TiB  411 MiB
> 6.7 GiB   1.8 TiB  68.13  0.92   83  up
>
> So, why does the cluster think that osd.31 is backfillfull if its
> only at 68% capacity?
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] backfillfull osd - but it is only at 68% capacity

2022-08-25 Thread Wyll Ingersoll


My cluster (ceph pacific) is complaining about one of the OSD being 
backfillfull:

[WRN] OSD_BACKFILLFULL: 1 backfillfull osd(s)

osd.31 is backfill full

backfillfull ratios:

full_ratio 0.95

backfillfull_ratio 0.9

nearfull_ratio 0.85

ceph osd df shows:

 31hdd  5.55899   1.0  5.6 TiB  3.8 TiB  3.7 TiB  411 MiB  6.7 GiB   
1.8 TiB  68.13  0.92   83  up

So, why does the cluster think that osd.31 is backfillfull if its only at 68% 
capacity?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rgw.meta pool df reporting 16EiB

2022-08-23 Thread Wyll Ingersoll


We have a large Pacific cluster (680 osd,  ~9.6PB ) - primarily it is used as 
an RGW object store.  The default.rgw.meta pool is reporting strange numbers:

default.rgw.meta 4 32 16EiB 64 11MiB 100 0

Why would the "Stored" value show 16EiB (which is the maximum possible for 
ceph)?   These numbers seem wildly wrong. Is there a way to diagnose what's 
wrong in this pool and fix it without losing data?  The data pool has 3PB that 
we cannot lose.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Full cluster, new OSDS not being used

2022-08-23 Thread Wyll Ingersoll
We did this but oddly enough it is showing the movement of PGS away​ from the 
new, underutilized OSDs instead of TO them as we would expect.

From: Wesley Dillingham 
Sent: Tuesday, August 23, 2022 2:13 PM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Full cluster, new OSDS not being used

https://docs.ceph.com/en/pacific/rados/operations/upmap/

Respectfully,

Wes Dillingham
w...@wesdillingham.com<mailto:w...@wesdillingham.com>
LinkedIn<http://www.linkedin.com/in/wesleydillingham>


On Tue, Aug 23, 2022 at 1:45 PM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:
Thank you - we have increased backfill settings, but can you elaborate on 
"injecting upmaps" ?

From: Wesley Dillingham mailto:w...@wesdillingham.com>>
Sent: Tuesday, August 23, 2022 1:44 PM
To: Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
mailto:ceph-users@ceph.io>>
Subject: Re: [ceph-users] Full cluster, new OSDS not being used

In that case I would say your options are to make use of injecting upmaps to 
move data off the full osds or to increase the backfill throttle settings to 
make things move faster.

Respectfully,

Wes Dillingham
w...@wesdillingham.com<mailto:w...@wesdillingham.com>
LinkedIn<http://www.linkedin.com/in/wesleydillingham>


On Tue, Aug 23, 2022 at 1:28 PM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:
Unfortunately, I cannot. The system in question is in a secure location and I 
don't have direct access to it.  The person on site runs the commands I send 
them and the osd tree is correct as far as we can tell. The new hosts and osds 
are in the right place in the tree and have proper weights.  One small 
difference is that the new osds have a class ("hdd"), whereas MOST of the 
pre-existing osds do not have a class designation, this is a cluster that has 
grown and been upgraded over several releases of ceph. Currently it is running 
pacific 16.2.9.  However, removing the class designation on one of the new osds 
did not make any difference so I dont think that is the issue.

The cluster is slowly recovering, but our new OSDs are very lightly used at 
this point, only a few PGs have been assigned to them, though more than zero 
and the number does appear to be slowly (very slowly) growing so recovery is 
happening but very very slowly.





From: Wesley Dillingham mailto:w...@wesdillingham.com>>
Sent: Tuesday, August 23, 2022 1:18 PM
To: Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
mailto:ceph-users@ceph.io>>
Subject: Re: [ceph-users] Full cluster, new OSDS not being used

Can you please send the output of "ceph osd tree"

Respectfully,

Wes Dillingham
w...@wesdillingham.com<mailto:w...@wesdillingham.com>
LinkedIn<http://www.linkedin.com/in/wesleydillingham>


On Tue, Aug 23, 2022 at 10:53 AM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:

We have a large cluster with a many osds that are at their nearfull or full 
ratio limit and are thus having problems rebalancing.
We added 2 more storage nodes, each with 20 additional drives  to give the 
cluster room to rebalance.  However, for the past few days, the new OSDs are 
NOT being used and the cluster remains stuck and is not improving.

The crush map is correct, the new hosts and osds are at the correct location, 
but dont seem to be getting used.

Any idea how we can force the full or backfillfull OSDs to start unloading 
their pgs to the newly added ones?

thanks!
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Full cluster, new OSDS not being used

2022-08-23 Thread Wyll Ingersoll
Thank you - we have increased backfill settings, but can you elaborate on 
"injecting upmaps" ?

From: Wesley Dillingham 
Sent: Tuesday, August 23, 2022 1:44 PM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Full cluster, new OSDS not being used

In that case I would say your options are to make use of injecting upmaps to 
move data off the full osds or to increase the backfill throttle settings to 
make things move faster.

Respectfully,

Wes Dillingham
w...@wesdillingham.com<mailto:w...@wesdillingham.com>
LinkedIn<http://www.linkedin.com/in/wesleydillingham>


On Tue, Aug 23, 2022 at 1:28 PM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:
Unfortunately, I cannot. The system in question is in a secure location and I 
don't have direct access to it.  The person on site runs the commands I send 
them and the osd tree is correct as far as we can tell. The new hosts and osds 
are in the right place in the tree and have proper weights.  One small 
difference is that the new osds have a class ("hdd"), whereas MOST of the 
pre-existing osds do not have a class designation, this is a cluster that has 
grown and been upgraded over several releases of ceph. Currently it is running 
pacific 16.2.9.  However, removing the class designation on one of the new osds 
did not make any difference so I dont think that is the issue.

The cluster is slowly recovering, but our new OSDs are very lightly used at 
this point, only a few PGs have been assigned to them, though more than zero 
and the number does appear to be slowly (very slowly) growing so recovery is 
happening but very very slowly.





From: Wesley Dillingham mailto:w...@wesdillingham.com>>
Sent: Tuesday, August 23, 2022 1:18 PM
To: Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>>
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io> 
mailto:ceph-users@ceph.io>>
Subject: Re: [ceph-users] Full cluster, new OSDS not being used

Can you please send the output of "ceph osd tree"

Respectfully,

Wes Dillingham
w...@wesdillingham.com<mailto:w...@wesdillingham.com>
LinkedIn<http://www.linkedin.com/in/wesleydillingham>


On Tue, Aug 23, 2022 at 10:53 AM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:

We have a large cluster with a many osds that are at their nearfull or full 
ratio limit and are thus having problems rebalancing.
We added 2 more storage nodes, each with 20 additional drives  to give the 
cluster room to rebalance.  However, for the past few days, the new OSDs are 
NOT being used and the cluster remains stuck and is not improving.

The crush map is correct, the new hosts and osds are at the correct location, 
but dont seem to be getting used.

Any idea how we can force the full or backfillfull OSDs to start unloading 
their pgs to the newly added ones?

thanks!
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Full cluster, new OSDS not being used

2022-08-23 Thread Wyll Ingersoll
Unfortunately, I cannot. The system in question is in a secure location and I 
don't have direct access to it.  The person on site runs the commands I send 
them and the osd tree is correct as far as we can tell. The new hosts and osds 
are in the right place in the tree and have proper weights.  One small 
difference is that the new osds have a class ("hdd"), whereas MOST of the 
pre-existing osds do not have a class designation, this is a cluster that has 
grown and been upgraded over several releases of ceph. Currently it is running 
pacific 16.2.9.  However, removing the class designation on one of the new osds 
did not make any difference so I dont think that is the issue.

The cluster is slowly recovering, but our new OSDs are very lightly used at 
this point, only a few PGs have been assigned to them, though more than zero 
and the number does appear to be slowly (very slowly) growing so recovery is 
happening but very very slowly.





From: Wesley Dillingham 
Sent: Tuesday, August 23, 2022 1:18 PM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Full cluster, new OSDS not being used

Can you please send the output of "ceph osd tree"

Respectfully,

Wes Dillingham
w...@wesdillingham.com<mailto:w...@wesdillingham.com>
LinkedIn<http://www.linkedin.com/in/wesleydillingham>


On Tue, Aug 23, 2022 at 10:53 AM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:

We have a large cluster with a many osds that are at their nearfull or full 
ratio limit and are thus having problems rebalancing.
We added 2 more storage nodes, each with 20 additional drives  to give the 
cluster room to rebalance.  However, for the past few days, the new OSDs are 
NOT being used and the cluster remains stuck and is not improving.

The crush map is correct, the new hosts and osds are at the correct location, 
but dont seem to be getting used.

Any idea how we can force the full or backfillfull OSDs to start unloading 
their pgs to the newly added ones?

thanks!
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Full cluster, new OSDS not being used

2022-08-23 Thread Wyll Ingersoll


We have a large cluster with a many osds that are at their nearfull or full 
ratio limit and are thus having problems rebalancing.
We added 2 more storage nodes, each with 20 additional drives  to give the 
cluster room to rebalance.  However, for the past few days, the new OSDs are 
NOT being used and the cluster remains stuck and is not improving.

The crush map is correct, the new hosts and osds are at the correct location, 
but dont seem to be getting used.

Any idea how we can force the full or backfillfull OSDs to start unloading 
their pgs to the newly added ones?

thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph nfs-ganesha - Unable to mount Ceph cluster

2022-06-29 Thread Wyll Ingersoll


[ceph pacific 16.2.9]

When creating a NFS export using "ceph nfs export apply ... -i export.json" for 
a subdirectory of /cephfs, does the subdir that you wish to export need to be 
pre-created or will ceph (or ganesha) create it for you?

I'm trying to create an "/shared" directory in a cephfs tree and export it 
using a JSON spec file, but the nfs-ganesha log file shows errors because it 
cannot mount or create the desired directory in cephfs.  If I manually create 
the directory prior to applying the export spec, it does work.  But it seems 
that ganesha is trying to create it for me so I'm wondering how to make that 
work.




29/06/2022 16:14:54 : epoch 62bc67c3 : foobar : ganesha.nfsd-6[sigmgr] 
create_export :FSAL :CRIT :Unable to mount Ceph cluster for /shared.

29/06/2022 16:14:54 : epoch 62bc67c3 : foobar : ganesha.nfsd-6[sigmgr] 
mdcache_fsal_create_export :FSAL :MAJ :Failed to call create_export on 
underlying FSAL Ceph

29/06/2022 16:14:54 : epoch 62bc67c3 : foobar : ganesha.nfsd-6[sigmgr] 
fsal_cfg_commit :CONFIG :CRIT :Could not create export for (/shared) to 
(/shared)

The JSON spec used looks like:


{

  "export_id": 2,

  "transports": [ "TCP" ],

  "cluster_id": "ceph",

  "path": "/shared",

  "pseudo": "/shared",

  "protocols": [4],

  "access_type": "RW",

  "squash": "no_root_squash",

  "fsal": {

"name":  "CEPH",

"user_id": "nfs.ceph.2",

"fs_name": "cephfs"

  }

}


thanks,
   Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] calling ceph command from a crush_location_hook - fails to find sys.stdin.isatty()

2022-06-27 Thread Wyll Ingersoll


[ceph pacific 16.2.9]

I have a crush_location_hook script which is a small python3 script that 
figures out the correct root/chassis/host location for a particular OSD.  Our 
map has 2 roots, one for an all-SSD, and another for HDDs, thus the need for 
the location hook. Without it, the SSD devices end up in the wrong crush 
location.  Prior to 16.2.9 release, they weren't being used because of a bug 
that was causing the OSDs to crash with the hook.  Now that we've upgraded to 
16.2.9 we want to use our location hook script again, but it fails in a 
different way.

The script works correctly when testing it standalone with the right 
parameters, but when it is called by the OSD process, it fails because when the 
ceph command references 'sys.stdin.isatty()' (at line 538 in /usr/bin/ceph), it 
isn't found because sys.stdin is NoneType.  I suspect this is because of how 
the OSD spawns the crush hook script, which then forks the ceph command.  
Somehow python (3.8) is not initializing the stdin, stdout, stderr members in 
the 'sys' module object.

Looking for guidance on how to get my location hook script to successfully use 
the "ceph" command to get the output of "ceph osd tree --format json"

thanks,
   Wyllys Ingersoll


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] runaway mon DB

2022-06-27 Thread Wyll Ingersoll

Running Ceph Pacific 16.2.7

We have a very large cluster with 3 monitors.  One of the monitor DBs is > 2x 
the size of the other 2 and is growing constantly (store.db fills up) and 
eventually fills up the /var partition on that server.  The monitor in question 
is not​ the leader.  The cluster itself is quite full but currently we cannot 
remove any data due to it's current mission requirements, so it is constantly 
in a state of rebalance and bumping up against the "toofull" limits.

How can we keep the monitor DB from growing so fast?
Why is it only on a secondary monitor not the primary?
Can we force a monitor to compact it's DB while the system is actively 
repairing ?

Thanks,
  Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs client permission restrictions?

2022-06-23 Thread Wyll Ingersoll
Thanks for the explanation, that's what I suspected but needed the confirmation.

From: Gregory Farnum 
Sent: Thursday, June 23, 2022 11:22 AM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] cephfs client permission restrictions?

On Thu, Jun 23, 2022 at 8:18 AM Wyll Ingersoll
 wrote:
>
> Is it possible to craft a cephfs client authorization key that will allow the 
> client read/write access to a path within the FS, but NOT allow the client to 
> modify the permissions of that path?
> For example, allow RW access to /cephfs/foo (path=/foo) but prevent the 
> client from modifying permissions on /foo.

Cephx won't do this on its own.— it enforces subtree-based access and
can restrict clients to acting as a specific (set of) uid/gids, but it
doesn't add extra stuff on top of that. (Modifying permissions is, you
know, a write.)

This is part of the standard Linux security model though, right? So
you can make somebody else the owner and give your restricted user
access via a group.
-Greg

>
> thanks,
>   Wyllys Ingersoll
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs client permission restrictions?

2022-06-23 Thread Wyll Ingersoll
Is it possible to craft a cephfs client authorization key that will allow the 
client read/write access to a path within the FS, but NOT allow the client to 
modify the permissions of that path?
For example, allow RW access to /cephfs/foo (path=/foo) but prevent the client 
from modifying permissions on /foo.

thanks,
  Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw multisite sync - how to fix data behind shards?

2022-06-09 Thread Wyll Ingersoll

Running "object rewrite" on a couple of the objects in the bucket seems to have 
triggered the sync and now things appear ok.


From: Szabo, Istvan (Agoda) 
Sent: Thursday, June 9, 2022 3:24 PM
To: Wyll Ingersoll 
Cc: ceph-users@ceph.io ; d...@ceph.io 
Subject: Re: [ceph-users] Re: radosgw multisite sync - how to fix data behind 
shards?

Try data sync init and restart the gateways, sometimes this helped me.

If this doesn’t turn on and off the sync policy on the bucket.

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

On 2022. Jun 9., at 20:48, Wyll Ingersoll  
wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


I ended up giving up after trying everything I could find in the forums and 
docs, deleted the problematic zone, and then re-added it back to the zonegroup 
and re-established the group sync policy for the bucket in question.  The 
sync-status is OK now, though the error list still shows a bunch of errors from 
yesterday that I cannot figure out how to clear ("sync error trim" doesn't do 
anything that I can tell).

My opinion is that multisite sync policy in the current Pacific release 
(16.2.9) is still very fragile and poorly documented as far as troubleshooting 
goes.  I'd love to see clear explanations of the various data and metadata 
operations - metadata, data, bucket, bilog, datalog.  It's hard to know where 
to start when things get into a bad state and the online resources are not 
helpful enough.

Another question, if a sync policy is defined on a bucket already has some 
objects in it, what command should be used to force a sync operation based on 
the new policy? It seems that only objects added AFTER the policy is applied 
get replicated, pre-existing ones are not replicated.


____
From: Wyll Ingersoll 
Sent: Thursday, June 9, 2022 9:35 AM
To: Amit Ghadge ; ceph-users@ceph.io ; 
d...@ceph.io 
Subject: [ceph-users] Re: radosgw multisite sync - how to fix data behind 
shards?

I think you mean "radosgw-admin sync error list", in which case there are 32 
shards, each with the same error.  I dont see errors on the master zone logs so 
I'm not sure how to correct the situation.


   "shard_id": 31,
   "entries": [
   {
   "id": "1_1654722349.230688_62850.1",
   "section": "data",
   "name": 
"zone-1:a6ed5947-0ceb-407b-812f-347fab2ef62d.677322760.1:6",
   "timestamp": "2022-06-08T21:05:49.230688Z",
   "info": {
   "source_zone": "a6ed5947-0ceb-407b-812f-347fab2ef62d",
   "error_code": 125,
   "message": "failed to sync bucket instance: (125) Operation 
canceled"
   }
   }
   ]
   }





From: Amit Ghadge 
Sent: Wednesday, June 8, 2022 9:16 PM
To: Wyll Ingersoll 
Subject: Re: radosgw multisite sync - how to fix data behind shards?

check any error by running command radosgw-admin data sync error list


-AmitG


On Wed, Jun 8, 2022 at 2:44 PM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:

Seeking help from a radosgw expert...

I have a 3-zone multisite configuration (all running pacific 16.2.9) with 1 
bucket per zone and a couple of small objects in each bucket for testing 
purposes.
One of the secondary zones cannot get seem to get into sync with the master, 
sync status reports:


 metadata sync syncing
   full sync: 0/64 shards
   incremental sync: 64/64 shards
   metadata is caught up with master
 data sync source: a6ed5947-0ceb-407b-812f-347fab2ef62d (zone-1)
   syncing
   full sync: 128/128 shards
   full sync: 66 buckets to sync
   incremental sync: 0/128 shards
   data is behind on 128 shards
   behind shards: 
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]


I have tried using "data sync init" and restarting the radosgw multiple times, 
but that does not seem to be helping in any way.

If I manually do "radosgw-admin data sync run --bucket b

[ceph-users] Re: radosgw multisite sync - how to fix data behind shards?

2022-06-09 Thread Wyll Ingersoll


I ended up giving up after trying everything I could find in the forums and 
docs, deleted the problematic zone, and then re-added it back to the zonegroup 
and re-established the group sync policy for the bucket in question.  The 
sync-status is OK now, though the error list still shows a bunch of errors from 
yesterday that I cannot figure out how to clear ("sync error trim" doesn't do 
anything that I can tell).

My opinion is that multisite sync policy in the current Pacific release 
(16.2.9) is still very fragile and poorly documented as far as troubleshooting 
goes.  I'd love to see clear explanations of the various data and metadata 
operations - metadata, data, bucket, bilog, datalog.  It's hard to know where 
to start when things get into a bad state and the online resources are not 
helpful enough.

Another question, if a sync policy is defined on a bucket already has some 
objects in it, what command should be used to force a sync operation based on 
the new policy? It seems that only objects added AFTER the policy is applied 
get replicated, pre-existing ones are not replicated.


____
From: Wyll Ingersoll 
Sent: Thursday, June 9, 2022 9:35 AM
To: Amit Ghadge ; ceph-users@ceph.io ; 
d...@ceph.io 
Subject: [ceph-users] Re: radosgw multisite sync - how to fix data behind 
shards?

I think you mean "radosgw-admin sync error list", in which case there are 32 
shards, each with the same error.  I dont see errors on the master zone logs so 
I'm not sure how to correct the situation.


"shard_id": 31,
"entries": [
{
"id": "1_1654722349.230688_62850.1",
"section": "data",
"name": 
"zone-1:a6ed5947-0ceb-407b-812f-347fab2ef62d.677322760.1:6",
"timestamp": "2022-06-08T21:05:49.230688Z",
"info": {
"source_zone": "a6ed5947-0ceb-407b-812f-347fab2ef62d",
"error_code": 125,
"message": "failed to sync bucket instance: (125) Operation 
canceled"
}
}
]
}





From: Amit Ghadge 
Sent: Wednesday, June 8, 2022 9:16 PM
To: Wyll Ingersoll 
Subject: Re: radosgw multisite sync - how to fix data behind shards?

check any error by running command radosgw-admin data sync error list


-AmitG


On Wed, Jun 8, 2022 at 2:44 PM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:

Seeking help from a radosgw expert...

I have a 3-zone multisite configuration (all running pacific 16.2.9) with 1 
bucket per zone and a couple of small objects in each bucket for testing 
purposes.
One of the secondary zones cannot get seem to get into sync with the master, 
sync status reports:


  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: a6ed5947-0ceb-407b-812f-347fab2ef62d (zone-1)
syncing
full sync: 128/128 shards
full sync: 66 buckets to sync
incremental sync: 0/128 shards
data is behind on 128 shards
behind shards: 
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]


I have tried using "data sync init" and restarting the radosgw multiple times, 
but that does not seem to be helping in any way.

If I manually do "radosgw-admin data sync run --bucket bucket-1" - it just 
hangs forever and doesn't appear to do anything.  Checking the sync status 
never shows any improvement in the shards.

It is very hard to figure out what to do as there are a several sync commands - 
 bucket sync, data sync, metadata sync  - and it is not clear what effect they 
have or how to properly run them when the syncing gets confused.

Any guidance on how to get out of this situation would be greatly appreciated.  
I've read lots of threads on various mailing list archives (via google search) 
and very few of them have any sort of resolution or recommendation that is 
confirmed to have fixed these sort of problems.


___
Dev mailing list -- d...@ceph.io<mailto:d...@ceph.io>
To unsubscribe send an email to dev-le...@ceph.io<mailto:dev-le...@ceph.io>
___
ceph-us

[ceph-users] Re: radosgw multisite sync - how to fix data behind shards?

2022-06-09 Thread Wyll Ingersoll
I think you mean "radosgw-admin sync error list", in which case there are 32 
shards, each with the same error.  I dont see errors on the master zone logs so 
I'm not sure how to correct the situation.


"shard_id": 31,
"entries": [
{
"id": "1_1654722349.230688_62850.1",
"section": "data",
"name": 
"zone-1:a6ed5947-0ceb-407b-812f-347fab2ef62d.677322760.1:6",
"timestamp": "2022-06-08T21:05:49.230688Z",
"info": {
"source_zone": "a6ed5947-0ceb-407b-812f-347fab2ef62d",
"error_code": 125,
"message": "failed to sync bucket instance: (125) Operation 
canceled"
}
}
]
}





From: Amit Ghadge 
Sent: Wednesday, June 8, 2022 9:16 PM
To: Wyll Ingersoll 
Subject: Re: radosgw multisite sync - how to fix data behind shards?

check any error by running command radosgw-admin data sync error list


-AmitG


On Wed, Jun 8, 2022 at 2:44 PM Wyll Ingersoll 
mailto:wyllys.ingers...@keepertech.com>> wrote:

Seeking help from a radosgw expert...

I have a 3-zone multisite configuration (all running pacific 16.2.9) with 1 
bucket per zone and a couple of small objects in each bucket for testing 
purposes.
One of the secondary zones cannot get seem to get into sync with the master, 
sync status reports:


  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: a6ed5947-0ceb-407b-812f-347fab2ef62d (zone-1)
syncing
full sync: 128/128 shards
full sync: 66 buckets to sync
incremental sync: 0/128 shards
data is behind on 128 shards
behind shards: 
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]


I have tried using "data sync init" and restarting the radosgw multiple times, 
but that does not seem to be helping in any way.

If I manually do "radosgw-admin data sync run --bucket bucket-1" - it just 
hangs forever and doesn't appear to do anything.  Checking the sync status 
never shows any improvement in the shards.

It is very hard to figure out what to do as there are a several sync commands - 
 bucket sync, data sync, metadata sync  - and it is not clear what effect they 
have or how to properly run them when the syncing gets confused.

Any guidance on how to get out of this situation would be greatly appreciated.  
I've read lots of threads on various mailing list archives (via google search) 
and very few of them have any sort of resolution or recommendation that is 
confirmed to have fixed these sort of problems.


___
Dev mailing list -- d...@ceph.io<mailto:d...@ceph.io>
To unsubscribe send an email to dev-le...@ceph.io<mailto:dev-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw multisite sync - how to fix data behind shards?

2022-06-08 Thread Wyll Ingersoll


Seeking help from a radosgw expert...

I have a 3-zone multisite configuration (all running pacific 16.2.9) with 1 
bucket per zone and a couple of small objects in each bucket for testing 
purposes.
One of the secondary zones cannot get seem to get into sync with the master, 
sync status reports:


  metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
  data sync source: a6ed5947-0ceb-407b-812f-347fab2ef62d (zone-1)
syncing
full sync: 128/128 shards
full sync: 66 buckets to sync
incremental sync: 0/128 shards
data is behind on 128 shards
behind shards: 
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127]


I have tried using "data sync init" and restarting the radosgw multiple times, 
but that does not seem to be helping in any way.

If I manually do "radosgw-admin data sync run --bucket bucket-1" - it just 
hangs forever and doesn't appear to do anything.  Checking the sync status 
never shows any improvement in the shards.

It is very hard to figure out what to do as there are a several sync commands - 
 bucket sync, data sync, metadata sync  - and it is not clear what effect they 
have or how to properly run them when the syncing gets confused.

Any guidance on how to get out of this situation would be greatly appreciated.  
I've read lots of threads on various mailing list archives (via google search) 
and very few of them have any sort of resolution or recommendation that is 
confirmed to have fixed these sort of problems.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw multisite sync /admin/log requests overloading system.

2022-06-03 Thread Wyll Ingersoll


Put another way - is there a way to throttle the metadata sync requests in a 
multisite cluster? They seem to be overwhelming the master zone rgw server, it 
constantly runs at 40%+ CPU and watching the logs it just appears to be a 
steady stream of /admin/log?type=metadata requests from the other zones.   Is 
this normal behavior?



From: Wyll Ingersoll 
Sent: Wednesday, June 1, 2022 11:57 AM
To: d...@ceph.io 
Subject: radosgw multisite sync /admin/log requests overloading system.


I have a simple multisite radosgw configuration setup for testing. There is 1 
realm, 1 zonegroup, and 2 separate clusters each with its own zone.  There is 1 
bucket with 1 object in it and no updates currently happening.  There is no 
group sync policy currently defined.

The problem I see is that the radosgw on the secondary zone is flooding the 
master zone with requests for the /admin/log . The radosgw on the secondary is 
consuming roughly 50% of the CPU cycles. The master zone radosgw is equally 
actiive a d is flooding the logs (at 1/5 level) with entries like this:

2022-06-01T11:45:06.719-0400 7ff415f8b700  1 == req done req=0x7ff5e02ed680 
op status=0 http_status=200 latency=0.00440s ==
2022-06-01T11:45:06.719-0400 7ff415f8b700  1 beast: 0x7ff5e02ed680: 10.15.1.40 
- syncuser [01/Jun/2022:11:45:06.715 -0400] "GET 
/admin/log?type=metadata=4=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756
 HTTP/1.1" 200 44 - - - latency=0.00440s
2022-06-01T11:45:06.719-0400 7ff446fed700  1 == req done req=0x7ff5e0572680 
op status=0 http_status=200 latency=0.00440s ==
2022-06-01T11:45:06.719-0400 7ff446fed700  1 beast: 0x7ff5e0572680: 10.15.1.40 
- syncuser [01/Jun/2022:11:45:06.715 -0400] "GET 
/admin/log?type=metadata=5=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756
 HTTP/1.1" 200 44 - - - latency=0.00440s


What is going on and how do I fix this?  The period on both zones is current 
and at the same epoch value.
Any ideas/suggestions?

thanks,
   Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] radosgw multisite sync /admin/log requests overloading system.

2022-06-01 Thread Wyll Ingersoll
I have a simple multisite radosgw configuration setup for testing. There is 1 
realm, 1 zonegroup, and 2 separate clusters each with its own zone.  There is 1 
bucket with 1 object in it and no updates currently happening.  There is no 
group sync policy currently defined.

The problem I see is that the radosgw on the secondary zone is flooding the 
master zone with requests for the /admin/log . The radosgw on the secondary is 
consuming roughly 50% of the CPU cycles. The master zone radosgw is equally 
actiive a d is flooding the logs (at 1/5 level) with entries like this:

2022-06-01T11:45:06.719-0400 7ff415f8b700  1 == req done req=0x7ff5e02ed680 
op status=0 http_status=200 latency=0.00440s ==
2022-06-01T11:45:06.719-0400 7ff415f8b700  1 beast: 0x7ff5e02ed680: 10.15.1.40 
- syncuser [01/Jun/2022:11:45:06.715 -0400] "GET 
/admin/log?type=metadata=4=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756
 HTTP/1.1" 200 44 - - - latency=0.00440s
2022-06-01T11:45:06.719-0400 7ff446fed700  1 == req done req=0x7ff5e0572680 
op status=0 http_status=200 latency=0.00440s ==
2022-06-01T11:45:06.719-0400 7ff446fed700  1 beast: 0x7ff5e0572680: 10.15.1.40 
- syncuser [01/Jun/2022:11:45:06.715 -0400] "GET 
/admin/log?type=metadata=5=92e4fbd8-3429-4cc6-a9f4-6f756ba0c592=100&=3bc6efd6-a780-4cd1-9685-376e8b477756
 HTTP/1.1" 200 44 - - - latency=0.00440s


What is going on and how do I fix this?  The period on both zones is current 
and at the same epoch value.
Any ideas/suggestions?

thanks,
   Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Adding 2nd RGW zone using cephadm - fail.

2022-05-31 Thread Wyll Ingersoll
Problem solved - 2 of the pools (zone-2.rgw.meta and zone-2.rgw.log) did not 
have the "rgw" application enabled.  Once that was fixed, it started working.


____
From: Wyll Ingersoll 
Sent: Tuesday, May 31, 2022 3:51 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Adding 2nd RGW zone using cephadm - fail.

I'm having trouble adding a secondary zone RGW using cephadm, running with ceph 
16.2.9.

The master realm, zonegroup, and zone are already configured and working on 
another cluster.
This is a new cluster configured with cephadm, everything is up and running but 
when I try to add an RGW and create a 2nd zone, the RGW always fails (errors 
below).

Prior to creating the container, I created the RGW pools for the new zone 
(zone-2.rgw.) and  pulled over the realm using "radosgw-admin realm pull ..." 
and that all works. I can see the zonegroup configuration.

Then I create the new zone specifying the correct realm, zonegroup, and the new 
zone name (zone-2) and get no errors, and I can view the new zonegroup 
configuration with radosgw-admin.  Then I update the period for the realm, 
successfully.

I then populate a spec file for creating the RGW that looks like:

---

service_type: rgw

service_id: zone-2

placement:

  hosts:

  - k4

spec:

  rgw_frontend_type: beast

  rgw_frontend_port: 4443

  rgw_realm: testrealm

  rgw_zone: zone-2

  ssl: true


The SSL certificate and key files are setup in the ceph config database and are 
ok.

When the orchestrator starts the new container with the RGW, it results in an 
"error" state.  Viewing the log I see the following errors:


May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed reading data 
(obj=zone-2.rgw.log:bucket.sync-source-hints.), r=-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to update sources index for bucket=:[]) 
r=-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to initialize bucket sync policy 
handler: get_bucket_sync_hints() on bucket=-- returned r=-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0 -1 rgw main: ERROR: could not initialize zone policy handler for 
zone=zone-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to start notify service ((1) Operation 
not permitted

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to init services (ret=(1) Operation not 
permitted)

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.320+ 
7ff87ddbc5c0 -1 Couldn't init storage provider (RADOS)

May 31 15:36:24 k4 systemd[1]: 
ceph-a4cfb4a0-cbd6-11ec-b27f-c3fe3a7bc84c@rgw.keeper-2.k4.jugmpm.service: Main 
process exited, code=exited, status=5/NOTINSTALLED

May 31 15:36:24 k4 systemd[1]: 
ceph-a4cfb4a0-cbd6-11ec-b27f-c3fe3a7bc...@rgw.zone-2.k4.jugmpm.service: Failed 
with result 'exit-code'.
​



I don't know what step I am missing. I've setup secondary zones manually before 
with no problems, so I wonder if there's a bug in the orchestrator when it 
configures rgw or if I'm just missing a parameter somewhere.

thanks,
  Wyllys Ingersoll


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Adding 2nd RGW zone using cephadm - fail.

2022-05-31 Thread Wyll Ingersoll
I'm having trouble adding a secondary zone RGW using cephadm, running with ceph 
16.2.9.

The master realm, zonegroup, and zone are already configured and working on 
another cluster.
This is a new cluster configured with cephadm, everything is up and running but 
when I try to add an RGW and create a 2nd zone, the RGW always fails (errors 
below).

Prior to creating the container, I created the RGW pools for the new zone 
(zone-2.rgw.) and  pulled over the realm using "radosgw-admin realm pull ..." 
and that all works. I can see the zonegroup configuration.

Then I create the new zone specifying the correct realm, zonegroup, and the new 
zone name (zone-2) and get no errors, and I can view the new zonegroup 
configuration with radosgw-admin.  Then I update the period for the realm, 
successfully.

I then populate a spec file for creating the RGW that looks like:

---

service_type: rgw

service_id: zone-2

placement:

  hosts:

  - k4

spec:

  rgw_frontend_type: beast

  rgw_frontend_port: 4443

  rgw_realm: testrealm

  rgw_zone: zone-2

  ssl: true


The SSL certificate and key files are setup in the ceph config database and are 
ok.

When the orchestrator starts the new container with the RGW, it results in an 
"error" state.  Viewing the log I see the following errors:


May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed reading data 
(obj=zone-2.rgw.log:bucket.sync-source-hints.), r=-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to update sources index for bucket=:[]) 
r=-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to initialize bucket sync policy 
handler: get_bucket_sync_hints() on bucket=-- returned r=-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0 -1 rgw main: ERROR: could not initialize zone policy handler for 
zone=zone-1

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to start notify service ((1) Operation 
not permitted

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.316+ 
7ff87ddbc5c0  0 rgw main: ERROR: failed to init services (ret=(1) Operation not 
permitted)

May 31 15:36:23 k4 bash[3525798]: debug 2022-05-31T19:36:23.320+ 
7ff87ddbc5c0 -1 Couldn't init storage provider (RADOS)

May 31 15:36:24 k4 systemd[1]: 
ceph-a4cfb4a0-cbd6-11ec-b27f-c3fe3a7bc84c@rgw.keeper-2.k4.jugmpm.service: Main 
process exited, code=exited, status=5/NOTINSTALLED

May 31 15:36:24 k4 systemd[1]: 
ceph-a4cfb4a0-cbd6-11ec-b27f-c3fe3a7bc...@rgw.zone-2.k4.jugmpm.service: Failed 
with result 'exit-code'.
​



I don't know what step I am missing. I've setup secondary zones manually before 
with no problems, so I wonder if there's a bug in the orchestrator when it 
configures rgw or if I'm just missing a parameter somewhere.

thanks,
  Wyllys Ingersoll


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io