[ceph-users] Re: pg_num != pgp_num - and unable to change.

2023-07-06 Thread Anthony D'Atri
Indeed.  For clarity, this process is not the same as the pg_autoscaler.  It's 
real easy to conflate the two, along with the balancer module, so I like to 
call that out to reduce confusion.

> On Jul 6, 2023, at 18:01, Dan van der Ster  wrote:
> 
> Since nautilus, pgp_num (and pg_num) will be increased by the mgr
> automatically to reach your pg_num_target over time. (If you're a source
> code reader check DaemonServer::adjust_pgs for how this works).

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snapshots: impact of moving data

2023-07-06 Thread Gregory Farnum
Moving files around within the namespace never changes the way the file
data is represented within RADOS. It’s just twiddling metadata bits. :)
-Greg

On Thu, Jul 6, 2023 at 3:26 PM Dan van der Ster 
wrote:

> Hi Mathias,
>
> Provided that both subdirs are within the same snap context (subdirs below
> where the .snap is created), I would assume that in the mv case, the space
> usage is not doubled: the snapshots point at the same inode and it is just
> linked at different places in the filesystem.
>
> However, if your cluster and livelihood depends on this being true, I
> suggest making a small test in a tiny empty cephfs, listing the rados pools
> before and after mv and snapshot operations to find out exactly which data
> objects are created.
>
> Cheers, Dan
>
> __
> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>
>
>
>
> On Thu, Jun 22, 2023 at 8:54 AM Kuhring, Mathias <
> mathias.kuhr...@bih-charite.de> wrote:
>
> > Dear Ceph community,
> >
> > We want to restructure (i.e. move around) a lot of data (hundreds of
> > terabyte) in our CephFS.
> > And now I was wondering what happens within snapshots when I move data
> > around within a snapshotted folder.
> > I.e. do I need to account for a lot increased storage usage due to older
> > snapshots differing from the new restructured state?
> > In the end it is just metadata changes. Are the snapshots aware of this?
> >
> > Consider the following examples.
> >
> > Copying data:
> > Let's say I have a folder /test, with a file XYZ in sub-folder
> > /test/sub1 and an empty sub-folder /test/sub2.
> > I create snapshot snapA in /test/.snap, copy XYZ to sub-folder
> > /test/sub2, delete it from /test/sub1 and create another snapshot snapB.
> > I would have two snapshots each with distinct copies of XYZ, hence using
> > double the space in the FS:
> > /test/.snap/snapA/sub1/XYZ <-- copy 1
> > /test/.snap/snapA/sub2/
> > /test/.snap/snapB/sub1/
> > /test/.snap/snapB/sub2/XYZ <-- copy 2
> >
> > Moving data:
> > Let's assume the same structure.
> > But now after creating snapshot snapA, I move XYZ to sub-folder
> > /test/sub2 and then create the other snapshot snapB.
> > The directory tree will look the same. But how is this treated
> internally?
> > Once I move the data, will there be an actually copy created in snapA to
> > represent the old state?
> > Or will this remain the same data (like a link to the inode or so)?
> > And hence not double the storage used for that file.
> >
> > I couldn't find (or understand) anything related to this in the docs.
> > The closest seems to be the hard-link section here:
> > https://docs.ceph.com/en/quincy/dev/cephfs-snapshots/#hard-links
> > Which unfortunately goes a bit over my head.
> > So I'm not sure if this answers my question.
> >
> > Thank you all for your help. Appreciate it.
> >
> > Best Wishes,
> > Mathias Kuhring
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snapshots: impact of moving data

2023-07-06 Thread Dan van der Ster
Hi Mathias,

Provided that both subdirs are within the same snap context (subdirs below
where the .snap is created), I would assume that in the mv case, the space
usage is not doubled: the snapshots point at the same inode and it is just
linked at different places in the filesystem.

However, if your cluster and livelihood depends on this being true, I
suggest making a small test in a tiny empty cephfs, listing the rados pools
before and after mv and snapshot operations to find out exactly which data
objects are created.

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Thu, Jun 22, 2023 at 8:54 AM Kuhring, Mathias <
mathias.kuhr...@bih-charite.de> wrote:

> Dear Ceph community,
>
> We want to restructure (i.e. move around) a lot of data (hundreds of
> terabyte) in our CephFS.
> And now I was wondering what happens within snapshots when I move data
> around within a snapshotted folder.
> I.e. do I need to account for a lot increased storage usage due to older
> snapshots differing from the new restructured state?
> In the end it is just metadata changes. Are the snapshots aware of this?
>
> Consider the following examples.
>
> Copying data:
> Let's say I have a folder /test, with a file XYZ in sub-folder
> /test/sub1 and an empty sub-folder /test/sub2.
> I create snapshot snapA in /test/.snap, copy XYZ to sub-folder
> /test/sub2, delete it from /test/sub1 and create another snapshot snapB.
> I would have two snapshots each with distinct copies of XYZ, hence using
> double the space in the FS:
> /test/.snap/snapA/sub1/XYZ <-- copy 1
> /test/.snap/snapA/sub2/
> /test/.snap/snapB/sub1/
> /test/.snap/snapB/sub2/XYZ <-- copy 2
>
> Moving data:
> Let's assume the same structure.
> But now after creating snapshot snapA, I move XYZ to sub-folder
> /test/sub2 and then create the other snapshot snapB.
> The directory tree will look the same. But how is this treated internally?
> Once I move the data, will there be an actually copy created in snapA to
> represent the old state?
> Or will this remain the same data (like a link to the inode or so)?
> And hence not double the storage used for that file.
>
> I couldn't find (or understand) anything related to this in the docs.
> The closest seems to be the hard-link section here:
> https://docs.ceph.com/en/quincy/dev/cephfs-snapshots/#hard-links
> Which unfortunately goes a bit over my head.
> So I'm not sure if this answers my question.
>
> Thank you all for your help. Appreciate it.
>
> Best Wishes,
> Mathias Kuhring
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Quarterly (CQ) - Issue #1

2023-07-06 Thread Dan van der Ster
Thanks Zac!

I only see the txt attachment here. Where can we get the PDF A4 and letter
renderings?

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Mon, Jul 3, 2023 at 10:29 AM Zac Dover  wrote:

> The first issue of "Ceph Quarterly" is attached to this email. Ceph
> Quarterly (or "CQ") is an overview of the past three months of upstream
> Ceph development. We provide CQ in three formats: A4, letter, and plain
> text wrapped at 80 columns.
>
> Zac Dover
> Upstream Documentation
> Ceph Foundation___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cannot get backfill speed up

2023-07-06 Thread Dan van der Ster
Hi Jesper,

Indeed many users reported slow backfilling and recovery with the mclock
scheduler. This is supposed to be fixed in the latest quincy but clearly
something is still slowing things down.
Some clusters have better luck reverting to osd_op_queue = wpq.

(I'm hoping by proposing this someone who tuned mclock recently will chime
in with better advice).

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com




On Wed, Jul 5, 2023 at 10:28 PM Jesper Krogh  wrote:

>
> Hi.
>
> Fresh cluster - but despite setting:
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> recovery_max_active_ssd
> osd_recovery_max_active_ssd  50
>
>mon
> default[20]
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> osd_max_backfills
> osd_max_backfills100
>
>mon
> default[10]
>
> I still get
> jskr@dkcphhpcmgt028:/$ sudo ceph status
>cluster:
>  id: 5c384430-da91-11ed-af9c-c780a5227aff
>  health: HEALTH_OK
>
>services:
>  mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028
> (age 16h)
>  mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys:
> dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd
>  mds: 2/2 daemons up, 1 standby
>  osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs
>
>data:
>  volumes: 2/2 healthy
>  pools:   9 pools, 495 pgs
>  objects: 24.85M objects, 60 TiB
>  usage:   117 TiB used, 159 TiB / 276 TiB avail
>  pgs: 10655690/145764002 objects misplaced (7.310%)
>   474 active+clean
>   15  active+remapped+backfilling
>   6   active+remapped+backfill_wait
>
>io:
>  client:   0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr
>  recovery: 328 MiB/s, 108 objects/s
>
>progress:
>  Global Recovery Event (9h)
>[==..] (remaining: 25m)
>
> With these numbers for the setting - I would expect to get more than 15
> active backfilling... (and based on SSD's and 2x25gbit network, I can
> also spend more resources on recovery than 328 MiB/s
>
> Thanks, .
>
> --
> Jesper Krogh
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg_num != pgp_num - and unable to change.

2023-07-06 Thread Dan van der Ster
Hi Jesper,

> In earlier versions of ceph (without autoscaler) I have only experienced
> that setting pg_num and pgp_num took immidiate effect?

That's correct -- in recent Ceph (since nautilus) you cannot manipulate
pgp_num directly anymore. There is a backdoor setting (set pgp_num_actual
...) but I don't really recommend that.

Since nautilus, pgp_num (and pg_num) will be increased by the mgr
automatically to reach your pg_num_target over time. (If you're a source
code reader check DaemonServer::adjust_pgs for how this works).

In short, the mgr is throttled by the target_max_misplaced_ratio, which
defaults to 5%.

So if you want to split more aggressively,
increase target_max_misplaced_ratio.

Cheers, Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com



On Wed, Jul 5, 2023 at 9:41 PM Jesper Krogh  wrote:

> Hi.
>
> Fresh cluster - after a dance where the autoscaler did not work
> (returned black) as described in the doc - I now seemingly have it
> working. It has bumpted target to something reasonable -- and is slowly
> incrementing pg_num and pgp_num by 2 over time (hope this is correct?)
>
> But .
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
> pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8
> min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22
> pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159
> lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk
> stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application
> cephfs
>
> pg_num = 150
> pgp_num = 22
>
> and setting pgp_num seemingly have zero effect on the system .. not even
> with autoscaling set to off.
>
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_autoscale_mode off
> set pool 22 pg_autoscale_mode to off
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pgp_num 150
> set pool 22 pgp_num to 150
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_num_min 128
> set pool 22 pg_num_min to 128
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_num 150
> set pool 22 pg_num to 150
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data
> pg_autoscale_mode on
> set pool 22 pg_autoscale_mode to on
> jskr@dkcphhpcmgt028:/$ sudo ceph progress
> PG autoscaler increasing pool 22 PGs from 150 to 512 (14s)
>  []
> jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
> pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8
> min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22
> pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159
> lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk
> stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application
> cephfs
>
> pgp_num != pg_num ?
>
> In earlier versions of ceph (without autoscaler) I have only experienced
> that setting pg_num and pgp_num took immidiate effect?
>
> Jesper
>
> jskr@dkcphhpcmgt028:/$ sudo ceph version
> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
> (stable)
> jskr@dkcphhpcmgt028:/$ sudo ceph health
> HEALTH_OK
> jskr@dkcphhpcmgt028:/$ sudo ceph status
>cluster:
>  id: 5c384430-da91-11ed-af9c-c780a5227aff
>  health: HEALTH_OK
>
>services:
>  mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028
> (age 15h)
>  mgr: dkcphhpcmgt031.afbgjx(active, since 32h), standbys:
> dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd
>  mds: 2/2 daemons up, 1 standby
>  osd: 40 osds: 40 up (since 44h), 40 in (since 39h); 33 remapped pgs
>
>data:
>  volumes: 2/2 healthy
>  pools:   9 pools, 495 pgs
>  objects: 24.85M objects, 60 TiB
>  usage:   117 TiB used, 158 TiB / 276 TiB avail
>  pgs: 13494029/145763897 objects misplaced (9.257%)
>   462 active+clean
>   23  active+remapped+backfilling
>   10  active+remapped+backfill_wait
>
>io:
>  client:   0 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 94 op/s wr
>  recovery: 705 MiB/s, 208 objects/s
>
>progress:
>
>
> --
> Jesper Krogh
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MON sync time depends on outage duration

2023-07-06 Thread Dan van der Ster
Hi Eugen!

Yes that sounds familiar from the luminous and mimic days.

Check this old thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F3W2HXMYNF52E7LPIQEJFUTAD3I7QE25/
(that thread is truncated but I can tell you that it worked for Frank).
Also the even older referenced thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/

The workaround for zillions of snapshot keys at that time was to use:
   ceph config set mon mon_sync_max_payload_size 4096

That said, that sync issue was supposed to be fixed by way of adding the
new option mon_sync_max_payload_keys, which has been around since nautilus.

So it could be in your case that the sync payload is just too small to
efficiently move 42 million osd_snap keys? Using debug_paxos and debug_mon
you should be able to understand what is taking so long, and tune
mon_sync_max_payload_size and mon_sync_max_payload_keys accordingly.

Good luck!

Dan

__
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com



On Thu, Jul 6, 2023 at 1:47 PM Eugen Block  wrote:

> Hi *,
>
> I'm investigating an interesting issue on two customer clusters (used
> for mirroring) I've not solved yet, but today we finally made some
> progress. Maybe someone has an idea where to look next, I'd appreciate
> any hints or comments.
> These are two (latest) Octopus clusters, main usage currently is RBD
> mirroring with snapshot mode (around 500 RBD images are synced every
> 30 minutes). They noticed very long startup times of MON daemons after
> reboot, times between 10 and 30 minutes (reboot time already
> subtracted). These delays are present on both sites. Today we got a
> maintenance window and started to check in more detail by just
> restarting the MON service (joins quorum within seconds), then
> stopping the MON service and wait a few minutes (still joins quorum
> within seconds). And then we stopped the service and waited for more
> than 5 minutes, simulating a reboot, and then we were able to
> reproduce it. The sync then takes around 15 minutes, we verified with
> other MONs as well. The MON store is around 2 GB of size (on HDD), I
> understand that the sync itself can take some time, but what is the
> threshold here? I tried to find a hint in the MON config, searching
> for timeouts with 300 seconds, there were only a few matches
> (mon_session_timeout is one of them), but I'm not sure if they can
> explain this behavior.
> Investigating the MON store (ceph-monstore-tool dump-keys) I noticed
> that there were more than 42 Million osd_snap keys, which is quite a
> lot and would explain the size of the MON store. But I'm also not sure
> if it's related to the long syncing process.
> Does that sound familiar to anyone?
>
> Thanks,
> Eugen
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rook on bare-metal?

2023-07-06 Thread Travis Nielsen
Here are the answers to some of the questions. Happy to follow up with more
discussion in the Rook Slack , Discussions
, or Issues
.

Thanks!
Travis

On Thu, Jul 6, 2023 at 4:43 AM Anthony D'Atri  wrote:

> I’m also using Rook on BM.  I had never used K8s before, so that was the
> learning curve, e.g. translating the example YAML files into the Helm
> charts we needed, and the label / taint / toleration dance to fit the
> square peg of pinning services to round hole nodes.  We’re using Kubespray
> ; I gather there are other ways of deploying K8s?
>
> Some things that could improve:
>
> * mgrs are limited to 2, apparently Sage previously said that was all
> anyone should need.  I would like to be able to deploy one for each mon.


Is there a specific need for 3? Or is it more of a habit/expectation?


> * The efficiency of `destroy`ing OSDs is not exploited, so replacing one
> involves more data shuffling than it otherwise might
>

There is a related design discussion in progress that will address the
replacement of OSDs to avoid the data reshuffling:
https://github.com/rook/rook/pull/12381


> * I’m specifying 3 RGWs but only getting 1 deployed, no idea why

* Ingress / load balancer service for multiple RGWs seems to be manual
> * Bundled alerts are kind of noisy
>

Curious for more details on these three issues if you want to open issues.


> * I’m still unsure what Rook does dynamically, and what it only does at
> deployment time (we use ArgoCD).  I.e., if I make changes, what sticks and
> what’s trampled?
>

Changes are intended to be updated if you change the settings in the CRDs.
If you see settings that are not applied when changed, agreed we should
track that and fix it, or at least document it.


> * How / if one can bake configuration (as in `ceph.conf` entries) into the
> YAML files vs manually running “ceph config”
>

ceph.conf settings can be applied through a configmap. See
https://rook.io/docs/rook/latest/Storage-Configuration/Advanced/ceph-configuration/#custom-csi-cephconf-settings


> * What the sidecars within the pods are doing, if any of them can be
> disabled
>

Sidecars are needed for some of the pods (csi drivers and mgr) to provide
some functionality. They can't be disabled unless some feature is disabled.
For example, if two mgrs are running, the mgr sidecar is needed to watch
when the mgr failover occurs so the services can update to point to the
active mgr. Search this doc for "sidecar" for some more details on the mgr
sidecar.
https://rook.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/#cluster-wide-resources-configuration-settings


> * Requests / limits for various pods, especially when on dedicated nodes.
> Plan to experiment with disabling limits and setting
> `autotune_memory_target_ratio` and `osd_memory_target_autotune`
>

Where you have dedicated nodes, it can certainly be simpler to remove the
resource requests/limits, as long as you set those memory limits. Default
requests/limits are set by the helm chart, and they can admittedly be
challenging to tune since there are so many moving parts.


> * Documentation for how to do pod-specific configuration, i.e. setting the
> number of OSDs per node when it isn’t uniform.  A colleague helped me sort
> this out, but I’m enumerating each node - would like to be able to do so
> more concisely, perhaps with a default and overrides.
>

There are multiple ways to deal with OSD creation, depending on the
environment. Curious to follow up on what worked for you, or how this could
be improved in the docs.


>
> > On Jul 6, 2023, at 4:13 AM, Joachim Kraftmayer - ceph ambassador <
> joachim.kraftma...@clyso.com> wrote:
> >
> > Hello
> >
> > we have been following rook since 2018 and have had our experiences both
> on bare-metal and in the hyperscalers.
> > In the same way, we have been following cephadm from the beginning.
> >
> > Meanwhile, we have been using both in production for years and the
> decision which orchestrator to use depends from project to project. e.g.,
> the features of both projects are not identical.
> >
> > Joachim
> >
> > ___
> > ceph ambassador DACH
> > ceph consultant since 2012
> >
> > Clyso GmbH - Premier Ceph Foundation Member
> >
> > https://www.clyso.com/
> >
> > Am 06.07.23 um 07:16 schrieb Nico Schottelius:
> >> Morning,
> >>
> >> we are running some ceph clusters with rook on bare metal and can very
> >> much recomend it. You should have proper k8s knowledge, knowing how to
> >> change objects such as configmaps or deployments, in case things go
> >> wrong.
> >>
> >> In regards to stability, the rook operator is written rather defensive,
> >> not changing monitors or the cluster if the quorom is not met and
> >> checking how the osd status is on removal/adding of osds.
> >>
> >> So TL;DR: very much usable and rather k8s native.
> >>
> >> BR,
> >>
> >> Nico
> >>
> >> 

[ceph-users] MON sync time depends on outage duration

2023-07-06 Thread Eugen Block

Hi *,

I'm investigating an interesting issue on two customer clusters (used  
for mirroring) I've not solved yet, but today we finally made some  
progress. Maybe someone has an idea where to look next, I'd appreciate  
any hints or comments.
These are two (latest) Octopus clusters, main usage currently is RBD  
mirroring with snapshot mode (around 500 RBD images are synced every  
30 minutes). They noticed very long startup times of MON daemons after  
reboot, times between 10 and 30 minutes (reboot time already  
subtracted). These delays are present on both sites. Today we got a  
maintenance window and started to check in more detail by just  
restarting the MON service (joins quorum within seconds), then  
stopping the MON service and wait a few minutes (still joins quorum  
within seconds). And then we stopped the service and waited for more  
than 5 minutes, simulating a reboot, and then we were able to  
reproduce it. The sync then takes around 15 minutes, we verified with  
other MONs as well. The MON store is around 2 GB of size (on HDD), I  
understand that the sync itself can take some time, but what is the  
threshold here? I tried to find a hint in the MON config, searching  
for timeouts with 300 seconds, there were only a few matches  
(mon_session_timeout is one of them), but I'm not sure if they can  
explain this behavior.
Investigating the MON store (ceph-monstore-tool dump-keys) I noticed  
that there were more than 42 Million osd_snap keys, which is quite a  
lot and would explain the size of the MON store. But I'm also not sure  
if it's related to the long syncing process.

Does that sound familiar to anyone?

Thanks,
Eugen
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

2023-07-06 Thread Mark Nelson


On 7/6/23 06:02, Matthew Booth wrote:

On Wed, 5 Jul 2023 at 15:18, Mark Nelson  wrote:

I'm sort of amazed that it gave you symbols without the debuginfo
packages installed.  I'll need to figure out a way to prevent that.
Having said that, your new traces look more accurate to me.  The thing
that sticks out to me is the (slight?) amount of contention on the PWL
m_lock in dispatch_deferred_writes, update_root_scheduled_ops,
append_ops, append_sync_point(), etc.

I don't know if the contention around the m_lock is enough to cause an
increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first
thing that jumps out at me.  There appears to be a large number of
threads (each tp_pwl thread, the io_context_pool threads, the qemu
thread, and the bstore_aio thread) that all appear to have potential to
contend on that lock.  You could try dropping the number of tp_pwl
threads from 4 to 1 and see if that changes anything.

Will do. Any idea how to do that? I don't see an obvious rbd config option.

Thanks for looking into this,
Matt


you thanked me too soon...it appears to be hard-coded in, so you'll have 
to do a custom build. :D


https://github.com/ceph/ceph/blob/main/src/librbd/cache/pwl/AbstractWriteLog.cc#L55-L56


Mark


--
Best Regards,
Mark Nelson
Head of R (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

2023-07-06 Thread Matthew Booth
On Wed, 5 Jul 2023 at 15:18, Mark Nelson  wrote:
> I'm sort of amazed that it gave you symbols without the debuginfo
> packages installed.  I'll need to figure out a way to prevent that.
> Having said that, your new traces look more accurate to me.  The thing
> that sticks out to me is the (slight?) amount of contention on the PWL
> m_lock in dispatch_deferred_writes, update_root_scheduled_ops,
> append_ops, append_sync_point(), etc.
>
> I don't know if the contention around the m_lock is enough to cause an
> increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first
> thing that jumps out at me.  There appears to be a large number of
> threads (each tp_pwl thread, the io_context_pool threads, the qemu
> thread, and the bstore_aio thread) that all appear to have potential to
> contend on that lock.  You could try dropping the number of tp_pwl
> threads from 4 to 1 and see if that changes anything.

Will do. Any idea how to do that? I don't see an obvious rbd config option.

Thanks for looking into this,
Matt
-- 
Matthew Booth
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rook on bare-metal?

2023-07-06 Thread Anthony D'Atri
I’m also using Rook on BM.  I had never used K8s before, so that was the 
learning curve, e.g. translating the example YAML files into the Helm charts we 
needed, and the label / taint / toleration dance to fit the square peg of 
pinning services to round hole nodes.  We’re using Kubespray ; I gather there 
are other ways of deploying K8s?

Some things that could improve:

* mgrs are limited to 2, apparently Sage previously said that was all anyone 
should need.  I would like to be able to deploy one for each mon.
* The efficiency of `destroy`ing OSDs is not exploited, so replacing one 
involves more data shuffling than it otherwise might
* I’m specifying 3 RGWs but only getting 1 deployed, no idea why
* Ingress / load balancer service for multiple RGWs seems to be manual
* Bundled alerts are kind of noisy
* I’m still unsure what Rook does dynamically, and what it only does at 
deployment time (we use ArgoCD).  I.e., if I make changes, what sticks and 
what’s trampled?
* How / if one can bake configuration (as in `ceph.conf` entries) into the YAML 
files vs manually running “ceph config”
* What the sidecars within the pods are doing, if any of them can be disabled
* Requests / limits for various pods, especially when on dedicated nodes.  Plan 
to experiment with disabling limits and setting `autotune_memory_target_ratio` 
and `osd_memory_target_autotune`
* Documentation for how to do pod-specific configuration, i.e. setting the 
number of OSDs per node when it isn’t uniform.  A colleague helped me sort this 
out, but I’m enumerating each node - would like to be able to do so more 
concisely, perhaps with a default and overrides.

> On Jul 6, 2023, at 4:13 AM, Joachim Kraftmayer - ceph ambassador 
>  wrote:
> 
> Hello
> 
> we have been following rook since 2018 and have had our experiences both on 
> bare-metal and in the hyperscalers.
> In the same way, we have been following cephadm from the beginning.
> 
> Meanwhile, we have been using both in production for years and the decision 
> which orchestrator to use depends from project to project. e.g., the features 
> of both projects are not identical.
> 
> Joachim
> 
> ___
> ceph ambassador DACH
> ceph consultant since 2012
> 
> Clyso GmbH - Premier Ceph Foundation Member
> 
> https://www.clyso.com/
> 
> Am 06.07.23 um 07:16 schrieb Nico Schottelius:
>> Morning,
>> 
>> we are running some ceph clusters with rook on bare metal and can very
>> much recomend it. You should have proper k8s knowledge, knowing how to
>> change objects such as configmaps or deployments, in case things go
>> wrong.
>> 
>> In regards to stability, the rook operator is written rather defensive,
>> not changing monitors or the cluster if the quorom is not met and
>> checking how the osd status is on removal/adding of osds.
>> 
>> So TL;DR: very much usable and rather k8s native.
>> 
>> BR,
>> 
>> Nico
>> 
>> zs...@tuta.io writes:
>> 
>>> Hello!
>>> 
>>> I am looking to simplify ceph management on bare-metal by deploying
>>> Rook onto kubernetes that has been deployed on bare metal (rke). I
>>> have used rook in a cloud environment but I have not used it on
>>> bare-metal. I am wondering if anyone here runs rook in bare-metal?
>>> Would you recommend it to cephadm or would you steer clear of it?
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> --
>> Sustainable and modern Infrastructures by ungleich.ch
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW accessing real source IP address of a client (e.g. in S3 bucket policies)

2023-07-06 Thread Christian Rohmann

Hey Casey, all,

On 16/06/2023 17:00, Casey Bodley wrote:



But when applying a bucket policy with aws:SourceIp it seems to only work if I 
set the internal IP of the HAProxy instance, not the public IP of the client.
So the actual remote address is NOT used in my case.


Did I miss any config setting anywhere?


your 'rgw remote addr param' config looks right. with that same
config, i was able to set a bucket policy that denied access based on


I found the issue. Embarrassingly it was simply a NAT-Hairpin which was 
applied to the traffic from the server I was testing with.
In short: Even though I targeted the public IP from the HAProxy instance 
the internal IP address of my test server was maintained as source since 
both machines are on the same network segment.
That is why I first thought the LB IP was applied to the policy, but not 
the actual public source IP of the client. In reality it was simply the 
private, RFC1918, IP of the test machine that came in as source.




Sorry for the noise and thanks for your help.

Christian


P.S. With IPv6, this would not have happened.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rook on bare-metal?

2023-07-06 Thread Joachim Kraftmayer - ceph ambassador

Hello

we have been following rook since 2018 and have had our experiences both 
on bare-metal and in the hyperscalers.

In the same way, we have been following cephadm from the beginning.

Meanwhile, we have been using both in production for years and the 
decision which orchestrator to use depends from project to project. 
e.g., the features of both projects are not identical.


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 06.07.23 um 07:16 schrieb Nico Schottelius:

Morning,

we are running some ceph clusters with rook on bare metal and can very
much recomend it. You should have proper k8s knowledge, knowing how to
change objects such as configmaps or deployments, in case things go
wrong.

In regards to stability, the rook operator is written rather defensive,
not changing monitors or the cluster if the quorom is not met and
checking how the osd status is on removal/adding of osds.

So TL;DR: very much usable and rather k8s native.

BR,

Nico

zs...@tuta.io writes:


Hello!

I am looking to simplify ceph management on bare-metal by deploying
Rook onto kubernetes that has been deployed on bare metal (rke). I
have used rook in a cloud environment but I have not used it on
bare-metal. I am wondering if anyone here runs rook in bare-metal?
Would you recommend it to cephadm or would you steer clear of it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph quota qustion

2023-07-06 Thread Konstantin Shalygin
Hi,

This is incomplete multiparts I guess, you should remove it first. Don't know 
how S3 Browser works with this entities


k
Sent from my iPhone

> On 6 Jul 2023, at 07:57, sejun21@samsung.com wrote:
> 
> Hi, I contact you for some question about quota.
> 
> Situation is following below.
> 
> 1. I set the user quota 10M
> 2. Using s3 browser, upload one 12M file
> 3. The upload failed as i wish, but some object remains in the pool(almost 
> 10M) and s3brower doesn't show failed file.
> 
> I expected nothing to be left in Ceph. 
> 
> My question is "can user or admin remove the remaining objects?"
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io