[ceph-users] Re: Fstab entry for mounting specific ceph fs?

2022-09-23 Thread Sagittarius-A Black Hole
Ah, I found it: mds_namespace IS in this case the name of the filesystem
Why not call it filesystem name instead of namespace, a term that is
as far as I could find, not defined in Ceph.

Thanks,

Daniel

On Fri, 23 Sept 2022 at 17:09, Sagittarius-A Black Hole
 wrote:
>
> Hi,
>
> thanks for the suggestion of the namespace. I'm trying to find any
> documentation over it, how do you set a name space for a filesystem /
> pool?
>
> Thanks,
>
> Daniel
>
> On Fri, 23 Sept 2022 at 16:01, Wesley Dillingham  
> wrote:
> >
> > Try adding mds_namespace option like so:
> >
> > 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
> > name=james_user,secretfile=/etc/ceph/secret.key,mds_namespace=myfs
> >
> > On Fri, Sep 23, 2022 at 6:41 PM Sagittarius-A Black Hole 
> >  wrote:
> >>
> >> Hi,
> >>
> >> The below fstab entry works, so that is a given.
> >> But how do I specify which Ceph filesystem I want to mount in this fstab 
> >> format?
> >>
> >> 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
> >> name=james_user, secretfile=/etc/ceph/secret.key
> >>
> >> I have tried different ways, but always get the error "source mount
> >> path was not specified"
> >> I can't find many examples of fstab ceph mounts unfortunately.
> >>
> >> Thanks,
> >>
> >> Daniel
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > --
> >
> > Respectfully,
> >
> > Wes Dillingham
> > w...@wesdillingham.com
> > LinkedIn
>
>
>
> --
> Por sperto kaj lerno ne sufiĉas eterno.



-- 
Por sperto kaj lerno ne sufiĉas eterno.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fstab entry for mounting specific ceph fs?

2022-09-23 Thread Sagittarius-A Black Hole
Hi,

thanks for the suggestion of the namespace. I'm trying to find any
documentation over it, how do you set a name space for a filesystem /
pool?

Thanks,

Daniel

On Fri, 23 Sept 2022 at 16:01, Wesley Dillingham  wrote:
>
> Try adding mds_namespace option like so:
>
> 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
> name=james_user,secretfile=/etc/ceph/secret.key,mds_namespace=myfs
>
> On Fri, Sep 23, 2022 at 6:41 PM Sagittarius-A Black Hole 
>  wrote:
>>
>> Hi,
>>
>> The below fstab entry works, so that is a given.
>> But how do I specify which Ceph filesystem I want to mount in this fstab 
>> format?
>>
>> 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
>> name=james_user, secretfile=/etc/ceph/secret.key
>>
>> I have tried different ways, but always get the error "source mount
>> path was not specified"
>> I can't find many examples of fstab ceph mounts unfortunately.
>>
>> Thanks,
>>
>> Daniel
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
>
> Respectfully,
>
> Wes Dillingham
> w...@wesdillingham.com
> LinkedIn



-- 
Por sperto kaj lerno ne sufiĉas eterno.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fstab entry for mounting specific ceph fs?

2022-09-23 Thread Sagittarius-A Black Hole
This is what I tried, following the link:

{name}@.{fs_name}=/ {mount}/{mountpoint} ceph
[mon_addr={ipaddress},secret=secretkey|secretfile=/path/to/secretfile

does not work, it reports: source mount path was not specified, unable
to parse mount source:-22

why is mount and mountpoint specified like this, this is just one
mount point, like /media/ceph_fs

Thanks,

Daniel

On Fri, 23 Sept 2022 at 16:16, Ramana Krisna Venkatesh Raja
 wrote:
>
> On Fri, Sep 23, 2022 at 6:41 PM Sagittarius-A Black Hole
>  wrote:
> >
> > Hi,
> >
> > The below fstab entry works, so that is a given.
> > But how do I specify which Ceph filesystem I want to mount in this fstab 
> > format?
> >
> > 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
> > name=james_user, secretfile=/etc/ceph/secret.key
> >
> > I have tried different ways, but always get the error "source mount
> > path was not specified"
> > I can't find many examples of fstab ceph mounts unfortunately.
> >
>
> https://docs.ceph.com/en/quincy/cephfs/mount-using-kernel-driver/#persistent-mounts
>
>
> > Thanks,
> >
> > Daniel
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>


-- 
Por sperto kaj lerno ne sufiĉas eterno.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fstab entry for mounting specific ceph fs?

2022-09-23 Thread Ramana Krisna Venkatesh Raja
On Fri, Sep 23, 2022 at 6:41 PM Sagittarius-A Black Hole
 wrote:
>
> Hi,
>
> The below fstab entry works, so that is a given.
> But how do I specify which Ceph filesystem I want to mount in this fstab 
> format?
>
> 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
> name=james_user, secretfile=/etc/ceph/secret.key
>
> I have tried different ways, but always get the error "source mount
> path was not specified"
> I can't find many examples of fstab ceph mounts unfortunately.
>

https://docs.ceph.com/en/quincy/cephfs/mount-using-kernel-driver/#persistent-mounts


> Thanks,
>
> Daniel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fstab entry for mounting specific ceph fs?

2022-09-23 Thread Wesley Dillingham
Try adding mds_namespace option like so:

192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
name=james_user,secretfile=/etc/ceph/secret.key,mds_namespace=myfs

On Fri, Sep 23, 2022 at 6:41 PM Sagittarius-A Black Hole <
nigrat...@gmail.com> wrote:

> Hi,
>
> The below fstab entry works, so that is a given.
> But how do I specify which Ceph filesystem I want to mount in this fstab
> format?
>
> 192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
> name=james_user, secretfile=/etc/ceph/secret.key
>
> I have tried different ways, but always get the error "source mount
> path was not specified"
> I can't find many examples of fstab ceph mounts unfortunately.
>
> Thanks,
>
> Daniel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
-- 

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Fstab entry for mounting specific ceph fs?

2022-09-23 Thread Sagittarius-A Black Hole
Hi,

The below fstab entry works, so that is a given.
But how do I specify which Ceph filesystem I want to mount in this fstab format?

192.168.1.11,192.168.1.12,192.168.1.13:/ /media/ceph_fs/
name=james_user, secretfile=/etc/ceph/secret.key

I have tried different ways, but always get the error "source mount
path was not specified"
I can't find many examples of fstab ceph mounts unfortunately.

Thanks,

Daniel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Freak issue every few weeks

2022-09-23 Thread J-P Methot
We just got a reply from Intel telling us that there's a new firmware 
coming out soon to fix an issue where S4510 and S4610 drives get IO 
timeouts that may lead to drive drops when under heavy load. This might 
very well be the source of our issue.


On 9/23/22 11:12, Stefan Kooman wrote:

On 9/23/22 15:22, J-P Methot wrote:

Thank you for your reply,

discard is not enabled in our configuration as it is mainly the 
default conf. Are you suggesting to enable it?


No. There is no consensus if enabling it is a good idea (depends on 
proper implementation among other things). From my experience on Intel 
S4610 (during LVM cleanup at OSD reprovisioning) it spends a lot of 
time discarding blocks (IIRC an order of magnitude more than Samsung 
PM883). So I doubt it would help. Altough it's hard to tell. Maybe it 
can do discards more often it does take less per operation, and might 
be less impactfull. But this is all speculation. It might be firmware 
related. Have all disks the same firmware? Are there disks that never 
experience this problem?


Gr. Stefan


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Wyll Ingersoll
Understood, that was a typo on my part.

Definitely dont cancel-backfill after generating the moves from 
placementoptimizer.


From: Josh Baergen 
Sent: Friday, September 23, 2022 11:31 AM
To: Wyll Ingersoll 
Cc: Eugen Block ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Balancer Distribution Help

Hey Wyll,

> $ pgremapper cancel-backfill --yes   # to stop all pending operations
> $ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
> $ bash upmap-moves
>
> Repeat the above 3 steps until balance is achieved, then re-enable the 
> balancer and unset the "no" flags set earlier?

You don't want to run cancel-backfill after placementoptimizer,
otherwise it will undo the balancing backfill.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Josh Baergen
Hey Wyll,

> $ pgremapper cancel-backfill --yes   # to stop all pending operations
> $ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
> $ bash upmap-moves
>
> Repeat the above 3 steps until balance is achieved, then re-enable the 
> balancer and unset the "no" flags set earlier?

You don't want to run cancel-backfill after placementoptimizer,
otherwise it will undo the balancing backfill.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Stefan Kooman

On 9/23/22 17:05, Wyll Ingersoll wrote:


When doing manual remapping/rebalancing with tools like pgremapper and 
placementoptimizer, what are the recommended settings for norebalance, 
norecover, nobackfill?
Should the balancer module be disabled if we are manually issuing the pg remap 
commands generated by those scripts so it doesn't interfere >
Something like this:

$ ceph osd set norebalance
$ ceph osd set norecover
$ ceph osd set nobackfill
$ ceph balancer off

$ pgremapper cancel-backfill --yes   # to stop all pending operations
$ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
$ bash upmap-moves


Disabling the manager is a good idea. After you are finished with 
placementoptimizer, it won't be able to do any work anyway, so you can 
safely turn it back on :-).


Setting the flags you suggested makes sense for the pgremapper phase. 
But as soon as everything is mapped back, you need to unset those. 
Because you need to be able to move data around when optimizing. So it 
won't work (might be a check on cluster state, not sure), or it only 
start moving data when you unset those flags.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Freak issue every few weeks

2022-09-23 Thread Stefan Kooman

On 9/23/22 15:22, J-P Methot wrote:

Thank you for your reply,

discard is not enabled in our configuration as it is mainly the default 
conf. Are you suggesting to enable it?


No. There is no consensus if enabling it is a good idea (depends on 
proper implementation among other things). From my experience on Intel 
S4610 (during LVM cleanup at OSD reprovisioning) it spends a lot of time 
discarding blocks (IIRC an order of magnitude more than Samsung PM883). 
So I doubt it would help. Altough it's hard to tell. Maybe it can do 
discards more often it does take less per operation, and might be less 
impactfull. But this is all speculation. It might be firmware related. 
Have all disks the same firmware? Are there disks that never experience 
this problem?


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Wyll Ingersoll


When doing manual remapping/rebalancing with tools like pgremapper and 
placementoptimizer, what are the recommended settings for norebalance, 
norecover, nobackfill?
Should the balancer module be disabled if we are manually issuing the pg remap 
commands generated by those scripts so it doesn't interfere?

Something like this:

$ ceph osd set norebalance
$ ceph osd set norecover
$ ceph osd set nobackfill
$ ceph balancer off

$ pgremapper cancel-backfill --yes   # to stop all pending operations
$ placementoptimizer.py balance --max-pg-moves 100 | tee upmap-moves
$ bash upmap-moves

Repeat the above 3 steps until balance is achieved, then re-enable the balancer 
and unset the "no" flags set earlier?



From: Eugen Block 
Sent: Friday, September 23, 2022 2:21 AM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: Balancer Distribution Help

+1 for increasing PG numbers, those are quite low.

Zitat von Bailey Allison :

> Hi Reed,
>
> Just taking a quick glance at the Pastebin provided I have to say
> your cluster balance is already pretty damn good all things
> considered.
>
> We've seen the upmap balancer at it's best in practice provides a
> deviation of about 10-20% percent across OSDs which seems to be
> matching up on your cluster. It's something that as the more nodes
> and OSDs you add that are equal in size to the cluster, and as the
> PGs increase on the cluster it can do a better and better job of,
> but in practice about a 10% difference in OSDs is  very normal.
>
> Something to note in the video provided is that they were using a
> cluster with 28PB of storage available, so who knows how many
> OSDs/nodes/PGs per pool/etc., that their cluster has the luxury and
> ability to balance across.
>
> The only thing I can think to suggest is just increasing the PG
> count as you've already mentioned. The ideal setting is about 100
> PGs per OSD, and looking at your cluster both the SSDs and the
> smaller HDDs have only about 50 PGs per OSD.
>
> If you're able to get both of those devices to a closer to 100 PG
> per OSD ratio it should help a lot more with the balancing. More PGs
> means more places to distribute data.
>
> It will be tricky in that I am just noticing for the HDDs you have
> some hosts/chassis with 24 OSDs per and others with 6 HDDs per so
> getting the PG distribution more even for those will be challenging,
> but for the SSDs it should be quite simple to get those to be 100
> PGs per OSD.
>
> Just taking a further look it does appear on some OSDs although I
> will say across the entire cluster the actual data stored is
> balanced good, there are a couple of OSDs where the OMAP/metadata is
> not balanced as well as the others.
>
> Where you are using EC pools for CephFS, any OMAP data cannot be
> stored within EC so it will store all of that within a replication
> data cephfs pool, most likely your hdd_cephfs pool.
>
> Just something to keep in mind as not only is it important to make
> sure the data is balanced, but the OMAP data and metadata are
> balanced as well.
>
> Otherwise though I would recommended just trying to get your cluster
> to a point where each of the OSDs have roughly 100 PGs per OSD, or
> at least as close to this as you are able to given your clusters
> crush rulesets.
>
> This should then help the balancer spread the data across the
> cluster, but again unless I overlooked something your cluster
> already appears to be extremely well balanced.
>
> There is a PG calculator you can use online at:
>
> https://old.ceph.com/pgcalc/
>
> There is also a PG calc on the Redhat website but it requires a subscription.
>
> Both calculators are essentially the same but I have noticed the
> free one will round down the PGs and the Redhat one will round up
> the PGs.
>
> Regards,
>
> Bailey
>
> -Original Message-
> From: Reed Dier 
> Sent: September 22, 2022 4:48 PM
> To: ceph-users 
> Subject: [ceph-users] Balancer Distribution Help
>
> Hoping someone can point me to possible tunables that could
> hopefully better tighten my OSD distribution.
>
> Cluster is currently
>> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974)
>> octopus (stable)": 307
> With plans to begin moving to pacific before end of year, with a
> possible interim stop at octopus.17 on the way.
>
> Cluster was born on jewel, and is fully bluestore/straw2.
> The upmap balancer works/is working, but not to the degree that I
> believe it could/should work, which seems should be much closer to
> near perfect than what I’m seeing.
>
> https://imgur.com/a/lhtZswo  <-
> Histograms of my OSD distribution
>
> https://pastebin.com/raw/dk3fd4GH
>  <- pastebin of
> cluster/pool/crush relevant bits
>
> To put it succinctly, I’m hoping to get much tighter OSD
> distribution, but I’m not sure what knobs to try turning next, as
> the upmap balancer has gone as far as it can, and I end up playing
> “reweight the 

[ceph-users] Re: how to enable ceph fscache from kernel module

2022-09-23 Thread David Yang
I found in some articles on the net that in their ceph.ko it depends on the
fscache module.


root@client:~# lsmod | grep ceph
ceph 376832 1
libceph 315392 1 ceph
fscache 65536 1 ceph
libcrc32c 16384 3xfs, raid456, libceph
root@client:~# modinfo ceph
filename: /lib/modules/4.15.0-112-generic/kernel/fs/ceph/ceph.ko
license: GPL
description: Ceph filesystem for Linux
author: Patience Warnick 
author: Yehuda Sadeh 
author: Sage Weil 
alias: fs-ceph
srcversion: B2806F4EAACAC1E19EE7AFA
depends: libceph,fscache
retpoline: Y
intree: Y
name: ceph
vermagic: 4.15.0-112-generic SMP mod_unload
signat: PKCS#7
signer:
sig_key:
sig_hashalgo: md4

David Yang  于2022年9月23日周五 12:17写道:

> hi,
> I am using kernel client to mount cephFS filesystem on Centos8.2.
> But my ceph's kernel module does not contain fscache.
>
>
> [root@host ~]# uname -r
> 5.4.163-1.el8.elrepo.x86_64
> [root@host ~]# lsmod|grep ceph
> ceph 446464 0
> libceph 368640 1 ceph
> dns_resolver 16384 1 libceph
> libcrc32c 16384 2xfs, libceph
> [root@host ~]# modinfo ceph
> filename:
> /lib/modules/5.4.163-1.el8.elrepo.x86_64/kernel/fs/ceph/ceph.ko.xz
> license: GPL
> description: Ceph filesystem for Linux
> author: Patience Warnick 
> author: Yehuda Sadeh 
> author: Sage Weil 
> alias: fs-ceph
> srcversion: 0923A6EE91D4CE16BC32EA2
> depends: libceph
> retpoline: Y
> intree: Y
> name: ceph
> vermagic: 5.4.163-1.el8.elrepo.x86_64 SMP mod_unload modversions
>
>
> What should I do to enable fscache in the ceph module, thanks.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Freak issue every few weeks

2022-09-23 Thread J-P Methot

Thank you for your reply,

discard is not enabled in our configuration as it is mainly the default 
conf. Are you suggesting to enable it?


On 9/22/22 14:20, Stefan Kooman wrote:

Just guessing here: have you configured "discard":

bdev enable discard
bdev async discard

We've see monitor slow ops when xfs was doing discard operations on 
the fs. Not sure if this could result in what you are seeing on OSDs.


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about recovery priority

2022-09-23 Thread Josh Baergen
Hi Fulvio,

> leads to a much shorter and less detailed page, and I assumed Nautilus
> was far behind Quincy in managing this...

The only major change I'm aware of between Nautilus and Quincy is that
in Quincy the mClock scheduler is able to automatically tune up/down
backfill parameters to achieve better speed and/or balance with client
I/O. The reservation mechanics themselves are unchanged.

> Thanks for "pgremapper", will give it a try once I have finished current
> data movement: will it still work after I upgrade to Pacific?

We're not aware of any Pacific incompatibilities at this time (we've
tested it there and community members have used it against Pacific),
though the tool has most heavily been used on Luminous and Nautilus,
as the README implies.

> You are correct, it would be best to drain OSDs cleanly, and I see
> pgremapper has an option for this, great!

Despite its name, I don't usually recommend using the "drain" command
for draining a batch of OSDs. Confusing, I know! "Drain" is best used
when you intend to move the data back afterwards, and if you give it
multiple targets, it won't balance data across those targets. The
reason for this is that "drain" doesn't pay attention to the
CRUSH-preferred PG location or target fullness, and thus it can make
suboptimal placement choices.

For your usecase, I would recommend using a downweight of OSDS on host
to 0.001 (can't be 0 - upmaps won't work) -> cancel-backfill (to map
data back to the host) -> undo-upmaps in a loop to optimally drain the
host.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Changing daemon config at runtime: tell, injectargs, config set and their differences

2022-09-23 Thread Oliver Schmidt
Hi everyone,

while evaluating different config options at our Ceph cluster, I discovered 
that there are multiple ways to apply (ephemeral) config changes to specific 
running daemons. But even after researching docs and manpages, and doing some 
experiments, I fail to understand when to use which of the commands. Even 
worse, the behaviour of these commands appears to depend on the actual config 
option that is changed.

DISCLAIMER: I did my experiments on a Ceph Luminous cluster. While this clearly 
is a deprecated Ceph release, I am still interested in the subtleties of the 
individual commands even when you can only answer this for current Ceph 
releases. 


To my understanding, there are at least 4 commands to change ceph daemon config 
at runtime:

- `ceph tell  injectargs --option_name=value`
- goes via monitor, mon intructs the daemon to do the injectargs
- `ceph daemon  config set "config option" value`
- local daemons only, via adminsocket
- also allows a `config get`
- `ceph tell  config set "option name" value`
- `ceph config set "config option" value`
- via some central monitor config store
- `config get` not yet implemented in Luminous

Is this overview correct, or do I have some misconceptions?

My experiments have shown that the behaviour of the commands differs depending 
on the config option changed:

Experiment 1: changing "osd max backfill" to 5 on daemon osd.6 and looking 
whether the value returned by `ceph daemon config get "osd max backfill" has 
changed.

the value has changed after:  `ceph tell injectargs`, `ceph tell config set`, 
`ceph daemon config set`
the value has not changed after: `ceph config set`

Experiment 2: changing "mon pg warn max object skew" to 5 on daemon 
mon.cartman09 and looking whether the value returned by `ceph daemon config get 
"mon pg warn max object skew" has changed.

the value has changed after:  `ceph tell injectargs`, `ceph daemon config set`
the value has not changed after: `ceph tell config set`, `ceph config set`

Even more confusing, in a cluster with a MANY_OBJECTS_PER_PG warning, only 
`ceph config set "mon pg warn max object skew" 20` resolved that warning, while 
a `ceph tell mon.\* injectargs --mon-pg-warn-max-object-skew=20` did not 
resolve the warning.


Can someone explain the subtle differences of these 4 commands to me? How does 
the central monitor config store relate to individual daemon configurations?
The full experiment logs can be found at the end of the mail.

Big thanks in advance

-- 
Oliver Schmidt · o...@flyingcircus.io · Systems Engineer
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick





Experiment 1:

```
~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "2"
}

~ # ceph tell osd.6 injectargs --osd-max-backfills=5
osd_max_backfills = '5' rocksdb_separate_wal_dir = 'true' (not observed, change 
may require restart)

~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "5"
}

~ # 

~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "2"
}

~ # ceph tell osd.6 config set "osd max backfills" 5
Set osd_max_backfills to 5

~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "5"
}

~ # 

~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "2"
}

~ # ceph daemon osd.6 config set "osd max backfills" 5
{
"success": "osd_max_backfills = '5' rocksdb_separate_wal_dir = 'true' (not 
observed, change may require restart) "
}

~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "5"
}

~ # 

~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "2"
}

~ # ceph config set "osd max backfills" 5
Set osd_max_backfills to 5

~ # ceph daemon osd.6 config get "osd max backfills"
{
"osd max backfills": "2"
}
```

===

Experiment 2:

```
root@cartman09 ~ # ceph daemon mon.cartman09 config get "mon pg warn max object 
skew"
{
"mon pg warn max object skew": "10.00"
}

root@cartman09 ~ # ceph tell mon.cartman09 injectargs 
--mon_pg_warn_max_object_skew=20
injectargs:mon_pg_warn_max_object_skew = '20.00' (not observed, change may 
require restart)

root@cartman09 ~ # ceph daemon mon.cartman09 config get "mon pg warn max object 
skew"
{
"mon pg warn max object skew": "20.00"
}

root@cartman09 ~ # 

root@cartman09 ~ # ceph daemon mon.cartman09 config get "mon pg warn max object 
skew"
{
"mon pg warn max object skew": "10.00"
}

root@cartman09 ~ # ceph tell mon.cartman09 config set "mon pg warn max object 
skew" 20
Set mon_pg_warn_max_object_skew to 20

root@cartman09 ~ # ceph daemon mon.cartman09 config get "mon pg warn max object 
skew"
{
"mon pg warn max object skew": "10.00"
}

root@cartman09 ~ # 

root@cartman09 ~ # ceph daemon 

[ceph-users] Why OSD could report spurious read errors.

2022-09-23 Thread Igor Fedotov

Hello All!

just to bring this knowledge to a wider audience...

Under some circumstances osds/clusters might report (and even suffer 
from) spurious disk read errors. The following comment's re-post sheds 
light on the root cause. Many thanks to Canonical's folks for that.


Originally posted at: https://tracker.ceph.com/issues/22464#note-72

"

At Canonical we tracked down and solved the cause of this bug. Credit to 
my colleague Mauricio Faria de Oliveira for identifying and fixing the 
issue. We fixed this a little while ago but this bug never got updated 
with the details, so adding them for future travellers.


The true cause is a bug in the Linux MADV_FREE implementation, which was 
first introduced in Linux v4.5. It's a race condition between MADV_FREE 
and Direct I/O that is triggered under memory pressure.


Upstream kernel fix with very detailed analysis in the commit message is 
here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6c8e2a256915a223f6289f651d6b926cd7135c9e

MADV_FREE is not directly used by Ceph so much as by tcmalloc. MADV_FREE 
was used by tcmalloc (gperftools) based on a compile-time detection. In 
2016 they then disabled use of MADV_FREE on linux because it was 
untested - released into v2.5.90 and v2.6+


Hence to hit this issue you needed to have a tcmalloc that was compiled 
on Linux v4.5+, running on Linux v4.5+ and before they intentionally 
disabled support for MADV_FREE. See this issue for details on disabling 
MADV_FREE:

https://github.com/gperftools/gperftools/issues/780

This was the case in Ubuntu Bionic 18.04 which shipped v2.5

Seems many moved on since but if you do experience this then upgrade to 
a kernel with the above fix:

mm: fix race between MADV_FREE reclaim and blkdev direct IO read

It typically manifests in two different ways, sometimes the checksum 
fails at the bluefs layer in which case newer Ceph versions added a 
retry on the read which often works around it since you don't hit the 
race twice. But you can also hit it in rocksdb which crashes the OSD.

"

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-23 Thread Eugen Block
Well, if that issue occurs it will be at the beginning of the  
recovery, so you may not notice it until you get inactive PGs. We hit  
that limit when we rebuilt all OSDs on one server with many EC chunks.  
Setting osd_max_pg_per_osd_hard_ratio to 5 (default 3) helped avoid  
inactive PGs for all other nodes, so we leave it like that in case a  
server goes down unintentionally.



Zitat von "Szabo, Istvan (Agoda)" :

Good to know thank you, so in that case during recovery it worth to  
increase those values right?


Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Eugen Block 
Sent: Friday, September 23, 2022 1:19 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Any disadvantage to go above the 100pg/osd  
or 4osd/disk?


Email received from the internet. If in doubt, don't click any link  
nor open any attachment !



Hi,

I can't speak from the developers perspective, but we discussed this  
just recently intenally and with a customer. We doubled the number  
of PGs on one of our customer's data pools from around 100 to 200  
PGs/OSD (HDDs with rocksDB on SSDs). We're still waiting for the  
final conclusion if the performance has increased or not, but it  
seems to work as expected. We probably would double it again if the  
PG size/objects per PG would affect the performance again. You just  
need to be aware of the mon_max_pg_per_osd and  
osd_max_pg_per_osd_hard_ratio configs in case of recovery. Otherwise  
we don't see any real issue with 200 or 400 PGs/OSD if the nodes can  
handle it.


Regards,
Eugen

Zitat von "Szabo, Istvan (Agoda)" :


Hi,

My question is, is there any technical limit to have 8osd/ssd and on
each of them 100pg if the memory and cpu resource available (8gb
memory/osd and 96vcore)?
The iops and bandwidth on the disks are very low so I don’t see any
issue to go with this.

In my cluster I’m using 15.3TB ssds. We have more than 2 billions of
objects in each of the 3 clusters.
The bottleneck is the pg/osd so last time when my serious issue solved
the solution was to bump the pg-s of the data pool the allowed maximum
with 4:2 ec.

I’m curious of the developers opinion also.

Thank you,
Istvan


This message is confidential and is for the sole use of the intended
recipient(s). It may also be privileged or otherwise protected by
copyright or other legal rules. If you have received it by mistake
please let us know by reply email and delete it from your system. It
is prohibited to copy this message or disclose its content to anyone.
Any confidentiality or privilege is not waived or lost by any mistaken
delivery or unauthorized disclosure of the message. All messages sent
to and from Agoda may be monitored to ensure compliance with company
policies, to protect the company's interests and to remove potential
malware. Electronic messages may be intercepted, amended, lost or
deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an  
email to ceph-users-le...@ceph.io



This message is confidential and is for the sole use of the intended  
recipient(s). It may also be privileged or otherwise protected by  
copyright or other legal rules. If you have received it by mistake  
please let us know by reply email and delete it from your system. It  
is prohibited to copy this message or disclose its content to  
anyone. Any confidentiality or privilege is not waived or lost by  
any mistaken delivery or unauthorized disclosure of the message. All  
messages sent to and from Agoda may be monitored to ensure  
compliance with company policies, to protect the company's interests  
and to remove potential malware. Electronic messages may be  
intercepted, amended, lost or deleted, or contain viruses.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about recovery priority

2022-09-23 Thread Fulvio Galeazzi

Hallo Josh thanks for your feedback!

On 9/22/22 14:44, Josh Baergen wrote:

Hi Fulvio,

https://docs.ceph.com/en/quincy/dev/osd_internals/backfill_reservation/
describes the prioritization and reservation mechanism used for
recovery and backfill. AIUI, unless a PG is below min_size, all
backfills for a given pool will be at the same priority.
force-recovery will modify the PG priority but doing so can have a
very delayed effect because a given backfill can be waiting behind a
bunch of other backfills that have acquired partial reservations,
which in turn are waiting behind other backfills that have partial
reservations, etc. etc. Once one is doing degraded backfill, they've
lost a lot of control over their system.


Yes I had found that page, which together with
https://docs.ceph.com/en/quincy/dev/osd_internals/recovery_reservation/
 explains the mechanism causing reservations to be waiting behind others...
However I am still on Nautilus and "sed -e 's/quincy/nautilus/' " 
leads to a much shorter and less detailed page, and I assumed Nautilus 
was far behind Quincy in managing this... in any case, I guess it's good 
to upgrade, and take advantage of software developments.



Rather than ripping out hosts like you did here, operators that want
to retain control will drain hosts without degradation.
https://github.com/digitalocean/pgremapper is one tool that can help
with this, though depending on the size of the system one can
sometimes simply downweight the host and then wait.


Thanks for "pgremapper", will give it a try once I have finished current 
data movement: will it still work after I upgrade to Pacific?


You are correct, it would be best to drain OSDs cleanly, and I see 
pgremapper has an option for this, great!
However, in my cluster (14 servers with ~20 disks each, ~3 PB raw space: 
cinder ~1PB, rgw~0.9PB) I see that draining (by reweighting to 0.) works 
nicely and predictably for replicated pools (1-2 days) but is terribly 
slow for my rgw 6+4 EC pool (>week): that's why I normally reweight up 
to some point and then rip 1 or 2 OSDs when I am fed up.
(By the way, the choice of 6+4 goes back to a few years, and was picked 
primarily as a compromise between space lost for redundancy and 
resilience to failures, when the cluster was much smaller: should make a 
few extensive tests and see whether it's worth to try different m+n.)


  Thanks again!

Fulvio




Josh

On Thu, Sep 22, 2022 at 6:35 AM Fulvio Galeazzi  wrote:


Hallo all,
   taking advantage of the redundancy of my EC pool, I destroyed a
couple of servers in order to reinstall them with a new operating system.
I am on Nautilus (but will evolve soon to Pacific), and today I am
not in "emergency mode": this is just for my education.  :-)

"ceph pg dump" shows a couple pg's with 3 missing chunks, some other
with 2, several with 1 missing chunk: that's fine and expected.
Having looked at it for a while, I think I understand the recovery queue
is unique: there is no internal higher priority for 3-missing-chunks PGs
wrt 1-missing-chunk PGs, right?
I tried to issue "ceph pg force-recovery" on the few worst-degraded PGs
but, apparently, numbers of 3-missing 2-missing and 1-missing are going
down at the same relative speed.
 Is this expected? Can I do something to "guide" the process?

Thanks for your hints

 Fulvio

--
Fulvio Galeazzi
GARR-CSD Department
skype: fgaleazzi70
tel.: +39-334-6533-250
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Fulvio Galeazzi
GARR-CSD Department
tel.: +39-334-6533-250
skype: fgaleazzi70
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-23 Thread Szabo, Istvan (Agoda)
Good to know thank you, so in that case during recovery it worth to increase 
those values right?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Eugen Block 
Sent: Friday, September 23, 2022 1:19 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Any disadvantage to go above the 100pg/osd or 
4osd/disk?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

I can't speak from the developers perspective, but we discussed this just 
recently intenally and with a customer. We doubled the number of PGs on one of 
our customer's data pools from around 100 to 200 PGs/OSD (HDDs with rocksDB on 
SSDs). We're still waiting for the final conclusion if the performance has 
increased or not, but it seems to work as expected. We probably would double it 
again if the PG size/objects per PG would affect the performance again. You 
just need to be aware of the mon_max_pg_per_osd and 
osd_max_pg_per_osd_hard_ratio configs in case of recovery. Otherwise we don't 
see any real issue with 200 or 400 PGs/OSD if the nodes can handle it.

Regards,
Eugen

Zitat von "Szabo, Istvan (Agoda)" :

> Hi,
>
> My question is, is there any technical limit to have 8osd/ssd and on
> each of them 100pg if the memory and cpu resource available (8gb
> memory/osd and 96vcore)?
> The iops and bandwidth on the disks are very low so I don’t see any
> issue to go with this.
>
> In my cluster I’m using 15.3TB ssds. We have more than 2 billions of
> objects in each of the 3 clusters.
> The bottleneck is the pg/osd so last time when my serious issue solved
> the solution was to bump the pg-s of the data pool the allowed maximum
> with 4:2 ec.
>
> I’m curious of the developers opinion also.
>
> Thank you,
> Istvan
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by
> copyright or other legal rules. If you have received it by mistake
> please let us know by reply email and delete it from your system. It
> is prohibited to copy this message or disclose its content to anyone.
> Any confidentiality or privilege is not waived or lost by any mistaken
> delivery or unauthorized disclosure of the message. All messages sent
> to and from Agoda may be monitored to ensure compliance with company
> policies, to protect the company's interests and to remove potential
> malware. Electronic messages may be intercepted, amended, lost or
> deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Eugen Block

+1 for increasing PG numbers, those are quite low.

Zitat von Bailey Allison :


Hi Reed,

Just taking a quick glance at the Pastebin provided I have to say  
your cluster balance is already pretty damn good all things  
considered.


We've seen the upmap balancer at it's best in practice provides a  
deviation of about 10-20% percent across OSDs which seems to be  
matching up on your cluster. It's something that as the more nodes  
and OSDs you add that are equal in size to the cluster, and as the  
PGs increase on the cluster it can do a better and better job of,  
but in practice about a 10% difference in OSDs is  very normal.


Something to note in the video provided is that they were using a  
cluster with 28PB of storage available, so who knows how many  
OSDs/nodes/PGs per pool/etc., that their cluster has the luxury and  
ability to balance across.


The only thing I can think to suggest is just increasing the PG  
count as you've already mentioned. The ideal setting is about 100  
PGs per OSD, and looking at your cluster both the SSDs and the  
smaller HDDs have only about 50 PGs per OSD.


If you're able to get both of those devices to a closer to 100 PG  
per OSD ratio it should help a lot more with the balancing. More PGs  
means more places to distribute data.


It will be tricky in that I am just noticing for the HDDs you have  
some hosts/chassis with 24 OSDs per and others with 6 HDDs per so  
getting the PG distribution more even for those will be challenging,  
but for the SSDs it should be quite simple to get those to be 100  
PGs per OSD.


Just taking a further look it does appear on some OSDs although I  
will say across the entire cluster the actual data stored is  
balanced good, there are a couple of OSDs where the OMAP/metadata is  
not balanced as well as the others.


Where you are using EC pools for CephFS, any OMAP data cannot be  
stored within EC so it will store all of that within a replication  
data cephfs pool, most likely your hdd_cephfs pool.


Just something to keep in mind as not only is it important to make  
sure the data is balanced, but the OMAP data and metadata are  
balanced as well.


Otherwise though I would recommended just trying to get your cluster  
to a point where each of the OSDs have roughly 100 PGs per OSD, or  
at least as close to this as you are able to given your clusters  
crush rulesets.


This should then help the balancer spread the data across the  
cluster, but again unless I overlooked something your cluster  
already appears to be extremely well balanced.


There is a PG calculator you can use online at:

https://old.ceph.com/pgcalc/

There is also a PG calc on the Redhat website but it requires a subscription.

Both calculators are essentially the same but I have noticed the  
free one will round down the PGs and the Redhat one will round up  
the PGs.


Regards,

Bailey

-Original Message-
From: Reed Dier 
Sent: September 22, 2022 4:48 PM
To: ceph-users 
Subject: [ceph-users] Balancer Distribution Help

Hoping someone can point me to possible tunables that could  
hopefully better tighten my OSD distribution.


Cluster is currently

"ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974)
octopus (stable)": 307
With plans to begin moving to pacific before end of year, with a  
possible interim stop at octopus.17 on the way.


Cluster was born on jewel, and is fully bluestore/straw2.
The upmap balancer works/is working, but not to the degree that I  
believe it could/should work, which seems should be much closer to  
near perfect than what I’m seeing.


https://imgur.com/a/lhtZswo  <-  
Histograms of my OSD distribution


https://pastebin.com/raw/dk3fd4GH  
 <- pastebin of  
cluster/pool/crush relevant bits


To put it succinctly, I’m hoping to get much tighter OSD  
distribution, but I’m not sure what knobs to try turning next, as  
the upmap balancer has gone as far as it can, and I end up playing  
“reweight the most full OSD whack-a-mole as OSD’s get nearful.”


My goal is obviously something akin to this perfect distribution  
like here: https://www.youtube.com/watch?v=niFNZN5EKvE=1353s  



I am looking to tweak the PG counts for a few pool.
Namely the ssd-radosobj has shrunk in size and needs far fewer PGs now.
Similarly hdd-cephfs shrunk in size as well and needs fewer PGs (as  
ceph health shows).
And on the flip side, ec*-cephfs likely need more PGs as they have  
grown in size.
However I was hoping to get more breathing room of free space on my  
most full OSDs before starting to do big PG expand/shrink.


I am assuming that my whacky mix of replicated vs multiple EC  
storage pools coupled with hybrid SSD+HDD pools is throwing off the  
balance more than if it was a more homogenous crush ruleset, but  
this is what exists and is what I’m working with.
Also, since it will look odd in the tree 

[ceph-users] Re: Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-23 Thread Eugen Block

Hi,

I can't speak from the developers perspective, but we discussed this  
just recently intenally and with a customer. We doubled the number of  
PGs on one of our customer's data pools from around 100 to 200 PGs/OSD  
(HDDs with rocksDB on SSDs). We're still waiting for the final  
conclusion if the performance has increased or not, but it seems to  
work as expected. We probably would double it again if the PG  
size/objects per PG would affect the performance again. You just need  
to be aware of the mon_max_pg_per_osd and  
osd_max_pg_per_osd_hard_ratio configs in case of recovery. Otherwise  
we don't see any real issue with 200 or 400 PGs/OSD if the nodes can  
handle it.


Regards,
Eugen

Zitat von "Szabo, Istvan (Agoda)" :


Hi,

My question is, is there any technical limit to have 8osd/ssd and on  
each of them 100pg if the memory and cpu resource available (8gb  
memory/osd and 96vcore)?
The iops and bandwidth on the disks are very low so I don’t see any  
issue to go with this.


In my cluster I’m using 15.3TB ssds. We have more than 2 billions of  
objects in each of the 3 clusters.
The bottleneck is the pg/osd so last time when my serious issue  
solved the solution was to bump the pg-s of the data pool the  
allowed maximum with 4:2 ec.


I’m curious of the developers opinion also.

Thank you,
Istvan


This message is confidential and is for the sole use of the intended  
recipient(s). It may also be privileged or otherwise protected by  
copyright or other legal rules. If you have received it by mistake  
please let us know by reply email and delete it from your system. It  
is prohibited to copy this message or disclose its content to  
anyone. Any confidentiality or privilege is not waived or lost by  
any mistaken delivery or unauthorized disclosure of the message. All  
messages sent to and from Agoda may be monitored to ensure  
compliance with company policies, to protect the company's interests  
and to remove potential malware. Electronic messages may be  
intercepted, amended, lost or deleted, or contain viruses.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-23 Thread Stefan Kooman

On 9/22/22 21:48, Reed Dier wrote:



Any tips or help would be greatly appreciated.


Try JJ's Ceph balancer [1]. In our case it turned out to be *way* more 
efficient than built-in balancer (faster conversion, less movements 
involved). And able to achieve a very good PG distribution and "reclaim" 
lot's of space. I Highly recommended it.


Gr. Stefan

[1]: https://github.com/TheJJ/ceph-balancer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io