[ceph-users] Re: rados df vs ls

2022-07-14 Thread stuart.anderson


> On Jul 14, 2022, at 11:11 AM, Janne Johansson  wrote:
> 
> Den ons 13 juli 2022 kl 04:33 skrev stuart.anderson 
> :
>> 
>>> On Jul 6, 2022, at 10:30 AM, stuart.anderson  
>>> wrote:
>>> 
>>> I am wondering if it is safe to delete the following pool that rados ls 
>>> reports is empty, but rados df indicates has a few thousand objects?
>> 
>> Please excuse reposting, but as a Ceph newbie I would really appreciate 
>> advice from someone more knowledgeable if it is safe to delete a pool with 
>> an empty "rados ls" output even though "rados df" reoprts a a few thousand 
>> objects (in a cluster with 500M objects)?
> 
> "Safe" depends more on if any client will be looking for these objects or not.

Since the only client I am interested in for this pool is CephFS, and 
find+getfatter doesn't report any references I am going to go for it.

> Still, small pools can be dumped to an output file with "rbd export"

Good idea. I used rados export instead of rbd export and that generated the 
following trivially small 52 Byte output file that is much smaller than the 9 
GB reported by rados df, and is presumably just some metadata,

[root@ceph-admin ~]# rados -p fs.data.user.hdd.ec export 
rados.fs.data.user.hdd.ec.export

[root@ceph-admin ~]# od -x rados.fs.data.user.hdd.ec.export 
000 ffce ffce 0002  0012  000a 
020 0101 000c  ffce 0a0a   
040  0101 000c  ffce 0b0b  
060  
064

[root@ceph-admin ~]# rados df
POOL_NAME   USEDOBJECTS ...
...
fs.data.user.hdd.ec  9.0 GiB   2319


> IIRC, so I would try that first, to a place with decent amount of
> space so I had a backup.
> Then remove the pool and if no system or client notices for a
> reasonable amount of time, stop keeping the backup. If something
> needed the "hidden" objects, re-import the pool again.
> 
>> Is there another way to find the objects that rados df is counting?
> 
> See if there is an option to list "all" objects or from all namespaces
> or similar, that could turn up something.

Another good idea. However, the rados ls command is still empty with the 
optional --all.

Thanks.

---
ander...@ligo.caltech.edu



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rados df vs ls

2022-07-14 Thread Janne Johansson
Den ons 13 juli 2022 kl 04:33 skrev stuart.anderson :
>
>
> > On Jul 6, 2022, at 10:30 AM, stuart.anderson  
> > wrote:
> >
> > I am wondering if it is safe to delete the following pool that rados ls 
> > reports is empty, but rados df indicates has a few thousand objects?
>
> Please excuse reposting, but as a Ceph newbie I would really appreciate 
> advice from someone more knowledgeable if it is safe to delete a pool with an 
> empty "rados ls" output even though "rados df" reoprts a a few thousand 
> objects (in a cluster with 500M objects)?

"Safe" depends more on if any client will be looking for these objects or not.

Still, small pools can be dumped to an output file with "rbd export"
IIRC, so I would try that first, to a place with decent amount of
space so I had a backup.
Then remove the pool and if no system or client notices for a
reasonable amount of time, stop keeping the backup. If something
needed the "hidden" objects, re-import the pool again.

> Is there another way to find the objects that rados df is counting?

See if there is an option to list "all" objects or from all namespaces
or similar, that could turn up something.

> Or is something like "fsck" or "scrub" to resync the output of rados ls & df ?

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm host maintenance

2022-07-14 Thread Kai Stian Olstad

On 14.07.2022 11:01, Steven Goodliff wrote:

If i get anywhere with
detecting the instance is the active manager handling that in Ansible
i will reply back here.


I use this

- command: ceph mgr stat
  register: r

- debug: msg={{ (r.stdout | from_json).active_name.split(".")[0] }}


This works because the first part of the instance name is the hostname.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd iostat requires pool specified

2022-07-14 Thread Ilya Dryomov
On Wed, Jul 13, 2022 at 10:50 PM Reed Dier  wrote:
>
> Hoping this may be trivial to point me towards, but I typically keep a 
> background screen running `rbd perf image iostat` that shows all of the rbd 
> devices with io, and how busy that disk may be at any given moment.
>
> Recently after upgrading everything to latest octopus release (15.2.16), it 
> no longer allows for not specifying the pool, which then means I can’t blend 
> all rbd pools together into a single view.
> How it used to appear:
> > NAMEWRRDWR_BYTESRD_BYTES  
> > WR_LATRD_LAT
> > rbd-ssd/app1 322/s   0/s   5.6 MiB/s   0 B/s 
> > 2.28 ms   0.00 ns
> > rbd-ssd/app2 223/s   5/s   2.1 MiB/s   147 KiB/s 
> > 3.56 ms   1.12 ms
> > rbd-hybrid/app3   76/s   0/s11 MiB/s   0 B/s
> > 16.61 ms   0.00 ns
> > rbd-hybrid/app4   11/s   0/s   395 KiB/s   0 B/s
> > 51.29 ms   0.00 ns
> > rbd-hybrid/app53/s   0/s74 KiB/s   0 B/s   
> > 151.54 ms   0.00 ns
> > rbd-hybrid/app60/s   0/s42 KiB/s   0 B/s
> > 13.90 ms   0.00 ns
> > rbd-hybrid/app70/s   0/s   2.4 KiB/s   0 B/s 
> > 1.70 ms   0.00 ns
> >
> > NAMEWRRDWR_BYTES   RD_BYTES 
> > WR_LAT  RD_LAT
> > rbd-ssd/app1 483/s   0/s   7.3 MiB/s  0 B/s2.17 
> > ms 0.00 ns
> > rbd-ssd/app2 279/s   5/s   2.5 MiB/s   69 KiB/s3.82 
> > ms   516.30 us
> > rbd-hybrid/app3  147/s   0/s10 MiB/s  0 B/s8.59 
> > ms 0.00 ns
> > rbd-hybrid/app6   10/s   0/s   425 KiB/s  0 B/s   75.79 
> > ms 0.00 ns
> > rbd-hybrid/app80/s   0/s   2.4 KiB/s  0 B/s1.85 
> > ms 0.00 ns
>
>
> > $ uname -r && rbd --version && rbd perf image iostat
> > 5.4.0-107-generic
> > ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus 
> > (stable)
> > rbd: mgr command failed: (2) No such file or directory: [errno 2] RADOS 
> > object not found (Pool 'rbd' not found)
>
> This is ubuntu 20.04, using packages rather than cephadm.
> I do not have a pool named `rbd` so that is correct, but I have a handful of 
> pools with the rbd application set.
>
> > $ for pool in rbd-{ssd,hybrid,ec82} ; do ceph osd pool application get 
> > $pool ; done
> > {
> > "rbd": {}
> > }
> > {
> > "rbd": {}
> > }
> > {
> > "rbd": {}
> > }
>
> Looking at the help output, it doesn’t seem to imply that the `pool-spec` is 
> optional, and it won’t take wildcard globs like `rbd*` for the pool name.
>
> > $ rbd help perf image iostat
> > usage: rbd perf image iostat [--pool ] [--namespace ]
> >  [--iterations ] [--sort-by 
> > ]
> >  [--format ] [--pretty-format]
> >  
> >
> > Display image IO statistics.
> >
> > Positional arguments
> >   pool specification
> >  (example: [/]
> >
> > Optional arguments
> >   -p [ --pool ] arg  pool name
> >   --namespace argnamespace name
> >   --iterations arg   iterations of metric collection [> 0]
> >   --sort-by arg (=write_ops) sort-by IO metric (write-ops, read-ops,
> >  write-bytes, read-bytes, write-latency,
> >  read-latency) [default: write-ops]
> >   --format arg   output format (plain, json, or xml) [default:
> >  plain]
> >   --pretty-formatpretty formatting (json and xml)
>
> Setting a pool name to one of my rbd pools either as pool-spec or -p/—pool 
> works, but obviously only for that pool, and not for *all* rbd pools, as it 
> functioned previously, in what appears to have been 15.2.13 previously.
> I didn’t see a PR mentioned in the 5.2.14-16 release notes that seemed to 
> mention changes to rbd that would affect this, but I could have glossed over 
> something.

Hi Reed,

You have stumbled onto a regression:

https://tracker.ceph.com/issues/56561

I will address it ASAP and we should be able to include the fix in the
upcoming 15.2.17 release.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-14 Thread Dan van der Ster
OK I recreated one OSD. It now has 4k min_alloc_size:

2022-07-14T10:52:58.382+0200 7fe5ec0aa200  1
bluestore(/var/lib/ceph/osd/ceph-0/) _open_super_meta min_alloc_size
0x1000

and I tested all these bluestore_prefer_deferred_size_hdd values:

4096: not deferred
4097: "_do_alloc_write deferring 0x1000 write via deferred"
65536: "_do_alloc_write deferring 0x1000 write via deferred"
65537: "_do_alloc_write deferring 0x1000 write via deferred"

With bluestore_prefer_deferred_size_hdd = 64k, I see that writes up to
0xf000 are deferred, e.g.:

 _do_alloc_write deferring 0xf000 write via deferred

Cheers, Dan

On Thu, Jul 14, 2022 at 9:37 AM Konstantin Shalygin  wrote:
>
> Dan, do you tested the redeploy one of your OSD with default pacific 
> bluestore_min_alloc_size_hdd (4096) ?
> This will also resolves this issue (just not affected, when all options in 
> their defaults)?
>
>
> Thanks,
> k
>
> On 14 Jul 2022, at 08:43, Dan van der Ster  wrote:
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm host maintenance

2022-07-14 Thread Steven Goodliff


Thanks for the replies,


It feels to me that cephadm should handle this case as it offers the 
maintenance function. right now i have a simple version of a playbook that just 
does the noout / patch the OS and reboot and unset noout ( similar to 
https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/untested-by-ci/cluster-maintenance.yml
 ) and a different version that attempts the host maintenance but fails on the 
instance that is running the mgr. If i get anywhere with detecting the instance 
is the active manager handling that in Ansible i will reply back here.


Cheers

Steven Goodliff




From: Robert Gallop 
Sent: 13 July 2022 16:55
To: Adam King
Cc: Steven Goodliff; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: cephadm host maintenance

This brings up a good follow on…. Rebooting in general for OS patching.

I have not been leveraging the maintenance mode function, as I found it was 
really no different than just setting noout and doing the reboot.  I find if 
the box is the active manager the failover happens quick, painless and 
automatically.  All the OSD’s just show as missing and come back once the box 
is back from reboot…

Am I causing issues I may not be aware of?  How is everyone handling patching 
reboots?

The only place I’m careful is the active MDS nodes, since that failover does 
cause a period of no i/o for the mounted clients, I generally fail that 
manually so I can ensure I don’t have to wait for the MDS to figure out an 
instance is gone and spin up a standby….

Any tips or techniques until there is a more holistic approach?

Thanks!


On Wed, Jul 13, 2022 at 9:49 AM Adam King 
mailto:adk...@redhat.com>> wrote:
Hello Steven,

Arguably, it should, but right now nothing is implemented to do so and
you'd have to manually run the "ceph mgr fail
node2-cobj2-atdev1-nvan.ghxlvw" before it would allow you to put the host
in maintenance. It's non-trivial from a technical point of view to have it
automatically do the switch as the cephadm instance is running on that
active mgr, so it will have to store somewhere that we wanted this host in
maintenance, fail over the mgr itself, then have the new cephadm instance
pick up that we wanted the host in maintenance and do so. Possible, but not
something anyone has had a chance to implement. FWIW, I do believe there
are also plans to eventually have a playbook for a rolling reboot or
something of the sort added to https://github.com/ceph/cephadm-ansible. But
for now, I think some sort of intervention to cause the fail over to happen
before running the maintenance enter command is necessary.

Regards,
 - Adam King

On Wed, Jul 13, 2022 at 11:02 AM Steven Goodliff <
steven.goodl...@globalrelay.net> wrote:

>
> Hi,
>
>
> I'm trying to reboot a ceph cluster one instance at a time by running in a
> Ansible playbook which basically runs
>
>
> cephadm shell ceph orch host maintenance enter   and then
> reboots the instance and exits the maintenance
>
>
> but i get
>
>
> ALERT: Cannot stop active Mgr daemon, Please switch active Mgrs with 'ceph
> mgr fail node2-cobj2-atdev1-nvan.ghxlvw'
>
>
> on one instance.  should cephadm handle the switch ?
>
>
> thanks
>
> Steven Goodliff
> Global Relay
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to 
> ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-14 Thread Konstantin Shalygin
Dan, do you tested the redeploy one of your OSD with default pacific 
bluestore_min_alloc_size_hdd (4096) ?
This will also resolves this issue (just not affected, when all options in 
their defaults)?


Thanks,
k

> On 14 Jul 2022, at 08:43, Dan van der Ster  wrote:
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific doesn't defer small writes for pre-pacific hdd osds

2022-07-14 Thread Zakhar Kirpichenko
Many thanks, Dan. Much appreciated!

/Z

On Thu, 14 Jul 2022 at 08:43, Dan van der Ster  wrote:

> Yes, that is correct. No need to restart the osds.
>
> .. Dan
>
>
> On Thu., Jul. 14, 2022, 07:04 Zakhar Kirpichenko, 
> wrote:
>
>> Hi!
>>
>> My apologies for butting in. Please confirm
>> that bluestore_prefer_deferred_size_hdd is a runtime option, which doesn't
>> require OSDs to be stopped or rebuilt?
>>
>> Best regards,
>> Zakhar
>>
>> On Tue, 12 Jul 2022 at 14:46, Dan van der Ster 
>> wrote:
>>
>>> Hi Igor,
>>>
>>> Thank you for the reply and information.
>>> I confirm that `ceph config set osd bluestore_prefer_deferred_size_hdd
>>> 65537` correctly defers writes in my clusters.
>>>
>>> Best regards,
>>>
>>> Dan
>>>
>>>
>>>
>>> On Tue, Jul 12, 2022 at 1:16 PM Igor Fedotov 
>>> wrote:
>>> >
>>> > Hi Dan,
>>> >
>>> > I can confirm this is a regression introduced by
>>> https://github.com/ceph/ceph/pull/42725.
>>> >
>>> > Indeed strict comparison is a key point in your specific case but
>>> generally  it looks like this piece of code needs more redesign to better
>>> handle fragmented allocations (and issue deferred write for every short
>>> enough fragment independently).
>>> >
>>> > So I'm looking for a way to improve that at the moment. Will fallback
>>> to trivial comparison fix if I fail to do find better solution.
>>> >
>>> > Meanwhile you can adjust bluestore_min_alloc_size_hdd indeed but I'd
>>> prefer not to raise it that high as 128K to avoid too many writes being
>>> deferred (and hence DB overburden).
>>> >
>>> > IMO setting the parameter to 64K+1 should be fine.
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Igor
>>> >
>>> > On 7/7/2022 12:43 AM, Dan van der Ster wrote:
>>> >
>>> > Hi Igor and others,
>>> >
>>> > (apologies for html, but i want to share a plot ;) )
>>> >
>>> > We're upgrading clusters to v16.2.9 from v15.2.16, and our simple
>>> "rados bench -p test 10 write -b 4096 -t 1" latency probe showed something
>>> is very wrong with deferred writes in pacific.
>>> > Here is an example cluster, upgraded today:
>>> >
>>> >
>>> >
>>> > The OSDs are 12TB HDDs, formatted in nautilus with the default
>>> bluestore_min_alloc_size_hdd = 64kB, and each have a large flash block.db.
>>> >
>>> > I found that the performance issue is because 4kB writes are no longer
>>> deferred from those pre-pacific hdds to flash in pacific with the default
>>> config !!!
>>> > Here are example bench writes from both releases:
>>> https://pastebin.com/raw/m0yL1H9Z
>>> >
>>> > I worked out that the issue is fixed if I set
>>> bluestore_prefer_deferred_size_hdd = 128k (up from the 64k pacific default.
>>> Note the default was 32k in octopus).
>>> >
>>> > I think this is related to the fixes in
>>> https://tracker.ceph.com/issues/52089 which landed in 16.2.6 --
>>> _do_alloc_write is comparing the prealloc size 0x1 with
>>> bluestore_prefer_deferred_size_hdd (0x1) and the "strictly less than"
>>> condition prevents deferred writes from ever happening.
>>> >
>>> > So I think this would impact anyone upgrading clusters with hdd/ssd
>>> mixed osds ... surely we must not be the only clusters impacted by this?!
>>> >
>>> > Should we increase the default bluestore_prefer_deferred_size_hdd up
>>> to 128kB or is there in fact a bug here?
>>> >
>>> > Best Regards,
>>> >
>>> > Dan
>>> >
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io