[ceph-users] Re: osd crash randomly

2022-10-24 Thread can zhu
The same osd crashed today:
 0> 2022-10-24T06:30:00.875+ 7f0bbf3bc700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f0bbf3bc700 thread_name:bstore_kv_final

 ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific
(stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f0bcee65ce0]
 2: (BlueStore::Onode::put()+0x29f) [0x562c2ea293df]
 3: (std::_Rb_tree,
boost::intrusive_ptr,
std::_Identity >,
std::less >,
std::allocator >
>::_M_erase(std::_Rb_tree_node
>*)+0x31) [0x562c2eade4d1]
 4: (BlueStore::TransContext::~TransContext()+0x12f) [0x562c2eade7ff]
 5: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x23e)
[0x562c2ea8969e]
 6: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x257)
[0x562c2ea95d87]
 7: (BlueStore::_kv_finalize_thread()+0x54e) [0x562c2eaafdae]
 8: (BlueStore::KVFinalizeThread::entry()+0x11) [0x562c2eae3d71]
 9: /lib64/libpthread.so.0(+0x81ca) [0x7f0bcee5b1ca]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

Tyler Stachecki  于2022年10月24日周一 11:24写道:

> On Sun, Oct 23, 2022 at 11:04 PM can zhu  wrote:
> >
> > crash info:
> >
> > {
> > "backtrace": [
> > "/lib64/libpthread.so.0(+0x12ce0) [0x7f82e87cece0]",
> > "(BlueStore::Onode::put()+0x1a3) [0x55bd21a422e3]",
> > "(std::_Hashtable > boost::intrusive_ptr >,
> > mempool::pool_allocator<(mempool::pool_index_t)4, std::pair > const, boost::intrusive_ptr > >,
> > std::__detail::_Select1st, std::equal_to,
> > std::hash, std::__detail::_Mod_range_hashing,
> > std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy,
> > std::__detail::_Hashtable_traits >::_M_erase(unsigned
> > long, std::__detail::_Hash_node_base*,
> > std::__detail::_Hash_node > boost::intrusive_ptr >, true>*)+0x68)
> [0x55bd21af7bc8]",
> > "(BlueStore::OnodeSpace::_remove(ghobject_t const&)+0x29b)
> > [0x55bd21a420eb]",
> > "(LruOnodeCacheShard::_trim_to(unsigned long)+0xeb)
> > [0x55bd21afce6b]",
> > "(BlueStore::OnodeSpace::add(ghobject_t const&,
> > boost::intrusive_ptr&)+0x49d) [0x55bd21a42dfd]",
> > "(BlueStore::Collection::get_onode(ghobject_t const&, bool,
> > bool)+0x46a) [0x55bd21a86e1a]",
> > "(BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > ceph::os::Transaction*)+0x1124) [0x55bd21aab2e4]",
> >
> >
> "(BlueStore::queue_transactions(boost::intrusive_ptr&,
> > std::vector
> > >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x316)
> > [0x55bd21ac76d6]",
> > "(non-virtual thunk to
> > PrimaryLogPG::queue_transactions(std::vector > std::allocator >&,
> > boost::intrusive_ptr)+0x58) [0x55bd2170b878]",
> >
> > "(ReplicatedBackend::do_repop(boost::intrusive_ptr)+0xeb0)
> > [0x55bd218fdff0]",
> >
> >
> "(ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x267)
> > [0x55bd2190e357]",
> >
>  "(PGBackend::handle_message(boost::intrusive_ptr)+0x52)
> > [0x55bd2173ed52]",
> > "(PrimaryLogPG::do_request(boost::intrusive_ptr&,
> > ThreadPool::TPHandle&)+0x5de) [0x55bd216e268e]",
> > "(OSD::dequeue_op(boost::intrusive_ptr,
> > boost::intrusive_ptr, ThreadPool::TPHandle&)+0x309)
> > [0x55bd21569fc9]",
> > "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> > boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x68)
> [0x55bd217c8e78]",
> > "(OSD::ShardedOpWQ::_process(unsigned int,
> > ceph::heartbeat_handle_d*)+0xc28) [0x55bd215874c8]",
> > "(ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x5c4)
> > [0x55bd21c042a4]",
> > "(ShardedThreadPool::WorkThreadSharded::entry()+0x14)
> > [0x55bd21c07184]",
> > "/lib64/libpthread.so.0(+0x81ca) [0x7f82e87c41ca]",
> > "clone()"
> > ],
> > "ceph_version": "16.2.10",
> > "crash_id":
> > "2022-10-21T22:15:37.801458Z_ded9e80e-6c08-4502-8887-0438cdcd4a1c",
> > "entity_name": "osd.21",
> > "os_id": "centos",
> > "os_name": "CentOS Stream",
> > "os_version": "8",
> > "os_version_id": "8",
> > "process_name": "ceph-osd",
> > "stack_sig":
> > "a43cb3d3dcfcda8bffe97d30a1cb9c244ba20d595a2c8759a5cc8274781b0020",
> > "timestamp": "2022-10-21T22:15:37.801458Z",
> > "utsname_hostname": "node07",
> > "utsname_machine": "x86_64",
> > "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
> > "utsname_sysname": "Linux",
> > "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021"
> > }
> >
> > message info:
> > [root@node07 ~]# dmesg -T | grep osd
> > [Sat Oct 22 06:18:46 2022] tp_osd_tp[23817]: segfault at 55bd0001 ip
> > 7f82e931cd6a sp 7f82c6592f08 error 4 in
> > libtcmalloc.so.4.5.3[7f82e92e1000+4d000]
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> There is currently a known race condition in onode reference counting
> that affects all versions of Ceph [1][2]. Your backtrace is different
> from everything 

[ceph-users] Re: Understanding rbd objects, with snapshots

2022-10-24 Thread Chris Dunlop

Hi Maged,

Thanks for taking the time to go into a detailed explanation. It's 
certainly not as easy as working out the appropriate object to get via 
rados. As you suggest, I'll have to look into ceph-objectstore-tool and 
perhaps librados to get any further.


Thanks again,

Chris

On Mon, Oct 24, 2022 at 02:41:07PM +0200, Maged Mokhtar wrote:

On 18/10/2022 01:24, Chris Dunlop wrote:

Hi,

Is there anywhere that describes exactly how rbd data (including 
snapshots) are stored within a pool?


Hi Chris,

snaphots are stored on the same OSD as current object.
rbd snapshots are self managed rather than rados pool managed, the rbd 
client tales responsibility for passing the correct snapshot 
environment/context to the OSDs in i/o operations via librados.
First to create a snapshot, rbd client requests a unique snap id number 
from the mons.
This number and snap name are persisted/stored in rbd_header.xx object 
for the rbd image. it is added to a list of prev snaps if any.


When rbd client writes to rbd_data.xx rados object, it passes the list 
of snaps.
The OSD will look at the snap list and perform a lot of logic like 
create and copy original data to the snap before writing in case it did 
not have this snao before or copy data if snap did not have this 
offset/extent before...etc. The OSD will keep track of what snapshots it 
is storing for the object, their blob offset/extent in the object as 
well as their physical location on the OSD block device all in the 
rocksdb database. The physical location on block device can be far apart 
allocated by the allocator from free space on device. You can use 
ceph-objectstore-tool on the OSD to examine snapshot location and get 
its data.



When reading, the rbd client passes the snap id to read, or default id 
for head/current.
I do not believe you can use rados get command on the rbd_data.xx as you 
were doing to get snap shot data, even if you specify the snapshot 
parameter to the command as i think this works with rados pool snapshots 
and not self managed.
As a user, if you wanted to access rbd snap data, you can rbd map the 
snap and read from it via kernel rbd.
If you want to fiddle with reading snapshots at rados level on the 
rbd_data.xx, you can write a librados app that first read the snap id 
from rbd_header.xx based on snap name then pass this id in the context 
to the librados read function.


/Maged

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to remove remaining bucket index shard objects

2022-10-24 Thread 伊藤 祐司
Hi,

The large omap alert looks resolved last week, Although I don't know the 
underlying reasons.
When I got your email and tried to get the data, I noticed that the alerts had 
stopped. OMAP was 0 Bytes as follows. To make sure, I ran a deep scrub and 
waited for a while, but the alert has not recurred until now. Before the alerts 
stopped, the other team restarted the node where OSD and other modules were 
running for maintenance, which may have had an impact. However, reboots are 
done every week and have been done three times after compaction. It is 
therefore uncertain as to the root reason. There is a possibility of 
recurrence, so I will take a wait-and-see approach.

```
$ kubectl exec -n ceph-poc deploy/rook-ceph-tools -- ceph -s
  cluster:
    id:     49bd471e-84e6-412e-8ed0-75d7bc176657
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum b,d,f (age 4d)
    mgr: b(active, since 4d), standbys: a
    osd: 96 osds: 96 up (since 4d), 96 in (since 4d)
    rgw: 6 daemons active (6 hosts, 2 zones)

  data:
    pools:   16 pools, 4432 pgs
    objects: 10.19k objects, 34 GiB
    usage:   161 GiB used, 787 TiB / 787 TiB avail
    pgs:     4432 active+clean

  io:
    client:   3.1 KiB/s rd, 931 B/s wr, 3 op/s rd, 2 op/s wr

$ OSD_POOL=ceph-poc-object-store-ssd-index.rgw.buckets.index
$ (header="id used_mbytes used_objects omap_used_mbytes omap_used_keys"
>   echo "${header}"
>   echo "${header}" | tr '[[:alpha:]_' '-'
>   kubectl exec -n ceph-poc deploy/rook-ceph-tools -- ceph pg ls-by-pool 
> "${OSD_POOL}" --format=json | jq -r '.pg_stats |
>   sort_by(.stat_sum.num_bytes) | .[] | (.pgid, .stat_sum.num_bytes/1024/1024,
>   .stat_sum.num_objects, .stat_sum.num_omap_bytes/1024/1024,
>   .stat_sum.num_omap_keys)' | paste - - - - -) | column -t
id    used_mbytes  used_objects  omap_used_mbytes  omap_used_keys
--    ---      --
6.0   0            0             0                 0
6.1   0            0             0                 0
6.2   0            0             0                 0
6.3   0            0             0                 0
6.4   0            1             0                 0
6.5   0            1             0                 0
6.6   0            0             0                 0
6.7   0            0             0                 0
6.8   0            0             0                 0
6.9   0            0             0                 0
6.a   0            0             0                 0
6.b   0            0             0                 0
6.c   0            0             0                 0
6.d   0            0             0                 0
6.e   0            0             0                 0
6.f   0            1             0                 0
6.10  0            1             0                 0
6.11  0            0             0                 0
6.12  0            0             0                 0
6.13  0            0             0                 0
6.14  0            0             0                 0
6.15  0            0             0                 0
6.16  0            0             0                 0
6.17  0            0             0                 0
6.18  0            0             0                 0
6.19  0            1             0                 0
6.1a  0            1             0                 0
6.1b  0            0             0                 0
6.1c  0            0             0                 0
6.1d  0            0             0                 0
6.1e  0            1             0                 0
6.1f  0            0             0                 0
6.20  0            1             0                 0
6.21  0            0             0                 0
6.22  0            0             0                 0
6.23  0            0             0                 0
6.24  0            0             0                 0
6.25  0            0             0                 0
6.26  0            0             0                 0
6.27  0            1             0                 0
6.28  0            0             0                 0
6.29  0            0             0                 0
6.2a  0            1             0                 0
6.2b  0            0             0                 0
6.2c  0            0             0                 0
6.2d  0            0             0                 0
6.2e  0            0             0                 0
6.2f  0            0             0                 0
6.30  0            0             0                 0
6.31  0            1             0                 0
6.32  0            1             0                 0
6.33  0            0             0                 0
6.34  0            0             0                 0
6.35  0            0             0                 0
6.36  0            0             0                 0
6.37  0            0             0                 0
6.38  0            0             0                 0
6.39  0            0             0                 0
6.3a  0   

[ceph-users] Dashboard device health info missing

2022-10-24 Thread Wyll Ingersoll


Looking at the device health info for the OSDs in our cluster sometimes shows 
"No SMART data available".  This appears to only occur for SCSI type disks in 
our cluster. ATA disks have their full health SMART data displayed, but the 
non-ATA do not.

The actual SMART data (JSON formatted) is returned by the mgr, though because 
it is formatted differently I suspect the dashboard UI doesn't know how to 
interpret it.  This is misleading, at the least it could display the "output" 
section for each device so that the viewer can interpret it even if the 
dashboard doesn't know how to.

Is this a known bug?   We would like to have SMART data for all of our devices 
if possible.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS constant high write I/O to the metadata pool

2022-10-24 Thread Olli Rajala
I tried my luck and upgraded to 17.2.4 but unfortunately that didn't make
any difference here either.

I also looked more again at all kinds of client op and request stats and
wotnot which only made me even more certain that this io is not caused by
any clients.

What internal mds operation or mechanism could cause such high idle write
io? I've tried to fiddle a bit with some of the mds cache trim and memory
settings but I haven't noticed any effect there. Any pointers appreciated.

Cheers,
---
Olli Rajala - Lead TD
Anima Vitae Ltd.
www.anima.fi
---


On Mon, Oct 17, 2022 at 10:28 AM Olli Rajala  wrote:

> Hi Patrick,
>
> With "objecter_ops" did you mean "ceph tell mds.pve-core-1 ops" and/or
> "ceph tell mds.pve-core-1 objecter_requests"? Both these show very few
> requests/ops - many times just returning empty lists. I'm pretty sure that
> this I/O isn't generated by any clients - I've earlier tried to isolate
> this by shutting down all cephfs clients and this didn't have any
> noticeable effect.
>
> I tried to watch what is going on with that "perf dump" but to be honest
> all I can see is some numbers going up in the different sections :)
> ...don't have a clue what to focus on and how to interpret that.
>
> Here's a perf dump if you or anyone could make something out of that:
> https://gist.github.com/olliRJL/43c10173aafd82be22c080a9cd28e673
>
> Tnx!
> o.
>
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
>
>
> On Fri, Oct 14, 2022 at 8:32 PM Patrick Donnelly 
> wrote:
>
>> Hello Olli,
>>
>> On Thu, Oct 13, 2022 at 5:01 AM Olli Rajala  wrote:
>> >
>> > Hi,
>> >
>> > I'm seeing constant 25-50MB/s writes to the metadata pool even when all
>> > clients and the cluster is idling and in clean state. This surely can't
>> be
>> > normal?
>> >
>> > There's no apparent issues with the performance of the cluster but this
>> > write rate seems excessive and I don't know where to look for the
>> culprit.
>> >
>> > The setup is Ceph 16.2.9 running in hyperconverged 3 node core cluster
>> and
>> > 6 hdd osd nodes.
>> >
>> > Here's typical status when pretty much all clients are idling. Most of
>> that
>> > write bandwidth and maybe fifth of the write iops is hitting the
>> > metadata pool.
>> >
>> >
>> ---
>> > root@pve-core-1:~# ceph -s
>> >   cluster:
>> > id: 2088b4b1-8de1-44d4-956e-aa3d3afff77f
>> > health: HEALTH_OK
>> >
>> >   services:
>> > mon: 3 daemons, quorum pve-core-1,pve-core-2,pve-core-3 (age 2w)
>> > mgr: pve-core-1(active, since 4w), standbys: pve-core-2, pve-core-3
>> > mds: 1/1 daemons up, 2 standby
>> > osd: 48 osds: 48 up (since 5h), 48 in (since 4M)
>> >
>> >   data:
>> > volumes: 1/1 healthy
>> > pools:   10 pools, 625 pgs
>> > objects: 70.06M objects, 46 TiB
>> > usage:   95 TiB used, 182 TiB / 278 TiB avail
>> > pgs: 625 active+clean
>> >
>> >   io:
>> > client:   45 KiB/s rd, 38 MiB/s wr, 6 op/s rd, 287 op/s wr
>> >
>> ---
>> >
>> > Here's some daemonperf dump:
>> >
>> >
>> ---
>> > root@pve-core-1:~# ceph daemonperf mds.`hostname -s`
>> >
>> mds-
>> > --mds_cache--- --mds_log-- -mds_mem- ---mds_server---
>> mds_
>> > -objecter-- purg
>> > req  rlat fwd  inos caps exi  imi  hifc crev cgra ctru cfsa cfa  hcc
>> hccd
>> > hccr prcr|stry recy recd|subm evts segs repl|ino  dn  |hcr  hcs  hsr
>> cre
>> >  cat |sess|actv rd   wr   rdwr|purg|
>> >  4000  767k  78k   000161005
>> 5
>> >  37 |1.1k   00 | 17  3.7k 1340 |767k 767k| 40500
>> >  0 |110 |  42   210 |  2
>> >  5720  767k  78k   0003   16300   11
>>  11
>> >  0   17 |1.1k   00 | 45  3.7k 1370 |767k 767k| 57800
>> >  0 |110 |  02   280 |  4
>> >  5740  767k  78k   0004   34400   34
>>  33
>> >  2   26 |1.0k   00 |134  3.9k 1390 |767k 767k| 57   1300
>> >  0 |110 |  02  1120 | 19
>> >  6730  767k  78k   0006   32600   22
>>  22
>> >  0   32 |1.1k   00 | 78  3.9k 1410 |767k 768k| 67400
>> >  0 |110 |  02   560 |  2
>> >
>> ---
>> > Any ideas where to look at?
>>
>> Check the perf dump output of the mds:
>>
>> ceph tell mds.:0 perf dump
>>
>> over a period of time to identify what's going on. You can also look
>> at the objecter_ops (another tell 

[ceph-users] Re: Temporary shutdown of subcluster and cephfs

2022-10-24 Thread Patrick Donnelly
On Wed, Oct 19, 2022 at 7:54 AM Frank Schilder  wrote:
>
> Hi Dan,
>
> I know that "fs fail ..." is not ideal, but we will not have time for a clean 
> "fs down true" and wait for journal flush procedure to complete (on our 
> cluster this takes at least 20 minutes, which is way too long). My question 
> is more along the lines 'Is an "fs fail" destructive?'

It is not but lingering clients will not be evicted automatically by
the MDS. If you can, unmount before doing `fs fail`.

A journal flush is not really necessary. You only should wait ~10
seconds after the last client unmounts to give the MDS time to write
out to its journal any outstanding events.

> , that is, will an FS come up again after
>
> - fs fail
> ...
> - fs set  joinable true

Yes.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Joseph Mundackal
Quick napkin math

for your 3 way replicated pool - eg: pool 28 - you have 9.9 TB across 256
pgs ~= 10137 GB across 256 pgs ~= 39 GB per PG

for 4+2 ec on pool 51 - 32 TB across 128 pgs ~= 21768 GB across 128 pgs ~=
256 GB per pg - with the 4+2 profile this should be spread across 4 parts
~= 64 GB per PG

in my experience - a lot of features in ceph work well with replicated
pools and is a little buggy with EC pools. I am speculating that is why you
haven't gotten a warning for it.

On Mon, Oct 24, 2022 at 10:41 AM Tim Bishop  wrote:

> Hi Joseph,
>
> Here's some of the larger pools. Notable the largest (pool 51, 32 TiB
> CephFS data) doesn't have the highest number of PGs.
>
> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
> pool28 28  256  9.9 TiB2.61M   30 TiB  43.28 13 TiB
> pool29 29  256  9.5 TiB2.48M   28 TiB  42.13 13 TiB
> pool36 36  128  6.0 TiB1.58M   18 TiB  31.67 13 TiB
> pool39 39  256   20 TiB5.20M   60 TiB  60.37 13 TiB
> pool43 43   32  1.9 TiB  503.92k  5.7 TiB  12.79 13 TiB
> pool46 46   32  1.3 TiB  236.34k  3.9 TiB   9.14 13 TiB
> pool47 47  128  4.0 TiB1.04M   12 TiB  23.35 13 TiB
> pool51 51  128   32 TiB   32.47M   55 TiB  58.30 26 TiB
> pool53 53  128  3.3 TiB  864.88k  9.9 TiB  20.21 13 TiB
> pool57 57  128   14 TiB3.55M   21 TiB  34.80 26 TiB
>
> It does sounds like I need to increase that, but I had assume the
> autoscaler would have produced a warning if that was the case... it
> certainly has for some pools in the past, and I've adjusted as per its
> recommendation.
>
> Tim.
>
> On Mon, Oct 24, 2022 at 09:24:58AM -0400, Joseph Mundackal wrote:
> > Hi Tim,
> > You might want to check you pool utilization and see if there are
> > enough pg's in that pool. Higher GB per pg can result in this scenario.
> >
> > I am also assuming that you have the balancer module turn on (ceph
> balancer
> > status) should tell you that as well.
> >
> > If you have enough pgs in the bigger pools and the balancer module is on,
> > you shouldht have to manually reweight osd's.
> >
> > -Joseph
> >
> > On Mon, Oct 24, 2022 at 9:13 AM Tim Bishop 
> wrote:
> >
> > > Hi all,
> > >
> > > ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
> > > (stable)
> > >
> > > We're having an issue with the spread of data across our OSDs. We have
> > > 108 OSDs in our cluster, all identical disk size, same number in each
> > > server, and the same number of servers in each rack. So I'd hoped we'd
> > > end up with a pretty balanced distribution of data across the disks.
> > > However, the fullest is at 85% full and the most empty is at 40% full.
> > >
> > > I've included the osd df output below, along with pool and crush rules.
> > >
> > > I've also looked at the reweight-by-utilization command which would
> > > apparently help:
> > >
> > > # ceph osd test-reweight-by-utilization
> > > moved 16 / 5715 (0.279965%)
> > > avg 52.9167
> > > stddev 7.20998 -> 7.15325 (expected baseline 7.24063)
> > > min osd.45 with 31 -> 31 pgs (0.585827 -> 0.585827 * mean)
> > > max osd.22 with 70 -> 68 pgs (1.32283 -> 1.28504 * mean)
> > >
> > > oload 120
> > > max_change 0.05
> > > max_change_osds 4
> > > average_utilization 0.6229
> > > overload_utilization 0.7474
> > > osd.22 weight 1. -> 0.9500
> > > osd.23 weight 1. -> 0.9500
> > > osd.53 weight 1. -> 0.9500
> > > osd.78 weight 1. -> 0.9500
> > > no change
> > >
> > > But I'd like to make sure I understand why the problem is occuring
> > > first so I can rule out a configuration issue, since it feels like the
> > > cluster shouldn't be getting in to this state in the first place.
> > >
> > > I have some suspicions that the number of PGs may be a bit low on some
> > > pools, but autoscale-status is set to "on" or "warn" for every pool, so
> > > it's happy with the current numbers. Does it play nice with CephFS?
> > >
> > > Thanks for any advice.
> > > Tim.
> > >
> > > ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
> > >  AVAIL %USE   VAR   PGS  STATUS
> > >  22hdd  3.63199   1.0  3.6 TiB  3.1 TiB  3.1 TiB  450 MiB  7.6
> > > GiB   557 GiB  85.04  1.37   70  up
> > >  23hdd  3.63199   1.0  3.6 TiB  2.9 TiB  2.9 TiB  459 MiB  7.5
> > > GiB   759 GiB  79.64  1.28   64  up
> > >  53hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  703 MiB  8.0
> > > GiB   823 GiB  77.91  1.25   66  up
> > >  78hdd  3.63799   1.0  3.6 TiB  2.8 TiB  2.8 TiB  187 MiB  5.9
> > > GiB   851 GiB  77.15  1.24   61  up
> > >  26hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  432 MiB  7.7
> > > GiB   854 GiB  77.07  1.24   61  up
> > >  39hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  503 MiB  7.2
> > > GiB   874 GiB  76.55  1.23   65 

[ceph-users] Re: Failed to probe daemons or devices

2022-10-24 Thread Guillaume Abrioux
Hello Sake,

Could you share the output of vgs / lvs commands?
Also, I would suggest you to open a tracker [1]

Thanks!

[1] https://tracker.ceph.com/projects/ceph-volume

On Mon, 24 Oct 2022 at 10:51, Sake Paulusma  wrote:

> Last friday I upgrade the Ceph cluster from 17.2.3 to 17.2.5 with "ceph
> orch upgrade start --image
> localcontainerregistry.local.com:5000/ceph/ceph:v17.2.5-20221017". After
> sometime, an hour?, I've got a health warning: CEPHADM_REFRESH_FAILED:
> failed to probe daemons or devices. I'm using only Cephfs on the cluster
> and it's still working correctly.
> Checking the running services, everything is up and running; mon, osd and
> mds. But on the hosts running mon and mds services I get errors in the
> cephadm.log, see the loglines below.
>
> I look likes cephadm tries to start a container for checking something?
> What could be wrong?
>
>
> On mon nodes I got the following:
> 2022-10-24 10:31:43,880 7f179e5bfb80 DEBUG
> 
> cephadm ['gather-facts']
> 2022-10-24 10:31:44,333 7fc2d52b6b80 DEBUG
> 
> cephadm ['--image', '
> localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0',
> 'ceph-volume', '--fsid', '8909ef90-22ea-11ed-8df1-0050569ee1b1', '--',
> 'inventory', '--format=json-pretty', '--filter-for-batch']
> 2022-10-24 10:31:44,663 7fc2d52b6b80 INFO Inferring config
> /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/mon.oqsoel24332/config
> 2022-10-24 10:31:44,663 7fc2d52b6b80 DEBUG Using specified fsid:
> 8909ef90-22ea-11ed-8df1-0050569ee1b1
> 2022-10-24 10:31:45,574 7fc2d52b6b80 INFO Non-zero exit code 1 from
> /bin/podman run --rm --ipc=host --stop-signal=SIGTERM
> --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint
> /usr/sbin/ceph-volume --privileged --group-add=disk --init -e
> CONTAINER_IMAGE=
> localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0
> -e NODE_NAME=monnode2.local.com -e CEPH_USE_RANDOM_NONCE=1 -e
> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
> /var/run/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/run/ceph:z -v
> /var/log/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/log/ceph:z -v
> /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/crash:/var/lib/ceph/crash:z
> -v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v
> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
> /run/lock/lvm:/run/lock/lvm -v
> /var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/selinux:/sys/fs/selinux:ro
> -v /:/rootfs -v /tmp/ceph-tmp31tx1iy2:/etc/ceph/ce
>  ph.conf:z
> localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0
> inventory --format=json-pretty --filter-for-batch
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr Traceback
> (most recent call last):
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/sbin/ceph-volume", line 11, in 
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr
> load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr
> self.main(self.argv)
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in
> newfunc
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr return f(*a,
> **kw)
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr
> terminal.dispatch(self.mapper, subcommand_args)
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in
> dispatch
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr
> instance.main()
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/lib/python3.6/site-packages/ceph_volume/inventory/main.py", line 53,
> in main
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr
> with_lsm=self.args.with_lsm))
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 39, in
> __init__
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr
> all_devices_vgs = lvm.get_all_devices_vgs()
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File
> "/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in
> get_all_devices_vgs
> 2022-10-24 10:31:45,575 7fc2d52b6b80 INFO 

[ceph-users] Re: ceph-ansible install failure

2022-10-24 Thread Guillaume Abrioux
Hi Zhongzhou,

I think most of the time it means that a device is not wiped correctly.
Can you check that?

Thanks!

On Sat, 22 Oct 2022 at 01:01, Zhongzhou Cai  wrote:

> Hi folks,
>
> I'm trying to install ceph on GCE VMs (debian/ubuntu) with PD-SSDs using
> ceph-ansible image. The installation from clean has been good, but when I
> purged ceph cluster and tried to re-install, I saw the error:
>
> ```
>
> Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore
> bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap
> --keyfile - --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid
> 08d766d5-e843-4c65-9f4f-db7f0129b4e9 --setuser ceph --setgroup ceph
>
> stderr: 2022-10-21T22:07:58.447+ 7f71afead080 -1
> bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid
>
> stderr: 2022-10-21T22:07:58.455+ 7f71afead080 -1 bluefs _replay 0x0:
> stop: uuid 8b1ce55d-10c1-a33d-1817-8a8427657694 != super.uuid
> 3d8aa673-00bd-473c-a725-06ac31c6b945, block dump:
>
> stderr:   6a bc c7 44 83 87 8b 1c  e5 5d 10 c1 a3 3d 18 17
> |j..D.]...=..|
>
> stderr: 0010  8a 84 27 65 76 94 bd 12  3c 11 4a c4 32 6c eb a4
> |..'ev...<.J.2l..|
>
> …
>
> stderr: 0ff0  2b 57 4e a4 ad da be cb  bf df 61 fc f7 ce 4a 14
> |+WN...a...J.|
>
> stderr: 1000
>
> stderr: 2022-10-21T22:07:58.987+ 7f71afead080 -1 rocksdb:
> verify_sharding unable to list column families: NotFound:
>
> stderr: 2022-10-21T22:07:58.987+ 7f71afead080 -1
> bluestore(/var/lib/ceph/osd/ceph-1/) _open_db erroring opening db:
>
> stderr: 2022-10-21T22:07:59.515+ 7f71afead080 -1 OSD::mkfs:
> ObjectStore::mkfs failed with error (5) Input/output error
>
> stderr: 2022-10-21T22:07:59.515+ 7f71afead080 -1 [0;31m ** ERROR: error
> creating empty object store in /var/lib/ceph/osd/ceph-1/: (5) Input/output
> error[0m
>
> --> Was unable to complete a new OSD, will rollback changes
> ```
>
> Can someone explain what uuid != super.uuid means? The issue seems not to
> happen when installing on a clean disk. Would it be related to the purging
> process not doing a good cleanup job? FWIW, I'm using
>
> https://github.com/ceph/ceph-ansible/blob/main/infrastructure-playbooks/purge-cluster.yml
> to purge the cluster.
>
> Thanks,
> Zhongzhou Cai
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 

*Guillaume Abrioux*Senior Software Engineer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Josh Baergen
Hi Tim,

Ah, it didn't sink in for me at first how many pools there were here.
I think you might be hitting the issue that the author of
https://github.com/TheJJ/ceph-balancer ran into, and thus their
balancer might help in this case.

Josh

On Mon, Oct 24, 2022 at 8:37 AM Tim Bishop  wrote:
>
> Hi Josh,
>
> On Mon, Oct 24, 2022 at 07:20:46AM -0600, Josh Baergen wrote:
> > > I've included the osd df output below, along with pool and crush rules.
> >
> > Looking at these, the balancer module should be taking care of this
> > imbalance automatically. What does "ceph balancer status" say?
>
> # ceph balancer status
> {
> "active": true,
> "last_optimize_duration": "0:00:00.038795",
> "last_optimize_started": "Mon Oct 24 15:35:43 2022",
> "mode": "upmap",
> "optimize_result": "Optimization plan created successfully",
> "plans": []
> }
>
> Looks healthy?
>
> This cluster is on pacific but has been upgraded through numerous
> previous releases, so it is possible some settings have been inherited
> and are not the same defaults as a new cluster.
>
> Tim.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Tim Bishop
Hi Joseph,

Here's some of the larger pools. Notable the largest (pool 51, 32 TiB
CephFS data) doesn't have the highest number of PGs.

POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
pool28 28  256  9.9 TiB2.61M   30 TiB  43.28 13 TiB
pool29 29  256  9.5 TiB2.48M   28 TiB  42.13 13 TiB
pool36 36  128  6.0 TiB1.58M   18 TiB  31.67 13 TiB
pool39 39  256   20 TiB5.20M   60 TiB  60.37 13 TiB
pool43 43   32  1.9 TiB  503.92k  5.7 TiB  12.79 13 TiB
pool46 46   32  1.3 TiB  236.34k  3.9 TiB   9.14 13 TiB
pool47 47  128  4.0 TiB1.04M   12 TiB  23.35 13 TiB
pool51 51  128   32 TiB   32.47M   55 TiB  58.30 26 TiB
pool53 53  128  3.3 TiB  864.88k  9.9 TiB  20.21 13 TiB
pool57 57  128   14 TiB3.55M   21 TiB  34.80 26 TiB

It does sounds like I need to increase that, but I had assume the
autoscaler would have produced a warning if that was the case... it
certainly has for some pools in the past, and I've adjusted as per its
recommendation.

Tim.

On Mon, Oct 24, 2022 at 09:24:58AM -0400, Joseph Mundackal wrote:
> Hi Tim,
> You might want to check you pool utilization and see if there are
> enough pg's in that pool. Higher GB per pg can result in this scenario.
> 
> I am also assuming that you have the balancer module turn on (ceph balancer
> status) should tell you that as well.
> 
> If you have enough pgs in the bigger pools and the balancer module is on,
> you shouldht have to manually reweight osd's.
> 
> -Joseph
> 
> On Mon, Oct 24, 2022 at 9:13 AM Tim Bishop  wrote:
> 
> > Hi all,
> >
> > ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
> > (stable)
> >
> > We're having an issue with the spread of data across our OSDs. We have
> > 108 OSDs in our cluster, all identical disk size, same number in each
> > server, and the same number of servers in each rack. So I'd hoped we'd
> > end up with a pretty balanced distribution of data across the disks.
> > However, the fullest is at 85% full and the most empty is at 40% full.
> >
> > I've included the osd df output below, along with pool and crush rules.
> >
> > I've also looked at the reweight-by-utilization command which would
> > apparently help:
> >
> > # ceph osd test-reweight-by-utilization
> > moved 16 / 5715 (0.279965%)
> > avg 52.9167
> > stddev 7.20998 -> 7.15325 (expected baseline 7.24063)
> > min osd.45 with 31 -> 31 pgs (0.585827 -> 0.585827 * mean)
> > max osd.22 with 70 -> 68 pgs (1.32283 -> 1.28504 * mean)
> >
> > oload 120
> > max_change 0.05
> > max_change_osds 4
> > average_utilization 0.6229
> > overload_utilization 0.7474
> > osd.22 weight 1. -> 0.9500
> > osd.23 weight 1. -> 0.9500
> > osd.53 weight 1. -> 0.9500
> > osd.78 weight 1. -> 0.9500
> > no change
> >
> > But I'd like to make sure I understand why the problem is occuring
> > first so I can rule out a configuration issue, since it feels like the
> > cluster shouldn't be getting in to this state in the first place.
> >
> > I have some suspicions that the number of PGs may be a bit low on some
> > pools, but autoscale-status is set to "on" or "warn" for every pool, so
> > it's happy with the current numbers. Does it play nice with CephFS?
> >
> > Thanks for any advice.
> > Tim.
> >
> > ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
> >  AVAIL %USE   VAR   PGS  STATUS
> >  22hdd  3.63199   1.0  3.6 TiB  3.1 TiB  3.1 TiB  450 MiB  7.6
> > GiB   557 GiB  85.04  1.37   70  up
> >  23hdd  3.63199   1.0  3.6 TiB  2.9 TiB  2.9 TiB  459 MiB  7.5
> > GiB   759 GiB  79.64  1.28   64  up
> >  53hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  703 MiB  8.0
> > GiB   823 GiB  77.91  1.25   66  up
> >  78hdd  3.63799   1.0  3.6 TiB  2.8 TiB  2.8 TiB  187 MiB  5.9
> > GiB   851 GiB  77.15  1.24   61  up
> >  26hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  432 MiB  7.7
> > GiB   854 GiB  77.07  1.24   61  up
> >  39hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  503 MiB  7.2
> > GiB   874 GiB  76.55  1.23   65  up
> >  42hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.7 TiB  439 MiB  6.6
> > GiB   909 GiB  75.59  1.21   60  up
> > 101hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  306 MiB  7.1
> > GiB   913 GiB  75.50  1.21   61  up
> >  87hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  539 MiB  7.5
> > GiB   921 GiB  75.27  1.21   61  up
> >  59hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  721 MiB  7.9
> > GiB   957 GiB  74.30  1.19   64  up
> >  79hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  950 MiB  9.0
> > GiB   970 GiB  73.95  1.19   58  up
> >  34hdd  3.63199   1.0  3.6 TiB  2.7 TiB  2.7 TiB  202 MiB  6.0
> > GiB   974 GiB  73.85  1.19   57  up
> >  60

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Tim Bishop
Hi Josh,

On Mon, Oct 24, 2022 at 07:20:46AM -0600, Josh Baergen wrote:
> > I've included the osd df output below, along with pool and crush rules.
> 
> Looking at these, the balancer module should be taking care of this
> imbalance automatically. What does "ceph balancer status" say?

# ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.038795",
"last_optimize_started": "Mon Oct 24 15:35:43 2022",
"mode": "upmap",
"optimize_result": "Optimization plan created successfully",
"plans": []
}

Looks healthy?

This cluster is on pacific but has been upgraded through numerous
previous releases, so it is possible some settings have been inherited
and are not the same defaults as a new cluster.

Tim.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Joseph Mundackal
Hi Tim,
You might want to check you pool utilization and see if there are
enough pg's in that pool. Higher GB per pg can result in this scenario.

I am also assuming that you have the balancer module turn on (ceph balancer
status) should tell you that as well.

If you have enough pgs in the bigger pools and the balancer module is on,
you shouldht have to manually reweight osd's.

-Joseph

On Mon, Oct 24, 2022 at 9:13 AM Tim Bishop  wrote:

> Hi all,
>
> ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
> (stable)
>
> We're having an issue with the spread of data across our OSDs. We have
> 108 OSDs in our cluster, all identical disk size, same number in each
> server, and the same number of servers in each rack. So I'd hoped we'd
> end up with a pretty balanced distribution of data across the disks.
> However, the fullest is at 85% full and the most empty is at 40% full.
>
> I've included the osd df output below, along with pool and crush rules.
>
> I've also looked at the reweight-by-utilization command which would
> apparently help:
>
> # ceph osd test-reweight-by-utilization
> moved 16 / 5715 (0.279965%)
> avg 52.9167
> stddev 7.20998 -> 7.15325 (expected baseline 7.24063)
> min osd.45 with 31 -> 31 pgs (0.585827 -> 0.585827 * mean)
> max osd.22 with 70 -> 68 pgs (1.32283 -> 1.28504 * mean)
>
> oload 120
> max_change 0.05
> max_change_osds 4
> average_utilization 0.6229
> overload_utilization 0.7474
> osd.22 weight 1. -> 0.9500
> osd.23 weight 1. -> 0.9500
> osd.53 weight 1. -> 0.9500
> osd.78 weight 1. -> 0.9500
> no change
>
> But I'd like to make sure I understand why the problem is occuring
> first so I can rule out a configuration issue, since it feels like the
> cluster shouldn't be getting in to this state in the first place.
>
> I have some suspicions that the number of PGs may be a bit low on some
> pools, but autoscale-status is set to "on" or "warn" for every pool, so
> it's happy with the current numbers. Does it play nice with CephFS?
>
> Thanks for any advice.
> Tim.
>
> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
>  AVAIL %USE   VAR   PGS  STATUS
>  22hdd  3.63199   1.0  3.6 TiB  3.1 TiB  3.1 TiB  450 MiB  7.6
> GiB   557 GiB  85.04  1.37   70  up
>  23hdd  3.63199   1.0  3.6 TiB  2.9 TiB  2.9 TiB  459 MiB  7.5
> GiB   759 GiB  79.64  1.28   64  up
>  53hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  703 MiB  8.0
> GiB   823 GiB  77.91  1.25   66  up
>  78hdd  3.63799   1.0  3.6 TiB  2.8 TiB  2.8 TiB  187 MiB  5.9
> GiB   851 GiB  77.15  1.24   61  up
>  26hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  432 MiB  7.7
> GiB   854 GiB  77.07  1.24   61  up
>  39hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  503 MiB  7.2
> GiB   874 GiB  76.55  1.23   65  up
>  42hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.7 TiB  439 MiB  6.6
> GiB   909 GiB  75.59  1.21   60  up
> 101hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  306 MiB  7.1
> GiB   913 GiB  75.50  1.21   61  up
>  87hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  539 MiB  7.5
> GiB   921 GiB  75.27  1.21   61  up
>  59hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  721 MiB  7.9
> GiB   957 GiB  74.30  1.19   64  up
>  79hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  950 MiB  9.0
> GiB   970 GiB  73.95  1.19   58  up
>  34hdd  3.63199   1.0  3.6 TiB  2.7 TiB  2.7 TiB  202 MiB  6.0
> GiB   974 GiB  73.85  1.19   57  up
>  60hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.6 TiB  668 MiB  7.2
> GiB  1009 GiB  72.91  1.17   59  up
>  18hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.6 TiB  453 MiB  6.5
> GiB  1021 GiB  72.59  1.17   60  up
>  74hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  693 MiB  7.5
> GiB   1.0 TiB  72.12  1.16   62  up
>  19hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.6 TiB  655 MiB  7.9
> GiB   1.0 TiB  71.71  1.15   63  up
>  69hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  445 MiB  6.2
> GiB   1.0 TiB  71.70  1.15   65  up
>  43hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.6 TiB  170 MiB  4.7
> GiB   1.0 TiB  71.62  1.15   63  up
>  97hdd  3.63820   1.0  3.6 TiB  2.6 TiB  2.6 TiB  276 MiB  5.7
> GiB   1.0 TiB  71.33  1.15   66  up
>  67hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  430 MiB  6.3
> GiB   1.0 TiB  71.22  1.14   54  up
>  68hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  419 MiB  6.6
> GiB   1.1 TiB  70.68  1.13   58  up
>  31hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.5 TiB  419 MiB  5.2
> GiB   1.1 TiB  70.16  1.13   63  up
>  48hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.5 TiB  211 MiB  5.0
> GiB   1.1 TiB  70.13  1.13   56  up
>  73hdd  3.63799   1.0  3.6 TiB  2.5 TiB  2.5 TiB  765 MiB  7.1
> GiB   1.1 TiB  69.52  1.12   57  up
>  98hdd  3.63820   1.0  3.6 TiB  2.5 TiB  2.5 TiB  552 MiB  

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Anthony D'Atri
Hey, Tim.

Visualization is great to help get a better sense of OSD fillage than a table 
of numbers.  A Grafana panel works, or a quick script:

Grab this from from CERN:

https://gitlab.cern.ch/ceph/ceph-scripts/-/blob/master/tools/histogram.py 


Here’s a wrapper

#!/bin/bash

case $(uname -n) in
*admin* ) echo "Be sure that CEPH_ARGS is set to the proper cluster or 
you may get unexpected results"
;;
   *) 
esac


if [  -z "$CEPH_ARGS" ]; then
export CLUSTER=$(basename /etc/ceph/*conf | sed -e 's/.conf//')
export CEPH_ARGS="--cluster $CLUSTER"
fi

case $(ceph -v | awk '{print $3}' ) in
9*) ceph $CEPH_ARGS osd df | egrep -v WEIGHT\|TOTAL\|MIN\|ID\|nan | 
sed -e 's/ssd//' -e 's/hdd//' | awk '{print 1, $7}'  |  
/opt/storage/ceph-scripts/tools/histogram.py -a -b 200 -m 0 -x 100 -p --no-mvsd 
;;
14*)ceph $CEPH_ARGS osd df | egrep -v WEIGHT\|TOTAL\|MIN\|ID\|nan | 
sed -e 's/ssd//' -e 's/hdd//' | awk '{print 1, $16}' |  
/opt/storage/ceph-scripts/tools/histogram.py -a -b 200 -m 0 -x 100 -p --no-mvsd 
;;
*) echo "Update this script to handle this Ceph version" ; exit 1 ;;
esac



If you’re using Pacific, you probably don’t want to use 
reweight-by-utilization.  It works, but there are better ways.

Left to itself, OSD fillage will sorta look like a bell curve, with a few 
under-full and over-full outliers.  This is the nature of the CRUSH 
hash/algorithm, which Sage has called “probabilistic”

As Josh mentions, the ceph-mgr balancer module should be your go-to here.  With 
your uniform cluster it should do well; most likely it is turned off.

Note that this is not the same as the PG autoscaler, they are often confused.



> On Oct 24, 2022, at 09:11, Tim Bishop  wrote:
> 
> Hi all,
> 
> ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific 
> (stable)
> 
> We're having an issue with the spread of data across our OSDs. We have
> 108 OSDs in our cluster, all identical disk size, same number in each
> server, and the same number of servers in each rack. So I'd hoped we'd
> end up with a pretty balanced distribution of data across the disks.
> However, the fullest is at 85% full and the most empty is at 40% full.
> 
> I've included the osd df output below, along with pool and crush rules.
> 
> I've also looked at the reweight-by-utilization command which would
> apparently help:
> 
> # ceph osd test-reweight-by-utilization
> moved 16 / 5715 (0.279965%)
> avg 52.9167
> stddev 7.20998 -> 7.15325 (expected baseline 7.24063)
> min osd.45 with 31 -> 31 pgs (0.585827 -> 0.585827 * mean)
> max osd.22 with 70 -> 68 pgs (1.32283 -> 1.28504 * mean)
> 
> oload 120
> max_change 0.05
> max_change_osds 4
> average_utilization 0.6229
> overload_utilization 0.7474
> osd.22 weight 1. -> 0.9500
> osd.23 weight 1. -> 0.9500
> osd.53 weight 1. -> 0.9500
> osd.78 weight 1. -> 0.9500
> no change
> 
> But I'd like to make sure I understand why the problem is occuring
> first so I can rule out a configuration issue, since it feels like the
> cluster shouldn't be getting in to this state in the first place.
> 
> I have some suspicions that the number of PGs may be a bit low on some
> pools, but autoscale-status is set to "on" or "warn" for every pool, so
> it's happy with the current numbers. Does it play nice with CephFS?
> 
> Thanks for any advice.
> Tim.
> 
> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META 
> AVAIL %USE   VAR   PGS  STATUS
> 22hdd  3.63199   1.0  3.6 TiB  3.1 TiB  3.1 TiB  450 MiB  7.6 GiB   
> 557 GiB  85.04  1.37   70  up
> 23hdd  3.63199   1.0  3.6 TiB  2.9 TiB  2.9 TiB  459 MiB  7.5 GiB   
> 759 GiB  79.64  1.28   64  up
> 53hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  703 MiB  8.0 GiB   
> 823 GiB  77.91  1.25   66  up
> 78hdd  3.63799   1.0  3.6 TiB  2.8 TiB  2.8 TiB  187 MiB  5.9 GiB   
> 851 GiB  77.15  1.24   61  up
> 26hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  432 MiB  7.7 GiB   
> 854 GiB  77.07  1.24   61  up
> 39hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  503 MiB  7.2 GiB   
> 874 GiB  76.55  1.23   65  up
> 42hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.7 TiB  439 MiB  6.6 GiB   
> 909 GiB  75.59  1.21   60  up
> 101hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  306 MiB  7.1 GiB   
> 913 GiB  75.50  1.21   61  up
> 87hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  539 MiB  7.5 GiB   
> 921 GiB  75.27  1.21   61  up
> 59hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  721 MiB  7.9 GiB   
> 957 GiB  74.30  1.19   64  up
> 79hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  950 MiB  9.0 GiB   
> 970 GiB  73.95  1.19   58  up
> 34hdd  3.63199   1.0  3.6 TiB  2.7 TiB  2.7 TiB  202 MiB  6.0 GiB   
> 974 GiB  73.85  1.19   57  up
> 60hdd  3.63799 

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Josh Baergen
Hi Tim,

> I've included the osd df output below, along with pool and crush rules.

Looking at these, the balancer module should be taking care of this
imbalance automatically. What does "ceph balancer status" say?

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Advice on balancing data across OSDs

2022-10-24 Thread Tim Bishop
Hi all,

ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable)

We're having an issue with the spread of data across our OSDs. We have
108 OSDs in our cluster, all identical disk size, same number in each
server, and the same number of servers in each rack. So I'd hoped we'd
end up with a pretty balanced distribution of data across the disks.
However, the fullest is at 85% full and the most empty is at 40% full.

I've included the osd df output below, along with pool and crush rules.

I've also looked at the reweight-by-utilization command which would
apparently help:

# ceph osd test-reweight-by-utilization
moved 16 / 5715 (0.279965%)
avg 52.9167
stddev 7.20998 -> 7.15325 (expected baseline 7.24063)
min osd.45 with 31 -> 31 pgs (0.585827 -> 0.585827 * mean)
max osd.22 with 70 -> 68 pgs (1.32283 -> 1.28504 * mean)

oload 120
max_change 0.05
max_change_osds 4
average_utilization 0.6229
overload_utilization 0.7474
osd.22 weight 1. -> 0.9500
osd.23 weight 1. -> 0.9500
osd.53 weight 1. -> 0.9500
osd.78 weight 1. -> 0.9500
no change

But I'd like to make sure I understand why the problem is occuring
first so I can rule out a configuration issue, since it feels like the
cluster shouldn't be getting in to this state in the first place.

I have some suspicions that the number of PGs may be a bit low on some
pools, but autoscale-status is set to "on" or "warn" for every pool, so
it's happy with the current numbers. Does it play nice with CephFS?

Thanks for any advice.
Tim.

ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META 
AVAIL %USE   VAR   PGS  STATUS
 22hdd  3.63199   1.0  3.6 TiB  3.1 TiB  3.1 TiB  450 MiB  7.6 GiB   
557 GiB  85.04  1.37   70  up
 23hdd  3.63199   1.0  3.6 TiB  2.9 TiB  2.9 TiB  459 MiB  7.5 GiB   
759 GiB  79.64  1.28   64  up
 53hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  703 MiB  8.0 GiB   
823 GiB  77.91  1.25   66  up
 78hdd  3.63799   1.0  3.6 TiB  2.8 TiB  2.8 TiB  187 MiB  5.9 GiB   
851 GiB  77.15  1.24   61  up
 26hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  432 MiB  7.7 GiB   
854 GiB  77.07  1.24   61  up
 39hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.8 TiB  503 MiB  7.2 GiB   
874 GiB  76.55  1.23   65  up
 42hdd  3.63199   1.0  3.6 TiB  2.8 TiB  2.7 TiB  439 MiB  6.6 GiB   
909 GiB  75.59  1.21   60  up
101hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  306 MiB  7.1 GiB   
913 GiB  75.50  1.21   61  up
 87hdd  3.63820   1.0  3.6 TiB  2.7 TiB  2.7 TiB  539 MiB  7.5 GiB   
921 GiB  75.27  1.21   61  up
 59hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  721 MiB  7.9 GiB   
957 GiB  74.30  1.19   64  up
 79hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.7 TiB  950 MiB  9.0 GiB   
970 GiB  73.95  1.19   58  up
 34hdd  3.63199   1.0  3.6 TiB  2.7 TiB  2.7 TiB  202 MiB  6.0 GiB   
974 GiB  73.85  1.19   57  up
 60hdd  3.63799   1.0  3.6 TiB  2.7 TiB  2.6 TiB  668 MiB  7.2 GiB  
1009 GiB  72.91  1.17   59  up
 18hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.6 TiB  453 MiB  6.5 GiB  
1021 GiB  72.59  1.17   60  up
 74hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  693 MiB  7.5 GiB   
1.0 TiB  72.12  1.16   62  up
 19hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.6 TiB  655 MiB  7.9 GiB   
1.0 TiB  71.71  1.15   63  up
 69hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  445 MiB  6.2 GiB   
1.0 TiB  71.70  1.15   65  up
 43hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.6 TiB  170 MiB  4.7 GiB   
1.0 TiB  71.62  1.15   63  up
 97hdd  3.63820   1.0  3.6 TiB  2.6 TiB  2.6 TiB  276 MiB  5.7 GiB   
1.0 TiB  71.33  1.15   66  up
 67hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  430 MiB  6.3 GiB   
1.0 TiB  71.22  1.14   54  up
 68hdd  3.63799   1.0  3.6 TiB  2.6 TiB  2.6 TiB  419 MiB  6.6 GiB   
1.1 TiB  70.68  1.13   58  up
 31hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.5 TiB  419 MiB  5.2 GiB   
1.1 TiB  70.16  1.13   63  up
 48hdd  3.63199   1.0  3.6 TiB  2.6 TiB  2.5 TiB  211 MiB  5.0 GiB   
1.1 TiB  70.13  1.13   56  up
 73hdd  3.63799   1.0  3.6 TiB  2.5 TiB  2.5 TiB  765 MiB  7.1 GiB   
1.1 TiB  69.52  1.12   57  up
 98hdd  3.63820   1.0  3.6 TiB  2.5 TiB  2.5 TiB  552 MiB  7.1 GiB   
1.1 TiB  68.72  1.10   60  up
 58hdd  3.63799   1.0  3.6 TiB  2.5 TiB  2.5 TiB  427 MiB  6.3 GiB   
1.2 TiB  68.39  1.10   53  up
 14hdd  3.63199   1.0  3.6 TiB  2.5 TiB  2.5 TiB  409 MiB  6.0 GiB   
1.2 TiB  68.06  1.09   65  up
 47hdd  3.63199   1.0  3.6 TiB  2.5 TiB  2.5 TiB  166 MiB  5.5 GiB   
1.2 TiB  67.84  1.09   55  up
  9hdd  3.63199   1.0  3.6 TiB  2.5 TiB  2.5 TiB  419 MiB  5.9 GiB   
1.2 TiB  67.78  1.09   58  up
 90hdd  3.63820   1.0  3.6 TiB  2.5 TiB  2.5 TiB  277 MiB  6.3 GiB   
1.2 TiB  67.56  1.08   57   

[ceph-users] Re: Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

2022-10-24 Thread Martin Johansen
Hi, thank you, we replaced the domain of the service in text before
reporting the issue. Sorry, I should have mentioned.

admin.ceph.example.com was turned into  admin.ceph. for
privacy sake.

Best Regards,

Martin Johansen

On Mon, Oct 24, 2022 at 2:53 PM Murilo Morais  wrote:

> Hello Martin.
>
> Apparently cephadm is not able to resolve to `admin.ceph.`, check
> /etc/hosts or your DNS, try to ping and check if the IPs in `ceph orch host
> ls` are pinged and there is no packet loss.
>
> Try according to the documentation:
>
> https://docs.ceph.com/en/quincy/cephadm/operations/#cephadm-host-check-failed
>
> Em seg., 24 de out. de 2022 às 09:23, Martin Johansen 
> escreveu:
>
>> Hi, I deployed a Ceph cluster a week ago and have started experiencing
>> warnings. Any pointers as to how to further debug or fix it? Here is info
>> about the warnings:
>>
>> # ceph version
>> ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy
>> (stable)
>>
>> # ceph status
>>   cluster:
>> id: 
>> health: HEALTH_WARN
>> 1 hosts fail cephadm check
>>
>>   services:
>> mon:5 daemons, quorum admin.ceph.,mon,osd1,osd2,osd3
>> (age 79m)
>> mgr:admin.ceph..wvhmky(active, since 2h), standbys:
>> mon.jzfopv
>> osd:4 osds: 4 up (since 3h), 4 in (since 3h)
>> rbd-mirror: 2 daemons active (2 hosts)
>> rgw:5 daemons active (5 hosts, 1 zones)
>>
>>   data:
>> pools:   9 pools, 226 pgs
>> objects: 736 objects, 1.4 GiB
>> usage:   7.3 GiB used, 2.0 TiB / 2.1 TiB avail
>> pgs: 226 active+clean
>>
>>   io:
>> client:   36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr
>>
>> # journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep
>> "cephadm ERROR"
>> Oct 19 13:45:08 admin.ceph. bash[4445]: debug
>> 2022-10-19T11:45:08.163+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
>> Unable to write
>>
>> admin.ceph.:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
>> Unable to reach remote host admin.ceph..
>> Oct 19 13:45:08 admin.ceph. bash[4445]: debug
>> 2022-10-19T11:45:08.167+ 7fa7bb2d3700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.', 'mon.ceph.',
>> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
>> failed.
>> Oct 19 21:16:37 admin.ceph. bash[4445]: debug
>> 2022-10-19T19:16:37.504+ 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.', 'mon.ceph.',
>> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
>> failed.
>> Oct 21 14:00:52 admin.ceph. bash[4445]: debug
>> 2022-10-21T12:00:52.035+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
>> Unable to write
>>
>> admin.ceph.:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
>> Unable to reach remote host admin.ceph..
>> Oct 21 14:00:52 admin.ceph. bash[4445]: debug
>> 2022-10-21T12:00:52.047+ 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.', 'mon.ceph.',
>> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
>> failed.
>> Oct 21 14:25:04 admin.ceph. bash[4445]: debug
>> 2022-10-21T12:25:03.994+ 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.', 'mon.ceph.',
>> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
>> failed.
>> Oct 21 16:03:48 admin.ceph. bash[4445]: debug
>> 2022-10-21T14:03:48.320+ 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.', 'mon.ceph.',
>> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
>> failed.
>> Oct 22 06:26:17 admin.ceph. bash[4445]: debug
>> 2022-10-22T04:26:17.051+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
>> Unable to write admin.ceph.:/etc/ceph/ceph.client.admin.keyring:
>> Unable to reach remote host admin.ceph..
>> Oct 22 06:26:17 admin.ceph. bash[4445]: debug
>> 2022-10-22T04:26:17.055+ 7fa7b8ace700  0 [cephadm ERROR cephadm.utils]
>> executing refresh((['admin.ceph.', 'mon.ceph.',
>> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
>> failed.
>> ... Continues to this day
>>
>> # journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep
>> "auth: could not find secret_id"
>> Oct 19 16:52:48 admin.ceph. bash[4445]: debug
>> 2022-10-19T14:52:48.789+ 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> Oct 19 16:52:48 admin.ceph. bash[4445]: debug
>> 2022-10-19T14:52:48.989+ 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> Oct 19 16:52:49 admin.ceph. bash[4445]: debug
>> 2022-10-19T14:52:49.393+ 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> Oct 19 16:52:50 admin.ceph. bash[4445]: debug
>> 2022-10-19T14:52:50.197+ 7fa7f3120700  0 auth: could not find
>> secret_id=123
>> ... Continues to this day
>>
>> # journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep "Is
>> a
>> directory"
>> Oct 24 11:12:53 admin.ceph. bash[4445]:
>> orchestrator._interface.OrchestratorError: Command ['rm', '-f',
>> '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove
>> '/etc/ceph/ceph.client.admin.keyring': Is a directory
>> ... Continues to this day

[ceph-users] Re: Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

2022-10-24 Thread Murilo Morais
Hello Martin.

Apparently cephadm is not able to resolve to `admin.ceph.`, check
/etc/hosts or your DNS, try to ping and check if the IPs in `ceph orch host
ls` are pinged and there is no packet loss.

Try according to the documentation:
https://docs.ceph.com/en/quincy/cephadm/operations/#cephadm-host-check-failed

Em seg., 24 de out. de 2022 às 09:23, Martin Johansen 
escreveu:

> Hi, I deployed a Ceph cluster a week ago and have started experiencing
> warnings. Any pointers as to how to further debug or fix it? Here is info
> about the warnings:
>
> # ceph version
> ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy
> (stable)
>
> # ceph status
>   cluster:
> id: 
> health: HEALTH_WARN
> 1 hosts fail cephadm check
>
>   services:
> mon:5 daemons, quorum admin.ceph.,mon,osd1,osd2,osd3
> (age 79m)
> mgr:admin.ceph..wvhmky(active, since 2h), standbys:
> mon.jzfopv
> osd:4 osds: 4 up (since 3h), 4 in (since 3h)
> rbd-mirror: 2 daemons active (2 hosts)
> rgw:5 daemons active (5 hosts, 1 zones)
>
>   data:
> pools:   9 pools, 226 pgs
> objects: 736 objects, 1.4 GiB
> usage:   7.3 GiB used, 2.0 TiB / 2.1 TiB avail
> pgs: 226 active+clean
>
>   io:
> client:   36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr
>
> # journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep
> "cephadm ERROR"
> Oct 19 13:45:08 admin.ceph. bash[4445]: debug
> 2022-10-19T11:45:08.163+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
> Unable to write
>
> admin.ceph.:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
> Unable to reach remote host admin.ceph..
> Oct 19 13:45:08 admin.ceph. bash[4445]: debug
> 2022-10-19T11:45:08.167+ 7fa7bb2d3700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.', 'mon.ceph.',
> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
> failed.
> Oct 19 21:16:37 admin.ceph. bash[4445]: debug
> 2022-10-19T19:16:37.504+ 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.', 'mon.ceph.',
> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
> failed.
> Oct 21 14:00:52 admin.ceph. bash[4445]: debug
> 2022-10-21T12:00:52.035+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
> Unable to write
>
> admin.ceph.:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
> Unable to reach remote host admin.ceph..
> Oct 21 14:00:52 admin.ceph. bash[4445]: debug
> 2022-10-21T12:00:52.047+ 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.', 'mon.ceph.',
> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
> failed.
> Oct 21 14:25:04 admin.ceph. bash[4445]: debug
> 2022-10-21T12:25:03.994+ 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.', 'mon.ceph.',
> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
> failed.
> Oct 21 16:03:48 admin.ceph. bash[4445]: debug
> 2022-10-21T14:03:48.320+ 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.', 'mon.ceph.',
> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
> failed.
> Oct 22 06:26:17 admin.ceph. bash[4445]: debug
> 2022-10-22T04:26:17.051+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
> Unable to write admin.ceph.:/etc/ceph/ceph.client.admin.keyring:
> Unable to reach remote host admin.ceph..
> Oct 22 06:26:17 admin.ceph. bash[4445]: debug
> 2022-10-22T04:26:17.055+ 7fa7b8ace700  0 [cephadm ERROR cephadm.utils]
> executing refresh((['admin.ceph.', 'mon.ceph.',
> 'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
> failed.
> ... Continues to this day
>
> # journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep
> "auth: could not find secret_id"
> Oct 19 16:52:48 admin.ceph. bash[4445]: debug
> 2022-10-19T14:52:48.789+ 7fa7f3120700  0 auth: could not find
> secret_id=123
> Oct 19 16:52:48 admin.ceph. bash[4445]: debug
> 2022-10-19T14:52:48.989+ 7fa7f3120700  0 auth: could not find
> secret_id=123
> Oct 19 16:52:49 admin.ceph. bash[4445]: debug
> 2022-10-19T14:52:49.393+ 7fa7f3120700  0 auth: could not find
> secret_id=123
> Oct 19 16:52:50 admin.ceph. bash[4445]: debug
> 2022-10-19T14:52:50.197+ 7fa7f3120700  0 auth: could not find
> secret_id=123
> ... Continues to this day
>
> # journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep "Is a
> directory"
> Oct 24 11:12:53 admin.ceph. bash[4445]:
> orchestrator._interface.OrchestratorError: Command ['rm', '-f',
> '/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove
> '/etc/ceph/ceph.client.admin.keyring': Is a directory
> ... Continues to this day
>
> # ceph orch host ls
> HOST  ADDRLABELS  STATUS
> admin.ceph.  10.0.0.  _admin
> mon.ceph.10.0.0.  mon Offline
> osd1.ceph.   10.0.0.  osd1
> osd2.ceph.   10.0.0.  osd2Offline
> osd3.ceph.   10.0.0.  osd3
> osd4.ceph.   10.0.0.  osd4Offline
> 6 hosts in cluster
>
> Logs:
>
> 10/24/22 2:19:41 PM
> [INF]
> Cluster is now healthy
>
> 10/24/22 2:19:41 PM
> [INF]
> 

[ceph-users] Re: Understanding rbd objects, with snapshots

2022-10-24 Thread Maged Mokhtar


On 18/10/2022 01:24, Chris Dunlop wrote:

Hi,

Is there anywhere that describes exactly how rbd data (including 
snapshots) are stored within a pool?


I can see how a rbd broadly stores its data in rados objects in the 
pool, although the object map is opaque. But once an rbd snap is 
created and new data written to the rbd, where is the old data 
associated with the snap?


And/or how can I access the data from an rbd snapshot directly, e.g. 
using rados?


And, how can an object map be interpreted, i.e. what is the format?

I don't know if the snaps documentation here:

https://docs.ceph.com/en/latest/dev/osd_internals/snaps/

...is related to rbd snaps. Perhaps rbd snaps are "self managed snaps" 
requiring the use a "SnapContext", but the rados man page doesn't 
include any mention of this so it's unclear what's going on.


Perhaps rbd snapshots simply can't be accessed directly with the 
current tools (other than actually mapping a snapshot)?


See below for some test explorations...


Hi Chris,

snaphots are stored on the same OSD as current object.
rbd snapshots are self managed rather than rados pool managed, the rbd 
client tales responsibility for passing the correct snapshot 
environment/context to the OSDs in i/o operations via librados.
First to create a snapshot, rbd client requests a unique snap id number 
from the mons.
This number and snap name are persisted/stored in rbd_header.xx object 
for the rbd image. it is added to a list of prev snaps if any.


When rbd client writes to rbd_data.xx rados object, it passes the list 
of snaps.
The OSD will look at the snap list and perform a lot of logic like 
create and copy original data to the snap before writing in case it did 
not have this snao before or copy data if snap did not have this 
offset/extent before...etc. The OSD will keep track of what snapshots it 
is storing for the object, their blob offset/extent in the object as 
well as their physical location on the OSD block device all in the 
rocksdb database. The physical location on block device can be far apart 
allocated by the allocator from free space on device. You can use 
ceph-objectstore-tool on the OSD to examine snapshot location and get 
its data.



When reading, the rbd client passes the snap id to read, or default id 
for head/current.
I do not believe you can use rados get command on the rbd_data.xx as you 
were doing to get snap shot data, even if you specify the snapshot 
parameter to the command as i think this works with rados pool snapshots 
and not self managed.
As a user, if you wanted to access rbd snap data, you can rbd map the 
snap and read from it via kernel rbd.
If you want to fiddle with reading snapshots at rados level on the 
rbd_data.xx, you can write a librados app that first read the snap id 
from rbd_header.xx based on snap name then pass this id in the context 
to the librados read function.


/Maged





Cheers,

Chris

--
##
## create a test rbd within a test pool
##
$ ceph osd pool create test
$ rbd create --size 10M --object-size 1M "test/test1"
$ rbd info test/test1
rbd image 'test1':
    size 10 MiB in 10 objects
    order 20 (1 MiB objects)
    snapshot_count: 0
    id: 08ceb039ff1c19
    block_name_prefix: rbd_data.08ceb039ff1c19
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten
    op_features:     flags:     create_timestamp: Tue Oct 
18 09:35:49 2022

    access_timestamp: Tue Oct 18 09:35:49 2022
    modify_timestamp: Tue Oct 18 09:35:49 2022
$ rados -p test ls --all
    rbd_directory
    rbd_info
    rbd_object_map.08ceb039ff1c19
    rbd_header.08ceb039ff1c19
    rbd_id.test1
#
# "clean" object map - but no idea what the contents mean
#
$ rados -p test get rbd_object_map.08ceb039ff1c19 - | od -t x1 > 
/tmp/om.clean; cat /tmp/om.clean

000 0e 00 00 00 01 01 08 00 00 00 0a 00 00 00 00 00
020 00 00 00 00 00 0c 00 00 00 c6 44 f4 3a 01 00 00
040 00 00 00 00 00
045

##
## Write to the rbd
## - confirm data appears in rbd_data.xxx object
## - rbd_object_map changes
##
$ dev=$(rbd device map "test/test1"); declare -p dev
declare -- dev="/dev/rbd0"
$ printf '1' > $dev
#
# rdb_data object appears
#
$ rados -p test ls --all | sort
    rbd_data.08ceb039ff1c19.
    rbd_directory
    rbd_header.08ceb039ff1c19
    rbd_id.test1
    rbd_info
    rbd_object_map.08ceb039ff1c19
#
# new rbd_data contains our written data
#
$ rados -p test get rbd_data.08ceb039ff1c19. - | od -t x1
000 31 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
001
#
# rbd_object_map is updated
#
$ rados -p test get rbd_object_map.08ceb039ff1c19 - | od -t x1 > 
/tmp/om.head.1; cat /tmp/om.head.1

000 0e 00 00 00 01 01 08 00 00 00 0a 00 00 00 00 00
020 00 00 40 00 00 0c 

[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-24 Thread Boris Behrens
Cheers again.
I am still stuck at this. Someone got an idea how to fix it?

Am Fr., 7. Okt. 2022 um 11:30 Uhr schrieb Boris Behrens :

> Hi,
> I just wanted to reshard a bucket but mistyped the amount of shards. In a
> reflex I hit ctrl-c and waited. It looked like the resharding did not
> finish so I canceled it, and now the bucket is in this state.
> How can I fix it. It does not show up in the stale-instace list. It's also
> a multisite environment (we only sync metadata).
>
> $ radosgw-admin reshard status --bucket bucket
> [
> {
> "reshard_status": "not-resharding",
> "new_bucket_instance_id": "",
> "num_shards": -1
> }
> ]
>
> $ radosgw-admin bucket stats --bucket bucket
> {
> "bucket": "bucket",
> *"num_shards": 0,*
> ...
> *"id": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2296333939.14",*
> "marker": "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2296333939.14",
> ...
> }
>
> $ radosgw-admin metadata get
> bucket.instance:bucket:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2368407345.1
> {
> "key":
> "bucket.instance:bucket:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2368407345.1",
> "ver": {
> "tag": "QndcbsKPFDjs6rYKKDHde9bM",
> "ver": 2
> },
> "mtime": "2022-10-07T07:16:49.231685Z",
> "data": {
> "bucket_info": {
> "bucket": {
> "name": "bucket",
> "marker":
> "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2296333939.14",
> *"bucket_id":
> "ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2368407345.1",*
> ...
> },
> ...
> *"num_shards": 211,*
> ...
> },
> }
>
>
> Cheers
>  Boris
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

2022-10-24 Thread Martin Johansen
Hi, I deployed a Ceph cluster a week ago and have started experiencing
warnings. Any pointers as to how to further debug or fix it? Here is info
about the warnings:

# ceph version
ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy
(stable)

# ceph status
  cluster:
id: 
health: HEALTH_WARN
1 hosts fail cephadm check

  services:
mon:5 daemons, quorum admin.ceph.,mon,osd1,osd2,osd3
(age 79m)
mgr:admin.ceph..wvhmky(active, since 2h), standbys:
mon.jzfopv
osd:4 osds: 4 up (since 3h), 4 in (since 3h)
rbd-mirror: 2 daemons active (2 hosts)
rgw:5 daemons active (5 hosts, 1 zones)

  data:
pools:   9 pools, 226 pgs
objects: 736 objects, 1.4 GiB
usage:   7.3 GiB used, 2.0 TiB / 2.1 TiB avail
pgs: 226 active+clean

  io:
client:   36 KiB/s rd, 19 KiB/s wr, 35 op/s rd, 26 op/s wr

# journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep
"cephadm ERROR"
Oct 19 13:45:08 admin.ceph. bash[4445]: debug
2022-10-19T11:45:08.163+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
Unable to write
admin.ceph.:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
Unable to reach remote host admin.ceph..
Oct 19 13:45:08 admin.ceph. bash[4445]: debug
2022-10-19T11:45:08.167+ 7fa7bb2d3700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.', 'mon.ceph.',
'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
failed.
Oct 19 21:16:37 admin.ceph. bash[4445]: debug
2022-10-19T19:16:37.504+ 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.', 'mon.ceph.',
'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
failed.
Oct 21 14:00:52 admin.ceph. bash[4445]: debug
2022-10-21T12:00:52.035+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
Unable to write
admin.ceph.:/var/lib/ceph/1ea33904-4bad-11ed-b842-2d991d088396/config/ceph.conf:
Unable to reach remote host admin.ceph..
Oct 21 14:00:52 admin.ceph. bash[4445]: debug
2022-10-21T12:00:52.047+ 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.', 'mon.ceph.',
'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
failed.
Oct 21 14:25:04 admin.ceph. bash[4445]: debug
2022-10-21T12:25:03.994+ 7fa7bc2d5700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.', 'mon.ceph.',
'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
failed.
Oct 21 16:03:48 admin.ceph. bash[4445]: debug
2022-10-21T14:03:48.320+ 7fa7ba2d1700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.', 'mon.ceph.',
'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
failed.
Oct 22 06:26:17 admin.ceph. bash[4445]: debug
2022-10-22T04:26:17.051+ 7fa7afabc700  0 [cephadm ERROR cephadm.ssh]
Unable to write admin.ceph.:/etc/ceph/ceph.client.admin.keyring:
Unable to reach remote host admin.ceph..
Oct 22 06:26:17 admin.ceph. bash[4445]: debug
2022-10-22T04:26:17.055+ 7fa7b8ace700  0 [cephadm ERROR cephadm.utils]
executing refresh((['admin.ceph.', 'mon.ceph.',
'osd1.ceph.', 'osd2.ceph.', 'osd3.ceph.'],))
failed.
... Continues to this day

# journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep
"auth: could not find secret_id"
Oct 19 16:52:48 admin.ceph. bash[4445]: debug
2022-10-19T14:52:48.789+ 7fa7f3120700  0 auth: could not find
secret_id=123
Oct 19 16:52:48 admin.ceph. bash[4445]: debug
2022-10-19T14:52:48.989+ 7fa7f3120700  0 auth: could not find
secret_id=123
Oct 19 16:52:49 admin.ceph. bash[4445]: debug
2022-10-19T14:52:49.393+ 7fa7f3120700  0 auth: could not find
secret_id=123
Oct 19 16:52:50 admin.ceph. bash[4445]: debug
2022-10-19T14:52:50.197+ 7fa7f3120700  0 auth: could not find
secret_id=123
... Continues to this day

# journalctl -u ceph-@mgr.admin.ceph..wvhmky | grep "Is a
directory"
Oct 24 11:12:53 admin.ceph. bash[4445]:
orchestrator._interface.OrchestratorError: Command ['rm', '-f',
'/etc/ceph/ceph.client.admin.keyring'] failed. rm: cannot remove
'/etc/ceph/ceph.client.admin.keyring': Is a directory
... Continues to this day

# ceph orch host ls
HOST  ADDRLABELS  STATUS
admin.ceph.  10.0.0.  _admin
mon.ceph.10.0.0.  mon Offline
osd1.ceph.   10.0.0.  osd1
osd2.ceph.   10.0.0.  osd2Offline
osd3.ceph.   10.0.0.  osd3
osd4.ceph.   10.0.0.  osd4Offline
6 hosts in cluster

Logs:

10/24/22 2:19:41 PM
[INF]
Cluster is now healthy

10/24/22 2:19:41 PM
[INF]
Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
or devices)

10/24/22 2:18:33 PM
[WRN]
Health check failed: failed to probe daemons or devices
(CEPHADM_REFRESH_FAILED)

10/24/22 2:15:24 PM
[INF]
Cluster is now healthy

10/24/22 2:15:24 PM
[INF]
Health check cleared: CEPHADM_REFRESH_FAILED (was: failed to probe daemons
or devices)

10/24/22 2:15:24 PM
[INF]
Health check cleared: CEPHADM_HOST_CHECK_FAILED (was: 1 hosts fail cephadm
check)

10/24/22 2:13:10 PM
[WRN]
Health check failed: failed to probe daemons or devices
(CEPHADM_REFRESH_FAILED)

10/24/22 2:11:55 PM
[WRN]
Health check failed: 1 

[ceph-users] MGR process regularly not responding

2022-10-24 Thread Gilles Mocellin

Hi,

In our Ceph Pacific clusters (16.2.10) (1 for OpenStack and S3, 2 for 
backup on RBD and S3),
since the upgrade to Pacific, we have regularly the MGR not responding, 
not seen anymore in ceph status.

The process is still there.
Noting in the MGR log, just no more logs.

Restarting the service make it come back.

When all MGR are down, we have a warning in ceph status, but not before.

I can't find a similar bug in the Tracker.

Does someone also have that symptom ?
Do you have a workaround or solution ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Failed to probe daemons or devices

2022-10-24 Thread Sake Paulusma
Last friday I upgrade the Ceph cluster from 17.2.3 to 17.2.5 with "ceph orch 
upgrade start --image 
localcontainerregistry.local.com:5000/ceph/ceph:v17.2.5-20221017". After 
sometime, an hour?, I've got a health warning: CEPHADM_REFRESH_FAILED: failed 
to probe daemons or devices. I'm using only Cephfs on the cluster and it's 
still working correctly.
Checking the running services, everything is up and running; mon, osd and mds. 
But on the hosts running mon and mds services I get errors in the cephadm.log, 
see the loglines below.

I look likes cephadm tries to start a container for checking something? What 
could be wrong?


On mon nodes I got the following:
2022-10-24 10:31:43,880 7f179e5bfb80 DEBUG 

cephadm ['gather-facts']
2022-10-24 10:31:44,333 7fc2d52b6b80 DEBUG 

cephadm ['--image', 
'localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0',
 'ceph-volume', '--fsid', '8909ef90-22ea-11ed-8df1-0050569ee1b1', '--', 
'inventory', '--format=json-pretty', '--filter-for-batch']
2022-10-24 10:31:44,663 7fc2d52b6b80 INFO Inferring config 
/var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/mon.oqsoel24332/config
2022-10-24 10:31:44,663 7fc2d52b6b80 DEBUG Using specified fsid: 
8909ef90-22ea-11ed-8df1-0050569ee1b1
2022-10-24 10:31:45,574 7fc2d52b6b80 INFO Non-zero exit code 1 from /bin/podman 
run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json 
--net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk 
--init -e 
CONTAINER_IMAGE=localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0
 -e NODE_NAME=monnode2.local.com -e CEPH_USE_RANDOM_NONCE=1 -e 
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v 
/var/run/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/run/ceph:z -v 
/var/log/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1:/var/log/ceph:z -v 
/var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/crash:/var/lib/ceph/crash:z 
-v /run/systemd/journal:/run/systemd/journal -v /dev:/dev -v 
/run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v 
/run/lock/lvm:/run/lock/lvm -v 
/var/lib/ceph/8909ef90-22ea-11ed-8df1-0050569ee1b1/selinux:/sys/fs/selinux:ro 
-v /:/rootfs -v /tmp/ceph-tmp31tx1iy2:/etc/ceph/ce
 ph.conf:z 
localcontainerregistry.local.com:5000/ceph/ceph@sha256:122436e2f1df0c803666c5591db4a9b6c9196a71b4d44c6bd5d18102509dfca0
 inventory --format=json-pretty --filter-for-batch
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr Traceback (most 
recent call last):
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/sbin/ceph-volume", line 11, in 
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
self.main(self.argv)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in 
newfunc
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr return f(*a, **kw)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
terminal.dispatch(self.mapper, subcommand_args)
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in 
dispatch
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr instance.main()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/inventory/main.py", line 53, in 
main
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr 
with_lsm=self.args.with_lsm))
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 39, in 
__init__
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr all_devices_vgs = 
lvm.get_all_devices_vgs()
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in 
get_all_devices_vgs
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr return 
[VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr File 
"/usr/lib/python3.6/site-packages/ceph_volume/api/lvm.py", line 797, in 

2022-10-24 10:31:45,575 7fc2d52b6b80 INFO /bin/podman: stderr return 
[VolumeGroup(**vg) for vg in vgs]
2022-10-24 10:31:45,575 7fc2d52b6b80 INFO