[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
On Wed, Jun 29, 2022 at 9:55 PM Stefan Kooman  wrote:

> On 6/29/22 19:34, Curt wrote:
> > Hi Stefan,
> >
> > Thank you, that definitely helped. I bumped it to 20% for now and that's
> > giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s.  I'll
> > see how that runs and then increase it a bit more if the cluster handles
> > it ok.
> >
> > Do you think it's worth enabling scrubbing while backfilling?
>
> If the cluster can cope with the extra load, sure. If it slows down the
> backfilling to levels that are too slow ... temporarily disable it.
>
> Since
> > this is going to take a while. I do have 1 inconsistent PG that has now
> > become 10 as it splits.
>
> Hmm. Well, if it finds broken PGs, for sure pause backfilling (ceph osd
> set nobackfill) and have it handle this ASAP: ceph pg repair $pg.
> Something is wrong, and you want to have this fixed sooner rather than
> later.
>

 When I try to run a repair nothing happens, if I try to list
inconsistent-obj I get No scrub information available for 12.12.  If I tell
it to run a deep scrub, nothing.  I'll set debug and see what I can find in
the logs.

>
> Not sure what hardware you have, but you might benefit from disabling
> write caches, see this link:
>
> https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
>
> Gr. Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph mon cannot join to cluster during upgrade

2022-06-29 Thread Iban Cabrillo
Hi Eugen,
   There is only ceph-mgr and ceph-mon on this node (working fine for years 
with versions <14)

Jun 29 16:08:42 cephmon03 systemd: ceph-mon@cephmon03.service failed.   

   
Jun 29 16:16:36 cephmon03 kernel: ceph-mon[7498]: segfault at 8 ip 
7fa4c2e75ed7 sp 7ffee88e3730 error 4 in 
libceph-common.so.0[7fa4c2b97000+9b8000]
Jun 29 16:19:04 cephmon03 systemd: Reloading.   

   
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mgr@.service:15] Unknown lvalue 'LockPersonality' 
in section 'Service'
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mgr@.service:18] Unknown lvalue 
'MemoryDenyWriteExecute' in section 'Service' 
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mgr@.service:21] Unknown lvalue 
'ProtectControlGroups' in section 'Service'   
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mgr@.service:23] Unknown lvalue 
'ProtectKernelModules' in section 'Service'   
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mgr@.service:24] Unknown lvalue 
'ProtectKernelTunables' in section 'Service'  
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mon@.service:19] Unknown lvalue 'LockPersonality' 
in section 'Service'
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mon@.service:21] Unknown lvalue 
'MemoryDenyWriteExecute' in section 'Service' 
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mon@.service:25] Unknown lvalue 
'ProtectControlGroups' in section 'Service'   
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mon@.service:27] Unknown lvalue 
'ProtectKernelModules' in section 'Service'   
Jun 29 16:19:05 cephmon03 systemd: 
[/usr/lib/systemd/system/ceph-mon@.service:28] Unknown lvalue 
'ProtectKernelTunables' in section 'Service'   

I thought this was related to this bug https://tracker.ceph.com/issues/50997, 
the the trick didn't work for me.
#MemoryDenyWriteExecute=true

  MemoryDenyWriteExecute=false  

running manualy:
 /usr/bin/ceph-mon -f --cluster ceph --id cephmon03 --setuser ceph --setgroup 
ceph

I see the ceph-mon procc start to consume whole swap

src/central_freelist.cc:333] tcmalloc: allocation failed 8192   

  *** Caught signal (Segmentation 
fault) **   

   in thread 7fb892cfd1c0 thread_name:ceph-mon  

 ceph version 14.2.22 
(ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)

 1: (()+0xf630) [0x7fb886d2d630]

  
  2: (std::string::_Rep::_S_create(unsigned long, unsigned long, 
std::allocator const&)+0x59) [0x55ea4086f709] 
   3: 
(std::string::_M_mutate(unsigned long, unsigned long, unsigned long)+0x6b) 
[0x55ea40870dcb]
4: (std::string::assign(char 
const*, unsigned long)+0x55) [0x55ea40870fb5]   

 5: (()+0x2aa5c) [0x7fb88949ea5c]
.. 
 Mons are under Centos7.5 machine

Regards, I


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
Hi Stefan,

Thank you, that definitely helped. I bumped it to 20% for now and that's
giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s.  I'll see
how that runs and then increase it a bit more if the cluster handles it ok.

Do you think it's worth enabling scrubbing while backfilling?  Since this
is going to take a while. I do have 1 inconsistent PG that has now become
10 as it splits.

ceph health detail
HEALTH_ERR 21 scrub errors; Possible data damage: 10 pgs inconsistent; 2
pgs not deep-scrubbed in time
[ERR] OSD_SCRUB_ERRORS: 21 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 10 pgs inconsistent
pg 12.12 is active+clean+inconsistent, acting [28,1,37,0]
pg 12.32 is active+clean+inconsistent, acting [37,3,14,22]
pg 12.52 is active+clean+inconsistent, acting [4,33,7,23]
pg 12.72 is active+remapped+inconsistent+backfilling, acting
[37,3,14,22]
pg 12.92 is active+remapped+inconsistent+backfilling, acting [28,1,37,0]
pg 12.b2 is active+remapped+inconsistent+backfilling, acting
[37,3,14,22]
pg 12.d2 is active+clean+inconsistent, acting [4,33,7,23]
pg 12.f2 is active+remapped+inconsistent+backfilling, acting
[37,3,14,22]
pg 12.112 is active+clean+inconsistent, acting [28,1,37,0]
pg 12.132 is active+clean+inconsistent, acting [37,3,14,22]
[WRN] PG_NOT_DEEP_SCRUBBED: 2 pgs not deep-scrubbed in time
pg 4.13 not deep-scrubbed since 2022-06-16T03:15:16.758943+
pg 7.1 not deep-scrubbed since 2022-06-16T20:51:12.211259+

Thanks,
Curt

On Wed, Jun 29, 2022 at 5:53 PM Stefan Kooman  wrote:

> On 6/29/22 15:14, Curt wrote:
>
>
> >
> > Hi Stefan,
> >
> > Good to know.  I see the default if .05 for misplaced_ratio.  What do
> > you recommend would be a safe number to increase it to?
>
> It depends. It might be safe to put it to 1. But I would slowly increase
> it, have the manager increase pgp_num and see how the cluster copes with
> the increased load. If you have hardly any client workload you might
> bump this ratio quite a bit. At some point you would need to increase
> osd max backfill to avoid having PGs waiting on backfill.
>
> Gr. Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph nfs-ganesha - Unable to mount Ceph cluster

2022-06-29 Thread Robert Sander

Am 29.06.22 um 18:23 schrieb Wyll Ingersoll:


If I manually create the directory prior to applying the export spec, it does 
work.


I think that's the way to go.

>
But it seems that ganesha is trying to create it for me so I'm wondering 
how to make that work.


The orchestrator creates one cephx key per NFS export.

Ganesha gets a cephx key that is limited to the directory it should export.

It cannot create the directory itself because then it would need to have 
permissions in the directory above.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph nfs-ganesha - Unable to mount Ceph cluster

2022-06-29 Thread Wyll Ingersoll


[ceph pacific 16.2.9]

When creating a NFS export using "ceph nfs export apply ... -i export.json" for 
a subdirectory of /cephfs, does the subdir that you wish to export need to be 
pre-created or will ceph (or ganesha) create it for you?

I'm trying to create an "/shared" directory in a cephfs tree and export it 
using a JSON spec file, but the nfs-ganesha log file shows errors because it 
cannot mount or create the desired directory in cephfs.  If I manually create 
the directory prior to applying the export spec, it does work.  But it seems 
that ganesha is trying to create it for me so I'm wondering how to make that 
work.




29/06/2022 16:14:54 : epoch 62bc67c3 : foobar : ganesha.nfsd-6[sigmgr] 
create_export :FSAL :CRIT :Unable to mount Ceph cluster for /shared.

29/06/2022 16:14:54 : epoch 62bc67c3 : foobar : ganesha.nfsd-6[sigmgr] 
mdcache_fsal_create_export :FSAL :MAJ :Failed to call create_export on 
underlying FSAL Ceph

29/06/2022 16:14:54 : epoch 62bc67c3 : foobar : ganesha.nfsd-6[sigmgr] 
fsal_cfg_commit :CONFIG :CRIT :Could not create export for (/shared) to 
(/shared)

The JSON spec used looks like:


{

  "export_id": 2,

  "transports": [ "TCP" ],

  "cluster_id": "ceph",

  "path": "/shared",

  "pseudo": "/shared",

  "protocols": [4],

  "access_type": "RW",

  "squash": "no_root_squash",

  "fsal": {

"name":  "CEPH",

"user_id": "nfs.ceph.2",

"fs_name": "cephfs"

  }

}


thanks,
   Wyllys Ingersoll

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [ext] Re: cephadm orch thinks hosts are offline

2022-06-29 Thread Kuhring, Mathias
Hey all,

just want to note that I'm also looking for some kind of way to 
restart/reset/refresh orchestrator.
But in my case it's not the hosts but the services which are presumably 
wrongly reported and outdated:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NHEVEM3ESJYXZ4LPJ24BBCK6NCG4QRHP/

Don't know if this even can be related.
But in case you find a solution, I'll just stick around here and check 
if I can apply it.

Best,
Mathias

On 6/27/2022 12:33 PM, Thomas Roth wrote:
> Hi Adam,
>
> no, this is the 'feature' where the reboot of a mgr hosts causes all 
> known hosts to become unmanaged.
>
>
> > # lxbk0375 # ceph cephadm check-host lxbk0374 10.20.2.161
> > mgr.server reply reply (1) Operation not permitted check-host failed:
> > Host 'lxbk0374' not found. Use 'ceph orch host ls' to see all 
> managed hosts.
>
> In some email on this issue I can't find atm, someone describes a 
> workaround that allows to restart the entire orchestrator business.
> But that sounded risky.
>
> Regards
> Thomsa
>
>
> On 23/06/2022 19.42, Adam King wrote:
>> Hi Thomas,
>>
>> What happens if you run "ceph cephadm check-host " for one 
>> of the
>> hosts that is offline (and if that fails "ceph cephadm check-host
>>  ")? Usually, the hosts get marked offline when 
>> some ssh
>> connection to them fails. The check-host command will attempt a 
>> connection
>> and maybe let us see why it's failing, or, if there is no longer an 
>> issue
>> connecting to the host, should mark the host online again.
>>
>> Thanks,
>>    - Adam King
>>
>> On Thu, Jun 23, 2022 at 12:30 PM Thomas Roth  wrote:
>>
>>> Hi all,
>>>
>>> found this bug https://tracker.ceph.com/issues/51629  (Octopus 
>>> 15.2.13),
>>> reproduced it in Pacific and
>>> now again in Quincy:
>>> - new cluster
>>> - 3 mgr nodes
>>> - reboot active mgr node
>>> - (only in Quincy:) standby mgr node takes over, rebooted node becomse
>>> standby
>>> - `ceph orch host ls` shows all hosts as `offline`
>>> - add a new host: not offline
>>>
>>> In my setup, hostnames and IPs are well known, thus
>>>
>>> # ceph orch host ls
>>> HOST  ADDR LABELS  STATUS
>>> lxbk0374  10.20.2.161  _admin  Offline
>>> lxbk0375  10.20.2.162  Offline
>>> lxbk0376  10.20.2.163  Offline
>>> lxbk0377  10.20.2.164  Offline
>>> lxbk0378  10.20.2.165  Offline
>>> lxfs416   10.20.2.178  Offline
>>> lxfs417   10.20.2.179  Offline
>>> lxfs418   10.20.2.222  Offline
>>> lxmds22   10.20.6.67
>>> lxmds23   10.20.6.72   Offline
>>> lxmds24   10.20.6.74   Offline
>>>
>>>
>>> (All lxbk are mon nodes, the first 3 are mgr, 'lxmds22' was added after
>>> the fatal reboot.)
>>>
>>>
>>> Does this matter at all?
>>> The old bug report is one year old, now with prio 'Low'. And some 
>>> people
>>> must have rebooted the one or
>>> other host in their clusters...
>>>
>>> There is a cephfs on our cluster, operations seem to be unaffected.
>>>
>>>
>>> Cheers
>>> Thomas
>>>
>>> -- 
>>> 
>>> Thomas Roth
>>> Department: Informationstechnologie
>>> Location: SB3 2.291
>>>
>>>
>>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>>> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
>>>
>>> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
>>> Managing Directors / Geschäftsführung:
>>> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
>>> Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
>>> State Secretary / Staatssekretär Dr. Volkmar Dietz
>>>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
>
-- 
Mathias Kuhring

Dr. rer. nat.
Bioinformatician
HPC & Core Unit Bioinformatics
Berlin Institute of Health at Charité (BIH)

E-Mail:  mathias.kuhr...@bih-charite.de
Mobile: +49 172 3475576

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph mon cannot join to cluster during upgrade

2022-06-29 Thread Eugen Block
The log output you pasted suggests that an oom killer is responsible  
for the failure, can you confirm that? Are other services located on  
that node that use too much RAM?


Zitat von Iban Cabrillo :


Hi Guys,
I am in the upgrade proccess from mimic to nautilus.
The first step was to upgrade one cephmon, but after that this  
cephmon can not rejoin the cluster I see this at logs:


2022-06-29 15:54:48.200 7fd3d015f1c0 0 ceph version 14.2.22  
(ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable),  
process ceph-mon, pid 6121

2022-06-29 15:54:48.206 7fd3d015f1c0 0 pidfile_write: ignore empty --pid-file
2022-06-29 15:54:48.339 7fd3d015f1c0 0 load: jerasure load: lrc load: isa


This machine is mon and mrg and the mgr daemon y working fine after upgrade

At log messajes:

Jun 29 15:54:38 cephmon03 systemd: ceph-mon@cephmon03.service failed.
Jun 29 15:54:47 cephmon03 systemd: ceph-mon@cephmon03.service  
holdoff time over, scheduling restart.

Jun 29 15:54:47 cephmon03 systemd: Stopped Ceph cluster monitor daemon.
Jun 29 15:54:47 cephmon03 systemd: Started Ceph cluster monitor daemon.
Jun 29 15:56:43 cephmon03 kernel: pickup invoked oom-killer:  
gfp_mask=0x201da, order=0, oom_score_adj=0

Jun 29 15:56:43 cephmon03 kernel: pickup cpuset=/ mems_allowed=0
Jun 29 15:56:43 cephmon03 kernel: CPU: 1 PID: 1047 Comm: pickup Not  
tainted 3.10.0-957.5.1.el7.x86_64 #1

Jun 29 15:56:43 cephmon03 kernel: Call Trace:
Jun 29 15:56:43 cephmon03 kernel: [] dump_stack+0x19/0x1b
..


Any advise?
--
=
Ibán Cabrillo Bartolomé
Instituto de Fisica de Cantabria (IFCA-CSIC)
Santander, Spain
Tel: +34942200969/+34669930421
Responsable del Servicio de Computación Avanzada
==


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph mon cannot join to cluster during upgrade

2022-06-29 Thread Iban Cabrillo
Hi Guys, 
I am in the upgrade proccess from mimic to nautilus. 
The first step was to upgrade one cephmon, but after that this cephmon can not 
rejoin the cluster I see this at logs: 

2022-06-29 15:54:48.200 7fd3d015f1c0 0 ceph version 14.2.22 
(ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable), process ceph-mon, 
pid 6121 
2022-06-29 15:54:48.206 7fd3d015f1c0 0 pidfile_write: ignore empty --pid-file 
2022-06-29 15:54:48.339 7fd3d015f1c0 0 load: jerasure load: lrc load: isa 


This machine is mon and mrg and the mgr daemon y working fine after upgrade 

At log messajes: 

Jun 29 15:54:38 cephmon03 systemd: ceph-mon@cephmon03.service failed. 
Jun 29 15:54:47 cephmon03 systemd: ceph-mon@cephmon03.service holdoff time 
over, scheduling restart. 
Jun 29 15:54:47 cephmon03 systemd: Stopped Ceph cluster monitor daemon. 
Jun 29 15:54:47 cephmon03 systemd: Started Ceph cluster monitor daemon. 
Jun 29 15:56:43 cephmon03 kernel: pickup invoked oom-killer: gfp_mask=0x201da, 
order=0, oom_score_adj=0 
Jun 29 15:56:43 cephmon03 kernel: pickup cpuset=/ mems_allowed=0 
Jun 29 15:56:43 cephmon03 kernel: CPU: 1 PID: 1047 Comm: pickup Not tainted 
3.10.0-957.5.1.el7.x86_64 #1 
Jun 29 15:56:43 cephmon03 kernel: Call Trace: 
Jun 29 15:56:43 cephmon03 kernel: [] dump_stack+0x19/0x1b 
.. 


Any advise? 
-- 
= 
Ibán Cabrillo Bartolomé 
Instituto de Fisica de Cantabria (IFCA-CSIC) 
Santander, Spain 
Tel: +34942200969/+34669930421 
Responsable del Servicio de Computación Avanzada 
== 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS, ACLs, NFS and SMB

2022-06-29 Thread Robert Sander

Hi,

CephFS currently only supports POSIX ACLs.

These can be used when re-exporting the filesystem via Samba for SMB 
clients and via nfs-kernel-server for NFSv3 clients.


NFS-Ganesha in version 4.0 from Ceph 17 supports POSIX ACLs for the Ceph 
FSAL. But only on the backend, the frontend still only uses NFSv4 ACLs 
as the NFSv3 POSIX ACL side channel is not available.


Are there any plans to harmonize this?

E.g. support NFSv4 ACLs (which are similar to NTFS ACLs AFAIK) natively 
in CephFS which then could be exported via NFS-Ganesha for NFSv4 and 
Samba for SMB.


Or is this too much to be asked as there are three projects involved?

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
On Wed, Jun 29, 2022 at 4:42 PM Stefan Kooman  wrote:

> On 6/29/22 11:21, Curt wrote:
> > On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder  wrote:
> >
> >> Hi,
> >>
> >> did you wait for PG creation and peering to finish after setting pg_num
> >> and pgp_num? They should be right on the value you set and not lower.
> >>
> > Yes, only thing going on was backfill. It's still just slowly expanding
> pg
> > and pgp nums.   I even ran the set command again.  Here's the current
> info
> > ceph osd pool get EC-22-Pool all
> > size: 4
> > min_size: 3
> > pg_num: 226
> > pgp_num: 98
>
> This is coded in the mons and works like that from nautilus onwards:
>
> src/mon/OSDMonitor.cc
>
> ...
>  if (osdmap.require_osd_release < ceph_release_t::nautilus) {
>// pre-nautilus osdmap format; increase pg_num directly
>assert(n > (int)p.get_pg_num());
>// force pre-nautilus clients to resend their ops, since they
>// don't understand pg_num_target changes form a new interval
>p.last_force_op_resend_prenautilus = pending_inc.epoch;
>// force pre-luminous clients to resend their ops, since they
>// don't understand that split PGs now form a new interval.
>p.last_force_op_resend_preluminous = pending_inc.epoch;
>p.set_pg_num(n);
>  } else {
>// set targets; mgr will adjust pg_num_actual and pgp_num later.
>// make pgp_num track pg_num if it already matches.  if it is set
>// differently, leave it different and let the user control it
>// manually.
>if (p.get_pg_num_target() == p.get_pgp_num_target()) {
>  p.set_pgp_num_target(n);
>}
>p.set_pg_num_target(n);
>  }
> ...
>
> So, when pg_num and pgp_num are the same when pg_num is increased, it
> will slowly change pgp_num. If pgp_num is different (smaller, as it
> cannot be bigger than pg_num) it will not touch pgp_num.
>
> You might speed up this process by increasing "target_max_misplaced_ratio"
>
> Gr. Stefan
>

Hi Stefan,

Good to know.  I see the default if .05 for misplaced_ratio.  What do you
recommend would be a safe number to increase it to?

Thanks,
Curt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best value for "mds_cache_memory_limit" for large (more than 10 Po) cephfs

2022-06-29 Thread Robert Gallop
I’d say one thing to keep in mind is the higher you have your cache, and
the more that is currently consumed, the LONGER it will take in the event
the reply has to take over…

While standby-reply does help to improve takeover times, its not
significant if there is a lot of clients with a lot of open caps.

We are using in the 40GB cache after ramping up a bit a time to help with
recalls.  But when I failover now I’m looking at 1-3 minutes, with or
without standby-replay enabled.

Do some testing with failovers if you have the ability to ensure that your
timings are OK, too big can cause issues in that area, that I know of…

Robert

On Wed, Jun 29, 2022 at 6:54 AM Eugen Block  wrote:

> Hi,
>
> you can check how much your MDS is currently using:
>
> ceph daemon mds. cache status
>
> Does it already scratch your limit? I usually start with lower values
> if it's difficult to determine how much it will actually use and
> increase it if necessary.
>
> Zitat von Arnaud M :
>
> > Hello to everyone
> >
> > I have a ceph cluster currently serving cephfs.
> >
> > The size of the ceph filesystem is around 1 Po.
> > 1 Active mds and 1 Standby-replay
> > I do not have a lot of cephfs clients for now 5 but it may increase to 20
> > or 30.
> >
> > Here is some output
> >
> > Rank  | State  | Daemon| Activity | Dentries
> |
> > Inodes  | Dirs| Caps
> >
> > 0 | active | ceph-g-ssd-4-2.mxwjvd | Reqs: 130 /s | 10.2 M
>  |
> > 10.1 M  | 356.8 k | 707.6 k
> >
> > 0-s   | standby-replay | ceph-g-ssd-4-1.ixqewp | Evts: 0 /s   | 156.5 k
> |
> > 127.7 k | 47.4 k  |  0
> >
> > It is working really well
> >
> > I plan to to increase this cephfs cluster up to 10 Po (for now) and even
> > more
> >
> > What would be the good value for "mds_cache_memory_limit" ? I have set it
> > to 80 Gb because I have enough ram on my server to do so.
> >
> > Was it a good idea ? Or is it counter-productive ?
> >
> > All the best
> >
> > Arnaud
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Best value for "mds_cache_memory_limit" for large (more than 10 Po) cephfs

2022-06-29 Thread Eugen Block

Hi,

you can check how much your MDS is currently using:

ceph daemon mds. cache status

Does it already scratch your limit? I usually start with lower values  
if it's difficult to determine how much it will actually use and  
increase it if necessary.


Zitat von Arnaud M :


Hello to everyone

I have a ceph cluster currently serving cephfs.

The size of the ceph filesystem is around 1 Po.
1 Active mds and 1 Standby-replay
I do not have a lot of cephfs clients for now 5 but it may increase to 20
or 30.

Here is some output

Rank  | State  | Daemon| Activity | Dentries |
Inodes  | Dirs| Caps

0 | active | ceph-g-ssd-4-2.mxwjvd | Reqs: 130 /s | 10.2 M   |
10.1 M  | 356.8 k | 707.6 k

0-s   | standby-replay | ceph-g-ssd-4-1.ixqewp | Evts: 0 /s   | 156.5 k  |
127.7 k | 47.4 k  |  0

It is working really well

I plan to to increase this cephfs cluster up to 10 Po (for now) and even
more

What would be the good value for "mds_cache_memory_limit" ? I have set it
to 80 Gb because I have enough ram on my server to do so.

Was it a good idea ? Or is it counter-productive ?

All the best

Arnaud
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm orch thinks hosts are offline

2022-06-29 Thread Thomas Roth
Trying to resolve this, at first I tried to pause the cephadm processes ('ceph config-key set 
mgr/cephadm/pause true') which did not lead anywhere but loss of connectivity: how do you "resume"? 
Does not exist anywhere in the documentation!
Actually, there are quite many things in Ceph that you can switch on, but not off, of switch off, but 
not on - such as rebooting a mgr node ...



In addition to the `ceph orch host ls` showing everything Offline, I thus 
managed to get also
> ceph -s
> id: 98e1e122-ebe3-11ec-b165-8208fe80
>health: HEALTH_WARN
>9 hosts fail cephadm check
>21 stray daemon(s) not managed by cephadm
>
>  services:
>mon: 5 daemons, quorum lxbk0374,lxbk0375,lxbk0376,lxbk0377,lxbk0378 (age 
6d)
>mgr: lxbk0375.qtgomh(active, since 6d), standbys: lxbk0376.jstndr, 
lxbk0374.hdvmvg
>mds: 1/1 daemons up, 11 standby
>osd: 24 osds: 24 up (since 5d), 24 in (since 5d)
>
>  data:
>volumes: 1/1 healthy
>pools:   3 pools, 641 pgs
>objects: 4.77k objects, 16 GiB
>usage:   50 GiB used, 909 TiB / 910 TiB avail
>pgs: 641 active+clean


Good thing is that neither ceph nor cephfs care about the orchestrator thingy - everything kees 
working, it would seem ;-)



Finally, the workaround (or solution?):
Re-adding missing nodes is a bad idea in most every system, but not in Ceph.

Go to lxbk0375 - since that is the active mgr, cf. above.

> ssh-copy-id -f -i /etc/ceph/ceph.pub root@lxbk0374
> ceph orch host add lxbk0374 10.20.2.161

-> 'ceph orch host ls' shows that node no longer Offline.
-> Repeat with all the other hosts, and everything looks fine also from the 
orch view.


My question: Did I miss this procedure in the manuals?


Cheers
Thomas

On 23/06/2022 18.29, Thomas Roth wrote:

Hi all,

found this bug https://tracker.ceph.com/issues/51629  (Octopus 15.2.13), reproduced it in Pacific and 
now again in Quincy:

- new cluster
- 3 mgr nodes
- reboot active mgr node
- (only in Quincy:) standby mgr node takes over, rebooted node becomse standby
- `ceph orch host ls` shows all hosts as `offline`
- add a new host: not offline

In my setup, hostnames and IPs are well known, thus

# ceph orch host ls
HOST  ADDR LABELS  STATUS
lxbk0374  10.20.2.161  _admin  Offline
lxbk0375  10.20.2.162  Offline
lxbk0376  10.20.2.163  Offline
lxbk0377  10.20.2.164  Offline
lxbk0378  10.20.2.165  Offline
lxfs416   10.20.2.178  Offline
lxfs417   10.20.2.179  Offline
lxfs418   10.20.2.222  Offline
lxmds22   10.20.6.67
lxmds23   10.20.6.72   Offline
lxmds24   10.20.6.74   Offline


(All lxbk are mon nodes, the first 3 are mgr, 'lxmds22' was added after the 
fatal reboot.)


Does this matter at all?
The old bug report is one year old, now with prio 'Low'. And some people must have rebooted the one or 
other host in their clusters...


There is a cephfs on our cluster, operations seem to be unaffected.


Cheers
Thomas



--

Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recommended number of mons in a cluster

2022-06-29 Thread Konstantin Shalygin
Hi

You can deploy 3+2 or 3+5 mons, not 3+1

k
Sent from my iPhone

> On 28 Jun 2022, at 21:39, Vladimir Brik  
> wrote:
> 
> Hello
> 
> I have a ceph cluster with 3 mon servers that resides at a facility that 
> experiences significant outages once or twice a year. Is it possible mons 
> will not be able to re-establish quorum or get corrupted after an outage if 
> all of them go down uncleanly at about the same time?
> 
> I could set up 4 (it doesn't make sense to add fewer than 4, right?) extra 
> mons running in VMs at a different facility. Is there a downside to running 7 
> mons? If a network partition occurs between the facilities can ceph be relied 
> upon to recover gracefully once connectivity is restored?
> 
> 
> Thanks,
> 
> Vlad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Difficulty with fixing an inconsistent PG/object

2022-06-29 Thread Konstantin Shalygin
Hi!

Just try to Google data_digest_mismatch_oi
On old maillist archives couple of threads with same problem


k
Sent from my iPhone

> On 29 Jun 2022, at 13:54, Lennart van Gijtenbeek | Routz 
>  wrote:
> 
> Hello Ceph community,
> 
> 
> I hope you could help me with an issue we are experiencing on our backup 
> cluster.
> 
> The Ceph version we are running here is 10.2.10 (Jewel), and we are using 
> Filestore.
> The PG is part of a replicated pool with size=2.
> 
> 
> Getting the following error:
> ```
> 
> root@cephmon0:~# ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
> pg 37.189 is active+clean+inconsistent, acting [144,170]
> 2 scrub errors
> ```
> 
> ```
> root@cephmon0:~# grep 37.189 /var/log/ceph/ceph.log
> 2022-06-29 11:11:27.782920 osd.144 10.129.160.22:6800/2810 7598 : cluster 
> [INF] osd.144 pg 37.189 Deep scrub errors, upgrading scrub to deep-scrub
> 2022-06-29 11:11:27.884628 osd.144 10.129.160.22:6800/2810 7599 : cluster 
> [INF] 37.189 deep-scrub starts
> 2022-06-29 11:13:07.124841 osd.144 10.129.160.22:6800/2810 7600 : cluster 
> [ERR] 37.189 shard 144: soid 37:9193d307:::isqPpJMKYY4.001e:head 
> data_digest 0x50007bd9 != data_digest 0x885fabcc from auth oi 
> 37:9193d307:::isqPpJMKYY4.001e:head(7211'173457 osd.71.0:397191 
> dirty|data_digest|omap_digest s 4194304 uv 39699 dd 885fabcc od  
> alloc_hint [0 0])
> 2022-06-29 11:13:07.124849 osd.144 10.129.160.22:6800/2810 7601 : cluster 
> [ERR] 37.189 shard 170: soid 37:9193d307:::isqPpJMKYY4.001e:head 
> data_digest 0x50007bd9 != data_digest 0x885fabcc from auth oi 
> 37:9193d307:::isqPpJMKYY4.001e:head(7211'173457 osd.71.0:397191 
> dirty|data_digest|omap_digest s 4194304 uv 39699 dd 885fabcc od  
> alloc_hint [0 0])
> 2022-06-29 11:13:07.124853 osd.144 10.129.160.22:6800/2810 7602 : cluster 
> [ERR] 37.189 soid 37:9193d307:::isqPpJMKYY4.001e:head: failed to 
> pick suitable auth object
> 2022-06-29 11:20:46.459906 osd.144 10.129.160.22:6800/2810 7603 : cluster 
> [ERR] 37.189 deep-scrub 2 errors
> ```
> 
> The PG has already been transferred from 2 other OSDs. That is, the same 
> error occurred when the PG was stored on two different OSDs. So it seems this 
> is not a disk issue. There seems to be something wrong with the object 
> "isqPpJMKYY4.001e".
> However, when looking at the md5sum for the object. On both OSDs, this is the 
> same.
> 
> 
> ```
> 
> root@ceph12:/var/lib/ceph/osd/ceph-144/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C#
>  ls -l isqPpJMKYY4.001e__head_E0CBC989__25
> 
> -rw-r--r-- 1 ceph ceph 4194304 Jun  3 09:56 
> isqPpJMKYY4.001e__head_E0CBC989__25
> 
> root@ceph12:/var/lib/ceph/osd/ceph-144/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C#
>  md5sum isqPpJMKYY4.001e__head_E0CBC989__25
> 96d702072cd441f2d0af60783e8db248  
> isqPpJMKYY4.001e__head_E0CBC989__25
> ```
> 
> ```
> root@ceph15:/var/lib/ceph/osd/ceph-170/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C#
>  ls -l isqPpJMKYY4.001e__head_E0CBC989__25
> -rw-r--r-- 1 ceph ceph 4194304 Jun 23 16:41 
> isqPpJMKYY4.001e__head_E0CBC989__25
> 
> root@ceph15:/var/lib/ceph/osd/ceph-170/current/37.189_head/DIR_9/DIR_8/DIR_9/DIR_C#
>  md5sum isqPpJMKYY4.001e__head_E0CBC989__25
> 96d702072cd441f2d0af60783e8db248  
> isqPpJMKYY4.001e__head_E0CBC989__25
> ```
> 
> ```
> root@cephmon0:~# rados list-inconsistent-obj 37.189 --format=json-pretty
> {
>"epoch": 167653,
>"inconsistents": [
>{
>"object": {
>"name": "isqPpJMKYY4.001e",
>"nspace": "",
>"locator": "",
>"snap": "head",
>"version": 39699
>},
>"errors": [],
>"union_shard_errors": [
>"data_digest_mismatch_oi"
>],
>"selected_object_info": 
> "37:9193d307:::isqPpJMKYY4.001e:head(7211'173457 osd.71.0:397191 
> dirty|data_digest|omap_digest s 4194304 uv 39699 dd 885fabcc od  
> alloc_hint [0 0])",
>"shards": [
>{
>"osd": 144,
>"errors": [
>"data_digest_mismatch_oi"
>],
>"size": 4194304,
>"omap_digest": "0x",
>"data_digest": "0x50007bd9"
>},
>{
>"osd": 170,
>"errors": [
>"data_digest_mismatch_oi"
>],
>"size": 4194304,
>"omap_digest": "0x",
>"data_digest": "0x50007bd9"
>}
>]
>}
>]
> }
> ```
> 
> I don't understand where there is a "data_digest_mismatch_oi" error. Since 
> the checksums seem to match.
> 
> Does anyone have any

[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder  wrote:

> Hi,
>
> did you wait for PG creation and peering to finish after setting pg_num
> and pgp_num? They should be right on the value you set and not lower.
>
Yes, only thing going on was backfill. It's still just slowly expanding pg
and pgp nums.   I even ran the set command again.  Here's the current info
ceph osd pool get EC-22-Pool all
size: 4
min_size: 3
pg_num: 226
pgp_num: 98
crush_rule: EC-22-Pool
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: EC-22-Pro
fast_read: 0
pg_autoscale_mode: off
eio: false
bulk: false

>
> > How do you set the upmap balancer per pool?
>
> I'm afraid the answer is RTFM. I don't use it, but I believe to remember
> one could configure it for equi-distribution of PGs for each pool.
>
> Ok, I'll dig around some more. I glanced at the balancer page and didn't
see it.


> Whenever you grow the cluster, you should make the same considerations
> again and select numbers of PG per pool depending on number of objects,
> capacity and performance.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Curt 
> Sent: 28 June 2022 16:33:24
> To: Frank Schilder
> Cc: Robert Gallop; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: Ceph recovery network speed
>
> Hi Frank,
>
> Thank you for the thorough breakdown. I have increased the pg_num and
> pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary
> pool with the most data.  It looks like ceph slowly scales the pg up even
> with autoscaling off, since I see target_pg_num 2048, pg_num 199.
>
> root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048
> set pool 12 pg_num to 2048
> root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048
> set pool 12 pgp_num to 2048
> root@cephmgr:/# ceph osd pool get EC-22-Pool all
> size: 4
> min_size: 3
> pg_num: 199
> pgp_num: 71
> crush_rule: EC-22-Pool
> hashpspool: true
> allow_ec_overwrites: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> use_gmt_hitset: 1
> erasure_code_profile: EC-22-Pro
> fast_read: 0
> pg_autoscale_mode: off
> eio: false
> bulk: false
>
> This cluster will be growing quit a bit over the next few months.  I am
> migrating data from their old Giant cluster to a new one, by the time I'm
> done it should be 16 hosts with about 400TB of data. I'm guessing I'll have
> to increase pg again later when I start adding more servers to the cluster.
>
> I will look into if SSD's are an option.  How do you set the upmap
> balancer per pool?  Looking at ceph balancer status my mode is already
> upmap.
>
> Thanks again,
> Curt
>
> On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder  fr...@dtu.dk>> wrote:
> Hi Curt,
>
> looking at what you sent here, I believe you are the victim of "the law of
> large numbers really only holds for large numbers". In other words, the
> statistics of small samples is biting you. The PG numbers of your pools are
> so low that they lead to a very large imbalance of data- and IO placement.
> In other words, in your cluster a few OSDs receive the majority of IO
> requests and bottleneck the entire cluster.
>
> If I see this correctly, the PG num per drive varies from 14 to 40. That's
> an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is
> only 48. The autoscaler is screwing it up for you. It will slowly increase
> the number of active PGs, causing continuous relocation of objects for a
> very long time.
>
> I think the recovery speed you see for 8 objects per second is not too bad
> considering that you have an HDD only cluster. The speed does not increase,
> because it is a small number of PGs sending data - a subset of the 32 you
> had before. In addition, due to the imbalance of PGs per OSD, only a small
> number of PGs will be able to send data. You will need patience to get out
> of this corner.
>
> The first thing I would do is look at which pools are important for your
> workload in the long run. I see 2 pools having a significant number of
> objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40
> times the number of objects and bytes as default.rgw.buckets.data. I would
> scale both up in PG count with emphasis on EC-22-Pool.
>
> Your cluster can safely operate between 1100 and 2200 PGs with replication
> <=4. If you don't plan to create more large pools, a good choice of
> distributing this capacity might be
>
> EC-22-Pool: 1024 PGs (could be pushed up to 2048)
> default.rgw.buckets.data: 256 PGs
>
> That's towards the lower end of available PGs. Please make your own
> calculation and judgement.
>
> If you have settled on target numbers, change the pool sizes in one go,
> that is, set PG_num and PGP_num to the same 

[ceph-users] Orchestrator informations wrong and outdated

2022-06-29 Thread Kuhring, Mathias
Dear Ceph community,

we are in the curious situation that typical orchestrator queries 
provide wrong or outdated information about different services.
E.g. `ceph orch ls` will report wrong numbers on active services.
Or `ceph orch ps` reports many OSDs as "starting" and many services with 
an old version (15.2.14, but we are on 16.2.7).
Also the refresh times seem way of (capital M == months?).
However, the cluster is healthy (`ceph status` is happy).
And sample validation of affected services with systemctl also shows 
that they are up and ok.

We already tried the following without success:

a) re-registering cephadm as orchestrator backend
0|0[root@osd-1 ~]# ceph orch pause
0|0[root@osd-1 ~]# ceph orch set backend ''
0|0[root@osd-1 ~]# ceph mgr module disable cephadm
0|0[root@osd-1 ~]# ceph orch ls
Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
0|0[root@osd-1 ~]# ceph mgr module enable cephadm
0|0[root@osd-1 ~]# ceph orch set backend 'cephadm'

b) a failover of the MGR (hoping it would restart/reset the orchestrator 
module)
0|0[root@osd-1 ~]# ceph status | grep mgr
     mgr:   osd-1(active, since 6m), standbys: osd-5.jcfyqe, 
osd-4.oylrhe, osd-3
0|0[root@osd-1 ~]# ceph mgr fail
0|0[root@osd-1 ~]# ceph status | grep mgr
     mgr:   osd-5.jcfyqe(active, since 7s), standbys: 
osd-4.oylrhe, osd-3, osd-1

Is there any other way to somehow reset the orchestrator 
information/connection?
I added different relevant outputs below.

I also went through the MGR logs and found an issue with querying the 
docker repos.
I attempted to upgrade the MGRs to 16.2.9 a few weeks ago due to a 
different bug.
But this upgrade never went through.
Apparently due to cephadm not being able to pull the image.
Interestingly, I'm able to pull the image manually with docker pull. But 
cephadm is not.
I also get an error with `ceph orch upgrade ls` to check on available 
versions.
I'm not sure, if this is relevant to the orchestrator problem we have.
But to be safe, I also added the logs/output below.

Thank you for all your help!

Best Wishes,
Mathias


0|0[root@osd-1 ~]# ceph status
   cluster:
     id: 55633ec3-6c0c-4a02-990c-0f87e0f7a01f
     health: HEALTH_OK

   services:
     mon:   5 daemons, quorum osd-1,osd-2,osd-5,osd-4,osd-3 (age 
86m)
     mgr:   osd-5.jcfyqe(active, since 21m), standbys: 
osd-4.oylrhe, osd-3, osd-1
     mds:   1/1 daemons up, 1 standby
     osd:   270 osds: 270 up (since 13d), 270 in (since 5w)
     cephfs-mirror: 1 daemon active (1 hosts)
     rgw:   3 daemons active (3 hosts, 2 zones)

   data:
     volumes: 1/1 healthy
     pools:   17 pools, 6144 pgs
     objects: 692.54M objects, 1.2 PiB
     usage:   1.8 PiB used, 1.7 PiB / 3.5 PiB avail
     pgs: 6114 active+clean
  29   active+clean+scrubbing+deep
  1    active+clean+scrubbing

   io:
     client:   0 B/s rd, 421 MiB/s wr, 52 op/s rd, 240 op/s wr

0|0[root@osd-1 ~]# ceph orch ls
NAME   PORTS   RUNNING REFRESHED   
AGE  PLACEMENT
alertmanager   ?:9093,9094 0/1 -   
8M   count:1
cephfs-mirror  0/1 -   
5M   count:1
crash  2/6  7M ago  
4M   *
grafana    ?:3000  0/1 -   
8M   count:1
ingress.rgw.default    172.16.39.131:443,1967  0/2 -   
4M   osd-1
ingress.rgw.ext    172.16.39.132:443,1968  4/2  7M ago  
4M   osd-5
ingress.rgw.ext-website    172.16.39.133:443,1969  0/2 -   
4M   osd-3
mds.cephfs 2/2  9M ago  
4M   count-per-host:1;label:mds
mgr    5/5  9M ago  
9M   count:5
mon    5/5  9M ago  
9M   count:5
node-exporter  ?:9100  2/6  7M ago  
7w   *
osd.all-available-devices    0 -   
5w   *
osd.osd 54   
7M   label:osd
osd.unmanaged  180  9M ago  
-    
prometheus ?:9095  0/2 -   
8M   count:2
rgw.cubi   4/0  9M ago  
-    
rgw.default    ?:8100  2/1  7M ago  
4M   osd-1
rgw.ext    ?:8100  2/1  7M ago  
4M   osd-5
rgw.ext-website    ?:8200  0/1 -   
4M   osd-3

0|0[root@osd-1 ~]# ceph orch ps | grep starting | head -n 3
osd.0    osd-1 starting  -    
-    -    3072M     
osd.1    osd-2 starting  -    
-    -    3072M     
osd.10   osd-1 starting