[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Wesley Dillingham
pool 13 'mathfs_metadata' replicated size 2 min_size 2 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change

The problem is you have size=2 and min_size=2 on this pool. I would
increase the size of this pool to 3 (but i would also do that to all of
your pools which are size=2) the ok-to-stop command is failing because you
would drop below min_size by stopping any osd service this pg and those pgs
would then be inactive.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, May 26, 2022 at 2:22 PM Sarunas Burdulis 
wrote:

> On 5/26/22 14:09, Wesley Dillingham wrote:
> > What does "ceph osd pool ls detail" say?
>
> $ ceph osd pool ls detail
> pool 0 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 44740 flags
> hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 1 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 44740 lfor
> 0/0/31483 owner 18446744073709551615 flags hashpspool stripe_width 0
> application rgw
> pool 2 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31469 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 3 'default.rgw.data.root' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 4 'default.rgw.gc' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 5 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31387 owner 18446744073709551615 flags hashpspool
> stripe_width 0 application rgw
> pool 6 'default.rgw.users.uid' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31387 flags hashpspool stripe_width 0 application rgw
> pool 12 'mathfs_data' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/31370/31368 flags hashpspool stripe_width 0 application cephfs
> pool 13 'mathfs_metadata' replicated size 2 min_size 2 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/27164/27162 flags hashpspool stripe_width 0 application cephfs
> pool 15 'default.rgw.lc' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 44740 lfor 0/0/31374 flags hashpspool stripe_width 0 application rgw
> pool 21 'libvirt' replicated size 3 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 56244 lfor
> 0/33144/33142 flags hashpspool,selfmanaged_snaps stripe_width 0
> application rbd
> pool 36 'monthly_archive_metadata' replicated size 2 min_size 1
> crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> last_change 45338 lfor 0/27845/27843 flags hashpspool stripe_width 0
> application cephfs
> pool 37 'monthly_archive_data' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 45334 lfor 0/44535/44533 flags hashpspool stripe_width 0 application cephfs
> pool 38 'device_health_metrics' replicated size 2 min_size 1 crush_rule
> 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change
> 56507 flags hashpspool stripe_width 0 pg_num_min 1 application
> mgr_devicehealth
> pool 41 'lensfun_metadata' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 54066 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
> recovery_priority 5 application cephfs
> pool 42 'lensfun_data' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
> 54066 flags hashpspool stripe_width 0 application cephfs
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Sarunas Burdulis

On 5/26/22 14:09, Wesley Dillingham wrote:

What does "ceph osd pool ls detail" say?


$ ceph osd pool ls detail
pool 0 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 44740 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 1 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 44740 lfor 
0/0/31483 owner 18446744073709551615 flags hashpspool stripe_width 0 
application rgw
pool 2 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31469 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 3 'default.rgw.data.root' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 4 'default.rgw.gc' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31471 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 5 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31387 owner 18446744073709551615 flags hashpspool 
stripe_width 0 application rgw
pool 6 'default.rgw.users.uid' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31387 flags hashpspool stripe_width 0 application rgw
pool 12 'mathfs_data' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/31370/31368 flags hashpspool stripe_width 0 application cephfs
pool 13 'mathfs_metadata' replicated size 2 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/27164/27162 flags hashpspool stripe_width 0 application cephfs
pool 15 'default.rgw.lc' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
44740 lfor 0/0/31374 flags hashpspool stripe_width 0 application rgw
pool 21 'libvirt' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 56244 lfor 
0/33144/33142 flags hashpspool,selfmanaged_snaps stripe_width 0 
application rbd
pool 36 'monthly_archive_metadata' replicated size 2 min_size 1 
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on 
last_change 45338 lfor 0/27845/27843 flags hashpspool stripe_width 0 
application cephfs
pool 37 'monthly_archive_data' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
45334 lfor 0/44535/44533 flags hashpspool stripe_width 0 application cephfs
pool 38 'device_health_metrics' replicated size 2 min_size 1 crush_rule 
0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 
56507 flags hashpspool stripe_width 0 pg_num_min 1 application 
mgr_devicehealth
pool 41 'lensfun_metadata' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
54066 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 application cephfs
pool 42 'lensfun_data' replicated size 2 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 
54066 flags hashpspool stripe_width 0 application cephfs



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Wesley Dillingham
What does "ceph osd pool ls detail" say?

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, May 26, 2022 at 11:24 AM Sarunas Burdulis <
saru...@math.dartmouth.edu> wrote:

> Running
>
> `ceph osd ok-to-stop 0`
>
> shows:
>
> {"ok_to_stop":false,"osds":[1],
> "num_ok_pgs":25,"num_not_ok_pgs":2,
> "bad_become_inactive":["13.a","13.11"],
>
> "ok_become_degraded":["0.4","0.b","0.11","0.1a","0.1e","0.3c","2.5","2.10","3.19","3.1a","4.7","4.19","4.1e","6.10","12.1","12.6","15.9","21.17","21.18","36.8","36.13","41.7","41.1b","42.6","42.1a"]}
> Error EBUSY: unsafe to stop osd(s) at this time (2 PGs are or would
> become offline)
>
> What are “bad_become_inactive” PGs?
> What can be done to make OSD into “ok-to-stop” (or override it)?
>
> `ceph -s` still reports HEALT_OK and all PGs active+clean.
>
> Upgrade to 16.2.8 still complains about non-stoppable OSDs and won't
> proceed.
>
> --
> Sarunas Burdulis
> Dartmouth Mathematics
> math.dartmouth.edu/~sarunas
>
> · https://useplaintext.email ·
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-05-26 Thread Sarunas Burdulis

Running

`ceph osd ok-to-stop 0`

shows:

{"ok_to_stop":false,"osds":[1],
"num_ok_pgs":25,"num_not_ok_pgs":2,
"bad_become_inactive":["13.a","13.11"],
"ok_become_degraded":["0.4","0.b","0.11","0.1a","0.1e","0.3c","2.5","2.10","3.19","3.1a","4.7","4.19","4.1e","6.10","12.1","12.6","15.9","21.17","21.18","36.8","36.13","41.7","41.1b","42.6","42.1a"]}
Error EBUSY: unsafe to stop osd(s) at this time (2 PGs are or would 
become offline)


What are “bad_become_inactive” PGs?
What can be done to make OSD into “ok-to-stop” (or override it)?

`ceph -s` still reports HEALT_OK and all PGs active+clean.

Upgrade to 16.2.8 still complains about non-stoppable OSDs and won't 
proceed.


--
Sarunas Burdulis
Dartmouth Mathematics
math.dartmouth.edu/~sarunas

· https://useplaintext.email ·
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 2 pools - 513 pgs 100.00% pgs unknown - working cluster

2022-05-26 Thread Eugen Block

First thing I would try is a mgr failover.

Zitat von Eneko Lacunza :


Hi all,

I'm trying to diagnose a issue in a tiny cluster that is showing the  
following status:



root@proxmox3:~# ceph -s
  cluster:
    id: 80d78bb2-6be6-4dff-b41d-60d52e650016
    health: HEALTH_WARN
    1/3 mons down, quorum 0,proxmox3
    Reduced data availability: 513 pgs inactive

  services:
    mon: 3 daemons, quorum 0,proxmox3 (age 3h), out of quorum: 1
    mgr: proxmox3(active, since 16m), standbys: proxmox2
    osd: 12 osds: 8 up (since 3h), 8 in (since 3h)

  task status:

  data:
    pools:   2 pools, 513 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
 513 unknown

Cluster has 3 nodes, each with 4 OSDs. One of the nodes was offline  
for 3 weeks, and when bringing it back online VMs stalled on disk I/O.


Node has been shut down again and we're trying to understand the  
status, an then will try ti diagnose issue with the troubled node.


Currently VMs are working and can read RBD volumes, but there seems  
to be some kind of mgr issue (?) with stats.


There is no firewall on the nodes nor between the 3 nodes (all on  
the same switch). Ping is working for both CEph public and private  
networks.


MGR log show this continuosly:
2022-05-26T13:49:45.603+0200 7fb78ba3f700  0 auth: could not find  
secret_id=1892
2022-05-26T13:49:45.603+0200 7fb78ba3f700  0 cephx:  
verify_authorizer could not get service secret for service mgr  
secret_id=1892
2022-05-26T13:49:45.983+0200 7fb77a18d700  1 mgr.server send_report  
Not sending PG status to monitor yet, waiting for OSDs
2022-05-26T13:49:47.983+0200 7fb77a18d700  1 mgr.server send_report  
Not sending PG status to monitor yet, waiting for OSDs
2022-05-26T13:49:49.983+0200 7fb77a18d700  1 mgr.server send_report  
Not sending PG status to monitor yet, waiting for OSDs
2022-05-26T13:49:51.983+0200 7fb77a18d700  1 mgr.server send_report  
Giving up on OSDs that haven't reported yet, sending potentially  
incomplete PG state to m

on
2022-05-26T13:49:51.983+0200 7fb77a18d700  0 log_channel(cluster)  
log [DBG] : pgmap v3: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B  
/ 0 B avail
2022-05-26T13:49:53.983+0200 7fb77a18d700  0 log_channel(cluster)  
log [DBG] : pgmap v4: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B  
/ 0 B avail
2022-05-26T13:49:55.983+0200 7fb77a18d700  0 log_channel(cluster)  
log [DBG] : pgmap v5: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B  
/ 0 B avail
2022-05-26T13:49:57.987+0200 7fb77a18d700  0 log_channel(cluster)  
log [DBG] : pgmap v6: 513 pgs: 513 unknown; 0 B data, 0 B used, 0 B  
/ 0 B avail
2022-05-26T13:49:58.403+0200 7fb78ba3f700  0 auth: could not find  
secret_id=1892
2022-05-26T13:49:58.403+0200 7fb78ba3f700  0 cephx:  
verify_authorizer could not get service secret for service mgr  
secret_id=1892


So it seems that mgr is unable to contact OSDs for stats, then  
reports bad info to mon.


I see the following OSD ports open:
tcp    0  0 192.168.134.102:6800    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.133.102:6800    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.134.102:6801    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.133.102:6801    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.134.102:6802    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.133.102:6802    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.134.102:6803    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.133.102:6803    0.0.0.0:* LISTEN   
2268/ceph-osd
tcp    0  0 192.168.134.102:6804    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.133.102:6804    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.134.102:6805    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.133.102:6805    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.134.102:6806    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.133.102:6806    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.134.102:6807    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.133.102:6807    0.0.0.0:* LISTEN   
2271/ceph-osd
tcp    0  0 192.168.134.102:6808    0.0.0.0:* LISTEN   
2267/ceph-osd
tcp    0  0 192.168.133.102:6808    0.0.0.0:* LISTEN   
2267/ceph-osd
tcp    0  0 192.168.134.102:6809    0.0.0.0:* LISTEN   
2267/ceph-osd
tcp    0  0 192.168.133.102:6809    0.0.0.0:* LISTEN   
2267/ceph-osd
tcp    0  0 192.168.134.102:6810    0.0.0.0:* LISTEN   
2267/ceph-osd
tcp    0  0 192.168.133.102:6810    0.0.0.0:* LISTEN   
2267/ceph-osd
tcp    0  0 192.168.134.102:6811    0.0.0.0:* LISTEN   
2267/ceph-osd
tcp    0  0 192.168.133.102:6811    0.0.0.0:* LISTEN   

[ceph-users] Re: cannot assign requested address

2022-05-26 Thread Redouane Kachach Elhichou
Hello Dmitriy,

You have to provide a valid ip during the bootstrap: --mon-ip **

* *must be a valid ip from some interface on the current node.

Regards,
Redouane.




On Thu, May 26, 2022 at 2:14 AM Dmitriy Trubov 
wrote:

> Hi,
>
> I'm trying to install ansible octopus with cephadm.
>
> Here is message I got:
>
> cephadm bootstrap --mon-ip 
> Verifying podman|docker is present...
> Verifying lvm2 is present...
> Verifying time synchronization is in place...
> Unit chronyd.service is enabled and running
> Repeating the final host check...
> podman (/bin/podman) version 3.3.1 is present
> systemctl is present
> lvcreate is present
> Unit chronyd.service is enabled and running
> Host looks OK
> Cluster fsid: 567fd526-dc86-11ec-886e-00505691f522
> Verifying IP  port 3300 ...
> ERROR: [Errno 99] Cannot assign requested address
>
> Port 3300 is free; I can allocate this port by http server.
>
> OS is centos8
> Any ideas?
>
> Best regards,
> Dmitriy
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io